Managing modern networked systems and applications is daunting because infrastructure is complex and things can go wrong in so many parts of the technology stack -- servers, storage, network devices, applications, hypervisors, APIs, DNS, etc. How can you address the challenge?
A good place to start: problems that can solve themselves, should.
This is called "self-healing" in the systems management space. As our systems are increasingly virtualized, the opportunity to have our systems work around and self-correct issues has grown greatly in recent years.
The simplest example of self-healing is automatically restarting a service or process that stops or otherwise becomes unresponsive. It is important to keep in mind that this is a workaround and that automated activity of all sorts needs to be logged and monitored, in turn. If an application leaks memory such that it needs to be automatically restarted several times a day, that restart is not the fix, it's a Band-Aid that is mitigating the impact while the developers responsible fix the application.
But since many applications span many systems, to make your systems self-healing you need to move past simple automation in response to alerts and onto full orchestration capabilities. Automation allows you to perform single tasks, but orchestration allows you to initiate an entire workflow to perform an entire process.
That orchestrated workflow should consist of composable automated tasks, but it often needs a perspective larger than that of a single server or service. One server may know it needs to restart a downed process on itself, but higher level orchestration would decide to throw away that old server, start up a new VM, tell it where to find its database for its application to point to, check to verify its services start correctly, update DNS and put it into the load balancer.
Using solid orchestration you can even shift services automatically to different data centers when one is down.
In the DevOps space, this kind of orchestration has been seen as especially valuable, since it is also necessary to perform complex software deployments frequently, and a number of tools both free and commercial have been developed to address this need, such as open source options like Ansible, Salt, Rundeck, and extensions to Puppet and Chef.
Automating repetitive configuration and process tasks not only yields higher quality results (e.g., fewer mistakes), but also lets you free up manpower to work on higher-ROI issues instead of using valuable technical resources on what is effectively menial labor.
The most frequent reason cited for not implementing orchestration or even basic automation is "lack of time and resources to do so," but even a modest amount of automation can save enough time to prove its worth quickly.
Identify where you are spending low-value time on repetitive tasks -- code deploys, disk space cleanup, process restarts -- automate those first, and reinvest the time freed up to work on further orchestration. While orchestration code needs to be thought through carefully and can have bugs and unforeseen consequences, you will find that as more of your system is provisioned and adapts through orchestration, fewer issues will arise.
As you get more orchestration in place, you can look for tasks you may or may not set alerts for because it's too much work for one person to stay on top of. Removing unused capacity, for example, is often a quarterly process when done manually, but if you have automation configured you can, for example, check all your VMs to see which have not been logged into in a certain amount of time, and snapshot and stop them according to a predefined policy. This frees up capacity, which then frees up time in the never-ending search for spare capacity.
Not all orchestration has to happen without human intervention -- you might require human intervention for major orchestration events -- but consider whether it's easier and more repeatable to answer an SMS that says "DC1 is down, shift services to DC2 [Y/N]" and then have the rest happen via predefined automation, or to execute a manual process to perform the same task.
Orchestration can also be made proactive; if you have your entire topology monitored you can use automation to shift capacity or make routing decisions well before an issue is encountered. If on Monday mornings you get a spike of ERP system usage, then proactive orchestration can add more application servers ahead of time instead of waiting for performance thresholds to be crossed.
Network self-healing has been a "nice to have" goal for many years, but increasing data center complexity is making it more necessary and increasing sophistication in tooling is making it much more achievable.
Mueller has spent twenty years in IT and managed a variety of engineering teams that have used monitoring and automation to manage large scale Web services. He lives in Austin, TX.