The business impact and cost of application outages and degradation have increased tremendously with the growth of digital business. Gartner estimates are around US$5,600 per minute of downtime. This is exacerbated by the increasing complexity of underlying networks and systems.
The process by which these problems are detected and resolved, however, hasn’t evolved. The network is often blamed first during application and infrastructure outages and performance degradation, but isn't always the culprit.
Not only is this an erroneous assumption in many cases, but it also elongates the resolution process. Causes of outages or performance problems are generally a mix of network, system and application effects.
The blame-network-first culture was born and perpetuated for several reasons:
- The network is viewed as an underlying utility and a black box, so becomes an easy target.
- When network outages do happen, they’re often well-publicised and affect many, resulting in negative press.
- Network teams have maintained a deep technical expertise that’s siloed from other parts of the IT operations organisation.
An organisation I recently spoke with required an application update for a financial services subscription application that was deployed over a WAN to its paying clients, where the performance immediately degraded compared to in the test lab. The network team was blamed, even though the network looked fine. The network operations team deflected the blame using monitoring solutions to show that it wasn’t its fault.
After several days of finger pointing and war room meetings between the network and application teams, the network team used packet data to finally determine that the application update wasn’t written with WAN effects in mind. Analysis showed that the application itself was extremely "chatty" (many requests/responses) and that it was highly sensitive to increased network latency. The problem was finally assigned to the application team to investigate why the application was as chatty as it was.
To its credit, the organisation realised that the blame-network-first culture extended the problem resolution time, in this case by almost a week.
The risks of a blame-network-first culture are severe. Aside from obvious morale issues, blaming the network first only prolongs the problem resolution effort, forcing the network team to prove its own innocence, instead of collaboratively working with other teams to find a lasting fix to the problem.
We need to change this culture to improve uptime and reduce resolution times.
Dispel blame-network-first myths
Infrastructure and operations leaders must institute basic triage practices to roughly identify the general cause of a performance problem and determine the right specialists to hand it off to for root cause analysis. Unfortunately, this process is often skewed by some prevailing myths, so it’s important to dispel them before the network operations team is blamed.
- Network operations just needs to add more bandwidth to solve most problems
- Application works fine on the LAN, but not on the WAN, so must be the network
- Network team is responsible for any and all connectivity issues
Increasing bandwidth isn’t a panacea for many problems. WAN bandwidth is less expensive and network operations often have strict thresholds on WAN utilisation, knowing well in advance when upgrades are needed. The network team must work with application teams through the development and test phases of new or upgraded applications to ensure situations of excessive bandwidth consumption don’t arise in production.
Applications are still built in LAN environments and often not tested in poor network conditions with higher latency, jitter and packet loss before deployment. New application deployments and upgrades need to be written to withstand deployment to all remote users, as well as adverse network conditions. Network teams should point application teams toward specific transactions that are particularly latency-sensitive and then work together to find resolution.
As ownership of the IT estate diversifies, ISPs, content delivery networks and cloud providers increasingly make up the chain of application delivery, meaning a perceived network problem can exist in any one of these areas. While the network can certainly be a factor in connecting to one of these providers, the provider itself should be investigated.
Take a proactive approach
Nothing perpetuates the blame-network-first culture more than an IT operations unit that’s always reacting to events — trying to put out fires instead of anticipating and preventing them. Proactive monitoring for problems, versus reactively waiting for problems to find IT, allows IT operations to function more rationally. This allows basic triage to be completed before pointing fingers.
Take these practical steps to become more proactive:
- Check if an upcoming application upgrade needs increased bandwidth.
- Determine if there are any planned changes to desktop environments that will generate unusual network traffic.
- Be aware of potential impacts of upcoming system or network changes, especially in light of new application architectures or cloud migration.
- Document network traffic patterns aligned with scheduled job runs, backups and others.
Another step is to use the data from network performance monitoring tools to unite teams, not block the flow of information. Data from network monitoring tools can be used as a way to prove innocence, but more importantly, the data can be leveraged to resolve issues.
Sanjit Ganguli is a research director in Gartner’s IT operations management team, focusing on network and application performance management. He will be speaking at the Gartner IT Infrastructure, Operations & Data Centre Summit in Sydney, 15-16 May 2017.