Getting to the root of the problem

Comments

Automated network management software sophisticated stuff that promises an unprecedented ability to monitor a corporate network is on the horizon.

But skeptical IT managers say the tools still aren't smart enough. They want artificial intelligence that can diagnose a network problem and get it right at least seven out of 10 times.

Such automation relies on event correlation and root-cause analysis tools. The concept behind the tools is simple: keep track of network devices and relationships, automatically fix minor problems and refer more complex snafus to the network manager.

But skeptical IT managers are demanding proof of better automation, bedrock interoperability and broader usefulness before they will buy such tools.

"We've looked at these tools," says Tom Revak, domain architect at pharmaceutical company GlaxoSmithKline PLC in Research Triangle Park, N.C. But "until the artificial intelligence exists that can automatically update a dynamically changing network, it's just one more pretty little map of what could be."

Historically, users have been "skeptical that software can meaningfully achieve what human expertise has achieved," says Dennis Drogseth, an analyst at IT consultancy Enterprise Management Associates Inc. in Boulder, Colo. Drogseth researched and wrote the consultancy's report on root-cause analysis and event correlation that was released in December.

Users have viewed automation for root-cause analysis and event correlation as "more work than it's worth, requiring too much labor and knowledge for rules to be appropriately defined for a specific environment," says Drogseth.

Exactly, says Kristina Victoreen, senior network engineer at the University of Pennsylvania in Philadelphia. "We tried building on the autodiscovery that Spectrum does, but we spent more time fixing what it had discovered," Victoreen says. "The guys who do [the model building] found it was quicker to build the topological model of the network by hand, which is very time-consuming." Spectrum is a management tool from Aprisma Management Technologies in Durham, N.H.

The tools "need to sense when something out of the norm occurs, such as a critical deadline that forces people to work around the clock, and normally noncritical failures become critical and require immediate response," says Revak. "If they can't do this automatically, the administrative overhead greatly outweighs the return on investment."

The moment of truth for users seems to come when the software tools "can successfully automate problem diagnostics 70 percent of the time or better," Drogseth says in the report. At that point, "users believe they are justified in the investment.

"That 70 percent mark is being met today by most of the better products," Drogseth says.

The benefits can be substantial, he says: smoother-running networks, better service-level delivery, reduced staff requirements and lower overhead. These benefits, together with advancements in the software and a reduction in the costs of deployment, are driving an increase in the use of root-cause analysis and event-correlation tools, Drogseth says in the report.

"There's no way we could manage without it," says Chris Vecchiolla, IT project manager at Royal Caribbean Cruises Ltd. in Miami. Each of Royal Caribbean's 18 ocean liners has an IT staff of two people, but most systems management is handled remotely from Miami via satellite.

Royal Caribbean uses Compaq Insight Manager and Unicenter from Islandia, N.Y.-based Computer Associates International Inc. to manage and monitor "about 170 items, such as SCSI card failure and out-of-threshold notices on servers," Vecchiolla says.

Escalating alarms notify both onboard and Miami-based IT staffers of problems. When the system detects a virus, it automatically destroys it and notifies onboard IT staff of the action via a banner on a monitor, Vecchiolla says. But should a server exceed a predetermined threshold, Miami staff could be paged to handle the problem, he says.

Because Royal Carribean's ships cruise around the globe through every time zone, remote management from Miami sometimes occurs while onboard staffers are off-duty. When the Miami staff works on a ship's systems, "Unicenter automatically picks it up and generates a banner that goes to the onboard systems manager that tells them the date, time, workstation accessed, what was done," Vecchiolla says. "The [onboard IT staffers] like that a lot."

Drogseth says that more than half of the enterprise-level companies with which he spoke are beginning with automation's "lowest common denominator, alarm deduplication."

If a server goes down, each attempt by any user or device to access it can generate a separate alarm, which doesn't describe the root cause of the problem. Deduplication lets a network manager see a single, server-down alarm instead.

A Kansas City, Mo.-based unit of Boston-based financial services company State Street Corp. isn't doing root-cause analysis, but it does use Spectrum for alarm deduplication, says David Lembke, State Street's network services manager.

The University of Pennsylvania has been using Spectrum for five years to reduce the number of alarms reported by a single event, Victoreen says. "It works reasonably well, assuming we've built the model correctly," she says. But "it turns out [that] a big map with red dots flashing to show alarms is not that useful for us."

Drogseth says that in his interviews with 40 midsize to large companies, most IT managers said they know they must start automating, because networks have grown too large and complex to manage without automation tools.

Though Revak is skeptical, "that's not to say we're not interested," he says.

"We're rethinking trying to model all of our networks and maybe moving to trap aggregators or event correlation engines," Victoreen says.

IT managers are looking beyond the network focus that most vendors have stressed and are seeing extended uses for the tools, such as to support performance, help desk functions, inventory and asset management, change management, and security. Not all tools support all such extensions, Drogseth says.

Vendors of most of the tools also claim some kind of predictive capabilities.

A network that learns over time can not only help prevent problems, but it can also increase job satisfaction by releasing IT staffers from grunt work and calling on them only for more difficult questions.

But where most such artificial intelligence efforts fall short is in detecting subtle changes, Revak says. "Through repeated small changes, the norm [can] shift very near the failure point, setting up a significant failure situation for the next small deviation from the newly established norm," he says.

Predictive capabilities vary greatly, and not all are based on sophisticated artificial intelligence techniques, Drogseth says.

At the bottom of the range is basic linear trending. An algorithm can determine how long it will take a server to reach capacity if usage increases by, say, 20 percent per month.

At the other end are sophisticated tools like CA's neural network-based Neugents, Drogseth says.

A Neugent can look at historical data about network resource usage and a company's business cycle, says a CA spokeswoman. By aggregating and correlating data on network infrastructure and business relationships, the Neugent might predict that a server would reach capacity in six weeks but drop back to 30 percent in the seventh week, she says.

Royal Caribbean plans to implement CA's Neugent for Windows NT networks, which "will take us to another level of management," Vecchiolla says.

Before the root-cause analysis industry achieves that new level of management, however, it must hurdle the stumbling block of standards, Drogseth says.

Part of why the University of Pennsylvania isn't "getting as much back as we'd hoped for is that we have a lot of different software from different vendors, and they have a lot of different proprietary schemes and interfaces," Victoreen says.

"The systems management industry must develop the standards and interoperability capabilities required for the tools in the event, problem, change, configuration, inventory, workload management, capacity [planning], performance [monitoring] and security areas to work together," Revak says. "Each of these disciplines contains some part of the overall equation."

Root-cause analysis and event-correlation tools aren't layered onto a network so much as they are woven into its fabric, Drogseth says.

De facto standards such as Java, HTML and XML help provide a cooperative interface between different vendors' products.

But true interoperability demands a common thread, "a standard structure of data maintained in the object store" a database of network devices, applications and relationships, Drogseth says.

"At its most esoteric, standards refers to that platonically perfect state that never gets achieved," he says. "What we're seeing is some adoption of some standards by some vendors."

Users should look for vendor partnerships to ease root cause tool deployment and management, he says.

That's easier now than several years ago, when one New York-based financial services firm began doing root-cause analysis. "We had to build a lot of things ourselves because they weren't available at the time," says the firm's IT vice president, Gary Butler.

"We're using the [System Management ARTS Inc.] correlation engine, and we're feeding it with data from Tibco's smart agents" and San Francisco-based Micromuse Inc.'s NetCool presentation software, says Butler. "We can't always find the root cause 100 percent of the time, but we can at least find the more serious event, and that keeps us from wasting time with all the symptoms."

Revak says, "As the industry matures, the best bet is for companies to focus on developing their event infrastructure technology a prerequisite for any advanced management their people and their processes. Technology is not the most important. [Vendors] dislike it when I say this, but most important are the people and the processes" and the relationship between them.

Glossary

- Advanced correlative intelligence: A problem-isolation method cloaked in secrecy by most root-cause analysis tool vendors. This is where language is most likely to become obscure or insubstantial.

- Event correlation: Examines the relationship among events across an IT infrastructure to narrow the search for the cause of a problem.

- Object data store: Knowledge specific to devices, applications and connections that provides a database of codified detail for understanding objects and their relationships. An extensive object data store can contain object performance data for use in modeling routine interactions across device types such as servers and routers.

- Polling and instrumentation: Provide ongoing event data about infrastructure availability, performance and topology. They can include common availability metrics, as well as CPU utilization or even remote monitoring.

- Presentation and context: Encompass issues around what you see, how it looks and what it tells you. No matter how detailed the reporting, unless it's presented in a way that suggests a solution, it's just so much noise.

- Root cause analysis: Isolates the cause of failure or poor performance.

- Topology: The map of where things are. It can detail both the physical (Layer 2) and logical (Layer 3) network, and move on up the Open Systems Interconnection stack to include configuration information relevant to systems and applications.