Root cause analysis (RCA) has been around as a technology for some time, but it is gaining popularity of late thanks to an increase in the number of managed devices in data centers. Imagine a data center where 20,000 switch ports generate events and warnings every minute. In such cases, it is almost impossible to identify the problem. RCA comes in as an ally in such situations.
In order to deliver high levels of IT infrastructure availability, organizations need tools that help them isolate these problems. For example, the mean time required to isolate a problem takes up 65% of an administrator's time. RCA helps reduce this.
It's common to find that multiple events and alarms are generated during a device failure -- not just from malfunctioning network devices but also from adjacent attached devices. For example, if a switch generates an alarm that a particular port and card have gone down, then the servers attached to that switch also generate an alarm. This is where RCA and event correlation accurately identify the problem.
RCA is very useful for management of SLAs between the IT team and business users. RCA ensures higher levels of IT infrastructure availability.
To carry out a successful RCA exercise, it is wise to follow certain steps. The first step is to perform detailed mapping. This should be supplemented by selection of the right management tool.
Mapping: In order to conduct effective RCA, you must identify what your organization wants to manage. You can achieve this by creating a managed object definition language (MODL) model.
The next step is to create a topology of the data center to be managed, which encompasses all the managed elements. These may include routers, switches, servers and storage equipment, or particular applications that run on this infrastructure. Thus, a map is created of the data center and infrastructure network along with correlation.
More resources on root cause analysis
Diagnosing security problems: Always look for the root cause
How root cause analysis can help: Nine ways to improve application security after an incident
To explain this mapping, let's take the example of a core router in the centralized data center. Typically, multiple access routers connect to this core router. So this connectivity between the network's core routers and access routers will be shown in the mapping model. These router-based networks will be connected to switching devices at each of the locations. There will be different servers installed on the switches and different applications installed on these servers. Thus, a map of this information is created, which is known as the topology.
Select the right network management tool: Always ensure that your selected network management tool can auto-discover the topology or deployment scenario within your organization. It should be dynamic enough to cope with the attendant changes.
Data centers are now in a constant state of flux, since infrastructure continually changes to meet business needs. The topology itself should be able to effectively discover all infrastructure elements the first time. The tool should keep this topology updated on a periodic basis (or as and when required) so that it is the exact replica of what has been deployed in the data center.
Post the network management tool rollout: While RCA can achieve many wonderful things, it also introduces overhead issues into a network. Network overheads are created during discovery of the network elements that have to be managed. To counter this situation, it's essential that you schedule the discovery timing. Thus, you do not overload the network with queries related to discovery tools. A boundary may also be created, so that only a certain area needs to be discovered. That is one way of reducing the discovery time.
About the author
The director of sales for Ionix in India, Rajesh Awasthi is responsible for EMC's Ionix software business in India and the SAARC region. He has worked on sales, business development, consulting, partnership management, alliance management and product management for telecom software, data network, IT infrastructure, disaster recovery and business continuity solutions. Awasthi has worked on quality process building and implementation for increasing efficiency and has also won several awards.
(As told to Jasmine Desai.)