Your system administrators are looking at their screens, and all seems well. A sea of green denotes that every monitored system is performing optimally, and the system administrators’ thoughts are turning to the weekend and a 48-hour splurge of online gaming or code debugging.
On the helpdesk meanwhile, the screams of anguished users can be heard, bemoaning the fact that their productivity has plummeted – due to poor access to or poor performance of the IT platform.
Somehow, there has been a serious disconnect between the system administrators’ metrics and the users’ experience. From the IT department’s point of view, everything is running well – but are they looking at the wrong things, or perhaps not looking at enough of the right things? For the user, identifying blame is easy: it’s IT’s fault – but that is usually a gross, and often unjust, simplification.
Just how “healthy” is your datacentre? Is it doing what the business requires of it, or is it part of the problem? To answer this key question, datacentre managers must first identify what it is, exactly, that they need to measure.
To check the datacentre’s health and to minimise downtime, there is a need for multiple levels of measurement – from the highly granular, equipment-based monitoring and reporting, to the outside-in monitoring and reporting from the user’s viewpoint.
More on datacentre management
Checking if the IT equipment is in top shape
At the datacentre IT equipment level, monitoring its state and performance is no longer enough. Being reactive to problems is storing up issues – an “N+1” redundancy approach (having one more item of equipment than is truly needed) turns into an “N” approach if an item fails – and if a second one then fails while the first is being fixed, IT disaster strikes.
It is far more strategic to use a predictive approach – monitoring factors such as the temperature of key components such as the central processing units (CPUs) and disk drives; monitoring the power draw to see if this suddenly and unexpectedly alters or is trending upwards, and replace the system before it fails.
Datacentre managers should also understand that during a replacement, an N+1 approach is no longer providing any redundancy – so they should either go for an N+2 (or greater) approach, or ensure that the key component are easily accessible so that replacement can be carried out rapidly. This will help to minimise the time where redundancy is not in place.
Next is the environmental health of the datacentre. The use of monitoring tools for overall temperature, smoke and humidity, along with infrared heat sensors, will allow problems to be detected before they become issues. By linking these to the equipment monitoring systems, IT will be able to connect the presence of a datacentre hot spot (indicated by an infrared sensor) with a specific piece of equipment which can be swapped out or shut down to prevent the problem getting out of hand.
The broader facility and its equipment also need to be monitored and assessed to maintain a datacentre in good health. Whereas facilities management may be using a building information modelling (BIM) tool, this will generally not be integrated into IT’s systems management tools.
More on DCIM and modular datacentres
Using DCIM tools and modular systems
The use of a datacentre infrastructure management (DCIM) suite may pull everything together, but that alone will not suffice. In addition to implementing DCIM tools, the health of the facility’s power distribution, uninterruptible power supplies (UPSs), auxiliary generators and cooling systems have to be linked in to the overall view of how the datacentre is performing.
Through using modular systems throughout a datacentre infrastructure – from the IT equipment to the facility support equipment, failures of individual pieces of equipment can be allowed for.
Where possible, datacentre teams should use load balancing capabilities – for example, using intelligent virtualisation of servers, storage and networking equipment and intelligent workload management modes in UPSs and generators – to provide the maximum levels of business continuity.
Load balancing will provide much higher levels of availability than a direct, simple N+1 approach, as the failure of even two or more items can still be dealt with, even if the application performance is affected.
Taking not just an IT but a business approach to datacentre health
But assessing the datacentre infrastructure’s health is not just an IT question, it is a business question.
The above discussions deal with the datacentre itself – and for many organisations that may already have the above in place, it may be seen as being enough.
The problem is that the screens that the systems’ administrators are looking at are generally part of that datacentre environment, and are connected to the systems through datacentre networks at datacentre speeds. So it is hardly surprising that everything looks as if it is working well while the helpdesk goes into meltdown.
The datacentre is generally connected to the rest of the organisation through local and wide area networks (LANs and WANs). Users access the datacentre through these networks. If there are problems anywhere along these connections, the user will have a poor experience – and will contact the helpdesk, often with the perception that it is an application or a datacentre problem, rather than a network one.
It is incumbent on the datacentre manager, therefore, to be able to monitor connectivity across the different types of network to ensure that the datacentre serves the business users’ IT needs.
Any datacentre manager looking to fully support the business should take an end-to-end systems management approach, so ensuring that the organisation is working against a healthy IT platform – not just a healthy datacentre
Another challenge is that many of today’s workers are mobile and/or working remotely and will therefore be accessing the company’s datacentre services through public connectivity – ADSL lines or Wi-Fi or mobile wireless networks. Being able to measure network performance across these less predictable connections can be problematic – but providing tools to the helpdesk for them to “ping” the user’s device and see if latency, jitter or packet loss are causing issues will help the IT in effectively identifying the root cause of any issue.
The final area is around the user’s device itself. A PC may have a disk drive that is full, a tablet may have a process that has hung at 100% CPU utilisation, or a virus may be affecting overall performance. Putting in place tools that enable the endpoint device to be monitored and fixed – automatically wherever possible, or through efficient and effective means via the helpdesk where automation cannot be used – will again make root cause analysis easier.
The use of human mean opinion scoring (MOS) systems can also help. Rather than depend on technical measurement of the performance of systems where comparing a pseudo-transaction’s performance against an old service level agreement (SLA) and getting a green signal, asking real users as to their experience will be far more illuminating.
If users find performance to be too slow, it is no use pointing to the SLA and saying that it is within agreed limits – if the perception is that it is too slow, then it is down to IT to see if performance can be improved.
Monitoring the health of the datacentre is like monitoring the health of a person: focusing on just one area can mean missing where the real issue is and thereby lead to the failure to properly treat the problem.
Determining the set of measurements that will provide an accurate assessment of datacentre health in terms of business performance requires a holistic approach.
Any datacentre manager looking to fully support the business should take an end-to-end systems management approach, so ensuring that the organisation is working against a healthy IT platform – not just a healthy datacentre.
Clive Longbottom is service director at analyst Quocirca. The datacentre consultancy firm has three papers that cover ITLM and IT financing available for free download here: Using ICT financing for strategic gain; Don’t sweat assets, liberate them; and De-risking IT lifecycle management.
This was first published in July 2013