That is the daily reality for Neil Forster, IT director at Attenda Ltd., a managed services company with data centres in the UK and Germany. And with business set to expand, Forster needs to find new ways of delivering a high level of service without adding more people to the support desk, which means introducing greater automation wherever possible.
Like many companies with a lot of applications to run, Attenda has also adopted VMware to allow greater use of hardware resources, and the company has about 20% of its server estate running in virtualised mode at the moment, with plans to push implementation up to 80% when a new data centre opens later this year.
In order to achieve his goals of greater automation, Forster has recently begun using the new Aegis system from Texas-based security company NetIQ Corp. Although still in the early stages of implementation, he says the technology is already helping to automate many of the repetitive tasks that can take up the support team's time.
"When we started looking at automation tools about 18 months ago, we couldn't see anything that suited our exact needs, both in terms of agility and ability to integrate with our own systems," Forster said.
The 10-year-old company is a long-term user of the Information Technology Infrastructure Library (ITIL) -- a globally recognized collection of best practices -- for its IT service management. Attenda also uses a lot of homegrown software to support its operations, including its own configuration management database (CMDB).
"In 2001, when we first adopted ITIL, no one was producing CMDBs, so we built our own, as well as a tool to manage alerts," said Forster. "Our own CMDB has matured really well and is an essential part of our business." The CMDB is a fundamental component of ITIL; it collects data about all the devices and services that Attenda manages and stores them in a central repository.
Any new automation tool would need to integrate with the CMDB, since the Attenda environment is highly fluid with servers being constantly brought into and out of production. The company was already a user of NetIQ's AppManager performance management system, so when NetIQ came to discuss its upcoming IT process automation product, called Aegis, Forster was pleased to see that it had many of the features he had been looking for. "Aegis has an open architecture, which meant we could integrate it not only with AppManager, but also with the tools we had built in-house -- our CMDB and alert management tool."
The implementation began in April, and Forster says he is already beginning to see some processes being automated.
"Our alert management system collects and correlates alerts from about a dozen different point-solution monitoring tools. Aegis can talk directly to the alert manager, rather than having to move everything into one console," Forster said.
"We can use Aegis to respond to and correlate alerts from a content switch, servers, URL monitoring tools and so on. So we can build quite sophisticated rules which tell it when to kick off a particular workflow," he said.
Aegis is also helping to automate many everyday processes. "Every client has an operations manual containing detailed responses to particular alerts," Forster said. "Before we used Aegis, we had to do all of this manually. Every time a support person got a particular alert for a client, they had to go to the manual and look in the 'known errors' section. If it was a known error, then they had to follow the response."
The new tool is now starting to lift the burden of much of that work. "Once we have identified a problem, and a workaround for that problem, we aim to automate the workaround so that we don't have engineers doing it every time. It may be something as simple as restarting a server. In another case, it could be a time-out on a Web transaction monitor, where our engineers have to go and make five different checks -- any of which will require a different response. With Aegis, we can automate the checks and the responses."
One of the most common problems is disk space filling up, which normally requires someone to check whether files can be deleted, and if so, which ones. Now Aegis analyses which folders are using disk space and sends that information back to the administrator. "In some cases, we will automate the fix fully, and in other cases we will collect the data and pass it back to a human being," Forster said.
Forster said the benefits will come gradually as they manage to build workflow models for each process in Aegis. "It is early days. We have more than 300 known errors in our operations manuals, and ideally we'd like to automate all of them. We are focusing on the top 10%, starting with global problems such as disk space running out, which can occur on any server. They are the low-hanging fruit. Then we will move on to more customer-specific processes."
The business benefit, he said, is that much of the work becomes deskilled and can be pushed from third-line support down to second or even first-line support staff. That shift will allow skilled engineers to focus on more difficult problems, and let the business grow without an increase in the number of support staff.
"With our projected growth, if we can keep the headcount steady over the next two years, we will be very happy," he said.