By any measure, server virtualisation has been an amazing success. In 10 short years, virtualisation has moved from a desktop product to datacentre mainstream.
VMware has defined the market with its vSphere hypervisor and product ecosystem, followed closely by Microsoft Hyper-V and open source KVM and Xen.
Virtualisation provides the basis for cloud (private and public), and in terms of penetration around 80% of all datacentre workloads are virtualised, with virtual servers being the default option for new deployments.
Ease of use of virtualisation is a real boon. A physical server costs money, has to be budgeted, and takes time and effort to install and configure. By comparison, spinning up a virtual machine has only a marginal cost (once the hardware is deployed), making it much easier to create virtual machines on demand, with little fuss or effort.
However, as the ability to spin up a virtual machine to test a new operating system (OS) release, upgrade software or develop new software functionality becomes radically easier, so we introduce risk into the deployment model. That risk comes in the form of something called VM sprawl.
Sprawl in IT has been around for years. As soon as a technology becomes cheap and easy to use, so the sprawl scenario arises.
Think of how something as innocuous as disk space and file shares has changed over the years. When this technology was first introduced, disk space was relatively expensive, and so time and effort was put into curating and managing data content. This was even more the case in the mainframe days, when every megabyte allocation would be validated and accounted for.
Today, file serving is ubiquitous and made simple through the use of Dropbox and other similar file-sharing services. The result is that all of us have gigabytes of data we don’t need, and in most cases we have no idea what is being kept. This sprawl problem can exist in just the same way with VM sprawl – the uncontrolled creation of virtual machines in a virtual server environment.
What causes VM sprawl? There are obvious causes, such lack of management or control around the virtual machine creation process, but there are also technical reasons why it might occur.
Problems include disassociation of the virtual machine from the host inventory; this can happen if a virtual machine is deliberately or accidentally removed from the hypervisor inventory, which is easily achieved through VMware vSphere vCenter, for example.
VMs may be deliberately removed to move them to another host cluster, and in large environments with thousands of VMs, the risk of miscataloguing some of them is quite high.
Virtual machine fragmentation can also occur. This can happen when VMs are moved regularly around the storage environment and for some reason the copy process fails, leaving multiple VMDKs on disk in a partially copied state. The move may eventually complete, but old files are left on disk.
Finally, there are snapshots. Creating many snapshots across virtual machines can result in significant additional use of disk capacity.
So why we should care about VM sprawl? After all, disk space is cheap and plentiful today, so what’s the problem? Well, there are a number of issues:
Cost: Disk space may be cheap, but it’s not free. VMs that are powered up also consume CPU time and memory. In addition, these virtual machines could be incurring licence costs in the form of backup agent, OS and database licences. Where increased storage utilisation requires a capacity upgrade, space occupied by unused virtual machines could result in unnecessary additional purchases.
Management: More VMs means more management work for administrators. Storage administrators have to provision more disk space, while VM administrators have to manage more virtual machines and juggle physical resources.
Risk: Without adequate tracking of VMs, it can be difficult to determine which VMs form part of production systems. This is especially true when the number of VMs gets past what a single person can remember, or, even worse, when that person leaves the business. Imagine a disaster recovery scenario where administrators and business owners have to spend precious recovery time trying to work out whether a particular virtual machine is critical to running the business.
Of course, tackling VM sprawl mitigates against cost, management overhead and risk, but how exactly can these issues affect the work of the storage administrator? Here are a few thoughts:
- Lack of disk space: VMs can be big users of disk space. A Windows virtual machine can take upwards of 40GB if not thin provisioned, making it easy to quickly exhaust resources. VM sprawl can mask the true utilisation, resulting in unnecessary disk purchases and all the planning and deployment work that goes into that process. This can be exacerbated by heavy use of snapshots.
- Impact on performance: Creating many VMs (that are running) can have significant performance overhead on storage, especially when VMs have scheduled or automated tasks running, such as virus scanning or defragmentation. Moving VMs around an infrastructure on a frequent basis, simply to juggle storage capacity, also uses precious I/O cycles that could be better used delivering I/O to the host.
- Management: The management overhead can be significant for the storage administrator, including having to create additional datastores or volumes, rebalancing storage for capacity and performance, and ensuring all VMs are backed up in a timely fashion. Lack of storage may also result in excessive time wasted on clean-up tasks like recovering disk space.
- Data protection: Users expect all their VMs to be backed up. So, uncontrolled growth in primary disk utilisation means backup systems have to scale too and more pressure is put on the backup administrator to ensure everything gets safely backed up within agreed service levels. This can also have a direct effect on cost and wasted resources due to repeatedly backing up the same idle virtual machines. In a disaster scenario, the recovery process may be delayed by restoring virtual machines that have no practical production use.
Tackling VM sprawl
So how do we go about tackling the problem of VM sprawl without compromising the benefits of on-demand virtual machine creation?
Audit VMs – This may seem like a simple suggestion, but the aim here is to ensure that VMs map to a hypervisor host or cluster. This can be achieved using simple PowerShell scripting that can check the on-disk contents against the inventory that the datastore or volume is assigned to. Any virtual machines found not to be associated with a hypervisor can be archived or deleted (subject to verification).
Implement good naming standards – Without decent naming standards, tracking down the owner of an unused VM can be an issue. Naming standards for virtual machines are probably in place anyway, as without them managing large numbers of VMs is impossible. Either the name or the description should reference the business owner or department, and/or a specific contact.
Implement data policies – Policies include using thin provisioning where possible (including cleanup and optimisation tasks) and snapshots where appropriate, with standards around snapshot-retention times.
Implement VM archiving – Many backup products provide facilities to archive VMs. Deleting a currently unused virtual machine might be undesirable if that VM is needed again in the future. Therefore, simply having the ability to archive a VM off to cheaper storage or tape provides the ability to reduce primary disk sprawl. Obviously, the key requirement here is to ensure the VM can easily be found again when needed.
Implement VM lifecycle management tools – As the demand for virtual machines grows, many organisations will see benefits from implementing a management framework around virtual machine deployments, such as ServiceMesh Agility from CSC. These tools provide a framework to deploy virtual machines within projects, tracking all objects within a centralised database. These platforms are a stepping stone into a cloud management infrastructure and so are more likely to be useful in organisations with heavy development requirements and large-scale VM deployments.
One final observation: using server virtualisation means the storage administrator has to work more closely than ever with other teams, and needs an understanding of the value of the data being stored. In many cases, of course, the storage and virtualisation administrator may be one and the same person.
More on VM storage and backup