Troubleshooting VM performance, capacity management woes

Problems stemming from performance and capacity management are big worries for virtualisation administrators. But there are tool sets and metrics they can use for troubleshooting.

With increasing virtual machine (VM) performance and improved availability monitoring tools, IT pros may think that performance and capacity management will be less a concern. But there will always be circumstances that cannot be resolved -- even with the latest feature-rich hypervisor-level functionality. When such circumstances occur, existing hardware resources which are allocated to a virtual machine will decrease and end user complaints on performance degradation will increase. The virtualisation administrator must resolve these problems with a suitable, easy-to-use tool set that can gather the most relevant statistics. In this tip, technology expert Daniel Eason offers advice on which tool sets and metrics professionals can use for troubleshooting.

Guest VM or host VM?

User complaints of performance degradation can be stressful for IT administrators. Although server virtualisation helps consolidate server workloads, virtualisation technology also has a hidden complexity that masks existing and potential configuration issues within any supportive external hardware - such as disk storage configuration - which could be the issue.  

Traditional server OS monitoring tools and performance counters used within a physical server infrastructure are now mostly redundant because of the change in approach towards consolidated workload monitoring.

Examples of such tool sets are:
- Windows PerfMon OS Monitoring counters (relevant to Windows)
- Microsoft Systems Center Operations Manager OS Monitoring
- HP OpenView-based OS Monitoring

Instead, monitoring strategies must be focused on using toolsets which monitor performance from outside of the virtual machine (VM) at a hypervisor and host level. Examples include:

  • esxtop (covered in detail throughout this article)
  • Veeam Monitor
  • Quest vFoglight
  • VKernel Performance Analyser
  • VMware vCenter Operations

Available host-level tools mentioned above allow administrators to have an instant view of single VM consumption and provide clarity on the core hardware I/O operations such as CPU, RAM and storage.

Available monitoring tool sets

So, what tools should you be using and reviewing to troubleshoot performance and capacity management issues?

One important tool set for monitoring is the vSphere command-line tool esxtop, which is included within fully installed versions ESX/vSphere or resxtop. It is used to monitor remotely with the VMware vMA (virtual Management Assistant). Both tools can be used for obtaining an instant snapshot counter information via the command line interface, and additionally historical results can be archived to raw Cluster Shared Volumes (CSV) format for review in a more simplified form or for use at a later point in time.

At glance it can be confusing to try and change to different hardware resource statistics, but when you press the letters displayed within the command-line interface session, it allows you to toggle to the desired information on most relevant hardware resources.

Newer versions vSphere include esxtop batch support to export historical statistics into a CSV file for viewing outside the CLI session.

To begin a simplified fault-finding process, the following is a good command for capturing all performance counters into a single CSV file in batch mode:

CLI command line: esxtop -a -b > myresults.csv

Once exported, the next sections provide a list various hardware-related counters that you may need to review within esxtop to identify an issue. I also have some tips on what you can do in order to remediate those problems.

Identifying vCPU problems

Problems with CPU are usually due to the vCPU configuration for VMs and not so much a lack available CPU power. The ease adding vCPUs just because you can easily do this for a VM may have big consequences for multiple running VMs within a shared environment.

Table 1 (below) illustrates the counters shown within esxtop which can help you determine possible performance issues with vCPUs.

Table 1: Esxtop counters

Counter Reference Symptom Solution
%RDY CPU ready time A high number >60% indicates that there is contention and VMs are in a queue awaiting an opportunity to use CPU. See if you can reduce the vCPU count your VM. Also check if the VM has the correct HAL (hardware abstraction layer) version to match the processor numbers.
%USED Used processor resource High percentages (>60%) for this means that a particular VM is consuming a lot CPU. Investigate whether you can resource-control other VMs in order to redirect resources back to the VM that is suffering.
%CSTP Co scheduling volume As with %RDY, a high percentage (>60%)  shows that vCPU is used far too heavily across the complete host. Establish if your VMs with multiple vCPU are actually multithread capable. If not, reduce thread capability to provide more available scheduler time to VMs that do support multithread apps.



  If this is high (>60%), it indicates that limits are set for this particular VM. Review why there’s a limit imposed. Is it because it was placed at deployment stage? Was it added to a resource pool or were limits applied to the VM by mistake?

Identifying problems

Sometimes, performance and capacity management issues arise because lack in a particular VM. IT staff can identify this problem by reviewing the reporting statistics that highlight the usage features that occur when vSphere/ESX management technologies are paging and VM Ballooning.

If there’s overcommit level within the console you will see “MEM Overcommit avg: eg 0.40, 0.40, 0.40”. The three numbers are ordered in 1-, 5- and 15-minute intervals and show that over those periods you would be 40% overcommitted on physical available.

To establish whether you are really at the edge exhausting , the statistics shown within the counter SWCUR will highlight how much the problematic VM has swapped. Swapping to disk puts a lot pressure on storage. So avoid swapping at all costs.

Counters (identified in Table 2) are what you should look for when trying to see if swapping is your potential demon in the closet.

Table 2: Counters

Counter Explanation
SWCUR This shows the amount (in MB) this VM has swapped to disk in the past.
r/s & w/s Read and write levels swap shown in high amounts indicate large amounts paging. If R/S is high, this could indicate that a large request was made by the application and it is still using swap .

You can establish if ballooning is enabled for each VM under the counter MCTL. This provides a simple “Y” or “N” as to whether it is present. If you see “N” and believe your application is -bound, you will need to do something about it, so ensure VM tools are installed to enable VM Ballooning.

Overall, to resolve bottlenecks, ensure that you have not over-committed on resources on VMs that need a guaranteed amount . If you have carefully accessed the overall consumption on a per-VM basis, and you are certain that this is normal behavior, then you should investigate the increase physical RAM upon your host in question or increase the number hosts if this VM is within a Distributed Resource Scheduler cluster.

Identifying disk or I/O problems

A number components and layers that are resident within storage a virtualised environment may be the root cause your issue. But it is probably safe to suggest that for storage issues, the root cause is likely to be due to external storage configuration, ie, RAID-level allocation or total amount spindle volumes.

Esxtop is capable monitoring storage at the following levels:

  • HBA - Press d, f, b, c, d, e, h, j, s, 2, Enter.
  • LUN - Press u, f, b, c, f, h, s, 2, Enter.
  • VM – Press v, f, b, d, e, h, j, s, 2, Enter.

When inside each esxtop counter view, there are key metrics that users must review to establish if storage is the root cause any potential issues (as shown in Table 3 below).

Table 3: Solutions for storage issues

Column Explanation Solution
CMDS/s Total IOPs between storage device and VM. If this is low, look to the external storage array configuration. You can increase the total amount disk spindles present within the relevant storage for more I/O or review RAID configurations to establish if it is appropriate for the workload.
DAVG/cmd Average response time in milliseconds for each command being sent to the storage device Look for any configuration issues on the external storage connectivity devices such as the SAN switch.

Storage problems are rarely at the hypervisor level and are usually an external configuration at the storage subsystem layer. For external storage arrays, consult documentation from the incumbent storage vendor to ensure that you’re following the best practices. Also remember that different storage media types -- FC against SATA -- give different results, results that you can miss when you are trying to see why certain performance is lagging because encapsulation by the hypervisor storage type.

Reviewing data with Windows PerfMon

All esxtop collected CSV data via batch mode can be viewed within Windows PerfMon. This requires the CSV to be copied to an accessible drive and then opened as a log file data source. Doing this provides the added benefit being able to point and click counters and select a date range.

Reviewing data with vCenter

VCenter also shows performance statistics that are within esxtop through the graphical user interface (GUI) and provides some the same counters as historical statistics. Consider the following when opting to use vCenter over esxtop directly:

  • You will need to collect additional statistic levels which require more database, performance storage and maintenance.
  • You may find that vCenter bombards you with too much information which may slow down fault finding problems.
  • Historical data collected in vCenter is not always shown as per the recommended guidelines from VMware.
  • Whereas esxtop is on a per-host basis, vCenter has the benefit multi-host management.

Reviewing data with esxplot

The free viewer tool, esxplot, enables you to open esxtop CSV files. It gives you the opportunity to open CSVs without using PerfMon on Windows; it also supports Linux. This will allow easier drill down each component which is critical when you need to identify root causes quickly. You can download it from VMware Labs.


Check out these reporting tools and counters  when you’re experiencing performance issues with virtual machines. They could help you resolve performance and capacity management issues within your virtualised infrastructure.

But one important takeaway for performance and capacity management is that you should always keep an open mind and never rush into conclusions about where a problem lies. Explore other potential factors such as a regular antivirus scanning or a backup job which has overrun or other similar services that may well be the root cause.

Daniel Eason is an infrastructure architect at a multinational company and is based in the UK.

Read more on Datacentre performance troubleshooting, monitoring and optimisation