The Essentials Series: Resolving VMware vSphere's Six Biggest Performance Issues

by Greg Shields


VMware vSphere is a complicated beast. Because of its moving parts and deep integrations, maintaining vSphere's performance can become a never-ending exercise in capacity management, requiring constant monitoring. To monitor and maintain high performance, vSphere offers over a hundred metrics and counters exposed in the vCenter Client. The hard part, however, is in understanding which counters are useful for recognizing resource capacity, and which counters simply muddy up the water. In The Essentials Series: Resolving VMware vSphere's Six Biggest Performance Issues, author and virtualization expert Greg Shields introduces you to an important set of useful counters that will clearly quantify the behavior of vSphere. Furthermore, he will introduce actionable intelligence that an administrator can use to glean actual resolutions from the raw data.


Article 1: What Six vSphere Issues Most Impact VM Performance?

Monitoring Behaviors to Find Performance Issues

Before we can delve into the technical information, it is necessary to recognize the biggest issues vCenter environments face. When you step into your office on a Monday morning to find a dozen work orders and voicemails, its your job to figure out why "the server is slow today."

That troubleshooting process has for too long been a subjective activity. Part of the reason for our gut‐feeling approach to performance and capacity management has centered on the servers lack of instrumentation. In the physical world, instrumenting a server required extra effort, additional and sometimes expensive software, and an advanced degree in statistics and data analysis. Today, even as virtualization complicates these activities through its collocation of virtual machines, it also eases performance and capacity management by automatically instrumenting virtual machine activities with a range of behavioral monitors.

Article 2: What Ten Counters Quantify those Behaviors?

A virtual environment is by nature an invisible environment. You simply can't crack the case on a vSphere host and expect to "see" the behaviors going on inside. That's why its counters are so important. They represent your only way to understand the behaviors and quantify potential resolutions.

But counters by themselves are very scary things. A counter is by definition just a number. Put together enough of those numbers, and you'll create a graph not unlike Figure 1 in the previous article. Divining meaning from the points on that graph, however, is another thing entirely. Studying charts and graphs is an activity that can consume every part of your workday. With those graphs constantly evolving with a virtual environment's behaviors, just keeping up is a challenge all its own.

Yet monitoring virtual machine performance is a virtual environment's most important activity. That's why this series' previous article suggested that an unaided person can never effectively convert raw data into actionable intelligence. Oh, yes, in a tiny environment with just a few interdependencies, you probably might. But most of our VMware vSphere data centers are large and distributed. Finding the source of a performance issue isn't easy when you're starting at its unending integers.

Article 3: What Twelve Tools and Processes turn Raw Data into Resolutions?

Keeping eyes on ten counters for a single virtual machine isn't easy. Doing the same for dozens or hundreds of virtual machines is functionally impossible for any human being. That's why assistive tools are necessary to convert those counters' raw data into actionable intelligence. Answering that all‐important question of What should I do? requires aligning what's going with the range of possible resolutions.

This last article was written specifically to highlight how difficult that process is with counters alone. If net.usage.average is high today but so is disk.busResets.summation, what should you do? Is the bottleneck related to network oversubscription or to a situation in your storage layer? Even worse, are both subsystems experiencing a problem, or is one problem causing the other?

Even more insidious is the situation where the issue isn't a problem at all. Instead of sourcing from some hardware shortcoming, perhaps the problem relates to another administrator's storage or networking activities. Maybe they've just begun a large and unthrottled migration of data over the network. Numbers lie. They do so particularly when no governance exists over the activities those numbers are measuring.