The higher the number of servers you manage, the less likely you will be able to continually monitor their health all on your own. Systems administrators do actually sleep every now and then -- so an ever-vigilant computer that probes system health and can monitor performance issues on a large number of machines is a huge help.
My personal preference for monitoring tools happens to be Nagios. There are a number of good monitoring solutions out there (a number of them seem to actually be based off of Nagios) but I've long liked Nagios for its price (free), its completeness, and the fact that it is an open source project.
The open source nature of Nagios combined with the modular nature of its probes and the fact that the plug-ins themselves are pretty easy to write means that if Nagios doesn't happen to check an attribute out of the box, I can either easily write a new script that does (or more likely, someone has already written it for me). There is a large set of third-party plug-ins available online that go far beyond system load and ping checks and move into SAN multipathing and more advanced Apache monitoring.
For the longest time, I just thought of Nagios as a monitoring tool. It would probe all of my servers and send alerts if a particular service was down or if system load or other statistics were outside the norm. Then one day while I was looking for a good solution for trending performance data on my servers, I realized that while there are many other tools out there like Cacti can poll system statistics and graph them for you, I already had a system in place that polled every server I cared about in my network, I should just figure out a way to extend it to graph all of that data it collected.
While a monitoring server mostly values data that falls outside of the norm, it still collects tons of valuable data every time it does a probe. Even though Nagios does not graph performance data by default, it does offer a mechanism to collect the data it does get from its probes. Basically, all a Nagios plug-in has to do to support Nagios's performance data collection is to output the extra performance data at the end of its standard output. The format for the output is pretty straightforward and is documented for Nagios 3.0. Once the plug-in outputs this data, Nagios can then be configured to simply dump this data into a file in certain formats for later parsing, or it can pass the data to a third-party program. There are a number of programs to manage this data but I settled on one called PNP. PNP stores the performance data to RRD (Round-Robin Database) files that can then easily be graphed.
Using graphs for troubleshooting server systems performance
Now what exactly is the advantage of graphing all of this data? Graphs aren't just for vendor presentations, graphs can be invaluable when you are trying to identify and track performance problems on your network. While you could certainly just pore through the performance numbers by hand, you will find you can identify problem points more quickly when all the system stats for a machine are graphed and lined up according to time.
Whether you use Nagios and PNP or some other graphing tool, once your system is set up, how do you use these graphs to track down performance issues? Sometimes you get lucky, but it's not always as easy as finding that one graph with a spike. For example, let's take one of the most basic statistics you will likely monitor and graph: system load. On Linux and a number of other Unix systems, the system load is displayed with three numbers: the average number of running or uninterruptable processes over one, five, and 15 minute intervals. These numbers aren't normalized across multiple CPUs so for instance a load average of one on a single CPU machine means the processor is currently 100% busy. But on a two-CPU machine, a load of one means you have one processor idle on average.
However, spikes in load average can be misleading. While it's easy to point out a performance problem being caused by a high load, it's important to remember that all load averages really tell you are how many processes are running and potentially waiting. Load averages don't tell you why they are waiting. There are a number of different causes for high load averages on a system and they can cause the system performance to degrade in different ways.
Probably the most simple cause of a high-load average is a large number of processes on the system, many of which fully use a CPU. If all of your CPUs are currently completely busy and new processes spawn, each of those processes will have to wait for their turn with the CPU. This CPU-bound load can behave interestingly. Depending on how many of the waiting processes use the CPU heavily, you could have a very high load but still have a relatively responsive system. I've seen systems with CPU-bound loads in the hundreds that while not exactly zippy, could still be logged into to check performance without much of a problem. I've also seen machines with relatively low CPU-bound loads bog down because there were enough CPU hogs running at the same time to more than tie up all CPUs.
Another reason for high load is often due to I/O bottlenecks. When processes compete for the same disk resources, some have to wait and during high disk I/O the waiting processes can stack up. In my experience high I/0-bound load can cause the system to become even more sluggish than CPU-bound loads even for lower load averages.
Since there are a number of different causes of load, it can sometimes take a bit of detective work to track down the root cause. However, good graphing tools can often help you pinpoint the cause much more quickly. For instance, a few metrics I monitor and graph on my systems are the load averages, RAM and swap utilization, disk I/O for each mount point, and network I/O. Once all the graphs are lined up, you can easily tell whether the spikes in load correspond to spikes in any of the other metrics. If I see high load but no spike in disk, then there's a good chance the load is CPU-bound. If I see high load that correlates with high disk I/O then I can be assured that the load is disk I/O-bound. If I also notice my overall RAM use increasing before the spike in load along with an increase in my system swap, then I would have a good hunch that the load could be caused by the system running out of available RAM and relying on swap, which would then cause a large increase in disk I/O.
What's even better about using graphs to track down performance problems is that you can do it after the fact. For whatever reason, some times you can't access a machine as it is experiencing a performance issue. By the time you get an alert and log into the system, it's possible that everything could have returned to normal. There are a number of times I have been able to piece together the cause of a performance issue strictly from the graphs. I know on my graphs I can always tell when my nightly backup job has run by the series of spikes in disk, then network traffic. This has been particularly handy when I've needed to rule out the backup job as the cause of sluggish performance as I can do it at a glance and not have to dig through backup logs.
Over time, your monitoring can also provide good baselines for trending. Whether it's something more complex like the gradual increase of overall Apache processes your Web servers use during peak times and how they correlate with spikes in your RAM usage, or whether it's something simple like the rates your databases consume disk space over the past few months, good graphing tools tied into your monitoring can provide you with reports automatically that you'd otherwise have to devote to mundane data collection and manual graphing. Plus, with proper arrangements of your statistics, you can more easily see the relationship of spikes across a number of different systems.
The combination of a good monitoring tool that is reliable and extensible with automated graphing and trending tools makes yet another otherwise time-consuming and mundane process like gathering statistics and tracking performance bottlenecks manageable. When your downtime is measured in dollars, not seconds, you definitely need all the advantages you can get so that you can accurately and quickly isolate the cause of performance issues. Plus you get that added advantage of fancy graphs to throw into your next presentation in front of management.