The most commonly used metric for capacity planning does not address the fundamental questions that need to be answered to perform capacity planning.
The common assumption is that we need to measure utilization to understand link capacity, determine if there is congestion affecting your network service performance, and plan for capacity increases or decreases. This is why all major performance management products measure interface throughput and calculate utilization as a metric. Almost all capacity planning tools also rely on interface throughput or utilization. However, this metric doesn’t address the question.
Before we look at what to measure, let’s discuss the reason why we need to measure it. IP is an asynchronous protocol; therefore, bandwidth is not allocated by circuit or service, but by demand. The physical media; however, uses a synchronous delivery at a fixed rate. This can be illustrated by cars on a road. The road is only so wide (bandwidth) and can accommodate a certain number of cars at a certain speed. When there are too many cars during rush hour, we have traffic congestion. Likewise, during periods of high demand the fixed rate circuit may become congested causing packet loss, jitter, and service degradation. We want to prevent service degradation due to congestion; so we need a means to manage it. The fundamental questions we need to answer are:
The assumption is that utilization indicates congestion and loss; however, this is false. Interface throughput measures the number of bits or packets through an interface over a polling period, usually 5 minutes or more. Utilization is then calculated based on interface speed. This means that every data point is an average over the polling period, which is rarely less than 5 minutes. The problem is that the interface queue only holds 300 milliseconds of traffic. A five minute average is 1000 times the queue length. Traffic is bursty, so you really have no idea what is happening within that 300ms window. Even if you increase the polling rate, the statistics are only updated every 2-3 seconds. If you were polling every 5 seconds (which would be burdensome), the polling rate would still be 20 times the queue length. Furthermore, utilization doesn’t tell you if there is congestion or discards – high utilization implies congestion based on a certain set of assumptions.
Any relationship between interface utilization and congestion is an educated guess at best. Due to the polling frequency, it’s not a very accurate guess.
The most common approach to congestion management is to implement a differentiated services model, also known as Class of Service (CoS) or Quality of Service (QoS). This provides a mechanism to reduce congestion and improve performance of loss and jitter sensitive applications while allowing you to more fully utilize available bandwidth. This improves both performance and efficiency.
DiffServ also complicates the problem described above. If we don’t know how much congestion or discards due to insufficient bandwidth we had without CoS, how much less do we know when we add the complexity of a differentiated services model?
Almost all routers utilize and algorithm called Random Early Detection (RED) to improve the performance of TCP traffic. This algorithm discards certain TCP packets before congestion occurs to improve TCP performance. This is another factor that adds to the complexity of monitoring congestion.
In a perfect world, using DiffServ and RED, we could maximize interface utilization while minimizing discards and jitter. Ideally this means 100% utilization and good performance. Of course, nothing is ideal, but where is that sweet spot for utilization? Can we get the most value from bandwidth and still not sacrifice service? This is the goal, but measuring utilization will never tell us how well we’re progressing toward that goal or how much congestion there is, because we really need to know about queuing.
The solution is to measure the number of discards in the output queue and the throughput of each queue instead. These metrics are available through SNMP on most full-featured platforms, but are not typically out-of-the-box in any performance management system.
Measuring queue discards answers the fundamental question. These discards are even separated by RED discards (not necessarily a congestion indicator) and tail drop discards, which indicate loss due to insufficient bandwidth. Furthermore, systems implementing CoS have the capability of measuring the discards by class of service. Discards in Best Effort, for example, are not necessarily bad, the service was designed to accommodate a certain rate of discard in some classes.
The other advantage this has is not only does it provide insight into operations, but provides data necessary for engineering. When the Class of Service model was designed, the service classifications, queuing methods, and queue capacities were designed to meet certain criteria. How well are they meeting the established criteria? Queue throughput and discard data should be used to evaluate the performance of the CoS design on a regular basis and can be leveraged for improvements to that design.
Choosing metrics that provide insight into the health and performance of systems and processes can be challenging. Metrics need to be aligned with the requirements of the systems and processes that they support. While many performance management systems provide useful metrics out-of-the box, you will undoubtably have to define others yourself and determine a means to collect and report them.
I break metrics down into two major categories: strategic and operational.
Strategic metrics provide a broad insight into a service’s overall performance. These are the type of metrics that are briefed at the manager’s weekly meeting. They usually aren’t directly actionable, but are very useful for trending.
Strategic metrics should be used to evaluate the overall effect of process or system improvements. Healthy organizations are involved in some manner of Deming style continuous process improvement (CPI) which also applies to system/service design. As changes are implemented metrics are monitors to determine if the changes improved the system or process as expected.
Some examples of strategic metrics are: system availability, homepage load time, and incidents identified through ITSM vs. those identified by customers. These provide a high level indicator of performance more closely related to business objectives than to specific system or process operation and design criteria.
Operational metrics provide detail and are useful to help identify service disruptions, problems, capacity planning, and areas for improvement. These metrics are often directly actionable. Operations can use these metrics to proactively identify potential service disruptions, isolate the cause of a problem, and evaluate the effectiveness of the team. Engineering uses these metrics to determine if the service design is meeting the design requirements, identify areas for design improvements, and provide data necessary for planning new services and upgrades.
Good metrics should be aligned with operational factors that indicate the health of the service and the design requirements. Metrics, just like every other aspect of a system design, are driven by requirements. The specific design requirements and criteria should be used to define metrics that measure how that aspect of the service is meeting the specified design objective. Historical metrics are valuable to baseline performance and can be used to configure thresholds or historical reference in problem isolation and forecasting.
For example, if you have employed a differentiated services strategy you should be monitoring the traffic volume and queue discards for each class of service you’ve defined. This will help you understand if your traffic projections are accurate and the QOS design is meeting the system requirements. Historical data can help identify traffic trends that influenced the change and determine if it was due to growth, a new application or service, or a “Mother’s Day” traffic anomaly.
Sometime metrics are more valuable when correlated with other metrics. This is true for both strategic and operational metrics. In such cases it is often useful to create a composite metric.
Google, for example, has a health score composed from page load time and other metrics that is briefed to the senior execs daily. In another example, perhaps the calls between the web front end and the SSO are only of concern if they are not directly related to the number of users connecting. In this case a composite metric may provide operations a key piece of information to proactively identify a potential service disruption or reduce MTTR.
Few performance management systems have the capability to create composite metrics within the application. There are always ways around that, but usually involve writing custom glueware.
Metrics should have a specific purpose. The consumers of the metrics should find value in the data – both the data itself and the way it is presented. Like every aspect of the service, metrics should be in a Demingesque continual improvement cycle. Metric definitions, the mechanism to collect them, and how they are communicated to their audience need to be constantly evaluated.
Metrics often become useless if the metric becomes the process objective. Take the time to resolve an incident for example. This metrics can provide valuable insight into the effectiveness of the operations staff and processes; however, it seldom does. This is because most operations managers know this and continually press their staff to close tickets as soon as possible to keep MTTR low. The objective of the operations process is not to close tickets quickly, but to support customer satisfaction by maintaining the service. Because the metric becomes the objective, it looses its value. This is difficult enough to address when the service is managed in-house, but when it becomes outsourced, that is even more troublesome. Operations SLAs often specifically address MTTR. If the service provider is contractually obligated to keep MTTR low, they will focus on closing tickets even if the issue remains unresolved.