The most commonly used metric for capacity planning does not address the fundamental questions that need to be answered to perform capacity planning.
The common assumption is that we need to measure utilization to understand link capacity, determine if there is congestion affecting your network service performance, and plan for capacity increases or decreases. This is why all major performance management products measure interface throughput and calculate utilization as a metric. Almost all capacity planning tools also rely on interface throughput or utilization. However, this metric doesn’t address the question.
Before we look at what to measure, let’s discuss the reason why we need to measure it. IP is an asynchronous protocol; therefore, bandwidth is not allocated by circuit or service, but by demand. The physical media; however, uses a synchronous delivery at a fixed rate. This can be illustrated by cars on a road. The road is only so wide (bandwidth) and can accommodate a certain number of cars at a certain speed. When there are too many cars during rush hour, we have traffic congestion. Likewise, during periods of high demand the fixed rate circuit may become congested causing packet loss, jitter, and service degradation. We want to prevent service degradation due to congestion; so we need a means to manage it. The fundamental questions we need to answer are:
The assumption is that utilization indicates congestion and loss; however, this is false. Interface throughput measures the number of bits or packets through an interface over a polling period, usually 5 minutes or more. Utilization is then calculated based on interface speed. This means that every data point is an average over the polling period, which is rarely less than 5 minutes. The problem is that the interface queue only holds 300 milliseconds of traffic. A five minute average is 1000 times the queue length. Traffic is bursty, so you really have no idea what is happening within that 300ms window. Even if you increase the polling rate, the statistics are only updated every 2-3 seconds. If you were polling every 5 seconds (which would be burdensome), the polling rate would still be 20 times the queue length. Furthermore, utilization doesn’t tell you if there is congestion or discards – high utilization implies congestion based on a certain set of assumptions.
Any relationship between interface utilization and congestion is an educated guess at best. Due to the polling frequency, it’s not a very accurate guess.
The most common approach to congestion management is to implement a differentiated services model, also known as Class of Service (CoS) or Quality of Service (QoS). This provides a mechanism to reduce congestion and improve performance of loss and jitter sensitive applications while allowing you to more fully utilize available bandwidth. This improves both performance and efficiency.
DiffServ also complicates the problem described above. If we don’t know how much congestion or discards due to insufficient bandwidth we had without CoS, how much less do we know when we add the complexity of a differentiated services model?
Almost all routers utilize and algorithm called Random Early Detection (RED) to improve the performance of TCP traffic. This algorithm discards certain TCP packets before congestion occurs to improve TCP performance. This is another factor that adds to the complexity of monitoring congestion.
In a perfect world, using DiffServ and RED, we could maximize interface utilization while minimizing discards and jitter. Ideally this means 100% utilization and good performance. Of course, nothing is ideal, but where is that sweet spot for utilization? Can we get the most value from bandwidth and still not sacrifice service? This is the goal, but measuring utilization will never tell us how well we’re progressing toward that goal or how much congestion there is, because we really need to know about queuing.
The solution is to measure the number of discards in the output queue and the throughput of each queue instead. These metrics are available through SNMP on most full-featured platforms, but are not typically out-of-the-box in any performance management system.
Measuring queue discards answers the fundamental question. These discards are even separated by RED discards (not necessarily a congestion indicator) and tail drop discards, which indicate loss due to insufficient bandwidth. Furthermore, systems implementing CoS have the capability of measuring the discards by class of service. Discards in Best Effort, for example, are not necessarily bad, the service was designed to accommodate a certain rate of discard in some classes.
The other advantage this has is not only does it provide insight into operations, but provides data necessary for engineering. When the Class of Service model was designed, the service classifications, queuing methods, and queue capacities were designed to meet certain criteria. How well are they meeting the established criteria? Queue throughput and discard data should be used to evaluate the performance of the CoS design on a regular basis and can be leveraged for improvements to that design.