The most commonly used metric for capacity planning does not address the fundamental questions that need to be answered to perform capacity planning.
The common assumption is that we need to measure utilization to understand link capacity, determine if there is congestion affecting your network service performance, and plan for capacity increases or decreases. This is why all major performance management products measure interface throughput and calculate utilization as a metric. Almost all capacity planning tools also rely on interface throughput or utilization. However, this metric doesn’t address the question.
Before we look at what to measure, let’s discuss the reason why we need to measure it. IP is an asynchronous protocol; therefore, bandwidth is not allocated by circuit or service, but by demand. The physical media; however, uses a synchronous delivery at a fixed rate. This can be illustrated by cars on a road. The road is only so wide (bandwidth) and can accommodate a certain number of cars at a certain speed. When there are too many cars during rush hour, we have traffic congestion. Likewise, during periods of high demand the fixed rate circuit may become congested causing packet loss, jitter, and service degradation. We want to prevent service degradation due to congestion; so we need a means to manage it. The fundamental questions we need to answer are:
The assumption is that utilization indicates congestion and loss; however, this is false. Interface throughput measures the number of bits or packets through an interface over a polling period, usually 5 minutes or more. Utilization is then calculated based on interface speed. This means that every data point is an average over the polling period, which is rarely less than 5 minutes. The problem is that the interface queue only holds 300 milliseconds of traffic. A five minute average is 1000 times the queue length. Traffic is bursty, so you really have no idea what is happening within that 300ms window. Even if you increase the polling rate, the statistics are only updated every 2-3 seconds. If you were polling every 5 seconds (which would be burdensome), the polling rate would still be 20 times the queue length. Furthermore, utilization doesn’t tell you if there is congestion or discards – high utilization implies congestion based on a certain set of assumptions.
Any relationship between interface utilization and congestion is an educated guess at best. Due to the polling frequency, it’s not a very accurate guess.
The most common approach to congestion management is to implement a differentiated services model, also known as Class of Service (CoS) or Quality of Service (QoS). This provides a mechanism to reduce congestion and improve performance of loss and jitter sensitive applications while allowing you to more fully utilize available bandwidth. This improves both performance and efficiency.
DiffServ also complicates the problem described above. If we don’t know how much congestion or discards due to insufficient bandwidth we had without CoS, how much less do we know when we add the complexity of a differentiated services model?
Almost all routers utilize and algorithm called Random Early Detection (RED) to improve the performance of TCP traffic. This algorithm discards certain TCP packets before congestion occurs to improve TCP performance. This is another factor that adds to the complexity of monitoring congestion.
In a perfect world, using DiffServ and RED, we could maximize interface utilization while minimizing discards and jitter. Ideally this means 100% utilization and good performance. Of course, nothing is ideal, but where is that sweet spot for utilization? Can we get the most value from bandwidth and still not sacrifice service? This is the goal, but measuring utilization will never tell us how well we’re progressing toward that goal or how much congestion there is, because we really need to know about queuing.
The solution is to measure the number of discards in the output queue and the throughput of each queue instead. These metrics are available through SNMP on most full-featured platforms, but are not typically out-of-the-box in any performance management system.
Measuring queue discards answers the fundamental question. These discards are even separated by RED discards (not necessarily a congestion indicator) and tail drop discards, which indicate loss due to insufficient bandwidth. Furthermore, systems implementing CoS have the capability of measuring the discards by class of service. Discards in Best Effort, for example, are not necessarily bad, the service was designed to accommodate a certain rate of discard in some classes.
The other advantage this has is not only does it provide insight into operations, but provides data necessary for engineering. When the Class of Service model was designed, the service classifications, queuing methods, and queue capacities were designed to meet certain criteria. How well are they meeting the established criteria? Queue throughput and discard data should be used to evaluate the performance of the CoS design on a regular basis and can be leveraged for improvements to that design.
If you’re looking at implementing capacity planning or hiring someone to do capacity planning there are a few things you should consider.
Capacity planning should be an ongoing part of the lifecycle of any network (or any IT service for that matter). The network was designed to meet a certain capacity knowing that may grow as the network gets larger and/or support more users and services. There are several way to go about this and the best approach is dependent on your situation. There should be some fairly specific plans on how to measure utilization, forecast, report, make decisions, and increase or decrease capacity. There are also many aspects to capacity. Link utilization is one obvious capacity limitation, but processor utilization may not be so obvious, and where VPNs are involved there are logical limits to the volume of traffic that can be handled by each device. There are also physical limitations such as port and patch panel connections, power consumption, UPS capacity, etc. These should all be addressed as an integral part of the network design, and if it has been overlooked, the design needs to be re-evaluated in light of the capacity management program. There are also the programatic aspects – frequency of evaluation, control gates, decision points, who to involve where, etc. This is all part of the lifecycle.
There are a wide variety of tools available for capacity planning and analysis. Which are selected will be determined by the approach you’re taking to manage capacity, how the data is to be manipulated, reported, and consumed, as well as architectural factors such as hardware capabilities, available data, and other network management systems in use. One simple approach is to measure utilization through SNMP and use linear forecasting to predict future capacity requirements. This is very easy to set up, but doesn’t provide the most reliable results. A much better approach is to collect traffic data, overlay it on a dynamic model of the network, then use failure analysis to predict capacity changes as a result of limited failures. This can be combined with linear forecasting; however, failure scenarios will almost always be the determining factor. Many organizations use QoS to prioritize certain classes of traffic over others. This adds yet another dimension to the workflow. There is also traffic engineering design, third party and carrier capabilities, and the behavior of the services supported by the network. It can become more complicated than it might appear at first glance.
Some understanding of the technologies is necessary to evaluate the data and make recommendations on any changes. If dynamic modeling is a tool used to forecast, there are another set of skills. The tools may produce much of the reporting; however, there will need to be some analysis captured in a report that will be evaluated by other elements in the organization requiring communication and presentation skills.
It’s highly unlikely that the personnel responsible for defining the program, gathering requirements, selecting COTS tools, writing middleware, and implementing all this will be the same as those that use the tools or produce the reports or maybe even read the reports and evaluate them. The idea of “hiring a capacity management person” to do all this isn’t really feasible. Those with the skills and motivation to define the program and/or design and implement it will not likely be interested in operating the system or creating the reports. One approach to this is to bring in someone with the expertise to define the approach, design and implement the tools, then train the personnel who will be using them. These engagements are usually relatively short and provide a great value.