Choosing metrics that provide insight into the health and performance of systems and processes can be challenging. Metrics need to be aligned with the requirements of the systems and processes that they support. While many performance management systems provide useful metrics out-of-the box, you will undoubtably have to define others yourself and determine a means to collect and report them.
I break metrics down into two major categories: strategic and operational.
Strategic metrics provide a broad insight into a service’s overall performance. These are the type of metrics that are briefed at the manager’s weekly meeting. They usually aren’t directly actionable, but are very useful for trending.
Strategic metrics should be used to evaluate the overall effect of process or system improvements. Healthy organizations are involved in some manner of Deming style continuous process improvement (CPI) which also applies to system/service design. As changes are implemented metrics are monitors to determine if the changes improved the system or process as expected.
Some examples of strategic metrics are: system availability, homepage load time, and incidents identified through ITSM vs. those identified by customers. These provide a high level indicator of performance more closely related to business objectives than to specific system or process operation and design criteria.
Operational metrics provide detail and are useful to help identify service disruptions, problems, capacity planning, and areas for improvement. These metrics are often directly actionable. Operations can use these metrics to proactively identify potential service disruptions, isolate the cause of a problem, and evaluate the effectiveness of the team. Engineering uses these metrics to determine if the service design is meeting the design requirements, identify areas for design improvements, and provide data necessary for planning new services and upgrades.
Good metrics should be aligned with operational factors that indicate the health of the service and the design requirements. Metrics, just like every other aspect of a system design, are driven by requirements. The specific design requirements and criteria should be used to define metrics that measure how that aspect of the service is meeting the specified design objective. Historical metrics are valuable to baseline performance and can be used to configure thresholds or historical reference in problem isolation and forecasting.
For example, if you have employed a differentiated services strategy you should be monitoring the traffic volume and queue discards for each class of service you’ve defined. This will help you understand if your traffic projections are accurate and the QOS design is meeting the system requirements. Historical data can help identify traffic trends that influenced the change and determine if it was due to growth, a new application or service, or a “Mother’s Day” traffic anomaly.
Sometime metrics are more valuable when correlated with other metrics. This is true for both strategic and operational metrics. In such cases it is often useful to create a composite metric.
Google, for example, has a health score composed from page load time and other metrics that is briefed to the senior execs daily. In another example, perhaps the calls between the web front end and the SSO are only of concern if they are not directly related to the number of users connecting. In this case a composite metric may provide operations a key piece of information to proactively identify a potential service disruption or reduce MTTR.
Few performance management systems have the capability to create composite metrics within the application. There are always ways around that, but usually involve writing custom glueware.
Metrics should have a specific purpose. The consumers of the metrics should find value in the data – both the data itself and the way it is presented. Like every aspect of the service, metrics should be in a Demingesque continual improvement cycle. Metric definitions, the mechanism to collect them, and how they are communicated to their audience need to be constantly evaluated.
Metrics often become useless if the metric becomes the process objective. Take the time to resolve an incident for example. This metrics can provide valuable insight into the effectiveness of the operations staff and processes; however, it seldom does. This is because most operations managers know this and continually press their staff to close tickets as soon as possible to keep MTTR low. The objective of the operations process is not to close tickets quickly, but to support customer satisfaction by maintaining the service. Because the metric becomes the objective, it looses its value. This is difficult enough to address when the service is managed in-house, but when it becomes outsourced, that is even more troublesome. Operations SLAs often specifically address MTTR. If the service provider is contractually obligated to keep MTTR low, they will focus on closing tickets even if the issue remains unresolved.
If you’re looking at implementing capacity planning or hiring someone to do capacity planning there are a few things you should consider.
Capacity planning should be an ongoing part of the lifecycle of any network (or any IT service for that matter). The network was designed to meet a certain capacity knowing that may grow as the network gets larger and/or support more users and services. There are several way to go about this and the best approach is dependent on your situation. There should be some fairly specific plans on how to measure utilization, forecast, report, make decisions, and increase or decrease capacity. There are also many aspects to capacity. Link utilization is one obvious capacity limitation, but processor utilization may not be so obvious, and where VPNs are involved there are logical limits to the volume of traffic that can be handled by each device. There are also physical limitations such as port and patch panel connections, power consumption, UPS capacity, etc. These should all be addressed as an integral part of the network design, and if it has been overlooked, the design needs to be re-evaluated in light of the capacity management program. There are also the programatic aspects – frequency of evaluation, control gates, decision points, who to involve where, etc. This is all part of the lifecycle.
There are a wide variety of tools available for capacity planning and analysis. Which are selected will be determined by the approach you’re taking to manage capacity, how the data is to be manipulated, reported, and consumed, as well as architectural factors such as hardware capabilities, available data, and other network management systems in use. One simple approach is to measure utilization through SNMP and use linear forecasting to predict future capacity requirements. This is very easy to set up, but doesn’t provide the most reliable results. A much better approach is to collect traffic data, overlay it on a dynamic model of the network, then use failure analysis to predict capacity changes as a result of limited failures. This can be combined with linear forecasting; however, failure scenarios will almost always be the determining factor. Many organizations use QoS to prioritize certain classes of traffic over others. This adds yet another dimension to the workflow. There is also traffic engineering design, third party and carrier capabilities, and the behavior of the services supported by the network. It can become more complicated than it might appear at first glance.
Some understanding of the technologies is necessary to evaluate the data and make recommendations on any changes. If dynamic modeling is a tool used to forecast, there are another set of skills. The tools may produce much of the reporting; however, there will need to be some analysis captured in a report that will be evaluated by other elements in the organization requiring communication and presentation skills.
It’s highly unlikely that the personnel responsible for defining the program, gathering requirements, selecting COTS tools, writing middleware, and implementing all this will be the same as those that use the tools or produce the reports or maybe even read the reports and evaluate them. The idea of “hiring a capacity management person” to do all this isn’t really feasible. Those with the skills and motivation to define the program and/or design and implement it will not likely be interested in operating the system or creating the reports. One approach to this is to bring in someone with the expertise to define the approach, design and implement the tools, then train the personnel who will be using them. These engagements are usually relatively short and provide a great value.