Choosing metrics that provide insight into the health and performance of systems and processes can be challenging. Metrics need to be aligned with the requirements of the systems and processes that they support. While many performance management systems provide useful metrics out-of-the box, you will undoubtably have to define others yourself and determine a means to collect and report them.
I break metrics down into two major categories: strategic and operational.
Strategic metrics provide a broad insight into a service’s overall performance. These are the type of metrics that are briefed at the manager’s weekly meeting. They usually aren’t directly actionable, but are very useful for trending.
Strategic metrics should be used to evaluate the overall effect of process or system improvements. Healthy organizations are involved in some manner of Deming style continuous process improvement (CPI) which also applies to system/service design. As changes are implemented metrics are monitors to determine if the changes improved the system or process as expected.
Some examples of strategic metrics are: system availability, homepage load time, and incidents identified through ITSM vs. those identified by customers. These provide a high level indicator of performance more closely related to business objectives than to specific system or process operation and design criteria.
Operational metrics provide detail and are useful to help identify service disruptions, problems, capacity planning, and areas for improvement. These metrics are often directly actionable. Operations can use these metrics to proactively identify potential service disruptions, isolate the cause of a problem, and evaluate the effectiveness of the team. Engineering uses these metrics to determine if the service design is meeting the design requirements, identify areas for design improvements, and provide data necessary for planning new services and upgrades.
Good metrics should be aligned with operational factors that indicate the health of the service and the design requirements. Metrics, just like every other aspect of a system design, are driven by requirements. The specific design requirements and criteria should be used to define metrics that measure how that aspect of the service is meeting the specified design objective. Historical metrics are valuable to baseline performance and can be used to configure thresholds or historical reference in problem isolation and forecasting.
For example, if you have employed a differentiated services strategy you should be monitoring the traffic volume and queue discards for each class of service you’ve defined. This will help you understand if your traffic projections are accurate and the QOS design is meeting the system requirements. Historical data can help identify traffic trends that influenced the change and determine if it was due to growth, a new application or service, or a “Mother’s Day” traffic anomaly.
Sometime metrics are more valuable when correlated with other metrics. This is true for both strategic and operational metrics. In such cases it is often useful to create a composite metric.
Google, for example, has a health score composed from page load time and other metrics that is briefed to the senior execs daily. In another example, perhaps the calls between the web front end and the SSO are only of concern if they are not directly related to the number of users connecting. In this case a composite metric may provide operations a key piece of information to proactively identify a potential service disruption or reduce MTTR.
Few performance management systems have the capability to create composite metrics within the application. There are always ways around that, but usually involve writing custom glueware.
Metrics should have a specific purpose. The consumers of the metrics should find value in the data – both the data itself and the way it is presented. Like every aspect of the service, metrics should be in a Demingesque continual improvement cycle. Metric definitions, the mechanism to collect them, and how they are communicated to their audience need to be constantly evaluated.
Metrics often become useless if the metric becomes the process objective. Take the time to resolve an incident for example. This metrics can provide valuable insight into the effectiveness of the operations staff and processes; however, it seldom does. This is because most operations managers know this and continually press their staff to close tickets as soon as possible to keep MTTR low. The objective of the operations process is not to close tickets quickly, but to support customer satisfaction by maintaining the service. Because the metric becomes the objective, it looses its value. This is difficult enough to address when the service is managed in-house, but when it becomes outsourced, that is even more troublesome. Operations SLAs often specifically address MTTR. If the service provider is contractually obligated to keep MTTR low, they will focus on closing tickets even if the issue remains unresolved.
Change Management is an important function in most organizations. It carries more weight than many of the other ITIL functions because it’s the biggest pain point. It’s a well established fact that upwards of 80% of all outages are self-inflicted. IT managers are constantly getting heat over deployment that didn’t go exactly as planned. When you boil that down to lost productivity or missed business opportunities it amounts to a sizable amount of money. These are just some of the reason Change Management gets so much well deserved attention.
So, you establish a Change Advisory Board. There is a lot of preparation and documentation that has to go into any change before it’s presented to the board for approval. Each change is categorized, analyzed, scrutinized, until everyone involved is thoroughly mesmerized. The time required to get a change approve may also have increased five-fold. The process is controlled through some rather expensive management software, well documented, well planned, and hopefully well executed.
The question is: After expending all this effort into the Change Management process, expending the resources in additional planning and documentation, and spending all the time in meetings, and prolonging the time required to get a task accomplished, did CM reduce service disruptions and save more money than was invested in the process? Let’s face it; if not, then throw the whole thing out and go back to shootin’ from the hip.
The stakeholders on the board probably didn’t review the detailed documentation that has been prepared. There are probably only a few people in the entire organization who will ever read it. The stakeholders only have a few important questions: why is this change necessary, what’s the impact, who or what will be affected, what are the risks, are they adequately mitigated, and is there a viable back-out process. There are probably a few key people in each business unit who could review the implementation details and provide their respective stakeholder with a recommendation and/or list of concerns and remediations.
Is the CAB keeping any metrics? Are you aware of how many changes of each category are being implemented? Were they on schedule? Were the impacts more or less than expected? Is there a way to relate an incidents as a result of a change to the change in your incident reporting system? Is all this management making an improvement, or have you just spent more resources managing with no real gain? When you make a change to the process, does it streamline the process and/or improve the results?
Change Management is good. CM in the context of ITIL framework is excellent … but we must always keep focused on the end objective – becoming more efficient and effective. CM for the sake of CM is a common ill and needs to be tempered with CS (common sense).