I’ve been engineering, maintaining, and managing network and IT systems for numerous organizations, large and small, for longer than I care to elaborate on. In all but three cases, all of which were small with little structure, the organization had a change management process that held to most of the ITIL process framework to some extent. All had many of the service operations processes and were working to improve what was lacking. All the CTO/ITOs understood the value of service transition and service operations processes. However, few had services catalog or any of the ITIL service design process when I first began working with them, and nobody really gave it much thought. Almost all of the articles and discussions on the internet related to ITIL are about service transition or operation. Rarely if ever is anything written about service design or strategy. Half of the ITIL service lifecycle gets all the attention, while the other half is relegated to part of the academic exercise of certification. If it isn’t important or necessary, why is it there?
Perhaps it’s that most people who are proponents of the ITIL framework have either an operations or project management background and don’t understand the engineering process and how it relates to ITIL. How many engineers or developers do you know who embrace ITIL? Most see it as a burden forced upon them by PMs and operations. Isn’t the CAB a stick used to pummel engineering into writing documents that nobody, not even the installer, will ever read? What if they saw themselves as a vital part of the service lifecycle and key to cost saving measures?
Service transition and operations are where cost recovery is most evident. I’ve often heard it said that service operations and transition are where ITIL realizes its ROI. I argue that a well executed service design provides even more ROI than operations and transition, though it is not evident in the design part of the lifecycle.
To illustrate this point, consider an auto manufacturer. A lot goes into the design of the auto. The customer doesn’t see the design or the manufacturing process, but they see the O&M processes. Do you know anyone who had a vehicle that was always needing repair? The repair was costly but how much of the need for repair was due to poor design? I had a Bronco II that would constantly break the ring gear which often led to a transmission rebuild. The aluminum ring gear couldn’t handle the torque and would rip into pieces. I had several friends who owned minivans that would throw a transmission every 30,000 miles. It wasn’t bad luck or abuse, it was a bad design. The manufacturer fixed that in later years, but it gave them a bad reputation and caused sales on that model to fall. How about recalls? They are very costly. First there’s the issue of diagnosing a problem with a product that’s already in the field, then the redesign, and the retrofit. The point I’m trying to illustrate is that design flaws are very costly, but that cost shows up in the operations and transition part of the lifecycle, not the design stage.
The Rogers Cellular outage in Oct 2013 is one example. Rogers has not had a very good record for service and availability. They suffered an outage impacting their entire network for a few hours that made national news. How do you suppose this outage affected sales? An inadequate design can have some very expensive consequences.
The business case for change management is built on reducing the cost associated with service disruptions due to change. While change management is good, the real problem is unexpected service disruption as a result of change. Planned service disruption can be scheduled in a manner to least impact customers. It’s the unintended consequences that are the trouble. A well executed service design process produces a transition plan that correctly identifies the impact of the change (change evaluation). Change management has nothing to do with this; in fact, change management relies on this being done correctly. A large part of what most organizations are using change management to correct isn’t even addressed by change management; it’s addressed by service design. This may be counter-intuitive, but it’s true nonetheless.
CloudFlare made the news when they experienced an hour long outage effecting their entire worldwide network in March 2013. This outage was due to a change that caused the Juniper routers to become resource starved after a firewall rule was applied. Juniper received the bad rap on this; however, it was the network engineering team at CloudFlare that was to blame, not Juniper. Although this was due to a JUNOS bug, Juniper had identified the bug and released a patch in October 2012. CloudFlare made a change to the network (service design change) that was released immediately to ward off a DDoS attack (this would be a service patch in ITIL terms). The change was not tested adequately and the behavior was not as expected. It was the service design process at fault here, and there was nothing in the change management process to check this. This is because change management attempts to manage the risk associated with change by controlling “how” the change is executed. Change management does nothing with the content of the change. It is presupposed that the “what” being changed has been adequately designed and tested as part of the service design process.
IT services such as Windows domain, Exchange, and databases, seldom have any resemblance of an engineering practice, but typically function as a product implementation center. Implementing a well defined service design program requires a major paradigm shift for an organization. Most organizations don’t view the engineering process with the same discipline as other areas of industry. Because most networks and IT systems are composed of COTS products that need relatively little configuration, the configuration details of the COTS system and how they integrate with other systems are not viewed as a system that should be subject to the same engineering processes as any other system that needs to be designed. This is a very costly assumption.
In mathematics, a limit is the value that a function “approaches” as the input approaches some value. Limits are essential to mathematical analysis and are used to define continuity, derivatives, and integrals. If we were to take the limit of service availability as service design approaches perfection, we would see that there were no unexpected outcomes during service transition – everything would behave exactly as expected thus eliminating the cost of unintended service disruptions due to change. The service would operate perfectly and incidents would be reduced to the happenstance where a component malfunctioned as expected MTBF. There would be no problems to identify workarounds or resolutions for. This would greatly increase service availability and performance and produce a substantial ROI.
Network instrumentation is another area where the design is seldom on target. Most networks are poorly instrumented until after they’ve grown to a point where the lack of visibility is causing problems. This applies to event management – what events are being trapped or logged, how those events are filtered, enriched, and/or suppressed in the event management system to provide the NOC with the most useful data. It also applies to performance management – what metrics are collected, how thresholds are set, how metrics are correlated to produce indicators useful to the consumers of that data. It also applies to traffic statistics such as Netflow or similar technologies. This should all be part of the network design from the beginning, because the network has to be maintained as part of the lifecycle.
The service design aspect of the ITIL service lifecycle is greatly undervalued and often overlooked entirely. Take a look at what ITIL recommends should be in the Service Design Package – the topology is only one of many aspects of the design. Poorly executed service design results in increased costs due to unexpected service disruptions during service transition and decreased system availability due to flaws in the design. All these require additional effort to identify the problem, compensate for the service disruption, and the rework the design and execute the transition again.
A wise approach to ITSM is to expend the effort in service design rather than expending much more cost to deal with the fallout from a poorly designed network. The engineering staff should see themselves as having a key role in the service design process and know the value they contribute to the entire system lifecycle.