I’ve noticed a trend with almost every company I’ve consulted for: most network engineering does not use abstract design, but rather provisions each element in a concrete manner. This is not a cost effective approach for many reasons.
There is a paradigm associated with designing a network using COTS products that causes network engineering production centers to disregard the conventional engineering process. Consider your automobile. The last time you went to the repair shop, did the mechanic go through the entire car to ensure all the correct parts were installed? Of course not; the VIN let them know what build was used and all cars of that build were identical except for a few items that made them unique. Even the options were identical to the same model with those options. This was not done solely for the benefit of the consumer, but because it is the most cost effective way to manufacture and maintain the vehicle throughout its lifecycle. This principal can be seem in most industries …. except IT. Why is compliance software for network systems so popular and valuable? Because network devices are seldom configured according to a standard. In the cases where they are to some extent, they are configured using templates that have to be applied manually and the variables entered manually, so they still vary. This would be like automotive engineers assembling cars by hand. It isn’t cost effective for a number of reasons.
I was demonstrating a proactive change validation process for a large enterprise customer. They provided me the configlets and the change documentation for an upcoming change. I modeled the current network and applied the proposed changes in the simulation and found several errors that would have made the modified network unable to route traffic. They used templates to create the configlets used in the change, but they used one incorrect template and populated them with some incorrect variables. The change was an upgrade that had been accomplished at many locations and was standardized to some extent. If this were implemented as-is, the implementation engineer would have made the necessary modifications to make the system operational and introduce some degree of variance.
This paradigm exists in the service operations as well as design and provisioning. Tier III often changes the configuration of a device to resolve an incident. The would be like an auto repair shop making changes to your automobile design to fix a problem. It’s not running correctly so they install spark plugs that are different from the manufacturer’s specification. The repair shop wouldn’t do that. Why is it commonplace in IT? This is basically ad-hoc system redesign. If the configuration of a device needs to be changed to resolve a problem, then the system was designed wrong. There are only a few exceptions to this. For example, if a router at a remote site has a hot spare interface it is often configured and disabled. In the event of a failure, the spare interface is enabled, and possible readdressed to take the place of the failed one. This isn’t really a redesign, it is an operational procedure used in a failure scenario.
This problem has a snowball effect. Because there are so many variations in the network design, there is no feasible way to test design improvements. If there were standardization, each variation of the standard systems and sub-systems could be tested in the lab/QA environment. But because there is really no standard, and too many exceptions to any that exist, the only way to adequately test anything would be to replicate the entire network in QA. Because of this, the rate of unsuccessful changes and unexpected impacts to change are extremely high. Management attempts to mitigate this through more rigorous change management, which can not solve this problem and adds delay and effort to the change process. In the end it costs the organization in loss of productivity due to system downtime and unnecessary labor to manage change and resolve incidents caused by change.
The ITIL Service Design process treats a service, such as a network service, similar to the way the automotive industry manages this in the previous examples. When the network is treated as a service that must be subject to this same rigorous engineering process, the result is improved efficiency a high degree of predictability that reduces service disruptions caused by unexpected problems encountered during changes. This requires a great deal more engineering effort during the design and release processes, but the ROI is improved availability and reduction effort during implementation. Implementing the release package becomes a turn-key operation that should be performed by the operations or provisioning team rather than engineering. This paradigm shift often takes some time for an organization to grasp and function efficiently in, but will improve performance and efficiency and paves the way toward automated provisioning.
In order to accomplish this the design must be abstracted in such a manner to express the level of detail necessary to create physical assembly and logical provisioning such as naming, addressing, routing configuration, policy, management configuration, VLAN assignment, etc. This is most certainly possible because all of these things follow a system of logic – they are not arbitrarily assigned.
An example of this can be seen in Windows system deployment and management. In the 90’s if you wanted to install a Windows server, you would insert a disk into the server and go through an installation process. If you were really on your game, you could create a installer init file that answered most of the questions the install utility would ask. Any custom configuration would need to be accomplished manually one machine at a time. The advent of system images and group policy provided a means to abstract the system design in a way that an enterprise can easily provision new systems identically and manage them very efficiently.
While there is no out-of-the-box product that provides a mechanism to abstract the network design in the manner that Windows uses images and GPO, it is certainly not out of reach. The mechanisms to design networks using abstract construct can be developed/integrated and are worth the effort in large environments.
The larger problem is changing the paradigm. I worked on a project where we developed an Operational Support System (OSS) that provided automated provisioning. The customer entered the service order into a CRM system which caused the downstream provisioning system to push out all the necessary config changes to provision the service on the network devices. The system development took us 7 years, but it took just as long to change the organizational mindset to be able to see network design in using abstract constructs.
The most commonly used metric for capacity planning does not address the fundamental questions that need to be answered to perform capacity planning.
The common assumption is that we need to measure utilization to understand link capacity, determine if there is congestion affecting your network service performance, and plan for capacity increases or decreases. This is why all major performance management products measure interface throughput and calculate utilization as a metric. Almost all capacity planning tools also rely on interface throughput or utilization. However, this metric doesn’t address the question.
Before we look at what to measure, let’s discuss the reason why we need to measure it. IP is an asynchronous protocol; therefore, bandwidth is not allocated by circuit or service, but by demand. The physical media; however, uses a synchronous delivery at a fixed rate. This can be illustrated by cars on a road. The road is only so wide (bandwidth) and can accommodate a certain number of cars at a certain speed. When there are too many cars during rush hour, we have traffic congestion. Likewise, during periods of high demand the fixed rate circuit may become congested causing packet loss, jitter, and service degradation. We want to prevent service degradation due to congestion; so we need a means to manage it. The fundamental questions we need to answer are:
The assumption is that utilization indicates congestion and loss; however, this is false. Interface throughput measures the number of bits or packets through an interface over a polling period, usually 5 minutes or more. Utilization is then calculated based on interface speed. This means that every data point is an average over the polling period, which is rarely less than 5 minutes. The problem is that the interface queue only holds 300 milliseconds of traffic. A five minute average is 1000 times the queue length. Traffic is bursty, so you really have no idea what is happening within that 300ms window. Even if you increase the polling rate, the statistics are only updated every 2-3 seconds. If you were polling every 5 seconds (which would be burdensome), the polling rate would still be 20 times the queue length. Furthermore, utilization doesn’t tell you if there is congestion or discards – high utilization implies congestion based on a certain set of assumptions.
Any relationship between interface utilization and congestion is an educated guess at best. Due to the polling frequency, it’s not a very accurate guess.
The most common approach to congestion management is to implement a differentiated services model, also known as Class of Service (CoS) or Quality of Service (QoS). This provides a mechanism to reduce congestion and improve performance of loss and jitter sensitive applications while allowing you to more fully utilize available bandwidth. This improves both performance and efficiency.
DiffServ also complicates the problem described above. If we don’t know how much congestion or discards due to insufficient bandwidth we had without CoS, how much less do we know when we add the complexity of a differentiated services model?
Almost all routers utilize and algorithm called Random Early Detection (RED) to improve the performance of TCP traffic. This algorithm discards certain TCP packets before congestion occurs to improve TCP performance. This is another factor that adds to the complexity of monitoring congestion.
In a perfect world, using DiffServ and RED, we could maximize interface utilization while minimizing discards and jitter. Ideally this means 100% utilization and good performance. Of course, nothing is ideal, but where is that sweet spot for utilization? Can we get the most value from bandwidth and still not sacrifice service? This is the goal, but measuring utilization will never tell us how well we’re progressing toward that goal or how much congestion there is, because we really need to know about queuing.
The solution is to measure the number of discards in the output queue and the throughput of each queue instead. These metrics are available through SNMP on most full-featured platforms, but are not typically out-of-the-box in any performance management system.
Measuring queue discards answers the fundamental question. These discards are even separated by RED discards (not necessarily a congestion indicator) and tail drop discards, which indicate loss due to insufficient bandwidth. Furthermore, systems implementing CoS have the capability of measuring the discards by class of service. Discards in Best Effort, for example, are not necessarily bad, the service was designed to accommodate a certain rate of discard in some classes.
The other advantage this has is not only does it provide insight into operations, but provides data necessary for engineering. When the Class of Service model was designed, the service classifications, queuing methods, and queue capacities were designed to meet certain criteria. How well are they meeting the established criteria? Queue throughput and discard data should be used to evaluate the performance of the CoS design on a regular basis and can be leveraged for improvements to that design.
Choosing metrics that provide insight into the health and performance of systems and processes can be challenging. Metrics need to be aligned with the requirements of the systems and processes that they support. While many performance management systems provide useful metrics out-of-the box, you will undoubtably have to define others yourself and determine a means to collect and report them.
I break metrics down into two major categories: strategic and operational.
Strategic metrics provide a broad insight into a service’s overall performance. These are the type of metrics that are briefed at the manager’s weekly meeting. They usually aren’t directly actionable, but are very useful for trending.
Strategic metrics should be used to evaluate the overall effect of process or system improvements. Healthy organizations are involved in some manner of Deming style continuous process improvement (CPI) which also applies to system/service design. As changes are implemented metrics are monitors to determine if the changes improved the system or process as expected.
Some examples of strategic metrics are: system availability, homepage load time, and incidents identified through ITSM vs. those identified by customers. These provide a high level indicator of performance more closely related to business objectives than to specific system or process operation and design criteria.
Operational metrics provide detail and are useful to help identify service disruptions, problems, capacity planning, and areas for improvement. These metrics are often directly actionable. Operations can use these metrics to proactively identify potential service disruptions, isolate the cause of a problem, and evaluate the effectiveness of the team. Engineering uses these metrics to determine if the service design is meeting the design requirements, identify areas for design improvements, and provide data necessary for planning new services and upgrades.
Good metrics should be aligned with operational factors that indicate the health of the service and the design requirements. Metrics, just like every other aspect of a system design, are driven by requirements. The specific design requirements and criteria should be used to define metrics that measure how that aspect of the service is meeting the specified design objective. Historical metrics are valuable to baseline performance and can be used to configure thresholds or historical reference in problem isolation and forecasting.
For example, if you have employed a differentiated services strategy you should be monitoring the traffic volume and queue discards for each class of service you’ve defined. This will help you understand if your traffic projections are accurate and the QOS design is meeting the system requirements. Historical data can help identify traffic trends that influenced the change and determine if it was due to growth, a new application or service, or a “Mother’s Day” traffic anomaly.
Sometime metrics are more valuable when correlated with other metrics. This is true for both strategic and operational metrics. In such cases it is often useful to create a composite metric.
Google, for example, has a health score composed from page load time and other metrics that is briefed to the senior execs daily. In another example, perhaps the calls between the web front end and the SSO are only of concern if they are not directly related to the number of users connecting. In this case a composite metric may provide operations a key piece of information to proactively identify a potential service disruption or reduce MTTR.
Few performance management systems have the capability to create composite metrics within the application. There are always ways around that, but usually involve writing custom glueware.
Metrics should have a specific purpose. The consumers of the metrics should find value in the data – both the data itself and the way it is presented. Like every aspect of the service, metrics should be in a Demingesque continual improvement cycle. Metric definitions, the mechanism to collect them, and how they are communicated to their audience need to be constantly evaluated.
Metrics often become useless if the metric becomes the process objective. Take the time to resolve an incident for example. This metrics can provide valuable insight into the effectiveness of the operations staff and processes; however, it seldom does. This is because most operations managers know this and continually press their staff to close tickets as soon as possible to keep MTTR low. The objective of the operations process is not to close tickets quickly, but to support customer satisfaction by maintaining the service. Because the metric becomes the objective, it looses its value. This is difficult enough to address when the service is managed in-house, but when it becomes outsourced, that is even more troublesome. Operations SLAs often specifically address MTTR. If the service provider is contractually obligated to keep MTTR low, they will focus on closing tickets even if the issue remains unresolved.
In this article Kevin makes the argument that KPIs and thresholds are inadequate indicators of system faults because of the complexity of the systems. This is especially true with application monitoring systems, but some networks have become complex enough that no one indicator is adequate to provide an actionable alert. Mature event management systems utilize a system of logic that enables correlation between events and event enrichment as well as suppression to produce a cause-effect relationship between events. Root Cause Analysis (RCA) is an underutilized function in many event management systems (even though many have the capability). In most cases the rules and logic have to be developed because many of the necessary relationships are not present out-of-the-box.
In this lecture, “Software-Defined Networking at the Crossroads”, Scott Shenker, University of California, Berkeley discusses SDN, it’s evolution, principals, and current state.
I’d like to solicit comments on their presumptions. Are networks really difficult to manage? If so it it because of the technology or because management is often an afterthought rather than an integral part of the system design?
Pay particular attention to the term “operator”. What department or role is Dr. Shenker referring to as operator? Is it the NOC or the Network Engineering department?
If you’re looking at implementing capacity planning or hiring someone to do capacity planning there are a few things you should consider.
Capacity planning should be an ongoing part of the lifecycle of any network (or any IT service for that matter). The network was designed to meet a certain capacity knowing that may grow as the network gets larger and/or support more users and services. There are several way to go about this and the best approach is dependent on your situation. There should be some fairly specific plans on how to measure utilization, forecast, report, make decisions, and increase or decrease capacity. There are also many aspects to capacity. Link utilization is one obvious capacity limitation, but processor utilization may not be so obvious, and where VPNs are involved there are logical limits to the volume of traffic that can be handled by each device. There are also physical limitations such as port and patch panel connections, power consumption, UPS capacity, etc. These should all be addressed as an integral part of the network design, and if it has been overlooked, the design needs to be re-evaluated in light of the capacity management program. There are also the programatic aspects – frequency of evaluation, control gates, decision points, who to involve where, etc. This is all part of the lifecycle.
There are a wide variety of tools available for capacity planning and analysis. Which are selected will be determined by the approach you’re taking to manage capacity, how the data is to be manipulated, reported, and consumed, as well as architectural factors such as hardware capabilities, available data, and other network management systems in use. One simple approach is to measure utilization through SNMP and use linear forecasting to predict future capacity requirements. This is very easy to set up, but doesn’t provide the most reliable results. A much better approach is to collect traffic data, overlay it on a dynamic model of the network, then use failure analysis to predict capacity changes as a result of limited failures. This can be combined with linear forecasting; however, failure scenarios will almost always be the determining factor. Many organizations use QoS to prioritize certain classes of traffic over others. This adds yet another dimension to the workflow. There is also traffic engineering design, third party and carrier capabilities, and the behavior of the services supported by the network. It can become more complicated than it might appear at first glance.
Some understanding of the technologies is necessary to evaluate the data and make recommendations on any changes. If dynamic modeling is a tool used to forecast, there are another set of skills. The tools may produce much of the reporting; however, there will need to be some analysis captured in a report that will be evaluated by other elements in the organization requiring communication and presentation skills.
It’s highly unlikely that the personnel responsible for defining the program, gathering requirements, selecting COTS tools, writing middleware, and implementing all this will be the same as those that use the tools or produce the reports or maybe even read the reports and evaluate them. The idea of “hiring a capacity management person” to do all this isn’t really feasible. Those with the skills and motivation to define the program and/or design and implement it will not likely be interested in operating the system or creating the reports. One approach to this is to bring in someone with the expertise to define the approach, design and implement the tools, then train the personnel who will be using them. These engagements are usually relatively short and provide a great value.
This is a follow-on to the post Well Executed Service Design Provides Substantial ROI. The previous post discusses the rationale for investing in the service design activity. This provides some indicators that the service design processes are ineffective in the context of ITSM best practices. Take the quiz below and see how your organization scores:
If you answered “E” just follow the “Contact Us” link at the bottom of the page; you’re in trouble. If you don’t have design standards then do likewise. If less than 95% of your network isn’t in compliance with established standards, then they aren’t standards – they’re suggestions. It’s very likely that you are provisioning devices without a detailed design. This is the most common indicator of an ineffective service design process.
Ideally you should build and provision your network by design and not make changes to it that aren’t defined by subsequent releases of the design. There are many compliance validation systems on the market that do a great job of validating IOS compliance, configuration compliance against a template, and so on. While these tools have great value in an organization with a loose design program, they address the symptom rather than the problem.
There is some merit to engineering working operational issues. It keeps the engineers sharp on troubleshooting skills and helps them understand the details of the incidents in the field. Often an engineer will identify a design fault easily and open a problem case. However, if the engineers can’t spend adequate time doing engineering because they’re troubleshooting operational issues, then something is terribly amiss. If the network is so complicated that Tier III can’t identify a problem, then engineering needs to develop better tools to identify those problems rapidly. If engineering is constantly needed to make changes to devices to resolve failures, then the network design is lacking. Engineering being involved in incident resolution is an indicator of a poor or ineffective network service design.
Did you have to install the speedometer, oil pressure gauge, etc. as aftermarket add-ons to your automobile? If you’re designing something, and there are performance criteria, the design needs to include a means to measure that performance to ensure the system meets the requirements. Your service design program should be developing metrics and alerts as an integral part of the design.
The configuration of a device is part of the system/service design. Does the auto repair shop ever redesign your car in order to repair it? There are a few circumstances in a network system where config change may be a valid repair action. For example: Bouncing a port on a device requires the configuration to be changed momentarily, then changed back. Perhaps there are hot spare devices or interfaces that are configured but disabled, and the repair plan is to enable the secondary and disable the primary (this is actually part of the system design plan). Aside from these few exceptions any modification to a device configuration is a modification to the network design. If the design has a flaw, there should be a problem case opened, the root cause identified, and the fix rolled into the next release of the service design. When design change is a mechanism used to correct incidents, it indicates a lack of a cohesive design activity.
The design of the network must continually be improved to add new features, support additional IT services, improvements to existing aspects of the system, fixes to problems, etc. A best practice is to release these design changes on a scheduled cycle. Everything in ITIL is in the context of the “service”. In the case of network services this is the entire network viewed as a cohesive system. IOS updates, software updates to support systems, etc. are released by the manufacturer. Although these are part of the network system design, they do not constitute a release of the network service. For example: Network service version 2.2 may define JUNOS 10.4.x on all routers and version IOS 12.4.x on all switches. Release 2.3 of the network service may include a JUNOS update as well as an enhancement to the QOS design and a configuration modification to fix a routing anomaly discovered during the previous release. It is the service that is subject to release management. The updated service design is then provisioned on the applicable devices using a schedule that is dependent on multiple factors – mostly operational. The provisioning schedule is not the same as the release schedule. Release 2.3 may become approved on March 1 but not provisioned across the entire network until August through a series of change orders. A well established network service design program uses release management.
Network service design and provisioning are too often blurred or indistinguishable. A design is abstract and applies to no particular device, but contains enough detail to be provisioned on any particular device in a turn-key fashion. Most organizations design particular devices, thus skipping the design and incorporating it with provisioning. When this happens the design process must be repeated for each similar instance. Few CABs make the distinction between the two activities causing change management to become very labor intensive because the provisioning activity becomes subject to all the testing and scrutiny of the design activity and the design activity subject to all the operational concerns of provisioning. This is another indicator of a poorly functioning network service design activity. Note: This is the only question where “E” is the best answer.
I’ve been engineering, maintaining, and managing network and IT systems for numerous organizations, large and small, for longer than I care to elaborate on. In all but three cases, all of which were small with little structure, the organization had a change management process that held to most of the ITIL process framework to some extent. All had many of the service operations processes and were working to improve what was lacking. All the CTO/ITOs understood the value of service transition and service operations processes. However, few had services catalog or any of the ITIL service design process when I first began working with them, and nobody really gave it much thought. Almost all of the articles and discussions on the internet related to ITIL are about service transition or operation. Rarely if ever is anything written about service design or strategy. Half of the ITIL service lifecycle gets all the attention, while the other half is relegated to part of the academic exercise of certification. If it isn’t important or necessary, why is it there?
Perhaps it’s that most people who are proponents of the ITIL framework have either an operations or project management background and don’t understand the engineering process and how it relates to ITIL. How many engineers or developers do you know who embrace ITIL? Most see it as a burden forced upon them by PMs and operations. Isn’t the CAB a stick used to pummel engineering into writing documents that nobody, not even the installer, will ever read? What if they saw themselves as a vital part of the service lifecycle and key to cost saving measures?
Service transition and operations are where cost recovery is most evident. I’ve often heard it said that service operations and transition are where ITIL realizes its ROI. I argue that a well executed service design provides even more ROI than operations and transition, though it is not evident in the design part of the lifecycle.
To illustrate this point, consider an auto manufacturer. A lot goes into the design of the auto. The customer doesn’t see the design or the manufacturing process, but they see the O&M processes. Do you know anyone who had a vehicle that was always needing repair? The repair was costly but how much of the need for repair was due to poor design? I had a Bronco II that would constantly break the ring gear which often led to a transmission rebuild. The aluminum ring gear couldn’t handle the torque and would rip into pieces. I had several friends who owned minivans that would throw a transmission every 30,000 miles. It wasn’t bad luck or abuse, it was a bad design. The manufacturer fixed that in later years, but it gave them a bad reputation and caused sales on that model to fall. How about recalls? They are very costly. First there’s the issue of diagnosing a problem with a product that’s already in the field, then the redesign, and the retrofit. The point I’m trying to illustrate is that design flaws are very costly, but that cost shows up in the operations and transition part of the lifecycle, not the design stage.
The Rogers Cellular outage in Oct 2013 is one example. Rogers has not had a very good record for service and availability. They suffered an outage impacting their entire network for a few hours that made national news. How do you suppose this outage affected sales? An inadequate design can have some very expensive consequences.
The business case for change management is built on reducing the cost associated with service disruptions due to change. While change management is good, the real problem is unexpected service disruption as a result of change. Planned service disruption can be scheduled in a manner to least impact customers. It’s the unintended consequences that are the trouble. A well executed service design process produces a transition plan that correctly identifies the impact of the change (change evaluation). Change management has nothing to do with this; in fact, change management relies on this being done correctly. A large part of what most organizations are using change management to correct isn’t even addressed by change management; it’s addressed by service design. This may be counter-intuitive, but it’s true nonetheless.
CloudFlare made the news when they experienced an hour long outage effecting their entire worldwide network in March 2013. This outage was due to a change that caused the Juniper routers to become resource starved after a firewall rule was applied. Juniper received the bad rap on this; however, it was the network engineering team at CloudFlare that was to blame, not Juniper. Although this was due to a JUNOS bug, Juniper had identified the bug and released a patch in October 2012. CloudFlare made a change to the network (service design change) that was released immediately to ward off a DDoS attack (this would be a service patch in ITIL terms). The change was not tested adequately and the behavior was not as expected. It was the service design process at fault here, and there was nothing in the change management process to check this. This is because change management attempts to manage the risk associated with change by controlling “how” the change is executed. Change management does nothing with the content of the change. It is presupposed that the “what” being changed has been adequately designed and tested as part of the service design process.
IT services such as Windows domain, Exchange, and databases, seldom have any resemblance of an engineering practice, but typically function as a product implementation center. Implementing a well defined service design program requires a major paradigm shift for an organization. Most organizations don’t view the engineering process with the same discipline as other areas of industry. Because most networks and IT systems are composed of COTS products that need relatively little configuration, the configuration details of the COTS system and how they integrate with other systems are not viewed as a system that should be subject to the same engineering processes as any other system that needs to be designed. This is a very costly assumption.
In mathematics, a limit is the value that a function “approaches” as the input approaches some value. Limits are essential to mathematical analysis and are used to define continuity, derivatives, and integrals. If we were to take the limit of service availability as service design approaches perfection, we would see that there were no unexpected outcomes during service transition – everything would behave exactly as expected thus eliminating the cost of unintended service disruptions due to change. The service would operate perfectly and incidents would be reduced to the happenstance where a component malfunctioned as expected MTBF. There would be no problems to identify workarounds or resolutions for. This would greatly increase service availability and performance and produce a substantial ROI.
Network instrumentation is another area where the design is seldom on target. Most networks are poorly instrumented until after they’ve grown to a point where the lack of visibility is causing problems. This applies to event management – what events are being trapped or logged, how those events are filtered, enriched, and/or suppressed in the event management system to provide the NOC with the most useful data. It also applies to performance management – what metrics are collected, how thresholds are set, how metrics are correlated to produce indicators useful to the consumers of that data. It also applies to traffic statistics such as Netflow or similar technologies. This should all be part of the network design from the beginning, because the network has to be maintained as part of the lifecycle.
The service design aspect of the ITIL service lifecycle is greatly undervalued and often overlooked entirely. Take a look at what ITIL recommends should be in the Service Design Package – the topology is only one of many aspects of the design. Poorly executed service design results in increased costs due to unexpected service disruptions during service transition and decreased system availability due to flaws in the design. All these require additional effort to identify the problem, compensate for the service disruption, and the rework the design and execute the transition again.
A wise approach to ITSM is to expend the effort in service design rather than expending much more cost to deal with the fallout from a poorly designed network. The engineering staff should see themselves as having a key role in the service design process and know the value they contribute to the entire system lifecycle.
The network engineering process in most organizations is usually upside down. Network engineers are more often involved in provisioning and then perhaps go back and address the design.
Consider an auto manufacturer. The engineering department designs an automobile or series of autos, with various options. This design doesn’t apply to any particular auto, but it is abstracted to address all autos built using that design. The manufacturing then fits the factory using that design and builds a variety of autos in different colors and options each with unique VINs.
Consider a software developer. They develop an application with certain requirements and features and develop it to run on a variety of platforms. They then use an installation packager to deploy that application to the various platforms and configure it correctly.
Have you ever had a problem with your car and brought it to the shop? Did the mechanic first go over the entire auto to make sure it was built correctly? Was there ever any doubt that the auto was built according to the design just like all other cars of that make and model? No. The repair center can look up the model/VIN and determine exactly what parts and options were used on that particular auto. The design is standard and each auto has information pointing to the design. Have you ever had the repair shop tell you that a particular part needed to be replaced with one that had different specifications (change the design) to repair your problem? Of course not. Have you ever had a problem with a software application and called tech support? Did they ever run an application to ensure that it had all the correct libraries and executables? No, they ask for the version number. The version number tells them what design was used to create that specific instance of the application.
Why then, do network engineers have to constantly check routers and switches to see if they conform to standard templates (if standards actually exist)? Why do network engineers build (provision) the network then go back and try to put together a design (in the context described above)? Why does tier three often reconfigure (change the design) of a network device in order to resolve an incident? Why can’t many network changes be tested without building the entire production network in dev? Because the production network is the closest thing to a design that exists. The fact is that the common approach to network engineering is completely backwards – and this paradigm is so common that it’s rarely ever questioned.
Does anyone remember how MS Windows networks were built and maintained 15 years ago? Each computer (desktop or server) was built individually using an install program. Each was managed individually. This caused a great deal of diversity and was very labor intensive. Two significant changes that completely changed this were system images and active directory. Using these two design constructs the software could be centrally managed with much more consistency and less labor. It improved system availability and reduced TCO.
Would you like to learn how to design your network and then provision it using the design? Would you like some inexpensive tools to assist in that process?
Change Management is an important function in most organizations. It carries more weight than many of the other ITIL functions because it addresses the largest pain point – that large percentage of service disruption caused by change. IT managers are constantly getting heat over deployment that didn’t go exactly as planned. When you boil that down to lost productivity or missed business opportunities it amounts to a sizable amount of money. These are just some of the reason Change Management gets so much well deserved attention.
There is a lot of preparation and documentation that has to go into any change before it’s presented to the board for approval. Each change is categorized, analyzed, scrutinized, until everyone involved is thoroughly mesmerized. The time required to get a change approved may also have increased five-fold by the time the change management process is fully matured. The process is controlled through some rather expensive management software, well documented, well planned, and hopefully well executed.
The question is: After expending all this effort into the Change Management process, expending the resources in additional planning and documentation, and spending all the time in meetings, and prolonging the time required to get a task accomplished, did CM reduce service disruptions and save more money than was invested in the process? If not, the program was a failure and we’re spending our money in the wrong place.
Change Management can’t solve unanticipated problems due to change because it addresses execution not content. Change Management is about managing risk, not improving the quality of changes. There is nothing in the Change Management process that addresses the particulars of the change – that is addressed in the Service Design process. There are very few organizations supporting network services that have a comprehensive design process. More often the process is abbreviated if accomplished at all, and systems are provisioned directly. If looking at processes from an ITIL perspective most organizations have strong Service Operations processes, the big Service Transition process – Change Management, but Service Strategy and Design are usually lacking.
I’ve been designing and installing telecommunications systems for almost 30 years. I’ve held a variety of positions supporting small to large networks and seen a a variety of approaches to engineering and provisioning. Although labels and pigeon holes don’t adequately explain the wide variety of approaches in use, we can use a few broad categories to generally describe them.
This approach was typical 20 years ago. It relies on a small team of highly qualified network engineers who solve problems on the back of a napkin and provision systems directly. If there is a problem with the network service or a new capability needs to be added, a network engineer will come up with a solution and implement it on the devices in question. This isn’t to imply that there is no planning – on the contrary, there is planning, but each implementation is planned and executed individually. Sure there are standards, but they’re more often informal.
This isn’t necessarily a bad approach; it works well for small networks. If there is a highly skilled staff that communicate frequently this can be managed on an informal, ad-hoc basis. The trouble is that as the network grows and management tries to save money by staffing the engineering department with less experienced engineers, mistakes start to appear from unexpected non-standard configurations and error. At this point management steps in in an attempt to reign in the boys.
This approach is similar to the previous approach with the addition to a change management program. In an attempt to reduce unexpected service disruptions caused by change, a formal change management process is established to control how changes are executed and manage the impact of change disruptions. Changes are well documented and scrutinized by a Change Advisory Board (CAB). Impact assessments are presented CAB and the change is categorized based on the risk and impact. Specific change window periods are established and the implementations are managed. This forces the engineering staff to develop a more thorough implementation plan, but it doesn’t address the the fundamental problem.
In my opinion, this approach is a complete waste of time because it doesn’t address the problem – it addresses the symptoms. Not that Change Management is bad – it has its place and is necessary. What causes unexpected services disruptions caused by a change implementation? Unless your installers are under-qualified, it’s not how the implementation is executed. It’s what is being done. All this approach does is impose a great deal of management oversight and increase the service order to delivery time by adding control gates.
Change Management can’t control unexpected behavior because Change Management focuses on the execution of the change. If the impact of every change was known for certain, then the implementation could be managed without unexpected consequences. How can the impact be known with a high degree of certainty? By designing the network service as a system rather that designing each implementation, which is actually skipping the design process and jumping straight to provisioning. This is putting the cart before the horse. This is the most common practice in use and is why IT managers look to outsourcing network services. Herding cats is difficult if not impossible.
In addition to Change Management, standardized templates and compliance checking are often implemented in an attempt to standardize configuration across larger more complex networks. Often an IT management framework such as ITIL is embraced; however, seldom is the network service subject to the ITIL Service Design and Release Management processes. In this model a standard IOS images and configuration templates are developed to describe each element of the network configuration. These templates may be broken down into smaller, more manageable sub-components such as base, network management management, interface configuration, routing protocols, traffic engineering, tributary provisioning, performance, etc. These templates are then used as a standard to check network device configurations against through some compliance checking mechanism such as Riverbed NetDoctor or HPNA.
This is a large step in the right direction, but still fails to address the fundamental problem. Configuration Management is important, but it still doesn’t address the problem, but a symptom. There will often be a large number of devices out of compliance and bringing them into compliance is a burdensome process in a large network with a tight change management process. This is because they’re still skipping the design process and the operations managers have little confidence in the design – because in the larger context of the entire network, the design is non-existent.
It’s interesting to note that most large managed service providers are at this stage. This is partially because they have little control over the customer’s service strategy and design processes. The service contract primarily addresses service transition and operation. The metrics used to evaluate provider performance are largely operations related – availability, packet loss, etc. Providers are able to productize operations and transition processes to fit any environment. This contributes to difficulty getting the provider to implement changes. It’s in their best interest to keep the network stable and reliable.
There is a paradigm associated with designing a network using COTS products that causes network engineering workcenters to disregard the conventional engineering process. Consider the design of an aircraft platform. Engineers don’t go out and build aircraft from scratch and create each one slightly different. Engineers design a blueprint that addresses all aspects of the design and system lifecycle. The production team takes that blueprint and fits the factory to build many aircraft using this design. Retrofits follow a similar process. Consider a software engineering project. The developer develops the application for each platform it is to be released on and releases an installation package for each platform. That installation package takes into account any variation in the platform. One package may be released to install on various Windows OSs, another on various Linux OSs, another for the supported Mac OSs. This has been thoroughly tested prior to release. The installation package installs with high degree of certainty. Enhancements and fixes are packaged into release packages. Patches that have a time constraint are released outside the scheduled release schedule. Imagine if the developer released a collection of executables, libraries, and support files and expected the installer to configure it correctly based on the system it was being installed on. The results wouldn’t be very certain and there would be a large number of incidents reported for failed installations. Imagine if the aircraft designer released a set of guidelines and expected the factory to design each aircraft to order. I’d be taking the train! If this seems logical, then why do most organizations skip the design process for IT/telecom systems and jump straight to provisioning? This is because the system is a collection of COTS products and the design consists primarily of topology and configuration. This doesn’t make the design process any less vital.
Under this model, the network is considered a service and the design process creates a blueprint that will be applied wherever the service is provisioned. Standards and templates are part of that design, but there is much more. The entire topology and system lifecycle are addressed in a systematic way that ensures that each device in the network is a refection of that design. There is a system of logic that describes how these standards and templates are used to provision any component in the network. Enhancements and fixes are released on regular cycles across the entire network and the version of the network configuration is managed. This approach takes most of the guess work out of the provisioning process.
The ITIL Service Design process treats a service similar to the way the aircraft and software developer are handled in the above examples. When the network is treated as a service that must be subject to this same rigorous engineering process, the result is improved efficiency a high degree of predictability that reduces service disruptions caused by unexpected problems encountered during changes. This requires a great deal more engineering effort during the design and release processes, but the ROI is improved availability and reduction effort during implementation. Implementing the release package becomes a turn-key operation that should be performed by the operations or provisioning team rather than engineering. This paradigm shift often takes some time for an organization to grasp and function efficiently in, but will improve performance and efficiency and paves the way toward automated provisioning.
This is the Zen of the network service design continuum. It can’t be achieved unless there is a fundamental shift in the way the network engineering staff approaches network design. Engineering produces a blueprint that is implemented with a high degree of certainty. The network service is designed as a system with a well developed design package that address all aspects of the network topology and system lifecycle. Network hardware is standardized and standard systems are defined. Standards are developed in great detail. Configurations are designed from a systemic perspective in a manner that can be applied to standard systems using the other standards as inputs. The CMDB or some other authoritive data source will contain all the network configuration items and the relationships between them. A logical system is developed that addresses how these standards and relationships will be applied to any given implementation. This is all tested on individual components and as a system to ensure the system meets the desired design requirements and assess the impact of any changes that will have to be applied as a result of the release.
At this point the logic that has been developed to take the design and translate it to an implementation (provisioning) can be turned in to a automated routine that can produce the required configurations to provision all devices necessary to make any given change.
Some of the controls such as compliance checking become more of a spot check to verify that the automation is working effectively. Network engineers are no longer involved in provisioning, but in designing the service in a larger context. Provisioning becomes a repeatable process with a high degree of certainty. This greatly reduces the risk that Change management is attempting to control and makes this a workable process.
Most organizations with large or complex network would benefit greatly from automated network provisioning.