In this lecture, “Software-Defined Networking at the Crossroads”, Scott Shenker, University of California, Berkeley discusses SDN, it’s evolution, principals, and current state.
I’d like to solicit comments on their presumptions. Are networks really difficult to manage? If so it it because of the technology or because management is often an afterthought rather than an integral part of the system design?
Pay particular attention to the term “operator”. What department or role is Dr. Shenker referring to as operator? Is it the NOC or the Network Engineering department?
If you’re looking at implementing capacity planning or hiring someone to do capacity planning there are a few things you should consider.
Capacity planning should be an ongoing part of the lifecycle of any network (or any IT service for that matter). The network was designed to meet a certain capacity knowing that may grow as the network gets larger and/or support more users and services. There are several way to go about this and the best approach is dependent on your situation. There should be some fairly specific plans on how to measure utilization, forecast, report, make decisions, and increase or decrease capacity. There are also many aspects to capacity. Link utilization is one obvious capacity limitation, but processor utilization may not be so obvious, and where VPNs are involved there are logical limits to the volume of traffic that can be handled by each device. There are also physical limitations such as port and patch panel connections, power consumption, UPS capacity, etc. These should all be addressed as an integral part of the network design, and if it has been overlooked, the design needs to be re-evaluated in light of the capacity management program. There are also the programatic aspects – frequency of evaluation, control gates, decision points, who to involve where, etc. This is all part of the lifecycle.
There are a wide variety of tools available for capacity planning and analysis. Which are selected will be determined by the approach you’re taking to manage capacity, how the data is to be manipulated, reported, and consumed, as well as architectural factors such as hardware capabilities, available data, and other network management systems in use. One simple approach is to measure utilization through SNMP and use linear forecasting to predict future capacity requirements. This is very easy to set up, but doesn’t provide the most reliable results. A much better approach is to collect traffic data, overlay it on a dynamic model of the network, then use failure analysis to predict capacity changes as a result of limited failures. This can be combined with linear forecasting; however, failure scenarios will almost always be the determining factor. Many organizations use QoS to prioritize certain classes of traffic over others. This adds yet another dimension to the workflow. There is also traffic engineering design, third party and carrier capabilities, and the behavior of the services supported by the network. It can become more complicated than it might appear at first glance.
Some understanding of the technologies is necessary to evaluate the data and make recommendations on any changes. If dynamic modeling is a tool used to forecast, there are another set of skills. The tools may produce much of the reporting; however, there will need to be some analysis captured in a report that will be evaluated by other elements in the organization requiring communication and presentation skills.
It’s highly unlikely that the personnel responsible for defining the program, gathering requirements, selecting COTS tools, writing middleware, and implementing all this will be the same as those that use the tools or produce the reports or maybe even read the reports and evaluate them. The idea of “hiring a capacity management person” to do all this isn’t really feasible. Those with the skills and motivation to define the program and/or design and implement it will not likely be interested in operating the system or creating the reports. One approach to this is to bring in someone with the expertise to define the approach, design and implement the tools, then train the personnel who will be using them. These engagements are usually relatively short and provide a great value.
This is a follow-on to the post Well Executed Service Design Provides Substantial ROI. The previous post discusses the rationale for investing in the service design activity. This provides some indicators that the service design processes are ineffective in the context of ITSM best practices. Take the quiz below and see how your organization scores:
If you answered “E” just follow the “Contact Us” link at the bottom of the page; you’re in trouble. If you don’t have design standards then do likewise. If less than 95% of your network isn’t in compliance with established standards, then they aren’t standards – they’re suggestions. It’s very likely that you are provisioning devices without a detailed design. This is the most common indicator of an ineffective service design process.
Ideally you should build and provision your network by design and not make changes to it that aren’t defined by subsequent releases of the design. There are many compliance validation systems on the market that do a great job of validating IOS compliance, configuration compliance against a template, and so on. While these tools have great value in an organization with a loose design program, they address the symptom rather than the problem.
There is some merit to engineering working operational issues. It keeps the engineers sharp on troubleshooting skills and helps them understand the details of the incidents in the field. Often an engineer will identify a design fault easily and open a problem case. However, if the engineers can’t spend adequate time doing engineering because they’re troubleshooting operational issues, then something is terribly amiss. If the network is so complicated that Tier III can’t identify a problem, then engineering needs to develop better tools to identify those problems rapidly. If engineering is constantly needed to make changes to devices to resolve failures, then the network design is lacking. Engineering being involved in incident resolution is an indicator of a poor or ineffective network service design.
Did you have to install the speedometer, oil pressure gauge, etc. as aftermarket add-ons to your automobile? If you’re designing something, and there are performance criteria, the design needs to include a means to measure that performance to ensure the system meets the requirements. Your service design program should be developing metrics and alerts as an integral part of the design.
The configuration of a device is part of the system/service design. Does the auto repair shop ever redesign your car in order to repair it? There are a few circumstances in a network system where config change may be a valid repair action. For example: Bouncing a port on a device requires the configuration to be changed momentarily, then changed back. Perhaps there are hot spare devices or interfaces that are configured but disabled, and the repair plan is to enable the secondary and disable the primary (this is actually part of the system design plan). Aside from these few exceptions any modification to a device configuration is a modification to the network design. If the design has a flaw, there should be a problem case opened, the root cause identified, and the fix rolled into the next release of the service design. When design change is a mechanism used to correct incidents, it indicates a lack of a cohesive design activity.
The design of the network must continually be improved to add new features, support additional IT services, improvements to existing aspects of the system, fixes to problems, etc. A best practice is to release these design changes on a scheduled cycle. Everything in ITIL is in the context of the “service”. In the case of network services this is the entire network viewed as a cohesive system. IOS updates, software updates to support systems, etc. are released by the manufacturer. Although these are part of the network system design, they do not constitute a release of the network service. For example: Network service version 2.2 may define JUNOS 10.4.x on all routers and version IOS 12.4.x on all switches. Release 2.3 of the network service may include a JUNOS update as well as an enhancement to the QOS design and a configuration modification to fix a routing anomaly discovered during the previous release. It is the service that is subject to release management. The updated service design is then provisioned on the applicable devices using a schedule that is dependent on multiple factors – mostly operational. The provisioning schedule is not the same as the release schedule. Release 2.3 may become approved on March 1 but not provisioned across the entire network until August through a series of change orders. A well established network service design program uses release management.
Network service design and provisioning are too often blurred or indistinguishable. A design is abstract and applies to no particular device, but contains enough detail to be provisioned on any particular device in a turn-key fashion. Most organizations design particular devices, thus skipping the design and incorporating it with provisioning. When this happens the design process must be repeated for each similar instance. Few CABs make the distinction between the two activities causing change management to become very labor intensive because the provisioning activity becomes subject to all the testing and scrutiny of the design activity and the design activity subject to all the operational concerns of provisioning. This is another indicator of a poorly functioning network service design activity. Note: This is the only question where “E” is the best answer.
I’ve been engineering, maintaining, and managing network and IT systems for numerous organizations, large and small, for longer than I care to elaborate on. In all but three cases, all of which were small with little structure, the organization had a change management process that held to most of the ITIL process framework to some extent. All had many of the service operations processes and were working to improve what was lacking. All the CTO/ITOs understood the value of service transition and service operations processes. However, few had services catalog or any of the ITIL service design process when I first began working with them, and nobody really gave it much thought. Almost all of the articles and discussions on the internet related to ITIL are about service transition or operation. Rarely if ever is anything written about service design or strategy. Half of the ITIL service lifecycle gets all the attention, while the other half is relegated to part of the academic exercise of certification. If it isn’t important or necessary, why is it there?
Perhaps it’s that most people who are proponents of the ITIL framework have either an operations or project management background and don’t understand the engineering process and how it relates to ITIL. How many engineers or developers do you know who embrace ITIL? Most see it as a burden forced upon them by PMs and operations. Isn’t the CAB a stick used to pummel engineering into writing documents that nobody, not even the installer, will ever read? What if they saw themselves as a vital part of the service lifecycle and key to cost saving measures?
Service transition and operations are where cost recovery is most evident. I’ve often heard it said that service operations and transition are where ITIL realizes its ROI. I argue that a well executed service design provides even more ROI than operations and transition, though it is not evident in the design part of the lifecycle.
To illustrate this point, consider an auto manufacturer. A lot goes into the design of the auto. The customer doesn’t see the design or the manufacturing process, but they see the O&M processes. Do you know anyone who had a vehicle that was always needing repair? The repair was costly but how much of the need for repair was due to poor design? I had a Bronco II that would constantly break the ring gear which often led to a transmission rebuild. The aluminum ring gear couldn’t handle the torque and would rip into pieces. I had several friends who owned minivans that would throw a transmission every 30,000 miles. It wasn’t bad luck or abuse, it was a bad design. The manufacturer fixed that in later years, but it gave them a bad reputation and caused sales on that model to fall. How about recalls? They are very costly. First there’s the issue of diagnosing a problem with a product that’s already in the field, then the redesign, and the retrofit. The point I’m trying to illustrate is that design flaws are very costly, but that cost shows up in the operations and transition part of the lifecycle, not the design stage.
The Rogers Cellular outage in Oct 2013 is one example. Rogers has not had a very good record for service and availability. They suffered an outage impacting their entire network for a few hours that made national news. How do you suppose this outage affected sales? An inadequate design can have some very expensive consequences.
The business case for change management is built on reducing the cost associated with service disruptions due to change. While change management is good, the real problem is unexpected service disruption as a result of change. Planned service disruption can be scheduled in a manner to least impact customers. It’s the unintended consequences that are the trouble. A well executed service design process produces a transition plan that correctly identifies the impact of the change (change evaluation). Change management has nothing to do with this; in fact, change management relies on this being done correctly. A large part of what most organizations are using change management to correct isn’t even addressed by change management; it’s addressed by service design. This may be counter-intuitive, but it’s true nonetheless.
CloudFlare made the news when they experienced an hour long outage effecting their entire worldwide network in March 2013. This outage was due to a change that caused the Juniper routers to become resource starved after a firewall rule was applied. Juniper received the bad rap on this; however, it was the network engineering team at CloudFlare that was to blame, not Juniper. Although this was due to a JUNOS bug, Juniper had identified the bug and released a patch in October 2012. CloudFlare made a change to the network (service design change) that was released immediately to ward off a DDoS attack (this would be a service patch in ITIL terms). The change was not tested adequately and the behavior was not as expected. It was the service design process at fault here, and there was nothing in the change management process to check this. This is because change management attempts to manage the risk associated with change by controlling “how” the change is executed. Change management does nothing with the content of the change. It is presupposed that the “what” being changed has been adequately designed and tested as part of the service design process.
IT services such as Windows domain, Exchange, and databases, seldom have any resemblance of an engineering practice, but typically function as a product implementation center. Implementing a well defined service design program requires a major paradigm shift for an organization. Most organizations don’t view the engineering process with the same discipline as other areas of industry. Because most networks and IT systems are composed of COTS products that need relatively little configuration, the configuration details of the COTS system and how they integrate with other systems are not viewed as a system that should be subject to the same engineering processes as any other system that needs to be designed. This is a very costly assumption.
In mathematics, a limit is the value that a function “approaches” as the input approaches some value. Limits are essential to mathematical analysis and are used to define continuity, derivatives, and integrals. If we were to take the limit of service availability as service design approaches perfection, we would see that there were no unexpected outcomes during service transition – everything would behave exactly as expected thus eliminating the cost of unintended service disruptions due to change. The service would operate perfectly and incidents would be reduced to the happenstance where a component malfunctioned as expected MTBF. There would be no problems to identify workarounds or resolutions for. This would greatly increase service availability and performance and produce a substantial ROI.
Network instrumentation is another area where the design is seldom on target. Most networks are poorly instrumented until after they’ve grown to a point where the lack of visibility is causing problems. This applies to event management – what events are being trapped or logged, how those events are filtered, enriched, and/or suppressed in the event management system to provide the NOC with the most useful data. It also applies to performance management – what metrics are collected, how thresholds are set, how metrics are correlated to produce indicators useful to the consumers of that data. It also applies to traffic statistics such as Netflow or similar technologies. This should all be part of the network design from the beginning, because the network has to be maintained as part of the lifecycle.
The service design aspect of the ITIL service lifecycle is greatly undervalued and often overlooked entirely. Take a look at what ITIL recommends should be in the Service Design Package – the topology is only one of many aspects of the design. Poorly executed service design results in increased costs due to unexpected service disruptions during service transition and decreased system availability due to flaws in the design. All these require additional effort to identify the problem, compensate for the service disruption, and the rework the design and execute the transition again.
A wise approach to ITSM is to expend the effort in service design rather than expending much more cost to deal with the fallout from a poorly designed network. The engineering staff should see themselves as having a key role in the service design process and know the value they contribute to the entire system lifecycle.
The network engineering process in most organizations is usually upside down. Network engineers are more often involved in provisioning and then perhaps go back and address the design.
Consider an auto manufacturer. The engineering department designs an automobile or series of autos, with various options. This design doesn’t apply to any particular auto, but it is abstracted to address all autos built using that design. The manufacturing then fits the factory using that design and builds a variety of autos in different colors and options each with unique VINs.
Consider a software developer. They develop an application with certain requirements and features and develop it to run on a variety of platforms. They then use an installation packager to deploy that application to the various platforms and configure it correctly.
Have you ever had a problem with your car and brought it to the shop? Did the mechanic first go over the entire auto to make sure it was built correctly? Was there ever any doubt that the auto was built according to the design just like all other cars of that make and model? No. The repair center can look up the model/VIN and determine exactly what parts and options were used on that particular auto. The design is standard and each auto has information pointing to the design. Have you ever had the repair shop tell you that a particular part needed to be replaced with one that had different specifications (change the design) to repair your problem? Of course not. Have you ever had a problem with a software application and called tech support? Did they ever run an application to ensure that it had all the correct libraries and executables? No, they ask for the version number. The version number tells them what design was used to create that specific instance of the application.
Why then, do network engineers have to constantly check routers and switches to see if they conform to standard templates (if standards actually exist)? Why do network engineers build (provision) the network then go back and try to put together a design (in the context described above)? Why does tier three often reconfigure (change the design) of a network device in order to resolve an incident? Why can’t many network changes be tested without building the entire production network in dev? Because the production network is the closest thing to a design that exists. The fact is that the common approach to network engineering is completely backwards – and this paradigm is so common that it’s rarely ever questioned.
Does anyone remember how MS Windows networks were built and maintained 15 years ago? Each computer (desktop or server) was built individually using an install program. Each was managed individually. This caused a great deal of diversity and was very labor intensive. Two significant changes that completely changed this were system images and active directory. Using these two design constructs the software could be centrally managed with much more consistency and less labor. It improved system availability and reduced TCO.
Would you like to learn how to design your network and then provision it using the design? Would you like some inexpensive tools to assist in that process?
Change Management is an important function in most organizations. It carries more weight than many of the other ITIL functions because it addresses the largest pain point – that large percentage of service disruption caused by change. IT managers are constantly getting heat over deployment that didn’t go exactly as planned. When you boil that down to lost productivity or missed business opportunities it amounts to a sizable amount of money. These are just some of the reason Change Management gets so much well deserved attention.
There is a lot of preparation and documentation that has to go into any change before it’s presented to the board for approval. Each change is categorized, analyzed, scrutinized, until everyone involved is thoroughly mesmerized. The time required to get a change approved may also have increased five-fold by the time the change management process is fully matured. The process is controlled through some rather expensive management software, well documented, well planned, and hopefully well executed.
The question is: After expending all this effort into the Change Management process, expending the resources in additional planning and documentation, and spending all the time in meetings, and prolonging the time required to get a task accomplished, did CM reduce service disruptions and save more money than was invested in the process? If not, the program was a failure and we’re spending our money in the wrong place.
Change Management can’t solve unanticipated problems due to change because it addresses execution not content. Change Management is about managing risk, not improving the quality of changes. There is nothing in the Change Management process that addresses the particulars of the change – that is addressed in the Service Design process. There are very few organizations supporting network services that have a comprehensive design process. More often the process is abbreviated if accomplished at all, and systems are provisioned directly. If looking at processes from an ITIL perspective most organizations have strong Service Operations processes, the big Service Transition process – Change Management, but Service Strategy and Design are usually lacking.
I’ve been designing and installing telecommunications systems for almost 30 years. I’ve held a variety of positions supporting small to large networks and seen a a variety of approaches to engineering and provisioning. Although labels and pigeon holes don’t adequately explain the wide variety of approaches in use, we can use a few broad categories to generally describe them.
This approach was typical 20 years ago. It relies on a small team of highly qualified network engineers who solve problems on the back of a napkin and provision systems directly. If there is a problem with the network service or a new capability needs to be added, a network engineer will come up with a solution and implement it on the devices in question. This isn’t to imply that there is no planning – on the contrary, there is planning, but each implementation is planned and executed individually. Sure there are standards, but they’re more often informal.
This isn’t necessarily a bad approach; it works well for small networks. If there is a highly skilled staff that communicate frequently this can be managed on an informal, ad-hoc basis. The trouble is that as the network grows and management tries to save money by staffing the engineering department with less experienced engineers, mistakes start to appear from unexpected non-standard configurations and error. At this point management steps in in an attempt to reign in the boys.
This approach is similar to the previous approach with the addition to a change management program. In an attempt to reduce unexpected service disruptions caused by change, a formal change management process is established to control how changes are executed and manage the impact of change disruptions. Changes are well documented and scrutinized by a Change Advisory Board (CAB). Impact assessments are presented CAB and the change is categorized based on the risk and impact. Specific change window periods are established and the implementations are managed. This forces the engineering staff to develop a more thorough implementation plan, but it doesn’t address the the fundamental problem.
In my opinion, this approach is a complete waste of time because it doesn’t address the problem – it addresses the symptoms. Not that Change Management is bad – it has its place and is necessary. What causes unexpected services disruptions caused by a change implementation? Unless your installers are under-qualified, it’s not how the implementation is executed. It’s what is being done. All this approach does is impose a great deal of management oversight and increase the service order to delivery time by adding control gates.
Change Management can’t control unexpected behavior because Change Management focuses on the execution of the change. If the impact of every change was known for certain, then the implementation could be managed without unexpected consequences. How can the impact be known with a high degree of certainty? By designing the network service as a system rather that designing each implementation, which is actually skipping the design process and jumping straight to provisioning. This is putting the cart before the horse. This is the most common practice in use and is why IT managers look to outsourcing network services. Herding cats is difficult if not impossible.
In addition to Change Management, standardized templates and compliance checking are often implemented in an attempt to standardize configuration across larger more complex networks. Often an IT management framework such as ITIL is embraced; however, seldom is the network service subject to the ITIL Service Design and Release Management processes. In this model a standard IOS images and configuration templates are developed to describe each element of the network configuration. These templates may be broken down into smaller, more manageable sub-components such as base, network management management, interface configuration, routing protocols, traffic engineering, tributary provisioning, performance, etc. These templates are then used as a standard to check network device configurations against through some compliance checking mechanism such as Riverbed NetDoctor or HPNA.
This is a large step in the right direction, but still fails to address the fundamental problem. Configuration Management is important, but it still doesn’t address the problem, but a symptom. There will often be a large number of devices out of compliance and bringing them into compliance is a burdensome process in a large network with a tight change management process. This is because they’re still skipping the design process and the operations managers have little confidence in the design – because in the larger context of the entire network, the design is non-existent.
It’s interesting to note that most large managed service providers are at this stage. This is partially because they have little control over the customer’s service strategy and design processes. The service contract primarily addresses service transition and operation. The metrics used to evaluate provider performance are largely operations related – availability, packet loss, etc. Providers are able to productize operations and transition processes to fit any environment. This contributes to difficulty getting the provider to implement changes. It’s in their best interest to keep the network stable and reliable.
There is a paradigm associated with designing a network using COTS products that causes network engineering workcenters to disregard the conventional engineering process. Consider the design of an aircraft platform. Engineers don’t go out and build aircraft from scratch and create each one slightly different. Engineers design a blueprint that addresses all aspects of the design and system lifecycle. The production team takes that blueprint and fits the factory to build many aircraft using this design. Retrofits follow a similar process. Consider a software engineering project. The developer develops the application for each platform it is to be released on and releases an installation package for each platform. That installation package takes into account any variation in the platform. One package may be released to install on various Windows OSs, another on various Linux OSs, another for the supported Mac OSs. This has been thoroughly tested prior to release. The installation package installs with high degree of certainty. Enhancements and fixes are packaged into release packages. Patches that have a time constraint are released outside the scheduled release schedule. Imagine if the developer released a collection of executables, libraries, and support files and expected the installer to configure it correctly based on the system it was being installed on. The results wouldn’t be very certain and there would be a large number of incidents reported for failed installations. Imagine if the aircraft designer released a set of guidelines and expected the factory to design each aircraft to order. I’d be taking the train! If this seems logical, then why do most organizations skip the design process for IT/telecom systems and jump straight to provisioning? This is because the system is a collection of COTS products and the design consists primarily of topology and configuration. This doesn’t make the design process any less vital.
Under this model, the network is considered a service and the design process creates a blueprint that will be applied wherever the service is provisioned. Standards and templates are part of that design, but there is much more. The entire topology and system lifecycle are addressed in a systematic way that ensures that each device in the network is a refection of that design. There is a system of logic that describes how these standards and templates are used to provision any component in the network. Enhancements and fixes are released on regular cycles across the entire network and the version of the network configuration is managed. This approach takes most of the guess work out of the provisioning process.
The ITIL Service Design process treats a service similar to the way the aircraft and software developer are handled in the above examples. When the network is treated as a service that must be subject to this same rigorous engineering process, the result is improved efficiency a high degree of predictability that reduces service disruptions caused by unexpected problems encountered during changes. This requires a great deal more engineering effort during the design and release processes, but the ROI is improved availability and reduction effort during implementation. Implementing the release package becomes a turn-key operation that should be performed by the operations or provisioning team rather than engineering. This paradigm shift often takes some time for an organization to grasp and function efficiently in, but will improve performance and efficiency and paves the way toward automated provisioning.
This is the Zen of the network service design continuum. It can’t be achieved unless there is a fundamental shift in the way the network engineering staff approaches network design. Engineering produces a blueprint that is implemented with a high degree of certainty. The network service is designed as a system with a well developed design package that address all aspects of the network topology and system lifecycle. Network hardware is standardized and standard systems are defined. Standards are developed in great detail. Configurations are designed from a systemic perspective in a manner that can be applied to standard systems using the other standards as inputs. The CMDB or some other authoritive data source will contain all the network configuration items and the relationships between them. A logical system is developed that addresses how these standards and relationships will be applied to any given implementation. This is all tested on individual components and as a system to ensure the system meets the desired design requirements and assess the impact of any changes that will have to be applied as a result of the release.
At this point the logic that has been developed to take the design and translate it to an implementation (provisioning) can be turned in to a automated routine that can produce the required configurations to provision all devices necessary to make any given change.
Some of the controls such as compliance checking become more of a spot check to verify that the automation is working effectively. Network engineers are no longer involved in provisioning, but in designing the service in a larger context. Provisioning becomes a repeatable process with a high degree of certainty. This greatly reduces the risk that Change management is attempting to control and makes this a workable process.
Most organizations with large or complex network would benefit greatly from automated network provisioning.