network operations

The Power of Abstraction

I’ve noticed a trend with almost every company I’ve consulted for:  most network engineering does not use abstract design, but rather provisions each element in a concrete manner.  This is not a cost effective approach for many reasons.

Prevailing Paradigm

There is a paradigm associated with designing a network using COTS products that causes network engineering production centers to disregard the conventional engineering process.  Consider your automobile.  The last time you went to the repair shop, did the mechanic go through the entire car to ensure all the correct parts were installed?  Of course not; the VIN let them know what build was used and all cars of that build were identical except for a few items that made them unique.  Even the options were identical to the same model with those options.  This was not done solely for the benefit of the consumer, but because it is the most cost effective way to manufacture and maintain the vehicle throughout its lifecycle.  This principal can be seem in most industries …. except IT.  Why is compliance software for network systems so popular and valuable?  Because network devices are seldom configured according to a standard.  In the cases where they are to some extent, they are configured using templates that have to be applied manually and the variables entered manually, so they still vary.  This would be like automotive engineers assembling cars by hand.  It isn’t cost effective for a number of reasons.

I was demonstrating a proactive change validation process for a large enterprise customer.  They provided me the configlets and the change documentation for an upcoming change.  I modeled the current network and applied the proposed changes in the simulation and found several errors that would have made the modified network unable to route traffic.  They used templates to create the configlets used in the change, but they used one incorrect template and populated them with some incorrect variables.  The change was an upgrade that had been accomplished at many locations and was standardized to some extent.  If this were implemented as-is, the implementation engineer would have made the necessary modifications to make the system operational and introduce some degree of variance.

This paradigm exists in the service operations as well as design and provisioning.  Tier III often changes the configuration of a device to resolve an incident.  The would be like an auto repair shop making changes to your automobile design to fix a problem.  It’s not running correctly so they install spark plugs that are different from the manufacturer’s specification.  The repair shop wouldn’t do that.  Why is it commonplace in IT?  This is basically ad-hoc system redesign.  If the configuration of a device needs to be changed to resolve a problem, then the system was designed wrong.  There are only a few exceptions to this.  For example, if a router at a remote site has a hot spare interface it is often configured and disabled.  In the event of a failure, the spare interface is enabled, and possible readdressed to take the place of the failed one.  This isn’t really a redesign, it is an operational procedure used in a failure scenario.

This problem has a snowball effect.  Because there are so many variations in the network design, there is no feasible way to test design improvements.  If there were standardization, each variation of the standard systems and sub-systems could be tested in the lab/QA environment.  But because there is really no standard, and too many exceptions to any that exist, the only way to adequately test anything would be to replicate the entire network in QA.  Because of this, the rate of unsuccessful changes and unexpected impacts to change are extremely high.  Management attempts to mitigate this through more rigorous change management, which can not solve this problem and adds delay and effort to the change process.  In the end it costs the organization in loss of productivity due to system downtime and unnecessary labor to manage change and resolve incidents caused by change.

 ITSM Framework

 

The ITIL Service Design process treats a service, such as a network service, similar to the way the automotive industry manages this in the previous examples.  When the network is treated as a service that must be subject to this same rigorous engineering process, the result is improved efficiency a high degree of predictability that reduces service disruptions caused by unexpected probleITIL Change Managementms encountered during changes.  This requires a great deal more engineering effort during the design and release processes, but the ROI is improved availability and reduction effort during implementation.  Implementing the release package becomes a turn-key operation that should be performed by the operations or provisioning team rather than engineering.  This paradigm shift often takes some time for an organization to grasp and function efficiently in, but will improve performance and efficiency and paves the way toward automated provisioning.

In order to accomplish this the design must be abstracted in such a manner to express the level of detail necessary to create physical assembly and logical provisioning such as naming, addressing, routing configuration, policy, management configuration, VLAN assignment, etc. This is most certainly possible because all of these things follow a system of logic – they are not arbitrarily assigned.

An example of this can be seen in Windows system deployment and management.  In the 90’s if you wanted to install a Windows server, you would insert a disk into the server and go through an installation process.  If you were really on your game, you could create a installer init file that answered most of the questions the install utility would ask.  Any custom configuration would need to be accomplished manually one machine at a time.  The advent of system images and group policy provided a means to abstract the system design in a way that an enterprise can easily provision new systems identically and manage them very efficiently.

Conclusion

While there is no out-of-the-box product that provides a mechanism to abstract the network design in the manner that Windows uses images and GPO, it is certainly not out of reach.  The mechanisms to design networks using abstract construct can be developed/integrated and are worth the effort in large environments.

The larger problem is changing the paradigm.  I worked on a project where we developed an Operational Support System (OSS) that provided automated provisioning.   The customer entered the service order into a CRM system which caused the downstream provisioning system to push out all the necessary config changes to provision the service on the network devices.  The system development took us 7 years, but it took just as long to change the organizational mindset to be able to see network design in using abstract constructs.

Meaningful Metrics

Performance Management

Choosing metrics that provide insight into the health and performance of systems and processes can be challenging.  Metrics need to be aligned with the requirements of the systems and processes that they support.  While many performance management systems provide useful metrics out-of-the box, you will undoubtably have to define others yourself and determine a means to collect and report them.

I break metrics down into two major categories: strategic and operational.

 

Strategic Metrics

 

Strategic metrics provide a broad insight into a service’s overall performance.  These are the type of metrics that are briefed at the manager’s weekly meeting.  They usually aren’t directly actionable, but are very useful for trending.

Strategic metrics should be used to evaluate the overall effect of process or system improvements.  Healthy organizations are involved in some manner of Deming style continuous process improvement (CPI) which also applies to system/service design.  As changes are implemented metrics are monitors to determine if the changes improved the system or process as expected.

Some examples of strategic metrics are: system availability, homepage load time, and incidents identified through ITSM vs. those identified by customers.  These provide a high level indicator of performance more closely related to business objectives than to specific system or process operation and design criteria.

 

Operational Metrics

 

Operational metrics provide detail and are useful to help identify service disruptions, problems, capacity planning, and areas for improvement.  These metrics are often directly actionable.  Operations can use these metrics to proactively identify potential service disruptions, isolate the cause of a problem, and evaluate the effectiveness of the team.  Engineering uses these metrics to determine if the service design is meeting the design requirements, identify areas for design improvements, and provide data necessary for planning new services and upgrades.

Good metrics should be aligned with operational factors that indicate the health of the service and the design requirements.  Metrics, just like every other aspect of a system  design, are driven by requirements. The specific design requirements and criteria should be used to define metrics that measure how that aspect of the service is meeting the specified design objective.  Historical metrics are valuable to baseline performance and can be used to configure thresholds or historical reference in problem isolation and forecasting.

For example, if you have employed a differentiated services strategy you should be monitoring the traffic volume and queue discards for each class of service you’ve defined.  This will help you understand if your traffic projections are accurate and the QOS design is meeting the system requirements.  Historical data can help identify traffic trends that influenced the change and determine if it was due to growth, a new application or service, or a “Mother’s Day” traffic anomaly.

 

Composite Metrics

 

Sometime metrics are more valuable when correlated with other metrics.  This is true for both strategic and operational metrics.  In such cases it is often useful to create a composite metric.

Google, for example, has a health score composed from page load time and other metrics that is briefed to the senior execs daily.  In another example, perhaps the calls between the web front end and the SSO are only of concern if they are not directly related to the number of users connecting.  In this case a composite metric may provide operations a key piece of information to proactively identify a potential service disruption or reduce MTTR.

Few performance management systems have the capability to create composite metrics  within the application.  There are always ways around that, but usually involve writing custom glueware.

 

Keeping Focus

 

Metrics should have a specific purpose.  The consumers of the metrics should find value in the data – both the data itself and the way it is presented.  Like every aspect of the service, metrics should be in a Demingesque continual improvement cycle.  Metric definitions, the mechanism to collect them, and how they are communicated to their audience need to be constantly evaluated.

Metrics often become useless if the metric becomes the process objective.  Take the time to resolve an incident for example.  This metrics can provide valuable insight into the effectiveness of the operations staff and processes; however, it seldom does.  This is because most operations managers know this and continually press their staff to close tickets as soon as possible to keep MTTR low.  The objective of the operations process is not to close tickets quickly, but to support customer satisfaction by maintaining the service.  Because the metric becomes the objective, it looses its value.  This is difficult enough to address when the service is managed in-house, but when it becomes outsourced, that is even more troublesome.  Operations SLAs often specifically address MTTR.  If the service provider is contractually obligated to keep MTTR low, they will focus on closing tickets even if the issue remains unresolved.

 

Contact Us Today