Friday 4 September 2015

There is no excuse for predictable and preventable downtime.

So in my last post I said that I thought the ability to scale a network was the second most important property of a network, and challenged the reader to think about what might be or at least what they thought I might think is the most important property of a network.

So the answer to the challenge is availability.

It seems to me that if the network isn't available to use when ever its users want to use it, then literally everything else about the network or that is done for or to the network is a waste of time. It doesn't matter how scalable the network is, how well it performs, or how cost effective it is if it isn't available to serve the people when they expect to use it.

I came to this conclusion after considering both an early professional experience and a vendor's former slogan.

In the early 1990s, I was working in the central office, and more specifically the IT section of the Department of TAFE (Technical and Further Education) in South Australia. Previously I'd been working part time at one of the metropolitan TAFE colleges.

This was what I'd call my first full time industry job (or rather, was my first full time 2 month contract in a corporate environment). I was brought in to set up both the department's first Novell Netware 3.11 server and to help add the department's PCs to the network so that they could store their files on the new Novell server. This would have allowed the PCs' users to both share files and to be sure that they were being backed up, and to have access to shared printers (if you ever hear the term "shared file and print" services, this scenario is what is being referred to).

The reason I was brought in was that I had been setting up the same sort of file and print environment for classrooms at the TAFE college I'd been working at. Although I didn't realise it at the time, the difference was that in a classroom environment, system or service availability wasn't quite as critical, because lessons were limited to a few hours, so the impact of an outage was reasonably limited. However, in the central office of the department, availability was far more critical, because users wouldn't have been able to do their job. Thinking back, I don't think I really saw and thought about the distinction, because I mostly just thought things should just work regardless.

So after we in the IT section had set up the Novell file server, and the backbone of the 10Base-T network had been deployed through the building, we went around to "sell" access to the file and print services on the network.

I'm not sure completely why, however rather than there being a corporate direction to "network everything", with associated budgets, instead it was a more piecemeal approach, where other departments themselves would decide whether or not to connect to the network. If they did, then they would cover the costs of doing so (I remember it was $220 per PC to do so in 1990/1991, which was quite a lot of money, and covered the cost of the network card, the cabling and the Novell Netware client/connection license).

Perhaps it was because the idea of "networking everything" would have been too much of a radical idea at the time for the department of TAFE, and an incremental deployment with each department covering its own costs was a better approach. Or maybe there was no definite plan at all, and it was just that the idea of shared file and print services were so compelling to the IT section that they decided to "go ahead anyway" for their own benefit, and other departments could have that benefit too if they were interested.

Once a department had decided to connect to the network, we would then walk around and schedule a time with each individual PC user to attach their PC to the network. As part of that discussion we would also describe why were doing it, and what they would be able to do after we did it. In particular, we would say something like "if you store your files on the network, they're backed up."

"But what if the network fails?"

Groan. Compared to the single hard disk drive in the desktop PC, that either wasn't being backed up at all, or required manual backing up by its user, our network was far more reliable.

However, considering their question from their perspective, it was an understandable, because here we were proposing that instead of their data being stored in the box that was right in front of them on their desk, which was something tangible, we were proposing that it was to be stored somewhere else intangible that they couldn't easily see or touch if they wanted to. (In the "cloud". Hah, see, not a new idea, just a new name for it!)

This question felt like a personal affront. We'd worked hard to make sure things were reliable. Our Novell server had mirrored disks and was being backed up each night. Our network was built with quality equipment from Cisco and Synoptics by very competent people. Personally it felt to me like the PC user was questioning my competence; my ability to do my job.

Admittedly the PC user didn't have a choice as their boss had made that decision, however, it still irked me, this indirect and inadvertent questioning of my ability to deliver what I say I'll deliver.

So that is the experience that made me consider availability to be the most important property of a network. The network should be available enough that its users can have confidence in it being there and working whenever they try to use it.

The now abandoned vendor slogan was "The network works. No excuses."(tm), and was Cisco's during some part of the 1990s.

Having considered it over the years, I really liked the slogan (and even had it on a Cisco baseball cap that I'd got from somewhere or other - sadly left at a pub in North Adelaide) because I thought it summed up the approach that I generally had, but also in particular we had had when we were rolling out that early Novell network.

However, I think there actually are valid excuses for why the network "doesn't work" ... which I'll get to.

So to the title of this blog post is "There is no excuse for predictable and preventable downtime.", which I came up with in around 2006. I think it is a slightly more in depth version of the "The network works. No excluses." slogan, yet allows for some qualification.

The key term in the title is "predictable and preventable".

"predictable" means that you can foresee something possibly occurring, even though it may not occur often or may never occur at all.

"preventable" means that you have the resources (e.g., time or money) to ensure that it doesn't.

In the context of service or network downtime, if you are able to both predict and able to prevent downtime occurring, then you have no valid excuse for the downtime if it occurs.

On the other hand, if downtime occurs because of an unpredictable occurrence, then that is a valid excuse.

Or, if you could predict the downtime, but you didn't have the resources to prevent it, well that is also a valid excuse. If your request for redundancy resources is declined and then downtime occurs because there is no redundancy, then you are not to blame.

So sometimes there are valid excuses for downtime, but only if you can't predict and/or prevent it.

Monday 31 August 2015

"The only real problem is scaling."

"The only real problem is scaling. All others inherent are from that one. If you can scale, everything else must be working." - Mike O'Dell, Chief Scientist, UUNet, MPLS Conference, Nov 1998.

The above was quoted in the book "MPLS - Technology and Applications", by Bruce Davie and Yakov Rekhter, in chapter 8, "Virtual Private Networks", under the topic of "Scaling".

I was reading this book in around 2003. This quote really struck me, as I'd encountered enough instances in past 13 years of working where I'd either encountered scaling limits, or where I'd seen scaling limits overcome.

The other example of a scaling issue described in the book was the scaling problem of router control plane routing protocol neighbor adjacencies across a large ATM network.

In this case, the large ATM network was built to perform traffic engineering at layer 2, while many routers on the edge of the ATM network then formed layer 3 routing protocol neighbor adjacencies across the layer 2 network.

The trouble with this model is that attached to a large ATM network there can be a large number of routers, and there are limits as to how many routing protocol adjacencies a router can maintain. For example, in a normal network a router would typically have no more than perhaps a maximum of 3 to 5 routing protocol neighbor adjacencies, where as in this ATM model, the routing protocol neighbor adjacencies may number in the 100s, because there are 100s of routers attached to the ATM network.

These two things from the book were also significant to me because I'd fairly recently finished working at UUNet (August 2000 - July 2002) and knew the ATM example had come from UUNet.

More significantly, while at UUNet, I worked on their UUsecure product, which was an IPsec based VPN product, which operated over the top of UUNet's Internet service backbone. We had encountered the same sort of router control plane scaling problem.

One of the VPNs being built using the product had 10 000 sites (yes, 10K - I think an unimaginable number for most people who have or are building IPsec based VPNs). For redundancy, each site had two point-to-point IPsec tunnels, individually going to each of two central hubs, meaning 20 000 tunnels and 20 000 routes (1 route for each site, times 2 because they were announced twice).

With IPsec tunnels, the only way to detect if they're working or not is to send traffic over them. IPsec keep-alives weren't available at the time, and anyway, we needed to use dynamic routing to support fail over between the central hubs and any multihomed sites. BGP was the choice, because OSPF wouldn't scale to 10 000 sites and 20 000 routes (so yes, we were using BGP as an IGP, and also had 20 000 BGP sessions, because BGP sessions are also point-to-point).

The trouble was that although the routers in question (Cisco 7206VXRs with NPE-400s if I recall), with the addition of a hardware crypto accelerator, could handle up to 600 IPsec tunnels, once we added the overhead of running BGP and in particular BGP keep-alives over the tunnels, we could only support 200 IPsec tunnels per 7206VXR. If you do the maths, that means a total of 100 7206VXRs to build this VPN were required at the hub sites. I've seen pictures of them .... racks and racks of 5 RU 7206VXRs. (Very costly, but still cheaper and more secure than alternative VPN technologies available at the time.)

If we could have some how avoided all the BGP sessions and corresponding keep-alives, we would have only needed 32 instead of 100 7206VXRs to support 20 000 IPsec tunnels, meaning spending much less money on 7206VXRs, much less power and much less rack space.

In "MPLS - Technology and Applicatons", the authors say that the solution to this router neighbor adjacency scaling problem in the ATM network scenario would be to have the layer 3 routers become adjacent directly with the ATM switches, and have the routers and ATM switches communicate with each other about topology and path setup, including for traffic engineering purposes. This would reduce the number of router routing protocol neighbor adjacencies back down to the normal numbers of no more than 3 to 5.

In other words, the ATM switches effectively become members of the layer 3 routing domain, and the routers effectively become members of the ATM forwarding domain. The layer 2 and layer 3 networks have been flattened into a single network, rather than one network overlaid on top of another.

This is one of the problems that MPLS solves.

During development of MPLS, it was realised that it wouldn't be all that hard to build MPLS based VPNs using a label stack. This probably put an end to large scale IPsec VPNs, despite the loss of Confidentiality, Integrity and Authenticity that IPsec provides and MPLS VPNs don't. I'm sure certain government agencies welcomed the success of unencrypted and unauthenticated MPLS VPNs.

Over the years I've occasionally looked to see if there is work on methods of encrypting LSP traffic, and occasionally there has been. Looking just now, there is a current Internet Draft to do so - Opportunistic Security in MPLS Networks. Still not quite as secure as IPsec, as this would be PE-to-PE encryption, and it is opportunistic, rather than mandatory CE-to-CE/site-to-site encryption (unless the CEs were involved in setting up encrypted LSPs). Also note the draft is Experimental rather than Standards Track.

Hmm, well that has ended up a lot longer than I expected. I was going to use that story as an intro to some observations and thoughts I've had before and since on scaling since reading Mike O'Dell's quote. I'll keep them for another blog post.

To finish on at least some sort of moral, lesson or observation, "flatten to reduce neighbors"! :-)

Actually, I'll finish on something else. I've come to consider the second most important property of a network is the ability to scale it. I'll leave the reader to have a think about what might be the most important property (or at least, what you might think my opinion of what the most important property is).



Wednesday 12 August 2015

Lorem Ipsum

"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."