Friday 4 September 2015

There is no excuse for predictable and preventable downtime.

So in my last post I said that I thought the ability to scale a network was the second most important property of a network, and challenged the reader to think about what might be or at least what they thought I might think is the most important property of a network.

So the answer to the challenge is availability.

It seems to me that if the network isn't available to use when ever its users want to use it, then literally everything else about the network or that is done for or to the network is a waste of time. It doesn't matter how scalable the network is, how well it performs, or how cost effective it is if it isn't available to serve the people when they expect to use it.

I came to this conclusion after considering both an early professional experience and a vendor's former slogan.

In the early 1990s, I was working in the central office, and more specifically the IT section of the Department of TAFE (Technical and Further Education) in South Australia. Previously I'd been working part time at one of the metropolitan TAFE colleges.

This was what I'd call my first full time industry job (or rather, was my first full time 2 month contract in a corporate environment). I was brought in to set up both the department's first Novell Netware 3.11 server and to help add the department's PCs to the network so that they could store their files on the new Novell server. This would have allowed the PCs' users to both share files and to be sure that they were being backed up, and to have access to shared printers (if you ever hear the term "shared file and print" services, this scenario is what is being referred to).

The reason I was brought in was that I had been setting up the same sort of file and print environment for classrooms at the TAFE college I'd been working at. Although I didn't realise it at the time, the difference was that in a classroom environment, system or service availability wasn't quite as critical, because lessons were limited to a few hours, so the impact of an outage was reasonably limited. However, in the central office of the department, availability was far more critical, because users wouldn't have been able to do their job. Thinking back, I don't think I really saw and thought about the distinction, because I mostly just thought things should just work regardless.

So after we in the IT section had set up the Novell file server, and the backbone of the 10Base-T network had been deployed through the building, we went around to "sell" access to the file and print services on the network.

I'm not sure completely why, however rather than there being a corporate direction to "network everything", with associated budgets, instead it was a more piecemeal approach, where other departments themselves would decide whether or not to connect to the network. If they did, then they would cover the costs of doing so (I remember it was $220 per PC to do so in 1990/1991, which was quite a lot of money, and covered the cost of the network card, the cabling and the Novell Netware client/connection license).

Perhaps it was because the idea of "networking everything" would have been too much of a radical idea at the time for the department of TAFE, and an incremental deployment with each department covering its own costs was a better approach. Or maybe there was no definite plan at all, and it was just that the idea of shared file and print services were so compelling to the IT section that they decided to "go ahead anyway" for their own benefit, and other departments could have that benefit too if they were interested.

Once a department had decided to connect to the network, we would then walk around and schedule a time with each individual PC user to attach their PC to the network. As part of that discussion we would also describe why were doing it, and what they would be able to do after we did it. In particular, we would say something like "if you store your files on the network, they're backed up."

"But what if the network fails?"

Groan. Compared to the single hard disk drive in the desktop PC, that either wasn't being backed up at all, or required manual backing up by its user, our network was far more reliable.

However, considering their question from their perspective, it was an understandable, because here we were proposing that instead of their data being stored in the box that was right in front of them on their desk, which was something tangible, we were proposing that it was to be stored somewhere else intangible that they couldn't easily see or touch if they wanted to. (In the "cloud". Hah, see, not a new idea, just a new name for it!)

This question felt like a personal affront. We'd worked hard to make sure things were reliable. Our Novell server had mirrored disks and was being backed up each night. Our network was built with quality equipment from Cisco and Synoptics by very competent people. Personally it felt to me like the PC user was questioning my competence; my ability to do my job.

Admittedly the PC user didn't have a choice as their boss had made that decision, however, it still irked me, this indirect and inadvertent questioning of my ability to deliver what I say I'll deliver.

So that is the experience that made me consider availability to be the most important property of a network. The network should be available enough that its users can have confidence in it being there and working whenever they try to use it.

The now abandoned vendor slogan was "The network works. No excuses."(tm), and was Cisco's during some part of the 1990s.

Having considered it over the years, I really liked the slogan (and even had it on a Cisco baseball cap that I'd got from somewhere or other - sadly left at a pub in North Adelaide) because I thought it summed up the approach that I generally had, but also in particular we had had when we were rolling out that early Novell network.

However, I think there actually are valid excuses for why the network "doesn't work" ... which I'll get to.

So to the title of this blog post is "There is no excuse for predictable and preventable downtime.", which I came up with in around 2006. I think it is a slightly more in depth version of the "The network works. No excluses." slogan, yet allows for some qualification.

The key term in the title is "predictable and preventable".

"predictable" means that you can foresee something possibly occurring, even though it may not occur often or may never occur at all.

"preventable" means that you have the resources (e.g., time or money) to ensure that it doesn't.

In the context of service or network downtime, if you are able to both predict and able to prevent downtime occurring, then you have no valid excuse for the downtime if it occurs.

On the other hand, if downtime occurs because of an unpredictable occurrence, then that is a valid excuse.

Or, if you could predict the downtime, but you didn't have the resources to prevent it, well that is also a valid excuse. If your request for redundancy resources is declined and then downtime occurs because there is no redundancy, then you are not to blame.

So sometimes there are valid excuses for downtime, but only if you can't predict and/or prevent it.