Well so far 2008 has been incredibly busy. I’ve been involved with putting a major project live whilst also completing work on the new Xense Profiler for DFC. What with trying to have a life outside work that has left me with very little time to look at the support forums or to post here. Hopefully I’ll have a bit more time from March onwards.
Anyway February was looking a little empty so here’s a small thought. I’ve been spending time recently thinking about system availability. Now traditionally when you ask systems people to start thinking about these sorts of issues they immediately start thinking of resilience, load-balancing and clustering solutions. Now these definitely have their place but it’s useful to step back a bit and think about what we are trying to achieve.
What is availability?
If you look for a definition in a text book or somewhere on-line you are likely to find something similar to the following:
Availability is a measure of the continuous time a system is servicing user requests without failure. It is often measured in terms of a percentage uptime, with 100% being continously available without failure.
In reality 100% isn’t possible and you will see requirements quoted in terms of 99%, 99.9%, 99.99% and so on. The mythical uptime requirement is usually the ’5 nines’ 99.999%, which works out at around 5 minutes per year. If you find yourself being asked for this then the requesting department had better have deep pockets.
Causes of unavailability
Behind this apparently clear and simply definition are a load of questions. If you think about it there are all sorts of reasons why a service could be unavailable:
- Failure of a hard disk
- Network failure
- Operating system os crash
- Software bug
- Operator error (I usually call this the ‘del *.*’ problem)
Which of these could be protected by resilience? The first 3 could probably be solved by:
- hardware resilience (RAID)
- load-balancing and resilient network infrastructure
But what about the last 2? Most of the schemes mentioned above operate below the application layer. So problems like software bugs are not likely to be solved by load-balancing or clustering. In general these sort of problem need to be addressed by a monitoring and alerting system.
Operator errors, of course, are rather more difficult to cope with. Hopefully you have an adequate Disaster Recovery procedure that minimises the damage although in certain situations you are likely to lose some data. Even so typical Disaster Recovery procedures usually start in terms of hours possibly extending for days in the worst case scenario. As usual the more money you spend the lower the impact on the availability target.
So which of these is the most common. I would bet that the last one (the one it is most difficult to sucessfully protect against) is far more common that you might think. This is certainly the of view of Hennessy and Patterson in their book Computer Architecture: A Quantitative Approach.
Well it’s nearly my dinner time so I have promised my long suffering family to finish here. When presented with those bland ‘the system must be available for xx.xxx%’ requirements make sure you (and more importantly your customer/business) realise the implications of what they are asking for.