February | 2008 | Inside Documentum

Some thoughts about availability

February 29, 2008 at 9:19 pm | Posted in Architecture, Performance | 1 Comment

Well so far 2008 has been incredibly busy. I’ve been involved with putting a major project live whilst also completing work on the new Xense Profiler for DFC. What with trying to have a life outside work that has left me with very little time to look at the support forums or to post here. Hopefully I’ll have a bit more time from March onwards.

Anyway February was looking a little empty so here’s a small thought. I’ve been spending time recently thinking about system availability. Now traditionally when you ask systems people to start thinking about these sorts of issues they immediately start thinking of resilience, load-balancing and clustering solutions. Now these definitely have their place but it’s useful to step back a bit and think about what we are trying to achieve.

What is availability?

If you look for a definition in a text book or somewhere on-line you are likely to find something similar to the following:

Availability is a measure of the continuous time a system is servicing user requests without failure. It is often measured in terms of a percentage uptime, with 100% being continously available without failure.

In reality 100% isn’t possible and you will see requirements quoted in terms of 99%, 99.9%, 99.99% and so on. The mythical uptime requirement is usually the ‘5 nines’ 99.999%, which works out at around 5 minutes per year. If you find yourself being asked for this then the requesting department had better have deep pockets.

Causes of unavailability

Behind this apparently clear and simply definition are a load of questions. If you think about it there are all sorts of reasons why a service could be unavailable:

Failure of a hard disk
Network failure
Operating system os crash
Software bug
Operator error (I usually call this the ‘del *.*’ problem)

Which of these could be protected by resilience? The first 3 could probably be solved by:

hardware resilience (RAID)
load-balancing and resilient network infrastructure
clustering

But what about the last 2? Most of the schemes mentioned above operate below the application layer. So problems like software bugs are not likely to be solved by load-balancing or clustering. In general these sort of problem need to be addressed by a monitoring and alerting system.

Operator errors, of course, are rather more difficult to cope with. Hopefully you have an adequate Disaster Recovery procedure that minimises the damage although in certain situations you are likely to lose some data. Even so typical Disaster Recovery procedures usually start in terms of hours possibly extending for days in the worst case scenario. As usual the more money you spend the lower the impact on the availability target.

So which of these is the most common. I would bet that the last one (the one it is most difficult to sucessfully protect against) is far more common that you might think. This is certainly the of view of Hennessy and Patterson in their book Computer Architecture: A Quantitative Approach.

Parting words

Well it’s nearly my dinner time so I have promised my long suffering family to finish here. When presented with those bland ‘the system must be available for xx.xxx%’ requirements make sure you (and more importantly your customer/business) realise the implications of what they are asking for.

Search for:
February 2008

M T W T F S S

1 2 3

4 5 6 7 8 9 10

11 12 13 14 15 16 17

18 19 20 21 22 23 24

25 26 27 28 29

« Jan Apr »
Tags
Advanced Site Caching Services alfresco all_rows Big Data book review Captiva Case Management centerstage centrestage command line Composer content server Continuous Integration D6 data science deployment DFCprof dmcl tracing performance profiler docapp documentum documentum 6 documentum blog Documentum DCM DQL tuning ECM ECM vision emc EMC Licensing EMC World fast fatwire features first_rows fulltext grails Greenplum groovy HA High Availability install installation jython machine learning Momentum momentum 2008 Momentum 2009 Momentum 2011 momentum emcworld strategy mysql Netegrity NLP optimizer mode oracle PCS performance postgres print control services profiler query tuning rac records management scott roth sharepoint single sign on startup taskspace troubleshooting webpublisher xCP xDB xense xense profiler XML XML Store XMLStore
Tweets
Tweets by insidedctm
Top Posts
Archives
- September 2015
- March 2015
- September 2013
- July 2013
- June 2013
- May 2013
- January 2013
- November 2012
- July 2012
- December 2011
- November 2011
- October 2011
- July 2011
- May 2011
- April 2011
- March 2011
- February 2011
- January 2011
- November 2010
- October 2010
- July 2010
- June 2010
- May 2010
- April 2010
- March 2010
- January 2010
- November 2009
- June 2009
- March 2009
- February 2009
- January 2009
- December 2008
- November 2008
- September 2008
- August 2008
- June 2008
- May 2008
- April 2008
- February 2008
- January 2008
- December 2007
- November 2007
- October 2007
- September 2007
- August 2007
- July 2007
- June 2007
- May 2007
- April 2007
- March 2007
- February 2007
- January 2007
Categories
1356
- Consistency Checker resolutions
Links
Blog Stats
- 396,673 hits

February 2008
M	T	W	T	F	S	S
	1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29

Blog at WordPress.com.
Entries and comments feeds.

Inside Documentum

Some thoughts about availability

What is availability?

Causes of unavailability

Parting words

Tags

Tweets

Top Posts

Archives

Categories

1356

Links

Blog Stats