Monday 25 February 2008

The Five Whys

A little while ago I read this, and was very interested in the application of the five whys.

So here goes nothing...

This morning, we were faced with a number of servers down, at our main site. It took until 10:25 to get everything important back up and running. We know yesterday that there was some problem, but our monitoring platform was one of the downed machines, so our view of the problem was somewhat clouded.

Two issues; first,

why were the servers down?


Why? The UPS supplying the servers in question had run it's batteries flat.

Why? There has been a planned power outage on the site, and the cutout on the supply to the offending UPS had tripped, so no power in to UPS, battery go flat.

Why? and at this point I am stuck, and need to talk to our power people.

But I feel I ought to continue...

Why? The servers in question either don't have redundant power supplies, or they're not connected.

Why? Cos we've not surveyed what we've got and planned our power connections. yet.

And here we have a plan, both to lessen our exposure to this risk in the short term, and to better understand how to avoid it in the longer term.


Why did monitoring fail;


Why? cos the monitoring server was on the UPS that died.

Why? cos we've only got one monitoring server, and we have to put it somewhere.

Why? cos whenever I've considered duplicating the monitoring server I've rejected it because I don't want to double the number of messages we get, and hadn't really thought through how to avoid this. And because I've not been able to spare the hardware.


I suppose the question is what do we do to make this more robust? I'm gonna duplicate our nagios installation, but mess with notification in two key ways - it will notify using jabber instead of mail, and it will only notify if the main monitor server is down. I'll put it on a VM at the same site for now, and will migrate it off site when we've a vmware platform to move it to.

No comments: