Friday, 29 February 2008

Nagios and Twitter

Following on from this post, I've got a second nagios server set up now, monitoring all the same stuff.
It's running in a VM on hardware connected to a different UPS, so that's one weakness mitigated.The other improvement is using twitter as a notification channel, as opposed to mail on the primary monitor. So if our mail service goes down, we'll know about it.
We weren't finding out before, cos the monitor was mailing us, but the mail wasn't getting through.
Reminds me of what my partner pointed out ( and that I'd not considered ) when I outlined VoIP. "But doesn't that mean that when the network is down, no-one will be able to call you to complain?". Every cloud's got a silver lining.

Monday, 25 February 2008

The Five Whys

A little while ago I read this, and was very interested in the application of the five whys.

So here goes nothing...

This morning, we were faced with a number of servers down, at our main site. It took until 10:25 to get everything important back up and running. We know yesterday that there was some problem, but our monitoring platform was one of the downed machines, so our view of the problem was somewhat clouded.

Two issues; first,

why were the servers down?


Why? The UPS supplying the servers in question had run it's batteries flat.

Why? There has been a planned power outage on the site, and the cutout on the supply to the offending UPS had tripped, so no power in to UPS, battery go flat.

Why? and at this point I am stuck, and need to talk to our power people.

But I feel I ought to continue...

Why? The servers in question either don't have redundant power supplies, or they're not connected.

Why? Cos we've not surveyed what we've got and planned our power connections. yet.

And here we have a plan, both to lessen our exposure to this risk in the short term, and to better understand how to avoid it in the longer term.


Why did monitoring fail;


Why? cos the monitoring server was on the UPS that died.

Why? cos we've only got one monitoring server, and we have to put it somewhere.

Why? cos whenever I've considered duplicating the monitoring server I've rejected it because I don't want to double the number of messages we get, and hadn't really thought through how to avoid this. And because I've not been able to spare the hardware.


I suppose the question is what do we do to make this more robust? I'm gonna duplicate our nagios installation, but mess with notification in two key ways - it will notify using jabber instead of mail, and it will only notify if the main monitor server is down. I'll put it on a VM at the same site for now, and will migrate it off site when we've a vmware platform to move it to.

Brainwaves, man

Popped into town to meet up with H & C at the Symphony Hall, mostly over by the time I arrived, C was sat with electrodes on his head, trying to do a rubik's cube, and watching a visualisation of his EEG. Running on a Mac, looked very cool.

Turns out, it's an EEG over bluetooth package, about 1500 dollars. But, as things do, it got me looking.

OpenEEG looks very cool.

And gnaural. Basically lets you build your own brain control music. Woohooh.

Saturday, 23 February 2008

Not Friday gone but the Friday before, we had a job to do. For reasons I won't go into, we needed to replace the chassis on one of our core switches. It took several hours, 'cos we took the opportunity to tidy the cabling. Now it looks like this.

switchblade

We replaced pretty much every cable, and now apart from a few that are known to be temporary and are visually very obvious, they're all 'patchsee' cables. (It's really rather neat. They run a couple of strands of optical fibre through the cable see, and shining a special little torch on one end makes the other end light up. V. Handy).

This took 6 hours, all told (including lunch and fag breaks), and we're pretty happy with it.

I'm gonna try to remember to take monthly pics, and watch the entropy.