Friday 29 February 2008

Nagios and Twitter

Following on from this post, I've got a second nagios server set up now, monitoring all the same stuff.
It's running in a VM on hardware connected to a different UPS, so that's one weakness mitigated.The other improvement is using twitter as a notification channel, as opposed to mail on the primary monitor. So if our mail service goes down, we'll know about it.
We weren't finding out before, cos the monitor was mailing us, but the mail wasn't getting through.
Reminds me of what my partner pointed out ( and that I'd not considered ) when I outlined VoIP. "But doesn't that mean that when the network is down, no-one will be able to call you to complain?". Every cloud's got a silver lining.

Monday 25 February 2008

The Five Whys

A little while ago I read this, and was very interested in the application of the five whys.

So here goes nothing...

This morning, we were faced with a number of servers down, at our main site. It took until 10:25 to get everything important back up and running. We know yesterday that there was some problem, but our monitoring platform was one of the downed machines, so our view of the problem was somewhat clouded.

Two issues; first,

why were the servers down?


Why? The UPS supplying the servers in question had run it's batteries flat.

Why? There has been a planned power outage on the site, and the cutout on the supply to the offending UPS had tripped, so no power in to UPS, battery go flat.

Why? and at this point I am stuck, and need to talk to our power people.

But I feel I ought to continue...

Why? The servers in question either don't have redundant power supplies, or they're not connected.

Why? Cos we've not surveyed what we've got and planned our power connections. yet.

And here we have a plan, both to lessen our exposure to this risk in the short term, and to better understand how to avoid it in the longer term.


Why did monitoring fail;


Why? cos the monitoring server was on the UPS that died.

Why? cos we've only got one monitoring server, and we have to put it somewhere.

Why? cos whenever I've considered duplicating the monitoring server I've rejected it because I don't want to double the number of messages we get, and hadn't really thought through how to avoid this. And because I've not been able to spare the hardware.


I suppose the question is what do we do to make this more robust? I'm gonna duplicate our nagios installation, but mess with notification in two key ways - it will notify using jabber instead of mail, and it will only notify if the main monitor server is down. I'll put it on a VM at the same site for now, and will migrate it off site when we've a vmware platform to move it to.

Brainwaves, man

Popped into town to meet up with H & C at the Symphony Hall, mostly over by the time I arrived, C was sat with electrodes on his head, trying to do a rubik's cube, and watching a visualisation of his EEG. Running on a Mac, looked very cool.

Turns out, it's an EEG over bluetooth package, about 1500 dollars. But, as things do, it got me looking.

OpenEEG looks very cool.

And gnaural. Basically lets you build your own brain control music. Woohooh.

Saturday 23 February 2008

Not Friday gone but the Friday before, we had a job to do. For reasons I won't go into, we needed to replace the chassis on one of our core switches. It took several hours, 'cos we took the opportunity to tidy the cabling. Now it looks like this.

switchblade

We replaced pretty much every cable, and now apart from a few that are known to be temporary and are visually very obvious, they're all 'patchsee' cables. (It's really rather neat. They run a couple of strands of optical fibre through the cable see, and shining a special little torch on one end makes the other end light up. V. Handy).

This took 6 hours, all told (including lunch and fag breaks), and we're pretty happy with it.

I'm gonna try to remember to take monthly pics, and watch the entropy.

Sunday 3 February 2008

Roll your own VoIP Analysis, it's not that hard.

In the previous post, we were trying to debug a problem with our phones. Now, we're in education, and money's tight. IT systems purchasing goes like this:

Technical, management and sales agree on a service and a price. In this instance, it was a fully managed VoIP service, with training for our people, full redundancy, call reporting, the works.

Technical staff go back to work, leaving management and sales to iron out final details.

We end up with some boxes installed, no redundancy, no reporting or training, and call-center monkey support. 'Have you got QOS?'

So we sure as s**t can't afford call quality analysis software.

I'm rolling me own.

The basic Mitel system uses proprietary MiNET protocols to control stuff, but plain G711 over RTP for the audio. Wireshark can split this out and save audio streams, as well as doing jitter/latency analysis, but I needed something less manual.

tshark doesn't do this. But it's possible to script something similar.

######################## analyseCalls.sh ########################
#!/bin/sh
# tpcdump file as argument

## first, identify distinct RTP streams in input
for ff in $(tshark -r $1 "udp.port == 9000" -d "udp.port == 9000,rtp" -T fields -e "rtp.ssrc" 2>/dev/null | sort | uniq | cut -f1); do
## count the number of 'quiet' packets - this is, ahem, 'heuristic'
NSB=$(tshark -r $1 -d "udp.port==9000,rtp" "rtp.ssrc==${ff}" -T pdml 2>/dev/null | grep payload | cut -f10 -d'"' | sed -e 's/[d5][54761032]://g; s/[^:]//g' | wc -c)
NONSILENCE=$(echo "scale = 3; print $NSB / 160" | bc);
## suck out the audio payload
tshark -r $1 -d "udp.port==9000,rtp" "rtp.ssrc==${ff}" -T pdml 2>/dev/null | grep payload | cut -f10 -d'"' | grabAudio.pl > ${ff}.raw
## convert it to WAV
sox -c 1 -r 8000 -L -A -t raw ${ff}.raw ${ff}.wav
## Get a timestamp for the first packet
TD=$(tshark -r $1 -d "udp.port==9000,rtp" "rtp.ssrc==$ff" -tad | head -n1 | cut -f2,3 -d" ")
echo -n "$TD ${ff} $NONSILENCE "
## and do some call analysis
tshark -r $1 -d "udp.port==9000,rtp" "rtp.ssrc==$ff" -td 2>/dev/null | qual.pl;
done | sort -n

######################## grabAudio.pl ########################
#!/usr/bin/perl
## translate the ascii-hex payload from tshark to actual binary data
while(<>) {
$line = $_;
chop $line;
foreach $char ( split(/:/,$line)) {
print chr(hex($char));
}
}

######################## qual.pl #######################
#!/usr/bin/perl

while (<>) {
$line = $_;
## strip out unwanted spaces
$line =~ s/ / /g;
$line =~ s/^ //;
## separate the fields
( $pkt, $delta, $sip, $dummy, $dip, $dummy, $dummy, $dummy, $dummy, $dummy, $dummy, $ssrc, $dummy, $seq, $dummy, $time ) = split(/[ ,=]+/, $line);
## save the inter-packet arrival time in an array
push(@deltas,$delta);
## and keep the final RTP sequence number
$lastseq = $seq;
}
## compare number of packets seen to sequence number for loss..
$pkts = @deltas - 1;
$loss = $lastseq - $pkts;

## get the mean inter-packet gap
foreach $delta ( @deltas ) {
$dsum += $delta;
}
$dmean = $dsum / $pkts;

## and calculate the standard deviation for the whole set
## ( not sure if this is the right calculation )
foreach $delta ( @deltas ) {
$dsquared += ( $delta - $dmean ) * ( $delta - $dmean );
}
$jitter = sqrt($dsquared / $pkts);

print "$sip $dip $pkts $loss $dmean $jitter\n";

Now I'm not claiming that this is great. It'll do me for free, and it will make my VoIP guy a bit happier. Especially when I roll it into a nice mini-itx box that he can pop under a problem handset for a week, with a nice web interface, call playback, etc.

The point is, it ain't that hard.

Troubleshooting lessons.

OK. For a couple of weeks, we've had major phone problems. We run a couple of Mitel VoIP boxes and a few hundred phones off them. Each Mitel box has 30 ISDN lines ( I don't really get ISDN, otherwise I'd describe it better ).

We've been getting one way calls to our enquiries people, so our operators can't hear the caller.

This is the story of the diagnosis.

First off, we questioned users and first line support folks. It appeared to be happening on calls coming in to one of the Mitel boxes, destined for a group of handsets on a distant campus.

I took some packet captures from the switch port one of the phones is plugged into. I saw normal background stuff, some mitel control traffic (port 6800 and 6900), and two RTP streams, one each direction. So I pulled the streams to audio files (wireshark is lovely..), and listened to them. Sure enough, the outgoing side was fine. "Hello, ???, can I help you?......Hello?.......Hello?....click". The incoming side was silent. Not quiet, silent.

So, it's not the LAN, I reasoned. The audio stream is getting to the phone, there's just nothing in it.

So we talked to our Mitel reseller, who remotely looked over stuff and said nothing was wrong. And we talked to our ISDN provider, ditto. Our Mitel reseller sent someone out.

And while he was looking, I got a notification that one of our internet routers was down. And I got a colleague to look it over, and restart it. And it's fine, but unreachable. So I go look at the layer 3 switch it's connected to. And the interface is up, but no arp entry and no pings. Oh bum.

So I try pinging the Mitel box from the core switch. Uh-uh. ARP entry? Nope. OK, add a static ARP entry. All of a sudden, all is well.

Turns out it was the network all along. Switch hardware, we're working it out with the vendor right now.

The point being, I was fully satisfied it wasn't the network. I could talk through my analysis with capable colleagues, who agreed. And we were wrong.

The moral of the story? When there's several components to a problem, and all of them check out fine, then someone doesn't understand the problem.

And it's probably you.

  • Posted on: Sun, Feb 3 2008 10:06 AM