Friday 14 March 2008

Automating nagios configurations.

At the last count, we run something like 140 print queues, and as offices move, and printers get replaced, and 'stuff changes', queues are created and deleted and renamed. This post is about how I've addressed ensuring that nagios is monitoring all our queues, and minimising the opportunity for operator error.

A little background. We use CUPS to queue print jobs, and our technicians are free to create and delete queues as need be. They do not have access to the nagios configs.

So, the basic idea is that we periodically run a script on the nagios server that:

  • Queries each of our print servers for a list of existing queues
  • Creates a nagios config file for all print queues in the list
  • signals nagios to restart, and re-read it's configuration

So we get a monitoring configuration that doesn't miss print queues out, nor alarms about print queues that no longer exist. And no-one has to remember.

Which is nice.

So, ( and I apologise in advance for the code. I'm a sysadmin. Whaddya expect. ). The following is a perl script called from cron, once for each CUPS server. We pass the server address, and a human-readable site name, and we get nagios code out on stdout, which is piped into the appropriate nagios config directory. It depends on lpstat, which queries the CUPS server.



#!/usr/bin/perl

$cupsServer = $ARGV[0];
$site = $ARGV[1];

@queues = `lpstat -h $cupsServer -p | grep printer | grep -iv "sent" | grep -iv "off-line" | grep -iv "unable" | grep -iv "attempt" | cut -f2 -d" "`;
chop @queues;

foreach $queue ( @queues ) {
print "define service{\n";
print "\tuse generic-service\n";
print "\thost_name $cupsServer\n";
print "\tservice_description CUPS_" . $queue . "\n";
print "\tservicegroups " . $site . "PrintQueues\n";
print "\tcontact_groups " . $site . "-printer-admins\n";
print "\tcheck_command check_cups_queue!" . $queue . "\n";
print "\tregister 1\n}\n\n";

print "define serviceextinfo{\n";
print " host_name " . $cupsServer . "\n";
print " service_description CUPS_" . $queue . "\n";
print " notes_url http://wiki.example.com/wiki/index.php?title=Nagios/" . $queue . "&action=edit&preload=Nagios/NewServiceTemplate\n";
print " action_url http://" . $cupsServer . ".example.com:631/printers/" . $queue . "\n";
print " icon_image HPlj4550p.gif\n}\n\n";
}



Coupla notes - the nagios action_url shows a clickable icon taking the user to the CUPS queue in question. The notes_url points to a wiki page. We use this to keep notes about the service.

This is all very well, but nagios won't pick up the changes without a restart. So once cron has built the config file, it does this:


export now=$( /bin/date "+\%s" ); #get the current time into a format nagios understands
export commandfile='/var/lib/nagios2/rw/nagios.cmd'; #identify the file nagios reads for external commands
/usr/bin/printf "[\%lu] RESTART_PROGRAM\n" $(( now + 30 )) > $commandfile #tell nagios to restart in 30 seconds


And Bob's yer uncle. Monitoring our CUPS queues with nagios means we become aware of problems quicker, and respond quicker. And automating the config makes this practical.

No comments: