Nagios / The DICE Monitoring System
Last updated: 2016-07-06
Contents
1. Overview
Our Nagios monitoring system employs two servers,
nagios.inf.ed.ac.uk
and
nagios2.inf.ed.ac.uk
.
The first of these monitors all other machines requesting such service, (including
nagios2
, note); the second (
nagios2
) monitors only the first. (The details of exactly how any machine requests to be monitored is described in detail elsewhere - see
Existing documentation below; that documentation also describes the significance of the 'Spanning map subscribed to' field in the first table in
Appendix A.)
Both
nagios
and
nagios2
regenerate their Nagios configurations in a continous loop - see the script
/usr/bin/lcfg-monitor
. This allows the Nagios configuration to change dynamically as machines either request monitoring via their source profiles, or similarly request monitoring to cease. Should the newly-generated configuration be incorrect in any way, the Nagios system continues to run with its existing configuration, and also generates an alert to its nominated manager(s). (See
section 2.1 below for more on this.)
Should either
nagios
or
nagios2
detect a problem with any machine they're monitoring, they will generate appropriate alerts. All such alerts are sent both via the Jabber service, and via email.
Both
nagios
and
nagios2
have remote serial consoles (implemented via IPMI SOL and KVM, respectively), so both can be remotely rebooted when necessary. In addition,
nagios
can be remotely power-cycled via either IPMI, or the
rfe
-able power bars. See
Appendix A for the configuration details.
2. What can go wrong
2.1 Invalid Nagios configuration
The Nagios configuration of
nagios
is ultimately driven by headers and resources declared in the source profiles of all machines which have requested monitoring. It is possible - owing either to incorrect specifications in source profiles, or to inconsistencies in the LCFG system's idea of timestamps - that
nagios
's auto-generated Nagios configuration can become invalid. In this case:
-
nagios
's self-monitoring will generate alerts to be sent to that machine's nominated managers; and
-
tail -f /var/lcfg/log/nagios_server
on nagios
will display the error condition.
The end effect of such problems is that, whilst the Nagios system will continue to run using its current configuration, no changes to that configuration will be made. In particular, no machines will be added to, or removed from, the monitoring system.
Depending on the exact cause of the problem, there are a few options for fixing it:
- Identify the client machine whose source profile has caused the problem, and edit its source profile to fix the problem. To get more information about a failing configuration and to identify the client machine causing the problem,
ssh
to nagios
, nsu
to root, and run
/usr/sbin/nagios -v <config file name>
where config file name
will be <directory>/etc/nagios/nagios.cfg
, with <directory>
having been given in the alert messages being produced (e.g. ' Configuration in /tmp/nagios_WRlEzH is corrupt
'.)
(The -v
flag tells Nagios to verify the configuration file without subsequently trying to start the Nagios daemon.)
- Make a cosmetic change to
nagios
's profile and submit it, in order to 'kick' the LCFG systems idea of nagios
's timestamp.
- Restart the Nagios server on
nagios
:
ssh nagios
om nagios_server stop
om nagios_server start
(Comment: there have been reports that the more obvious command om nagios_server restart
can fail owing to some as-yet-not-understood race condition - hence the explicit stop
and start
above.)
2.2 Drop-out of the Jabber connection
The Nagios system sends out all alerts via both Jabber and email. On both
nagios
and
nagios2
, the daemon which interfaces the Nagios system to our Jabber service can drop out if either the Jabber server itself temporarily goes down, or if there is an interruption of the network between
nagios
/
nagios2
and the Jabber server. If this happens, then all alerts sent via Jabber will be lost. The fix is to restart the interfacing daemon:
ssh nagios / ssh nagios2
om jnotify stop
om jnotify start
One way to confirm that the Nagios/Jabber connection is working correctly is to look at your
Pidgin
'Buddy List': you should see
nagios@inf.ed.ac.uk
marked as 'Available' if you're expecting alerts from
nagios
, as well as
nagios2@inf.ed.ac.uk
if you're expecting alerts from
nagios2
. (The latter applies only to managers of the Nagios system.)
2.3 Hardware failures
In the case of a complete hardware failure of either
nagios
or
nagios2
, the obvious fix is simply to reinstall the Nagios system to a new machine or machines.
Almost everything required for the configuration of both machines is contained in their source profiles; the only hand configuration that's required is to create a principal and corresponding keytab entry for
nagios
to access the FRIEND.INF.ED.AC.UK KDC in order to be able to monitor it. This can't be automated since the iFriend KDC doesn't contain a host principal for
nagios
(or any of our servers, for that matter.) Using me (idurkacz) as an example, the process goes as follows:
Note: when typing the following instructions, replace <nagios> with the actual canonical hostname of the nagios server, and <iFriend KDC> with the actual canonical hostname of the iFriend KDC server. See Appendix A for those hostnames.
- Give yourself a principal that will be used for admin purposes on the iFriend KDC:
ssh <iFriend KDC>
nsu
kadmin.local -r FRIEND.INF.ED.AC.UK
Authenticating as principal idurkacz/admin@INF.ED.AC.UK with password.
kadmin.local: addprinc idurkacz/admin@FRIEND.INF.ED.AC.UK
kadmin.local: exit
- Make this a true 'admin' principal (in our usual sense) by adding the following to the profile of
<iFriend KDC> :
!kerberos.acls mADD(idurkaczadmin)
!kerberos.acl_idurkaczadmin mSETQ("idurkacz/admin\@<%kerberos.kdcrealm%> * *")
Those additions make an entry in the file /var/kerberos/krb5kdc/kadm5.acl . To effect this change, the kadmind daemon needs to be restarted. So, on the iFriend KDC:
om kerberos restart
- Add the following to the profile of
<nagios> :
!kerberos.realms mSET(ifriend)
kerberos.name_ifriend FRIEND.INF.ED.AC.UK
kerberos.admin_ifriend <iFriend KDC>.inf.ed.ac.uk:749
Those resources add the following entry to the [realms] section of the /etc/krb5.conf file:
FRIEND.INF.ED.AC.UK = {
admin_server = <iFriend KDC>.inf.ed.ac.uk:749
}
These resources are necessary; without them, the next step fails with a cryptic error message.
- On
nagios , create the principal in the iFriend KDC which will be used by the monitoring system to check that KDC, and extract the corresponding keytab entry:
ssh nagios
nsu
kadmin -r FRIEND.INF.ED.AC.UK -s <iFriend KDC>.inf.ed.ac.uk:749 -p idurkacz/admin@FRIEND.INF.ED.AC.UK
Authenticating as principal idurkacz/admin@FRIEND.INF.ED.AC.UK with password.
Password for idurkacz/admin@FRIEND.INF.ED.AC.UK:
kadmin: addprinc nagios/<nagios>.inf.ed.ac.uk@FRIEND.INF.ED.AC.UK
kadmin: ktadd -k /etc/nagios.keytab nagios/<nagios>.inf.ed.ac.uk@FRIEND.INF.ED.AC.UK
kadmin: exit
- On the iFriend KDC, push the
nagios/<nagios>.inf.ed.ac.uk@FRIEND.INF.ED.AC.UK principal to all iFriend KDC slave servers so that Nagios monitoring will also start working for them:
om kerberos push -q 0
(A cron job on the iFriend KDC does such a push every 15 minutes in any case.)
- FInally, revert the various ad-hoc changes made in the course of this set-up work:
- Remove the following from the profile of
<nagios> :
!kerberos.realms mSET(ifriend)
kerberos.name_ifriend FRIEND.INF.ED.AC.UK
kerberos.admin_ifriend <iFriend KDC>.inf.ed.ac.uk:749
- Remove the following from the profile of
<iFriend KDC> :
!kerberos.acls mADD(idurkaczadmin)
!kerberos.acl_idurkaczadmin mSETQ("idurkacz/admin\@<%kerberos.kdcrealm%> * *")
|
If the primary nagios server has been moved to new hardware then, before starting the service, its
/var/log/nagios
directory should be populated with the contents of the same directory on the old server. (As well as containing service logs, this directory contains the current state information for the nagios service.) Since these instructions assume that the old server has completely failed, the contents of this directory should be restored from its rsync mirror: get help from the Services Unit if necessary. (If the contents of this directory can't be restored, then the service
can still be started, but it will be missing any state information entered by service managers: e.g. service downtimes, comments, etc.)
Once the nagios server(s) are succesfully running on new hardware, the header
live/nagios_client.h
should be updated to reflect the new hostname(s) and IP address(es). That header tells nagios clients which server to which they should send their nagios 'passive checks' (e.g. the status reports for network bonding, h/w raid, etc.)
As a (very) short-term fix for a hardware failure of the primary monitoring machine
nagios
, and in order to reinstate monitoring for very critical client machines, it would be possible to move the monitoring of clients from
nagios
to
nagios2
by mutating the resource
nagios_client.monitoringcluster
from
nagios/all
to
nagios/slave
in the source profile of any such machine. If this
is done, it should be reverted as soon as the primary monitoring server is back on-line.
3. Existing documentation
- The DICE Monitoring System
This set of documents describes the overall design of the Informatics monitoring system, as well as the specific
resources which a client machine needs to set in order to request monitoring. For reference, a 'minimal' set of resources - where the client is requesting that its
ssh
service be monitored for liveness - is as follows:
#include <dice/options/nagios_client.h>
...[snip]...
!nagios_client.manager mSET(username or capability)
!nagios_client.components mADD(openssh)
!openssh.nagios_groups mADD(username or capability)
The username or capability field specifies to whom Nagios alerts should be sent in the event of problems; it should therefore be set to the nominal machine 'manager'. If a capability is specified rather than an individual username, it will be expanded by the monitoring system into the corresponding set of usernames. A specific example of such a setting is:
#include <dice/options/nagios_client.h>
...[snip]...
!nagios_client.manager mSET(nagios/inf-unit)
!nagios_client.components mADD(openssh)
!openssh.nagios_groups mADD(nagios/inf-unit)
- On-line Nagios documentation
This is the official set of documentation maintained on-line by the Nagios project itself. It describes Nagios in some detail and so is a useful reference, but of course says nothing about the specifics of the Informatics monitoring system.
Appendices
A. Details of current servers: hostnames, location, power, networking, etc.
--
IanDurkacz