TWiki> DICE Web>PandemicPlanning>NagiosTop5 (revision 18)EditAttach

Nagios / The DICE Monitoring System

Last updated: 2015-03-19


1. Overview

Our Nagios monitoring system employs two servers, cockerel and capon:

Server Physical location Spanning map subscribed to
cockerel a.k.a. nagios IF B.02 rack 9 nagios/all
capon a.k.a. nagios2 N/a - a KVM guest on AT KVM host waterloo nagios/slave

The first of these monitors all other machines requesting such service, (including capon, note); the second (capon) monitors only the first. (The details of exactly how any machine requests to be monitored is described in detail elsewhere - see Existing documentation below; that documentation also describes the significance of the 'Spanning map subscribed to' field in the above table.)

Both cockerel and capon regenerate their Nagios configurations in a continous loop - see the script /usr/bin/lcfg-monitor. This allows the Nagios configuration to change dynamically as machines either request monitoring via their source profiles, or similarly request monitoring to cease. Should the newly-generated configuration be incorrect in any way, the Nagios system continues to run with its existing configuration, and also generates an alert to its nominated manager(s). (See section 2.1 below for more on this.)

Should either cockerel or capon detect a problem with any machine they're monitoring, they will generate appropriate alerts. All such alerts are sent both via the Jabber service, and via email.

Both cockerel and capon have remote serial consoles (implemented via IPMI SOL and KVM, respectively), so both can be remotely rebooted when necessary. In addition, cockerel can be remotely power-cycled via either IPMI, or the rfe-able power bars (note that cockerel has two power supplies.) The details are as follows:

Server Managing console server Power bar/outlet
cockerel blatiere a.k.a. consoles &

2. What can go wrong

2.1 Invalid Nagios configuration

The Nagios configuration of cockerel is ultimately driven by headers and resources declared in the source profiles of all machines which have requested monitoring. It is possible - owing either to incorrect specifications in source profiles, or to inconsistencies in the LCFG system's idea of timestamps - that cockerel's auto-generated Nagios configuration can become invalid. In this case:

  • cockerel's self-monitoring will generate alerts to be sent to that machine's nominated managers; and
  • tail -f /var/lcfg/log/nagios_server on cockerel will display the error condition.

The end effect of such problems is that, whilst the Nagios system will continue to run using its current configuration, no changes to that configuration will be made. In particular, no machines will be added to, or removed from, the monitoring system.

Depending on the exact cause of the problem, there are a few options for fixing it:

  1. Identify the client machine whose source profile has caused the problem, and edit its source profile to fix the problem. To get more information about a failing configuration and to identify the client machine causing the problem, ssh to cockerel, nsu to root, and run

      /usr/sbin/nagios -v <config file name>

    where config file name will be <directory>/etc/nagios/nagios.cfg, with <directory> having been given in the alert messages being produced (e.g. ' Configuration in /tmp/nagios_WRlEzH is corrupt '.)

    (The -v flag tells Nagios to verify the configuration file without subsequently trying to start the Nagios daemon.)

  2. Make a cosmetic change to cockerel's profile and submit it, in order to 'kick' the LCFG systems idea of cockerel's timestamp.
  3. Restart the Nagios server on cockerel:

      ssh cockerel
      om nagios_server stop
      om nagios_server start

    (Comment: there have been reports that the more obvious command om nagios_server restart can fail owing to some as-yet-not-understood race condition - hence the explicit stop and start above.)

2.2 Drop-out of the Jabber connection

The Nagios system sends out all alerts via both Jabber and email. On both cockerel and capon, the daemon which interfaces the Nagios system to our Jabber service can drop out if either the Jabber server itself temporarily goes down, or if there is an interruption of the network between cockerel/capon and the Jabber server. If this happens, then all alerts sent via Jabber will be lost. The fix is to restart the interfacing daemon:

  ssh cockerel / ssh capon
  om jnotify stop
  om jnotify start

One way to confirm that the Nagios/Jabber connection is working correctly is to look at your Pidgin 'Buddy List': you should see marked as 'Available' if you're expecting alerts from cockerel, as well as if you're expecting alerts from capon. (The latter applies only to managers of the Nagios system.)

2.3 Hardware failures

In the case of a complete hardware failure of either cockerel or capon, the obvious fix is simply to reinstall the Nagios system to a new machine or machines.

Almost everything required for the configuration of both machines is contained in their source profiles; the only hand configuration that's required is to create a principal and corresponding keytab entry for cockerel to access the FRIEND.INF.ED.AC.UK KDC in order to be able to monitor it. This can't be automated since the iFriend KDC (currently, hanlon) doesn't contain a host principal for cockerel (or any of our servers, for that matter.) Using me (idurkacz) as an example, the process goes as follows:

  1. Give yourself a principal that will be used for admin purposes on the iFriend KDC:
      ssh hanlon
      kadmin.local -r FRIEND.INF.ED.AC.UK
      Authenticating as principal idurkacz/admin@INF.ED.AC.UK with password.
      kadmin.local:  addprinc idurkacz/admin@FRIEND.INF.ED.AC.UK
      kadmin.local:  exit

  2. Make this a true 'admin' principal (in our usual sense) by adding the following to the profile of hanlon:
      !kerberos.acls                mADD(idurkaczadmin)
      !kerberos.acl_idurkaczadmin   mSETQ("idurkacz/admin\@<%kerberos.kdcrealm%> * *") 

    Those additions make an entry in the file /var/kerberos/krb5kdc/kadm5.acl. To effect this change, the kadmind daemon needs to be restarted. So, on hanlon:

      om kerberos restart

  3. Add the following to the profile of cockerel:
      !kerberos.realms                mSET(ifriend)
      kerberos.name_ifriend           FRIEND.INF.ED.AC.UK

    Those resources add the following entry to the [realms] section of the /etc/krb5.conf file:

        admin_server =
    These resources are necessary; without them, the next step fails with a cryptic error message.

  4. On cockerel, create the principal in the iFriend KDC which will be used by the monitoring system to check that KDC, and extract the corresponding keytab entry:
      ssh cockerel
      kadmin -r FRIEND.INF.ED.AC.UK -s -p idurkacz/admin@FRIEND.INF.ED.AC.UK
      Authenticating as principal idurkacz/admin@FRIEND.INF.ED.AC.UK with password.
      Password for idurkacz/admin@FRIEND.INF.ED.AC.UK: 
      kadmin:  addprinc nagios/
      kadmin:  ktadd -k /etc/nagios.keytab nagios/
      kadmin:  exit

  5. On hanlon, push the nagios/ principal to all iFriend KDC slave servers so that Nagios monitoring will also start working for them:
      om kerberos push -q 0
    (A cron job on hanlon does such a push every 15 minutes in any case.)

  6. Remove the following from the profile of cockerel:
      !kerberos.realms                mSET(ifriend)
      kerberos.name_ifriend           FRIEND.INF.ED.AC.UK

If the primary nagios server has been moved to new hardware then, before starting the service, its /var/log/nagios directory should be populated with the contents of the same directory on the old server. (As well as containing service logs, this directory contains the current state information for the nagios service.) Since these instructions assume that the old server has completely failed, the contents of this directory should be restored from its rsync mirror: get help from the Services Unit if necessary. (If the contents of this directory can't be restored, then the service can still be started, but it will be missing any state information entered by service managers: e.g. service downtimes, comments, etc.)

Once the nagios server(s) are succesfully running on new hardware, the header live/nagios_client.h should be updated to reflect the new hostname(s) and IP address(es). That header tells nagios clients which server to which they should send their nagios 'passive checks' (e.g. the status reports for network bonding, h/w raid, etc.)

As a (very) short-term fix for a hardware failure of the primary monitoring machine cockerel, and in order to reinstate monitoring for very critical client machines, it would be possible to move the monitoring of clients from cockerel to capon by mutating the resource nagios_client.monitoringcluster from nagios/all to nagios/slave in the source profile of any such machine. If this is done, it should be reverted as soon as the primary monitoring server is back on-line.

3. Existing documentation

  1. The DICE Monitoring System

    This set of documents describes the overall design of the Informatics monitoring system, as well as the specific resources which a client machine needs to set in order to request monitoring. For reference, a 'minimal' set of resources - where the client is requesting that its ssh service be monitored for liveness - is as follows:

      #include <dice/options/nagios_client.h>
      !nagios_client.manager                  mSET(username or capability)
      !nagios_client.components               mADD(openssh)
      !openssh.nagios_groups                  mADD(username or capability)

    The username or capability field specifies to whom Nagios alerts should be sent in the event of problems; it should therefore be set to the nominal machine 'manager'. If a capability is specified rather than an individual username, it will be expanded by the monitoring system into the corresponding set of usernames. A specific example of such a setting is:

      #include <dice/options/nagios_client.h>
      !nagios_client.manager                  mSET(nagios/inf-unit)
      !nagios_client.components               mADD(openssh)
      !openssh.nagios_groups                  mADD(nagios/inf-unit)
  2. On-line Nagios documentation

    This is the official set of documentation maintained on-line by the Nagios project itself. It describes Nagios in some detail and so is a useful reference, but of course says nothing about the specifics of the Informatics monitoring system.

-- IanDurkacz

Edit | Attach | Print version | History: r21 < r20 < r19 < r18 < r17 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r18 - 19 Mar 2015 - 13:16:43 - IanDurkacz
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies