TWiki> DICE Web>PandemicPlanning>NagiosTop5 (revision 19)EditAttach

Nagios / The DICE Monitoring System

Last updated: 2016-07-06


1. Overview

Our Nagios monitoring system employs two servers, and

The first of these monitors all other machines requesting such service, (including nagios2, note); the second (nagios2) monitors only the first. (The details of exactly how any machine requests to be monitored is described in detail elsewhere - see Existing documentation below; that documentation also describes the significance of the 'Spanning map subscribed to' field in the first table in Appendix A.)

Both nagios and nagios2 regenerate their Nagios configurations in a continous loop - see the script /usr/bin/lcfg-monitor. This allows the Nagios configuration to change dynamically as machines either request monitoring via their source profiles, or similarly request monitoring to cease. Should the newly-generated configuration be incorrect in any way, the Nagios system continues to run with its existing configuration, and also generates an alert to its nominated manager(s). (See section 2.1 below for more on this.)

Should either nagios or nagios2 detect a problem with any machine they're monitoring, they will generate appropriate alerts. All such alerts are sent both via the Jabber service, and via email.

Both nagios and nagios2 have remote serial consoles (implemented via IPMI SOL and KVM, respectively), so both can be remotely rebooted when necessary. In addition, nagios can be remotely power-cycled via either IPMI, or the rfe-able power bars. See Appendix A for the configuration details.

2. What can go wrong

2.1 Invalid Nagios configuration

The Nagios configuration of nagios is ultimately driven by headers and resources declared in the source profiles of all machines which have requested monitoring. It is possible - owing either to incorrect specifications in source profiles, or to inconsistencies in the LCFG system's idea of timestamps - that nagios's auto-generated Nagios configuration can become invalid. In this case:

  • nagios's self-monitoring will generate alerts to be sent to that machine's nominated managers; and
  • tail -f /var/lcfg/log/nagios_server on nagios will display the error condition.

The end effect of such problems is that, whilst the Nagios system will continue to run using its current configuration, no changes to that configuration will be made. In particular, no machines will be added to, or removed from, the monitoring system.

Depending on the exact cause of the problem, there are a few options for fixing it:

  1. Identify the client machine whose source profile has caused the problem, and edit its source profile to fix the problem. To get more information about a failing configuration and to identify the client machine causing the problem, ssh to nagios, nsu to root, and run

      /usr/sbin/nagios -v <config file name>

    where config file name will be <directory>/etc/nagios/nagios.cfg, with <directory> having been given in the alert messages being produced (e.g. ' Configuration in /tmp/nagios_WRlEzH is corrupt '.)

    (The -v flag tells Nagios to verify the configuration file without subsequently trying to start the Nagios daemon.)

  2. Make a cosmetic change to nagios's profile and submit it, in order to 'kick' the LCFG systems idea of nagios's timestamp.
  3. Restart the Nagios server on nagios:

      ssh nagios
      om nagios_server stop
      om nagios_server start

    (Comment: there have been reports that the more obvious command om nagios_server restart can fail owing to some as-yet-not-understood race condition - hence the explicit stop and start above.)

2.2 Drop-out of the Jabber connection

The Nagios system sends out all alerts via both Jabber and email. On both nagios and nagios2, the daemon which interfaces the Nagios system to our Jabber service can drop out if either the Jabber server itself temporarily goes down, or if there is an interruption of the network between nagios/nagios2 and the Jabber server. If this happens, then all alerts sent via Jabber will be lost. The fix is to restart the interfacing daemon:

  ssh nagios / ssh nagios2
  om jnotify stop
  om jnotify start

One way to confirm that the Nagios/Jabber connection is working correctly is to look at your Pidgin 'Buddy List': you should see marked as 'Available' if you're expecting alerts from nagios, as well as if you're expecting alerts from nagios2. (The latter applies only to managers of the Nagios system.)

2.3 Hardware failures

In the case of a complete hardware failure of either nagios or nagios2, the obvious fix is simply to reinstall the Nagios system to a new machine or machines.

Almost everything required for the configuration of both machines is contained in their source profiles; the only hand configuration that's required is to create a principal and corresponding keytab entry for nagios to access the FRIEND.INF.ED.AC.UK KDC in order to be able to monitor it. This can't be automated since the iFriend KDC doesn't contain a host principal for nagios (or any of our servers, for that matter.) Using me (idurkacz) as an example, the process goes as follows:

Note: when typing the following instructions, replace <nagios> with the actual canonical hostname of the nagios server, and <iFriend KDC> with the actual canonical hostname of the iFriend KDC server. See Appendix A for those hostnames.

  1. Give yourself a principal that will be used for admin purposes on the iFriend KDC:
      ssh <iFriend KDC>
      kadmin.local -r FRIEND.INF.ED.AC.UK
      Authenticating as principal idurkacz/admin@INF.ED.AC.UK with password.
      kadmin.local:  addprinc idurkacz/admin@FRIEND.INF.ED.AC.UK
      kadmin.local:  exit

  2. Make this a true 'admin' principal (in our usual sense) by adding the following to the profile of <iFriend KDC>:
      !kerberos.acls                mADD(idurkaczadmin)
      !kerberos.acl_idurkaczadmin   mSETQ("idurkacz/admin\@<%kerberos.kdcrealm%> * *") 

    Those additions make an entry in the file /var/kerberos/krb5kdc/kadm5.acl. To effect this change, the kadmind daemon needs to be restarted. So, on the iFriend KDC:

      om kerberos restart

  3. Add the following to the profile of <nagios>:
      !kerberos.realms                mSET(ifriend)
      kerberos.name_ifriend           FRIEND.INF.ED.AC.UK
      kerberos.admin_ifriend          <iFriend KDC>

    Those resources add the following entry to the [realms] section of the /etc/krb5.conf file:

        admin_server = <iFriend KDC>
    These resources are necessary; without them, the next step fails with a cryptic error message.

  4. On nagios, create the principal in the iFriend KDC which will be used by the monitoring system to check that KDC, and extract the corresponding keytab entry:
      ssh nagios
      kadmin -r FRIEND.INF.ED.AC.UK -s <iFriend KDC> -p idurkacz/admin@FRIEND.INF.ED.AC.UK
      Authenticating as principal idurkacz/admin@FRIEND.INF.ED.AC.UK with password.
      Password for idurkacz/admin@FRIEND.INF.ED.AC.UK: 
      kadmin:  addprinc nagios/<nagios>
      kadmin:  ktadd -k /etc/nagios.keytab nagios/<nagios>
      kadmin:  exit

  5. On the iFriend KDC, push the nagios/<nagios> principal to all iFriend KDC slave servers so that Nagios monitoring will also start working for them:
      om kerberos push -q 0
    (A cron job on the iFriend KDC does such a push every 15 minutes in any case.)

  6. FInally, revert the various ad-hoc changes made in the course of this set-up work:

    1. Remove the following from the profile of <nagios>:
        !kerberos.realms                mSET(ifriend)
        kerberos.name_ifriend           FRIEND.INF.ED.AC.UK
        kerberos.admin_ifriend          <iFriend KDC>

    2. Remove the following from the profile of <iFriend KDC>:
        !kerberos.acls                mADD(idurkaczadmin)
        !kerberos.acl_idurkaczadmin   mSETQ("idurkacz/admin\@<%kerberos.kdcrealm%> * *") 

If the primary nagios server has been moved to new hardware then, before starting the service, its /var/log/nagios directory should be populated with the contents of the same directory on the old server. (As well as containing service logs, this directory contains the current state information for the nagios service.) Since these instructions assume that the old server has completely failed, the contents of this directory should be restored from its rsync mirror: get help from the Services Unit if necessary. (If the contents of this directory can't be restored, then the service can still be started, but it will be missing any state information entered by service managers: e.g. service downtimes, comments, etc.)

Once the nagios server(s) are succesfully running on new hardware, the header live/nagios_client.h should be updated to reflect the new hostname(s) and IP address(es). That header tells nagios clients which server to which they should send their nagios 'passive checks' (e.g. the status reports for network bonding, h/w raid, etc.)

As a (very) short-term fix for a hardware failure of the primary monitoring machine nagios, and in order to reinstate monitoring for very critical client machines, it would be possible to move the monitoring of clients from nagios to nagios2 by mutating the resource nagios_client.monitoringcluster from nagios/all to nagios/slave in the source profile of any such machine. If this is done, it should be reverted as soon as the primary monitoring server is back on-line.

3. Existing documentation

  1. The DICE Monitoring System

    This set of documents describes the overall design of the Informatics monitoring system, as well as the specific resources which a client machine needs to set in order to request monitoring. For reference, a 'minimal' set of resources - where the client is requesting that its ssh service be monitored for liveness - is as follows:

      #include <dice/options/nagios_client.h>
      !nagios_client.manager                  mSET(username or capability)
      !nagios_client.components               mADD(openssh)
      !openssh.nagios_groups                  mADD(username or capability)

    The username or capability field specifies to whom Nagios alerts should be sent in the event of problems; it should therefore be set to the nominal machine 'manager'. If a capability is specified rather than an individual username, it will be expanded by the monitoring system into the corresponding set of usernames. A specific example of such a setting is:

      #include <dice/options/nagios_client.h>
      !nagios_client.manager                  mSET(nagios/inf-unit)
      !nagios_client.components               mADD(openssh)
      !openssh.nagios_groups                  mADD(nagios/inf-unit)
  2. On-line Nagios documentation

    This is the official set of documentation maintained on-line by the Nagios project itself. It describes Nagios in some detail and so is a useful reference, but of course says nothing about the specifics of the Informatics monitoring system.


A. Details of current servers: hostnames, location, power, networking, etc.

Logical nagios server Actual hostname Physical location Spanning map subscribed toSorted ascending klaxon IF B.02 rack 9, slot 21 nagios/all capon N/a - a KVM guest on AT KVM host waterloo nagios/slave

Server Managing console server Power bar/outlet Network connections
klaxon & sr06/4 & sr07/4
capon N/a - a KVM guest on AT KVM host waterloo N/a

Logical server Actual hostname
iFriend KDC hanlon

-- IanDurkacz

Edit | Attach | Print version | History: r21 < r20 < r19 < r18 < r17 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r19 - 06 Jul 2016 - 11:06:46 - IanDurkacz
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies