TWiki
>
DICE Web
>
PandemicPlanning
>
NagiosTop5
(revision 19) (raw view)
Edit
Attach
<!-- This is a comment, in the HTML sense of the word, but Wiki directives still work, so you could put access control statements here to restrict access to your new topic, and they won't show up in the published page. Please think carefully when creating a new topic, and use a GoodWikiName. Remember that by default all Wiki content is world readable and editable unless you've taken steps to limit it. --> ---+!!Nagios / The DICE Monitoring System _Last updated: 2016-07-06_ ---++!!Contents %TOC% ---++1. Overview Our Nagios monitoring system employs two servers, =nagios.inf.ed.ac.uk= and =nagios2.inf.ed.ac.uk=. The first of these monitors all other machines requesting such service, (including =nagios2=, note); the second (=nagios2=) monitors only the first. (The details of exactly how any machine requests to be monitored is described in detail elsewhere - see [[#3_Existing_documentation][Existing documentation]] below; that documentation also describes the significance of the 'Spanning map subscribed to' field in the first table in [[#A_Details_of_current_servers_loc][Appendix A]].) Both =nagios= and =nagios2= regenerate their Nagios configurations in a continous loop - see the script =/usr/bin/lcfg-monitor=. This allows the Nagios configuration to change dynamically as machines either request monitoring via their source profiles, or similarly request monitoring to cease. Should the newly-generated configuration be incorrect in any way, the Nagios system continues to run with its existing configuration, and also generates an alert to its nominated manager(s). (See [[#2_1_Invalid_Nagios_configuration][section 2.1 below]] for more on this.) Should either =nagios= or =nagios2= detect a problem with any machine they're monitoring, they will generate appropriate alerts. All such alerts are sent both via the Jabber service, and via email. Both =nagios= and =nagios2= have remote serial consoles (implemented via IPMI SOL and KVM, respectively), so both can be remotely rebooted when necessary. In addition, =nagios= can be remotely power-cycled via either IPMI, or the <code>rfe</code>-able power bars. See [[#A_Details_of_current_servers_loc][Appendix A]] for the configuration details. ---++2. What can go wrong ---+++2.1 Invalid Nagios configuration The Nagios configuration of =nagios= is ultimately driven by headers and resources declared in the source profiles of all machines which have requested monitoring. It is possible - owing either to incorrect specifications in source profiles, or to inconsistencies in the LCFG system's idea of timestamps - that <code>nagios</code>'s auto-generated Nagios configuration can become invalid. In this case: * <code>nagios</code>'s self-monitoring will generate alerts to be sent to that machine's nominated managers; and * =tail -f /var/lcfg/log/nagios_server= on =nagios= will display the error condition. The end effect of such problems is that, whilst the Nagios system will continue to run using its current configuration, no changes to that configuration will be made. In particular, no machines will be added to, or removed from, the monitoring system. Depending on the exact cause of the problem, there are a few options for fixing it: <ol> <li>Identify the client machine whose source profile has caused the problem, and edit its source profile to fix the problem. To get more information about a failing configuration and to identify the client machine causing the problem, =ssh= to =nagios=, =nsu= to root, and run <verbatim> /usr/sbin/nagios -v <config file name> </verbatim> where =config file name= will be =<directory>/etc/nagios/nagios.cfg=, with =<directory>= having been given in the alert messages being produced (e.g. ' =Configuration in /tmp/nagios_WRlEzH is corrupt= '.) (The =-v= flag tells Nagios to verify the configuration file _without_ subsequently trying to start the Nagios daemon.) </li> <li>Make a cosmetic change to <code>nagios</code>'s profile and submit it, in order to 'kick' the LCFG systems idea of <code>nagios</code>'s timestamp. </li> <li>Restart the Nagios server on =nagios=: <verbatim> ssh nagios om nagios_server stop om nagios_server start </verbatim> (_Comment:_ there have been reports that the more obvious command =om nagios_server restart= can fail owing to some as-yet-not-understood race condition - hence the explicit =stop= and =start= above.) </li> </ol> ---+++2.2 Drop-out of the Jabber connection The Nagios system sends out all alerts via both Jabber and email. On both =nagios= and =nagios2=, the daemon which interfaces the Nagios system to our Jabber service can drop out if either the Jabber server itself temporarily goes down, or if there is an interruption of the network between <code>nagios</code>/<code>nagios2</code> and the Jabber server. If this happens, then all alerts sent via Jabber will be lost. The fix is to restart the interfacing daemon: <verbatim> ssh nagios / ssh nagios2 om jnotify stop om jnotify start </verbatim> One way to confirm that the Nagios/Jabber connection is working correctly is to look at your =Pidgin= 'Buddy List': you should see =nagios@inf.ed.ac.uk= marked as 'Available' if you're expecting alerts from =nagios=, as well as =nagios2@inf.ed.ac.uk= if you're expecting alerts from =nagios2=. (The latter applies only to managers of the Nagios system.) ---+++2.3 Hardware failures In the case of a complete hardware failure of either =nagios= or =nagios2=, the obvious fix is simply to reinstall the Nagios system to a new machine or machines. Almost everything required for the configuration of both machines is contained in their source profiles; the only hand configuration that's required is to create a principal and corresponding keytab entry for =nagios= to access the FRIEND.INF.ED.AC.UK KDC in order to be able to monitor it. This can't be automated since the iFriend KDC doesn't contain a host principal for =nagios= (or any of our servers, for that matter.) Using me (idurkacz) as an example, the process goes as follows: <table border="1" cellspacing="0" cellpadding="10"><tr><td> *Note:* when typing the following instructions, replace =<nagios>= with the actual canonical hostname of the =nagios= server, and =<iFriend KDC>= with the actual canonical hostname of the iFriend KDC server. See [[#A_Details_of_current_servers_loc][Appendix A]] for those hostnames. <ol> <li>Give yourself a principal that will be used for admin purposes on the iFriend KDC: <verbatim> ssh <iFriend KDC> nsu kadmin.local -r FRIEND.INF.ED.AC.UK Authenticating as principal idurkacz/admin@INF.ED.AC.UK with password. kadmin.local: addprinc idurkacz/admin@FRIEND.INF.ED.AC.UK kadmin.local: exit </verbatim> <li>Make this a true 'admin' principal (in our usual sense) by adding the following to the profile of =<iFriend KDC>=: <verbatim> !kerberos.acls mADD(idurkaczadmin) !kerberos.acl_idurkaczadmin mSETQ("idurkacz/admin\@<%kerberos.kdcrealm%> * *") </verbatim> Those additions make an entry in the file =/var/kerberos/krb5kdc/kadm5.acl=. To effect this change, the =kadmind= daemon needs to be restarted. So, on the iFriend KDC: <verbatim> om kerberos restart </verbatim> <li>Add the following to the profile of =<nagios>=: <verbatim> !kerberos.realms mSET(ifriend) kerberos.name_ifriend FRIEND.INF.ED.AC.UK kerberos.admin_ifriend <iFriend KDC>.inf.ed.ac.uk:749 </verbatim> Those resources add the following entry to the =[realms]= section of the =/etc/krb5.conf= file: <verbatim> FRIEND.INF.ED.AC.UK = { admin_server = <iFriend KDC>.inf.ed.ac.uk:749 } </verbatim> These resources are necessary; without them, the next step fails with a cryptic error message. <li>On =nagios=, create the principal in the iFriend KDC which will be used by the monitoring system to check that KDC, and extract the corresponding keytab entry: <verbatim> ssh nagios nsu kadmin -r FRIEND.INF.ED.AC.UK -s <iFriend KDC>.inf.ed.ac.uk:749 -p idurkacz/admin@FRIEND.INF.ED.AC.UK Authenticating as principal idurkacz/admin@FRIEND.INF.ED.AC.UK with password. Password for idurkacz/admin@FRIEND.INF.ED.AC.UK: kadmin: addprinc nagios/<nagios>.inf.ed.ac.uk@FRIEND.INF.ED.AC.UK kadmin: ktadd -k /etc/nagios.keytab nagios/<nagios>.inf.ed.ac.uk@FRIEND.INF.ED.AC.UK kadmin: exit </verbatim> <li>On the iFriend KDC, push the =nagios/<nagios>.inf.ed.ac.uk@FRIEND.INF.ED.AC.UK= principal to all iFriend KDC slave servers so that Nagios monitoring will also start working for them: <verbatim> om kerberos push -q 0 </verbatim> (A cron job on the iFriend KDC does such a push every 15 minutes in any case.) <li>FInally, revert the various ad-hoc changes made in the course of this set-up work: <ol type="I"> <li>Remove the following from the profile of =<nagios>=: <verbatim> !kerberos.realms mSET(ifriend) kerberos.name_ifriend FRIEND.INF.ED.AC.UK kerberos.admin_ifriend <iFriend KDC>.inf.ed.ac.uk:749 </verbatim> <li>Remove the following from the profile of =<iFriend KDC>=: <verbatim> !kerberos.acls mADD(idurkaczadmin) !kerberos.acl_idurkaczadmin mSETQ("idurkacz/admin\@<%kerberos.kdcrealm%> * *") </verbatim> </ol> </ol> </td></tr></table> If the primary nagios server has been moved to new hardware then, before starting the service, its <code>/var/log/nagios</code> directory should be populated with the contents of the same directory on the old server. (As well as containing service logs, this directory contains the current state information for the nagios service.) Since these instructions assume that the old server has completely failed, the contents of this directory should be restored from its rsync mirror: get help from the Services Unit if necessary. (If the contents of this directory can't be restored, then the service <em>can</em> still be started, but it will be missing any state information entered by service managers: e.g. service downtimes, comments, etc.) Once the nagios server(s) are succesfully running on new hardware, the header <code>live/nagios_client.h</code> should be updated to reflect the new hostname(s) and IP address(es). That header tells nagios clients which server to which they should send their nagios 'passive checks' (e.g. the status reports for network bonding, h/w raid, etc.) As a (very) short-term fix for a hardware failure of the primary monitoring machine =nagios=, and in order to reinstate monitoring for very critical client machines, it would be possible to move the monitoring of clients from =nagios= to =nagios2= by mutating the resource =nagios_client.monitoringcluster= from =nagios/all= to =nagios/slave= in the source profile of any such machine. If this _is_ done, it should be reverted as soon as the primary monitoring server is back on-line. ---++3. Existing documentation <ol> <li>[[http://www.dice.inf.ed.ac.uk/units/infrastructure/Documentation/Monitoring][The DICE Monitoring System]] This set of documents describes the overall design of the Informatics monitoring system, as well as the specific resources which a client machine needs to set in order to request monitoring. For reference, a 'minimal' set of resources - where the client is requesting that its =ssh= service be monitored for liveness - is as follows: <pre> #include <dice/options/nagios_client.h> ...[snip]... !nagios_client.manager mSET(<em>username or capability</em>) !nagios_client.components mADD(openssh) !openssh.nagios_groups mADD(<em>username or capability</em>) </pre> The <em>username or capability</em> field specifies to whom Nagios alerts should be sent in the event of problems; it should therefore be set to the nominal machine 'manager'. If a capability is specified rather than an individual username, it will be expanded by the monitoring system into the corresponding set of usernames. A specific example of such a setting is: <verbatim> #include <dice/options/nagios_client.h> ...[snip]... !nagios_client.manager mSET(nagios/inf-unit) !nagios_client.components mADD(openssh) !openssh.nagios_groups mADD(nagios/inf-unit) </verbatim> </li> <li>On-line [[http://go.nagios.com/nagioscore/docs][Nagios documentation]] This is the official set of documentation maintained on-line by the Nagios project itself. It describes Nagios in some detail and so is a useful reference, but of course says nothing about the specifics of the Informatics monitoring system. </li> </ol> ---++Appendices ---+++A. Details of current servers: hostnames, location, power, networking, etc. %TABLE{cellpadding="3" tablerules="all"}% |*Logical nagios server*|*Actual hostname*|*Physical location*|*Spanning map subscribed to*| |=nagios.inf.ed.ac.uk=|=klaxon=|IF B.02 rack 9, slot 21|=nagios/all=| |=nagios2.inf.ed.ac.uk=|=capon=|N/a - a KVM guest on AT KVM host =waterloo=|=nagios/slave=| %TABLE{cellpadding="3" tablerules="all"}% |*Server*|*Managing console server*|*Power bar/outlet*|*Network connections*| |=klaxon=|=consoles.inf.ed.ac.uk=|=s18.pdu.f.net/9= & =s19.pdu.f.net/9=|=sr06/4= & =sr07/4=| |=capon=|=atconsoles.inf.ed.ac.uk=|N/a - a KVM guest on AT KVM host =waterloo=|N/a| %TABLE{cellpadding="3" tablerules="all"}% |*Logical server*|*Actual hostname*| |iFriend KDC|=hanlon=| -- Main.IanDurkacz
Edit
|
Attach
|
P
rint version
|
H
istory
:
r21
<
r20
<
r19
<
r18
<
r17
|
B
acklinks
|
V
iew topic
|
Raw edit
|
More topic actions...
Topic revision: r19 - 06 Jul 2016 - 11:06:46 -
IanDurkacz
DICE
DICE Web
DICE Wiki Home
Changes
Index
Search
Meetings
CEG
Operational
Computing Projects
Technical Discussion
Units
Infrastructure
Managed Platform
Research & Teaching
Services
User Support
Other
Service Catalogue
Platform upgrades
Procurement
Historical interest
Emergencies
Critical shutdown
Where's my software?
Pandemic planning
This is
WebLeftBar
Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback
This Wiki uses
Cookies