Nagios Alerts for defective hardware

#include <dice/options/hwmon.h>

This performs some simple Nagios checks on a DICE server's hardware:

  1. RAID disks are checked as follows:
    1. MegaSAS: If a disk has gone "critical" or "failed", Nagios will be sent a critical error. Similarly if a virtual drive goes "offline". If a virtual drive is "degraded" Nagios will get a warning. This RAID check will be performed on machines using dice/options/raid_megaraid_sas.h.
    2. HP: HP RAID drives trigger an error when they are not "OK". This RAID check will be performed on machines using HP RAID.
    3. SAS 5i/R: A virtual drive that has "failed" triggers a Nagios error. One that is "degraded" triggers a Nagios warning. Any other status except "Optimal" triggers a Nagios warning. This RAID check will be performed on machines using dice/options/raid_sas5iR.h.
    4. H200: A virtual drive that has "failed" triggers a Nagios error. One that is "degraded" triggers a Nagios warning. Any other status except "Optimal" triggers a Nagios warning. This RAID check will be performed on machines using dice/options/raid_h200.h.
  2. Power supplies are checked using IPMI. Nagios gets an error if any failure or lack of redundancy is detected.
  3. Nagios gets an error if the script detects any disks mounted read-only on anything other than /media or /dev/loop*.

-- ChrisCooke - 17 Oct 2011

Topic revision: r6 - 17 Oct 2011 - 14:12:31 - ChrisCooke
DICE.NagiosHardwareChecks moved from DICE.NagiosRAID on 28 May 2010 - 15:34 by ChrisCooke - put it back
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies