Nagios Alerts for defective hardware
#include <dice/options/hwmon.h>
This performs some simple Nagios checks on a DICE server's hardware:
- RAID disks are checked as follows:
- MegaSAS: If a disk has gone "critical" or "failed", Nagios will be sent a critical error. Similarly if a virtual drive goes "offline". If a virtual drive is "degraded" Nagios will get a warning. This RAID check will be performed on machines using
dice/options/raid_megaraid_sas.h
.
- HP: HP RAID drives trigger an error when they are not "OK". This RAID check will be performed on machines using HP RAID.
- SAS 5i/R: A virtual drive that has "failed" triggers a Nagios error. One that is "degraded" triggers a Nagios warning. Any other status except "Optimal" triggers a Nagios warning. This RAID check will be performed on machines using
dice/options/raid_sas5iR.h
.
- H200: A virtual drive that has "failed" triggers a Nagios error. One that is "degraded" triggers a Nagios warning. Any other status except "Optimal" triggers a Nagios warning. This RAID check will be performed on machines using
dice/options/raid_h200.h
.
- Power supplies are checked using IPMI. Nagios gets an error if any failure or lack of redundancy is detected.
- Nagios gets an error if the script detects any disks mounted read-only on anything other than
/media
or /dev/loop*
.
--
ChrisCooke - 17 Oct 2011
Topic revision: r6 - 17 Oct 2011 - 14:12:31 -
ChrisCooke