Server Hardware Interaction

(devproj 134)

The project had two weeks of effort allocated to it, and this is about what it took. The project was somewhat open-ended as it started with a list of possible deliverables, as many of which were to be delivered in the time available. The two most important items on the list were delivered. They're simple solutions rather than gold-plated but they do the job.

Toohot

The toohot script runs on all compatible DICE servers (all but the most elderly hardware). It shuts down a server when the ambient temperature gets to a dangerously high level. This is intended as one stage of a multi-level high temperature warning system, other components of which will act with greater intelligence to send out warnings, save data and so on. This part acts as a last ditch defence when other high temperature warnings have failed, and simply shuts the machine down to forestall it from automatically shutting down its own components (which can damage data).

Nagios Alerts for defective hardware

Again on most DICE servers, warnings are sent via Nagios when certain hardware problems are detected:
  • When a power connection fails in a server with two power connections
  • When a high speed RAID disk or volume fails or suffers ill health.
The project has established a simple framework making it straightforward to add other tests in the future where necessary and practical.

If I had to make a criticism it would be that I would probably have made quicker progress with this project if I'd talked to colleagues more at the beginning of the project and thrashed out the technical details with them rather than trying to tackle the whole thing on my own. Despite that it did deliver its two most urgent goals.

A number of areas remain for future development:

  • Adding Nagios alerts for other types of RAID hardware
  • Adding Nagios alerts for other detectable hardware problems
  • Detection (and alerts?) of out-of-date RAID firmware, BIOS, etc.
  • Automated updates for firmware, BIOS etc.

-- ChrisCooke - 28 Jul 2010

Topic revision: r1 - 28 Jul 2010 - 18:18:17 - ChrisCooke
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies