Replacing a failed or failing disk

Here's how to replace a failed or failing disk on a Dell PowerEdge R720.

RAID tool

The RAID tool on the R720 is called MegaCli. The actual executable is at
/opt/MegaRAID/MegaCli/MegaCli64
The MegaCli Cheat Sheet is an invaluable help.

How do you know that a disk has a problem?

When a disk fails the status of its RAID array will change to "degraded".

Assuming that your server is using hwmon - and it should be because hwmon is included in server.h and in small-server.h - this status change will show up in the Nagios hwmon check, which will change to a "warning" status (yellow on the web page). The hwmon checks run every 15 minutes. As soon as four checks in a row have reported the same degraded RAID status, Nagios will send a notification via Jabber. It will send another such notification 15 minutes later, and a third 15 minutes after that. Subsequent Nagios notifications will go by email, and will carry on being sent every 15 minutes until they are acknowledged or the problem has gone away.

If a disk is failing but has not yet failed, its RAID array will still have the status "Optimal", but hwmon will report a "critical" status (red on the web page).

Which disk is affected?

For a failed disk:

If you're physically present then look at the lights on the front of the server. The light on the failed disk should no longer be a steady green. The disk slots on an R720 have small numbers marked just above them. These are the disk slot numbers. Note the slot number of the failed disk. Now login, nsu then run
/opt/MegaRAID/MegaCli/MegaCli64 -EncInfo -aALL
and look for the enclosure's "Device ID" number. This is the "Enclosure ID" which you will need later so note it down.

If you're remote from the server then check the logical disk status with

/opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -Lall -aALL
The "State" of a logical disk should be "Optimal", but a logical disk with one failed drive will instead be in a "Degraded" state. You can also check the physical disk information with
/opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL
A healthy physical disk will normally have a "Firmware state" of "Online, Spun Up". Once you've spotted your failed disk note down its "Enclosure Device ID" and "Slot Number" as you will need them later.

For a failing disk:

Check the physical disk information as above with
/opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL
The failing disk may still have the normal "Firmware state" of "Online, Spun Up", but look for a "Media Error Count" or "Predictive Failure Count" greater than zero, and "Drive has flagged a S.M.A.R.T. alert" of "Yes" rather than the usual "No".

If you're physically present, the usual solid green light on the disk may be flashing and changing its colour (e.g. flashing on and off and alternately green and orange).

Get the (RAID controller) Events Log

More detail on errors will be available in the events log. Also, if a problem disk is being reported to a supplier, and especially if it has logged errors but has not yet failed, the supplier may ask for the events log. This is how to get a copy of it (in this case in the file /tmp/megacli-events.log):
/opt/MegaRAID/MegaCli/MegaCli64 -AdpEventLog -GetEvents -f /tmp/megacli-events.log -aALL
A disk which is about to fail may have logged errors such as these:
Event Description: Unexpected sense: PD 02(e0x20/s2) Path 5000c50072624591, CDB: 2f 00 0c 69 50 b0 00 10 00 00, Sense: 3/11/00
Event Description: Patrol Read corrected medium error on PD 02(e0x20/s2) at c6950b3

Removing a disk ...

You should by now have an Enclosure ID number and a slot number for the failed drive. You'll also need the ID number of the adapter but this will probably always be 0.

... the recommended way

This is the textbook way to do it; you should follow this procedure if you can, especially if you're removing a healthy disk. In these commands replace E with your enclosure ID number and S with the slot number.
  1. Take the disk offline:
    /opt/MegaRAID/MegaCli/MegaCli64 -PDOffline -PhysDrv '[E:S]' -a0
  2. Mark the disk as missing:
    /opt/MegaRAID/MegaCli/MegaCli64 -PDMarkMissing -PhysDrv '[E:S]' -a0
  3. Prepare the disk for removal:
    /opt/MegaRAID/MegaCli/MegaCli64 -PDPrpRmv -PhysDrv '[E:S]' -a0
At each stage you should wait for the command to exit successfully. Having done all this you can safely remove the disk.

... the cavalier way

Just remove the disk. You'll probably get away with this if the disk has failed completely, because failed disks are automatically removed from the RAID set. Don't risk it for disks which are still functioning in some fashion. They should be removed the recommended way (see above).

Adding a replacement disk

Pop the replacement disk into the slot which contained the failed disk, and press it home in the normal way. The RAID hardware will see the new disk and will straight away incorporate it into the array and start to rebuild onto it. Its lights should start flashing rapidly in sync with another disk: this shows that the rebuilding is underway. You can confirm it by checking the physical disk status (see above). The new disk should have a "Firmware state" of "Rebuild" during the rebuilding process. Once the process has finished it will change to the normal "Online, Spun Up".

Disks which have almost failed and are being sent back to the supplier should first be wiped, for example with DBAN.

Topic revision: r9 - 12 Nov 2015 - 16:20:12 - ChrisCooke
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies