Replacing a failed or failing Dell server disk

Here's how to replace a failed or failing disk on a Dell PowerEdge server with a MegaRAID disk controller (which is most of them).

Diagnosis and information-gathering

RAID tool(s)

The RAID tool for MegaRAID controllers is called MegaCli. The actual executable is at

/opt/MegaRAID/MegaCli/MegaCli64
It's not the friendliest software in the world so the MegaCli Cheat Sheet is an invaluable help.

Other (largely non-Dell) servers might use different controllers and utilities, though variation is less likely on modern servers. There's a utility in the CO utils area called checkraid which attempts to guess which controller you're using and will produce a nominal status output for any supported controller:

/afs/inf.ed.ac.uk/group/cos/utils/checkraid [all]

The advice on interpreting output applies equally to checkraid — though for more in-depth analysis you might still need to call MegaCLI manually.

How do you know that a disk has a problem?

When a disk fails, MegaCli will report the status of its RAID array as "degraded" (instead of "Optimal").

Assuming that your server is using hwmon - and it should be because hwmon is included in server.h and in small-server.h - this status change will show up in the Nagios hwmon check, which will change to a "warning" status (yellow on the web page). The hwmon checks run every 15 minutes. As soon as four checks in a row have reported the same degraded RAID status, Nagios will send a notification via Jabber. It will send another such notification 15 minutes later, and a third 15 minutes after that. Subsequent Nagios notifications will go by email, and will carry on being sent every 15 minutes until they are acknowledged or the problem has gone away.

If a disk is failing but has not yet failed, its RAID array will still have the status "Optimal", but hwmon will report a "critical" status (red on the machine's Nagios service status detail web page).

In either case, the disk should be replaced as soon as possible.

Which disk is affected?

For a failed disk:

If you're physically present then look at the lights on the front of the server. The light on the failed disk should no longer be a steady green. The disk slots on an R720 have small numbers marked just above them. These are the disk slot numbers. Note the slot number of the failed disk. Now login, nsu then run
/opt/MegaRAID/MegaCli/MegaCli64 -EncInfo -aALL
and look for the enclosure's "Device ID" number. This is the "Enclosure ID" which you will need later so note it down.

If you're remote from the server then check the logical disk status with

/opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -Lall -aALL
The "State" of a logical disk should be "Optimal", but a logical disk with one failed drive will instead be in a "Degraded" state. You can also check the physical disk information with
/opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL
A healthy physical disk will normally have a "Firmware state" of "Online, Spun Up". Once you've spotted your failed disk note down its "Enclosure Device ID" and "Slot Number" as you will need them later.

For a failing disk:

Check the physical disk information as above with
/opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL
The failing disk may still have the normal "Firmware state" of "Online, Spun Up", but look for a "Media Error Count" or "Predictive Failure Count" greater than zero, and "Drive has flagged a S.M.A.R.T. alert" of "Yes" rather than the usual "No".

If you're physically present, the usual solid green light on the disk may be flashing and changing its colour (e.g. flashing on and off and alternately green and orange).

Get the (RAID controller) Events Log

More detail on errors will be available in the events log. Also, if a problem disk is being reported to Dell, and especially if it has logged errors but has not yet failed, Dell may ask for the events log. This is how to get a copy of it (in this case in the file /tmp/megacli-events.log):
/opt/MegaRAID/MegaCli/MegaCli64 -AdpEventLog -GetEvents -f /tmp/megacli-events.log -aALL
A disk which is about to fail may have logged errors such as these:
Event Description: Unexpected sense: PD 02(e0x20/s2) Path 5000c50072624591, CDB: 2f 00 0c 69 50 b0 00 10 00 00, Sense: 3/11/00
Event Description: Patrol Read corrected medium error on PD 02(e0x20/s2) at c6950b3

Get the disk's details

Once you know which slot the problem disk is in, you can find out more about it. Various fields in the output of the physical device info will give you clues, but for a proper answer use smartctl. Where a MegaRAID controller is in use, the command to use is
smartctl -i -d megaraid,slot /dev/sda
Where slot is the slot occupied by the disk - you got this already from the physical disk info or from looking at the machine's front panel. Technically the last thing on the smartctl command line should be the device name of the logical disk of which the dodgy physical disk is a part - but in tests /dev/sda seems to work for all physical disks on the system whichever logical disk they were actually part of; so /dev/sda will probably do. Here's some of what smartctl might tell you about an example disk:
=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST3600057SS
Revision:             ES68
User Capacity:        600,127,266,816 bytes [600 GB]
Logical block size:   512 bytes
Rotation Rate:        15000 rpm
Form Factor:          3.5 inches
Device type:          disk
Transport protocol:   SAS
In this example, the disk is a Seagate 600GB 3.5" 15K SAS disk, and its firmware revision is ES68. Transport protocol will generally either be SAS or SATA; speed will usually either be 15K (15000 rpm) or 7K; the form factor will either be 3.5 inches or 2.5 inches.

Check the machine's support status

See ReportingDellHardwareIssues.

Getting a replacement disk

Contacting Dell

See ReportingDellHardwareIssues.

Once your Dell contact agrees to send a new disk, it's probably best to ask for it to be sent to the Informatics Forum, 10 Crichton Street, EH8 9AB. That way it'll be delivered to reception. This becomes convenient later on in the process.

These days Dell sends out 2.5" disks as replacements for 3.5" disks, but in a wee 3.5" adapter frame. This works OK.

Replacing the disk

Removing a disk ...

You should by now have an Enclosure ID number and a slot number for the failed drive. You'll also need the ID number of the adapter but this will probably always be 0.

... the recommended way

This is the textbook way to do it; you should follow this procedure if you can, especially if you're removing a healthy disk. In these commands replace E with your enclosure ID number and S with the slot number.
  1. Take the disk offline:
    /opt/MegaRAID/MegaCli/MegaCli64 -PDOffline -PhysDrv '[E:S]' -a0
  2. Mark the disk as missing:
    /opt/MegaRAID/MegaCli/MegaCli64 -PDMarkMissing -PhysDrv '[E:S]' -a0
  3. Prepare the disk for removal:
    /opt/MegaRAID/MegaCli/MegaCli64 -PDPrpRmv -PhysDrv '[E:S]' -a0
At each stage you should wait for the command to exit successfully. Having done all this you can safely remove the disk.

... the cavalier way

Just remove the disk. You'll probably get away with this if the disk has failed completely, because failed disks are automatically removed from the RAID set. Don't risk it for disks which are still functioning in some fashion. They should be removed the recommended way (see above).

Swapping the disk caddy

The replacement disk will have come without a caddy. Unscrew the caddy from the problem disk and fit it to the replacement. This only takes a minute or two, and it's fairly obvious how to do it. Each server room has a screwdriver in the toolbox.

Adding a replacement disk

Pop the replacement disk into the slot which contained the failed disk, and press it home in the normal way. The RAID hardware will see the new disk and will straight away incorporate it into the array and start to rebuild onto it. Its lights should start flashing rapidly in sync with another disk: this shows that the rebuilding is underway. You can confirm it by checking the physical disk status (see above). The new disk should have a "Firmware state" of "Rebuild" during the rebuilding process. Once the process has finished it will change to the normal "Online, Spun Up".

Sending back the problem disk

Unless your server was bought with the Keep Your Hard Drive option, when Dell sends you a replacement disk it expects the problem disk to be sent back.

Keep Your Hard Drive?

Was your server bought with the Keep Your Hard Drive option? To find out, go back to the detailed inventory listing for your server. One of the fields returned by the inventory command will be ticket. This will give you the number of an RT ticket. With any luck that ticket will have all the details of the purchase of the machine. Look at the documents which are attached to the ticket. One of them should be a detailed quote for the purchase. Download and open that document and look to see if one of the items on the quote mentions Keep Your Hard Drive. If it does, you're in luck, and you can stop reading at this point.

How to wipe it

Disks which have not yet actually failed should first be wiped before being sent to Dell. Put the disk into a disk caddy and find a machine with a spare slot for it. Then either PXE-boot into DBAN and use that to wipe the disk, or more simply, identify the disk's device name in the machine you're using (e.g. /dev/sdb) and fill the disk with zeroes, e.g. ==dd if=/dev/zero of=/dev/sdb Once your disk has been wiped, remove it from the caddy again.

It is more important to ensure that the disk has been wiped than to return it promptly.

How to send it

The courier which delivered your replacement disk will automatically make three attempts to pick up the old disk, one attempt per day, with the first attempt happening on the next working day after the delivery. The courier will try to collect from wherever the replacement was delivered to, so giving the address of the Informatics Forum is a good idea, so that the receptionist can deal with the courier.

If it's not convenient for you to make the old disk available for collection so soon, just explain this to your Dell representative: they're generally very understanding and relaxed about this. If you miss one or more of the automated courier pickups, you will automatically receive a stern message from the courier company. Again, just contact your Dell representative and check that they're happy with your delay.

To send the disk, put it in the same packaging that the replacement was delivered in; close the box; tape it shut. You won't need to change the shipping label. Just hand the box back to reception with an explanation. The receptionist will be able to tell you when the disk has been collected.

Topic revision: r15 - 01 Jun 2018 - 11:29:37 - ChrisCooke
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies