Fixing problems with AFS servers

At the time of writing (27/4/09) we have twice seen the new linux based file servers lose contact with the new EVo disk arrays. It's not yet clear just why this is happening but the effects are obvious, Some users lose the ability to access their AFS home directories. This pages explains how to recognise that this is happening and how to restore normal service.

The symptoms

Normally the first indication this is happening will be when a user complains that they cannot access their AFS file space. When this happens, run vos examine on their homedir volume. For example, if the user has username cms:

# vos examine user.cms

user.cms                          536871009 RW    6003395 K  On-line
    squonk.inf.ed.ac.uk /vicepa 
    RWrite  536871009 ROnly  536871010 Backup  536871011 
    MaxQuota    8000000 K 
    Creation    Thu Jan  1 01:00:00 1970
    Copy        Wed Jan 14 16:49:06 2009
    Backup      Fri Mar 27 07:17:39 2009
    Last Update Fri Mar 27 14:54:35 2009
    22249 accesses in the past day (i.e., vnode references)

    RWrite: 536871009     ROnly: 536871010     Backup: 536871011 
    number of sites -> 3
       server squonk.inf.ed.ac.uk partition /vicepa RW Site 
       server wyvern.inf.ed.ac.uk partition /vicepa RO Site 
       server squonk.inf.ed.ac.uk partition /vicepa RO Site 

If you get an error from the command and only the second part of the above output is produced, it's a fair bet that the server has lost contact with the partition containing the RW volume.

Log into the server which hosts the RW volume (in the above example it would be squonk). You will probably find that the partition containing the affected volume (in the above example vicepa) is now mounted read only. There will also be error messages in /var/lcfg/log/syslog. As final confirmation, run the command vos listvol on the server and partition of the RW volume. Once again in the above examine we would type

# vos listvol squonk vicepa

Total number of volumes on server squonk partition /vicepa: 247 
backup.root                       536871305 RW       2009 K On-line
gdir.readonly                     536871111 RO         40 K On-line
pkgsdir.readonly                  536873334 RO        189 K On-line
project.readonly                  536870928 RO          2 K On-line
restore.volume                    536874829 RW    1686274 K On-line
restore.volume.readonly           536878292 RO    1686274 K On-line
root.afs.readonly                 536870913 RO          4 K On-line
root.cell.readonly                536870916 RO          8 K On-line
                                    .
                                    .
user.v1swils2                     536873534 RW    4207475 K On-line
user.v1swils2.backup              536873536 BK    4207475 K On-line
user.v1swils2.readonly            536873535 RO    4207475 K On-line
user.v1yabush                     536872207 RW    3480377 K On-line
user.v1yabush.backup              536872209 BK    3480377 K On-line
user.v1yabush.readonly            536872208 RO    3480377 K On-line

Total volumes onLine 247 ; Total volumes offLine 0 ; Total busy 0

Examine the output from this command, especially the Total volumes offLine field. If there are any volumes offline, then we need to get them online again. Run the vos listvol command on the other partitions on this server and see if any of them have offline volumes. Take a note of these. Also check that the other AFS servers at that site are not affected. At present we have 8 linux AFS fileservers, squonk, crocotta, bunyip and cameleopard in the Forum server area and unicorn, lammasu, pyrolisk and cockatrice in JCMB-2501. We still have two Solaris file servers, sphinx and wyvern in 2501 but they are not attached to the EVO.

The first thing we need to do is to get the afs vice partitions remounted read-write. The simplest and safest way to do this is to reboot the server. It's possible that the server may fsck the partition. Settle down for a bit of a wait as these partitions are 250G each.

When the server has rebooted, check that the AFS partitions are mounted read-write. Next we need to bring the affected volumes back on line. It's possible this is already happening. Run the bos status command on the server:

# bos status squonk 
Instance fs, currently running normally.
    Auxiliary status is: file server running.

This is what you would normally expect to see. If instead the Auxiliary status is salvaging file system. then the AFS server has detected errors on the file system and is correcting them for you. All you need to do is wait (on average the linux file servers take just under an hour to salvage all their partitions). You can follow progress by monitoring /usr/afs/logs/SalvageLog* on the server.

If the salvager isn't already running, we need to salvage the partitions we noted earlier to bring the volumes online. Use the bos salvage command:

# bos salvage squonk vicepa

This will shut down the file server and start the salvage of the volume. Once again, you can track progress by looking at /usr/afs/logs/SalvageLog* on the affected server. When the salvage completes, use the vos listvol command again to check that the volumes are now online. Remember to salvage every partition which had offline volumes.

Evolution SAN boxes

Just a quick note that both evolution SAN boxes have web interfaces http://ifevo1a.inf.ed.ac.uk/ and http://kbevo1a.inf.ed.ac.uk (there are also the "b" versions). They also have serial consoles, for the Forum just "console ifevo1" (no "a" or "b"). For the KB one, it's a bit trickier. Log in to kleiber and as root run minicom, set the speed to 115200 with CTRL-a z p i (obviously).

-- CraigStrachan - 27 Mar 2009

Topic revision: r4 - 24 Sep 2009 - 14:56:23 - IainRae
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies