Fixing problems with AFS servers
At the time of writing (27/4/09) we have twice seen the new linux based file servers loose contact with the new EVo disk arrays. It's not yet clear just why this is happening but the effects are obvious, Some users loose the ability to access their AFS home directories. This pages explains how to recognise this is happening and how to restore normal service.
The symptoms
Normally the first indication this is happening will be when a user complains that they cannot access their AFS file space. When this happens, run
vos examine
on their homedir volume. For example, if the user has username cms:
# vos examine user.cms
user.cms 536871009 RW 6003395 K On-line
squonk.inf.ed.ac.uk /vicepa
RWrite 536871009 ROnly 536871010 Backup 536871011
MaxQuota 8000000 K
Creation Thu Jan 1 01:00:00 1970
Copy Wed Jan 14 16:49:06 2009
Backup Fri Mar 27 07:17:39 2009
Last Update Fri Mar 27 14:54:35 2009
22249 accesses in the past day (i.e., vnode references)
RWrite: 536871009 ROnly: 536871010 Backup: 536871011
number of sites -> 3
server squonk.inf.ed.ac.uk partition /vicepa RW Site
server wyvern.inf.ed.ac.uk partition /vicepa RO Site
server squonk.inf.ed.ac.uk partition /vicepa RO Site
If you get an error from the command and only the second part of the above output is produced, it's a fair bet that the server has lost contact with the partition containing the RW volume.
Log into the server which hosts the RW volume (in the above example it would be squonk). You will probably find that the partition containing the affected volume (in the above example vicepa) is now mounted read only. There will also be error messages in /var/lcfg/log/syslog. As final confirmation, run the command
vos listvol
on the server and partition of the RW volume. Once again in the above examine we would type
# vos listvol squonk vicepa
Total number of volumes on server squonk partition /vicepa: 247
backup.root 536871305 RW 2009 K On-line
gdir.readonly 536871111 RO 40 K On-line
pkgsdir.readonly 536873334 RO 189 K On-line
project.readonly 536870928 RO 2 K On-line
restore.volume 536874829 RW 1686274 K On-line
restore.volume.readonly 536878292 RO 1686274 K On-line
root.afs.readonly 536870913 RO 4 K On-line
root.cell.readonly 536870916 RO 8 K On-line
.
.
user.v1swils2 536873534 RW 4207475 K On-line
user.v1swils2.backup 536873536 BK 4207475 K On-line
user.v1swils2.readonly 536873535 RO 4207475 K On-line
user.v1yabush 536872207 RW 3480377 K On-line
user.v1yabush.backup 536872209 BK 3480377 K On-line
user.v1yabush.readonly 536872208 RO 3480377 K On-line
Total volumes onLine 247 ; Total volumes offLine 0 ; Total busy 0
Examine the output from this command, especially the
Total volumes offLine
field. If there are any volumes offline, then we need to get them online again. Run the
vos listvol
command on the other partitions on this server and see if any of them have offline volumes. Take a note of these. Also check that the other AFS servers at that site are not affected. At present we have 4 linux AFS fileservers, squonk and crocotta in the Forum server area and unicorn and lammasu in JCMB-2501. We still have two Solaris file servers, sphinx and wyvern in 2501 but they are not attached to the EVO.
The first thing we need to do is to get the afs vice partitions remounted read-write. The simplest and safest way to do this is to reboot the server. It's possible that the server may fsck the partition. Settle down for a bit of a wait as these partitions are 250G each.
When the server has rebooted, check that the AFS partitions are mounted read-write. Next we need to bring the affected volumes back on line. It's possible this is already happening. Run the
bos status
command on the server:
# bos status squonk
Instance fs, currently running normally.
Auxiliary status is: file server running.
This is what you would normally expect to see. If instead the Auxiliary status is
salvaging file system.
then the AFS server has detected errors on the file system and is correcting them for you. All you need to do is wait (on average the linux file servers take just under an hour to salvage all their partitions). You can follow progress by monitoring /usr/afs/logs/SalvageLog* on the server.
If the salvager isn't already running, we need to salvage the partitions we noted earlier to bring the volumes online. Use the
bos salvage
command:
# bos salvage squonk vicepa
This will shut down the file server and start the salvage of the volume. Once again, you can track progress by looking at
/usr/afs/logs/SalvageLog*
on the affected server. When the salvage completes, use the
vos listvol
command again to check that the volumes are now online. Remember to salvage every partition which had offline volumes.
Evolution SAN boxes
Just a quick note that both evolution SAN boxes have web interfaces
http://ifevo1a.inf.ed.ac.uk/ and
http://kbevo1a.inf.ed.ac.uk (there are also the "b" versions). They also have serial consoles, for the Forum just "console ifevo1" (no "a" or "b"). For the KB one, it's a bit trickier. Log in to
kleiber
and as root run
minicom
, set the speed to 115200 with
CTRL-a z p i
(obviously).
--
CraigStrachan - 27 Mar 2009