TiBS top questions

How do I know if the backups are running as expected ?

Everyone on the TiBS pandemic team will shortly receive all the necessary TiBS e-mails. They consist routinely of a "TiBS afsgen report" (sent usually around 11pm) and a "TiBS-Backup-Report" which can arrive any time from about 7am depending on the backup load.

The "TiBS afsgen report" should not show any errors. The "TiBS-Backup-Report" is a little more complex but at least if you have received it, you know that TiBS has managed to back up all that it could and has completed. The report will show volumes that it has not been able to backup but we'll deal with those later on.

You will also receive e-mails telling you about full tapes and we have written a script to e-mail round when the number of spare tapes in the T680 tape library falls below 5 and/or there is a tape which has been marked with "AUTO-BAD-LABEL". It's in /usr/tibs/sbin and is called count_blanks.sh.

One e-mail to watch out for has the subject "cachemgr max balance error". It basically alerts us when one of the /tibscache partitions has run out of, or is about to run out of space. How to deal with this can be found at ServicesUnitFullTibsCache

There are other checks that you can do by logging on to the backup server, currently pergamon, which has an alias - backups . Probably the most useful command is tibstat but before you run any TiBS commands (must be done as root), you need to run:

. /etc/tibs.conf

which sets up necessary paths and variables.

Running

tibstat
while a backup is not running should show something like:

[alexandria]root: tibstat
root     26336     1  0 Sep01 ?        00:00:00 tibswd
root     26286     1  0 Sep01 ?        00:00:00 /usr/tibs/rlm/rlm -c /usr/tibs/rlm/tibs.lic -nows -dlog /usr/tibs/reports/rlm.txt
root     26288 26286  0 Sep01 ?        00:00:01 /usr/tibs/rlm/tibsrlm -s 4 -p "/usr/tibs/rlm"
root     26384     1  0 Sep01 ?        00:00:30 tapemgr
root     26339     1  0 Sep01 ?        00:00:53 cachemgr

Running

tibstat
while a backup is running should show something like:

[alexandria]root: tibstat
WRITING: afs group.blah
PENDING:  afs group.blah on afs-incr.tmp.2
PENDING:  afs group.blah2 on afs-incr.tmp.2
WRITING: afs group.blah3
PENDING:  afs group.blah3 on afs-incr.tmp.3
PENDING:  afs group.blah4 on afs-incr
root     12683 29416  0 06:10 ?        00:00:00 afsbackup -l afs-incr -w
root     20783     1  0 13:00 ?        00:00:00 tibstaped -t /dev/nst1
root     20808 20783  0 13:01 ?        00:00:00 sh -c afsvcheck -f /tibscache1/cache/afs/inf.ed.ac.uk/group/blah/full0.dat > /dev/null 2>&1
root     20859     1  0 13:09 ?        00:00:17 tibstaped -t /dev/nst2
root     22082 20859  0 15:14 ?        00:00:00 sh -c afsvcheck -f /tibscache1/cache/afs/inf.ed.ac.uk/group/blah3/full0.dat > /dev/null 2>&1
root     26336     1  0 Sep01 ?        00:00:00 tibswd
root     26286     1  0 Sep01 ?        00:00:00 /usr/tibs/rlm/rlm -c /usr/tibs/rlm/tibs.lic -nows -dlog /usr/tibs/reports/rlm.txt
root     26288 26286  0 Sep01 ?        00:00:00 /usr/tibs/rlm/tibsrlm -s 4 -p "/usr/tibs/rlm"
root     26384     1  0 Sep01 ?        00:00:24 tapemgr
root     26339     1  0 Sep01 ?        00:00:42 cachemgr
root     29415 29412  0 Sep07 ?        00:00:00 -c infauto >> /usr/tibs/reports/infauto.txt 2>&1
root     29416 29415  0 Sep07 ?        00:00:00 infauto
root     22083 22082 17 15:14 ?        00:01:28 afsvcheck -f /tibscache1/cache/afs/inf.ed.ac.uk/group/blah3/full0.dat
root     20809 20808  4 13:01 ?        00:06:38 afsvcheck -f /tibscache1/cache/afs/inf.ed.ac.uk/group/blah/full0.dat

For TiBS to run correctly, the following 3 processes must be running - tibswd, tapemgr and cachemgr. Details of the tibsconf component can be found at https://wiki.inf.ed.ac.uk/DICE/TiBSandlcfgtibs

The automatic backups are run from cron:

00 22  * * * /usr/tibs/bin/infauto >> /usr/tibs/reports/infauto.txt 2>&1

so, while the backup is running, tibstat should show this process. If it shows more than 1 infauto, there might be a problem! (If a backup has taken more than 24 hours - and this sometimes happens - then the second infauto will just hang around until the first completes. However, if the first doesn't complete ...... ). There is now a semaphore system in operation which should prevent multiple versions of infauto running. You may occasionally receive an email saying that the execution of infauto has been aborted because a previous instantiation is still running.

Also, if you see the word "CORRUPT" at the top of the output from tibstat, then there is definitely a problem - more about that later!

What do I do if TiBS is stuck ?

We have seen the backup server kernel panic on several occasions. Usually, a remote power cycle works as per the instructions on

https://wiki.inf.ed.ac.uk/view/DICE/IPMISOLConsoleConfiguration#7_Remote_power_control

However, TiBS at the moment does not start automatically after a reboot. To start TiBS , as root, first check that /dev/atli0 exists. it should be created automatically on boot and should be a symbolic link to /dev/changer. /dev/changer, in turn, should be a symbolic link. To check what /dev/changer should be linked to, see:

https://wiki.inf.ed.ac.uk/DICE/ServicesUnitTibs

under the heading General Stuff.

You then run:

. /etc/tibs.conf
runtibs

You should see a load of messages fly past and then it pauses for some time while it checks the contents of the tape library. This does take some time so don't panic! If you see an error at this point, it's probably because /dev/changer or /dev/atli0 are not pointing to tthe correct place. If this is the case, you need to stop TiBS by running 'stoptibs' and then restart it. You should get 3 confirmatory e-mails showing that tibswd, tapemgr and cachemgr have started successfully. NOTE you still get these messages even if /dev/changer is not pointing correctly so do pay close attention to this as the backups will not work!

Once you have started TiBS , a quick tibsmnt -q will confirm that the tape library can be seen.

If you see 2 (or more!) infauto processes running in the morning, then that suggests that there is a problem. It's quite likely caused by an AFS server hang so first of all talk to the "AFS team".

If in doubt, we do have a support contract with Teradactyl who have been extremely helpful in the past. You can contact them by e-mailing support@teradactylREMOVE_THIS.com but bear in mind that they are in the States so don't expect a response before 3pm!

How do I change tapes ?

The T680 tape library currently has 350 tapes in it and can hold up to 512. Adding or removing tapes is done via the front panel of the library, though it's unlikely you'll ever need to do this. More details about the T680 can be found at https://wiki.inf.ed.ac.uk/DICE/ServicesUnitSpectraT680 along with contact details should an engineer be needed.

How do I add/remove partitions (including exceptions), to/from the backup ?

It's unlikely that you'll need to add NFS partitions for obvious reasons and new afs volumes should be picked up automatically. We may want to remove NFS partitions however, if they are no longer in use and the associated server has been switched off. To do this:

hostdel -c afs_class -g solaris_group -n pegasus.inf.ed.ac.uk -p /disk/ptn051 -y

To find out which partitions are being backed up:

hostaudit -n pegasus.inf.ed.ac.uk  -l

If you don't want a particular AFS volume to be backed up (e.g. some of the group spaces), you need to add a line in /usr/tibs/state/afs/inf.ed.ac.uk/omit.afs_group (which is maintained by the tibsconf component, so don't edit by hand) e.g.

|group.ANC|

If you don't want a particular NFS partition backed up (but don't want to completely remove it using hostdel, you need to add skip in the appropriate line before solaris in the file /usr/tibs/state/classes/afs_class/solaris_group.txt

sphinx.inf.ed.ac.uk|/disk/ptn068|/tibscache2/cache|skip solaris

How do I restore files ?

Full details can be found at https://wiki.inf.ed.ac.uk/DICE/ServicesUnitTibs under the heading Restoring Files but in short, you use the command filesearch to identify which tape is needed and the afsrest to restore an afs volume and tibsrest to restore NFS files. Examples of searching for afs volumes and nfs files are:

filesearch -n afs -p user.bill -s group.getent
filesearch -n pegasus.inf.ed.ac.uk -p /disk/ptn041 -s myfile

To restore an AFS volume using a particular timestamp:

afsrest -n user.bill -q -f -t "2008/12/04/10/09/35"

Note the -q in the above example just does a query and tells you what it would do. Remove the -q to actually restore the volume.

To restore an NFS file:

tibsrest -n pegasus.inf.ed.ac.uk -p /disk/ptn041  -s bill/myfile  -r pegasus.inf.ed.ac.uk -a /tmp

The above example would restore bill/myfile to /tmp on pegasus from the most recent backup (since no timestamp is given).

What configuration files does TiBS use ?

file purpose editable
/usr/tibs/state/tibs.conf main config file component
/usr/tibs/state/caches.txt gives location of cache directories component
/usr/tibs/state/drives.txt defines available tape drives NO
/usr/tibs/state/classes.txt lists the TiBS classes component
/usr/tibs/state/clients.txt lists non-AFS TiBS clients shouldn't need to
/usr/tibs/state/subnets.txt info about the local network component
/usr/tibs/state/barcodes.txt matches barcode labels with tape label only to sort AUTO-BAD-LABEL
/usr/tibs/state/classes/afs_class/groups.txt defines the groups for that class component
/usr/tibs/state/classes/afs_class/linux_group.txt defines membership of linux_group NO
/usr/tibs/state/classes/afs_class/solaris_group.txt defines membership of solaris_group if you want to add skip to a partition
/usr/tibs/state/classes/afs_class/vicep_group.txt defines membership of vicep_group NO
/usr/tibs/state/classes/afs_class/afs_group.txt defines membership of afs_group NO
/usr/tibs/state/afs/inf.ed.ac.uk/omit.afs_group defines exclusion rules for AFS component
/usr/tibs/state/rules/solaris.txt defines exclusions for solaris edit with caution

Most of these are managed using the tibsconf component. See https://wiki.inf.ed.ac.uk/DICE/TiBSandlcfgtibs for more details.

-- AlisonDownie - 03 Sep 2009

-- CraigStrachan - 27 Feb 2019

Topic revision: r18 - 27 Feb 2019 - 08:37:49 - CraigStrachan
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies