LCFG Disaster Recovery Procedures

Don't Panic!

Force an LCFG client to swap servers

If you have an LCFG slave server running but it is not one of those currently in the client.url resource you can manually force the change.

Firstly rfe/edit the LCFG source profile and mutate the resource:

!client.url mSET(

Wait for the modified profile(s) to succesfully compile. Then on the machine you want to switch LCFG servers do the following:

$ om client stop
[OK] client: stop
% nsu
% /usr/sbin/rdxprof -u
% qxprof client.url
% om client start
[OK] client: start

Whichever situation (either using the LCFG slave on the DR server or the test server), the recommendeded approach is to use this method to make the slave server self-managing as an initial step. It is also a good idea to change the client URL for the server hosting the master data (this will be the same machine as the slave if using the DR server).

Scenario 1: LCFG Master Failure

In the situation where the LCFG master server has been lost but everything else is still functional (e.g. a hardware failure has occurred) the LCFG DR server ( can be used to keep the service running. It is important to remember that the DR server is not intended to be used as a permanent replacement for the master server but rather as a temporary facility to keep the LCFG service running. That means that it does not provide a complete replica of all services.

The DR server uses the LCFG rmirror component to take a backup of the most important data (e.g. source profiles, headers, package lists) from the master every 15 minutes. To avoid the potential for any corruption of the mirrored data the first step is to stop the rmirror component on the DR server:

% om rmirror stop

This will result in regular reports of failed cron jobs but that is preferable to losing data. It's probably a good idea to warn the other COs that some changes may have been lost.

To make this permanent across reboots take the rmirror component out of the resource:

! mREMOVE(rmirror)

The next step is to alter the DNS so that the aliases lcfg-master and lcfgsvn are both associated with the DR server (currently this is sauce). To speed up the propagation of the changes on the LCFG slaves give the LCFG dns component a kick:

% om dns update

and then check (e.g. with the host command) on the LCFG slaves that the aliases have updated. The slaves should now be working again.

Note that the various lcfg rfe maps, e.g. for source profiles, also refer to the lcfg-master alias so until the DNS updates have reached all machines the use of rfe might fail.

The LCFG master also hosts the LCFG svn web service ( and the DICE orders host ( The ordershost data would have to be restored from backup if the disk has failed. There is a backup of the LCFG svn service on the DR server. Due to the apache configuration being managed via LCFG it's not as simple as just altering the DNS entry. Instead there is an alternative service available at, the LCFG "core" tree can be checked out as usual and changes made will be fed into any slaves using the DR server as a master. If this service is only going to be used temporarily then any changes made to this repository would need to be copied to the repository on so be careful...

A checkout can be done like this:

% svn co lcfgbackup

Scenario 2: LCFG Slave Failures

We have multiple LCFG slaves in different physical locations in the hope that we will never lose them all at the same time. In case we do, here are two possible solutions, the first is simpler but relies on the normal LCFG master still being functional. The second solution is necessary if we are using the DR server as a temporary master & slave server.

With either solution the easiest option is to use a profile for an existing LCFG slave server (currently these are mousa and trondra) that has failed. If you try to create a new profile you will have to modify some LCFG resources to open up access to rsync to the master server (done using live/lcfg-slave-servers-list.h).

Note that in an emergency we do not need to immediately replace the LCFG slave servers, the server running on the DR machine is sufficient. In the short term it is possible to just change the various CNAMEs (lcfg1, lcfg2, lcfg3, lcfg4, lcfg5) to point at the DR server.

Solution 1: Use the test server

Firstly ensure that the test server is compiling your profile, this is controlled by the lcfgtesthosts rfe map which limits the set of source profiles which are copied using rsync.

The next step is to alter the client URL for that profile by mutating the client.url resource in the relevant LCFG source profile:

!client.url    mSET(

To do the install requires that the DHCP entry for the machine is correct. The simplest option might be to create a virtual machine which uses the same MAC address as the server which has been lost. If that is not an option then the dhclient.mac resource will have to be modified in the source profile. As the LCFG service is not functional the relevant /etc/dhcpd.conf file will also have to be modified manually on the DHCP server (this is on the netinf server for each site, see the Inf Unit Kit List). You MUST talk to the Infrastructure Unit before doing this. The following steps will be necessary:

% om dhcpd stop
% nsu
% edit /etc/dhcpd.conf
% /etc/init.d/dhcpd restart

If you use the component to restart the server you will get your manual changes overwritten.

Once that profile has compiled you can start the install process. If it is possible to edit the DNS then alter the verbatim records for lcfg and lcfghost so that they refer to the IP address of the test server.

Alternatively, at the boot prompt you will need to append the lcfg.url option to override the standard LCFG server url which is provided by DHCP.

For the CD install this would be like:

sr0 lcfg.url=

(changing the location of the CD drive to suit your hardware, of course). With PXE you will need to edit the boot entry.

The install should now progress as normal and you will soon have a working LCFG slave server.

Solution 2: Use the DR server

If the LCFG master has also been lost then firstly the DR server needs to be switched to work as the master following the instructions in "Scenario 1".

Once that is done the LCFG slave running on the DR server can be used to install more machines. This is done the same way as for "Solution 1" (above) except that you should refer to the slave server instead of lcfgtest


The MPU DR server, sauce, carries a full set of packages for the supported server platforms (as of Jan 2011: SL5, SL5_64, SL6 and SL6_64). This copy is updated nightly from using rmirror.

The DR package repository is available via{sites|rpms}/... The following can be used to make a machine use the DR package repository instead of the normal AFS based repository :-

       !updaterpms.rpmpath     mSUBST(cache.pkgs,dr.pkgs)

Also on the DR server are backups for the LCFG master and slave. These are configured through the dice/options/lcfg-dr-server.h header. That header combines the dice/options/lcfg-master-server.h and dice/options/lcfg-slave-server.h headers. It acts in pretty much the same way as the standard master and slave server except that the slave server uses rsync to fetch data files from itself. There is a regular rmirror process which fetches from the normal master server.

In the case of losing the LCFG master and slave service you should stop the rmirror process on sauce:

        om rmirror stop

and then, to ensure consistency, stop the slave server, delete the caches (as root) and then restart:

        om server stop
        rm -f /var/lcfg/conf/server/cache/*.db
        om server start

-- AlastairScobie - 11 Jan 2011

Edit | Attach | Print version | History: r12 | r10 < r9 < r8 < r7 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r8 - 13 Sep 2011 - 13:02:11 - StephenQuinney
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies