LCFG Disaster Recovery Procedures

Don't Panic!

Force an LCFG client to swap servers

If you have an LCFG slave server running but it is not one of those currently in the client.url resource you can manually force the change.

Firstly rfe/edit the LCFG source profile and mutate the resource:

!client.url mSET(http://lcfgtest.inf.ed.ac.uk/profiles)

Wait for the modified profile(s) to succesfully compile. Then on the machine you want to switch LCFG servers do the following:

$ om client stop
[OK] client: stop
% nsu
% /usr/sbin/rdxprof -u http://lcfgtest.inf.ed.ac.uk/profiles
% qxprof client.url
url=http://lcfgtest.inf.ed.ac.uk/profiles
% om client start
[OK] client: start

Whichever situation (either using the LCFG slave on the DR server or the test server), the recommendeded approach is to use this method to make the slave server self-managing as an initial step. It is also a good idea to change the client URL for the server hosting the master data (this will be the same machine as the slave if using the DR server).

Scenario 1: LCFG Master Failure

In the situation where the LCFG master server has been lost but everything else is still functional (e.g. a hardware failure has occurred) the LCFG DR server (lcfg-dr.inf.ed.ac.uk) can be used to keep the service running. It is important to remember that the DR server is not intended to be used as a permanent replacement for the master server but rather as a temporary facility to keep the LCFG service running. That means that it does not provide a complete replica of all services.

The DR server uses the LCFG rmirror component to take a backup of the most important data (e.g. source profiles, headers, package lists) from the master every 15 minutes. To avoid the potential for any corruption of the mirrored data the first step is to stop the rmirror component on the DR server:

% om rmirror stop

This will result in regular reports of failed cron jobs but that is preferable to losing data. It's probably a good idea to warn the other COs that some changes may have been lost.

To make this permanent across reboots take the rmirror component out of the systemd.wanted_units_lcfgmultiuser resource:

!systemd.wanted_units_lcfgmultiuser mREMOVE(lcfg-rmirror.service)

The next step is to alter the DNS so that the aliases lcfg-master and lcfgsvn are both associated with the DR server (currently this is lcfg-dr). To speed up the propagation of the changes on the LCFG slaves give the LCFG dns component a kick:

% om dns update

and then check (e.g. with the host command) on the LCFG slaves that the aliases have updated. The slaves should now be working again.

Note that the various lcfg rfe maps, e.g. for source profiles, also refer to the lcfg-master alias so until the DNS updates have reached all machines the use of rfe might fail.

The LCFG master also hosts the LCFG svn web service (svn.lcfg.org). There is a backup of the LCFG svn service on the DR server. Due to the apache configuration being managed via LCFG it's not as simple as just altering the DNS entry. Instead there is an alternative service available at backup.lcfg.org, the LCFG "core" tree can be checked out as usual and changes made will be fed into any slaves using the DR server as a master. If this service is only going to be used temporarily then any changes made to this repository would need to be copied to the repository on svn.lcfg.org so be careful...

A checkout can be done like this:

% svn co https://backup.lcfg.org/svn/lcfg/core lcfgbackup

Scenario 2: LCFG Slave Failures

We have multiple LCFG slaves in different physical locations in the hope that we will never lose them all at the same time. In case we do, here are two possible solutions, the first is simpler but relies on the normal LCFG master still being functional. The second solution is necessary if we are using the DR server as a temporary master & slave server.

With either solution the easiest option is to use a profile for an existing LCFG slave server (currently these are lcfg1 and lcfg3) that has failed. If you try to create a new profile you will have to modify some LCFG resources to open up access to rsync to the master server (done using live/lcfg-slave-servers-list.h).

Note that in an emergency we do not need to immediately replace the LCFG slave servers, the server running on the DR machine is sufficient. In the short term it is possible to just change the various CNAMEs (lcfg1, lcfg2, lcfg3, lcfg4, lcfg5) to point at the DR server.

Solution 1: Use the test server

Firstly ensure that the test server is compiling your profile, this is controlled by the lcfgtesthosts rfe map which limits the set of source profiles which are copied using rsync.

The next step is to alter the client URL for that profile by mutating the client.url resource in the relevant LCFG source profile:

!client.url    mSET(http://lcfgtest.inf.ed.ac.uk/profiles)

To do the install requires that the DHCP entry for the machine is correct. The simplest option might be to create a virtual machine which uses the same MAC address as the server which has been lost. If that is not an option then the dhclient.mac resource will have to be modified in the source profile. As the LCFG service is not functional the relevant /etc/dhcpd.conf file will also have to be modified manually on the DHCP server (this is on the netinf server for each site, see the Inf Unit Kit List). You MUST talk to the Infrastructure Unit before doing this. The following steps will be necessary:

% om dhcpd stop
% nsu
% edit /etc/dhcpd.conf
% /etc/init.d/dhcpd restart

If you use the component to restart the server you will get your manual changes overwritten.

Once that profile has compiled you can start the install process. If it is possible to edit the DNS then alter the verbatim records for lcfg and lcfghost so that they refer to the IP address of the test server.

Alternatively, at the boot prompt you will need to append the lcfg.url option to override the standard LCFG server url which is provided by DHCP.

For the CD install this would be like:

sr0 lcfg.url=http://lcfgtest.inf.ed.ac.uk/profiles

(changing the location of the CD drive to suit your hardware, of course). With PXE you will need to edit the boot entry.

The install should now progress as normal and you will soon have a working LCFG slave server.

Solution 2: Use the DR server

If the LCFG master has also been lost then firstly the DR server needs to be switched to work as the master following the instructions in "Scenario 1".

Once that is done the LCFG slave running on the DR server can be used to install more machines. This is done the same way as for "Solution 1" (above) except that you should refer to the lcfg-dr.inf.ed.ac.uk slave server instead of lcfgtest

Packages

The MPU DR server, salamanca, carries a full set of packages for the supported server platforms. This copy is updated nightly from rsync.pkgs.inf.ed.ac.uk using rmirror.

The DR package repository is available via http://dr.pkgs.inf.ed.ac.uk/{sites|rpms}/... The following can be used to make a machine use the DR package repository instead of the normal AFS based repository :-

       !updaterpms.rpmpath     mSUBST(cache.pkgs,dr.pkgs)

Also on the DR server are backups for the LCFG master and slave. These are configured through the dice/options/lcfg-dr-server.h header. That header combines the dice/options/lcfg-master-server.h and dice/options/lcfg-slave-server.h headers. It acts in pretty much the same way as the standard master and slave server except that the slave server uses rsync to fetch data files from itself. There is a regular rmirror process which fetches from the normal master server.

LCFG install ISO images

LCFG install ISO images are stored on the MPU DR server (dr.pkgs.inf.ed.ac.uk) under the directory /disk/dr/cdroms

-- StephenQuinney - 24 Jan 2019

Topic revision: r12 - 23 May 2019 - 10:51:51 - StephenQuinney
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies