www.inf.ed.ac.uk Off site Disaster Recovery Plan

July 2019 - This needs a review, but basically www.inf is currently skelp, www-dr is currently hobgoblin, plone is a dying part of www.inf, with most/all content having been replaced with redirects to web.inf.ed.ac.uk

Clarifications

www.inf.ed.ac.uk refers to the content hosted (currently) on the server skelp.inf.ed.ac.uk. Since the introduction of Polopoly, some content originally hosted on http://www.inf.ed.ac.uk now redirects you to the equivalent content hosted at http://www.ed.ac.uk/schools-departments/informatics/. This document does not cover the content hosted on www.ed.ac.uk. It's assumed IS have their own DR plan for www.ed.ac.uk.

Also, though skelp.inf.ed.ac.uk also hosts some institute sites, eg www.dice.inf.ed.ac.uk, we are primarily concerned with the main www.inf.ed.ac.uk site. Though as we'll see, happily the institute sites will also be covered by this plan.

Overview

There are currently 3 main technologies responsible for www.inf.ed.ac.uk content.

  • The first, and longest standing, is the CVS publishing of raw HTML method.
  • Second, Plone authored content.
  • Third, a simple redirect off to new www.ed.ac.uk Polopoly hosted content.

Following a disaster that destroys the existing www.inf.ed.ac.uk, then to recreate the service you would need to recover the data for the CVS and Plone content; the configuration that defines the web server setup (including redirects to Polopoly); and system configuration.

Traditional LCFG configuration, and ensuring you have accessible and current backups of the data, would be enough to restore the service in the event of a disaster, details of which are documented at ServicesUnitWwwInfRestore. However it would take time to do this, and given the high profile of the www.inf.ed.ac.uk site, something more dynamic/automated is required. So we now have a copy of www.inf which is no more than 11 hours (that would be if changes were made between 11pm and 10am) out of date with the live site.

The Plan

The basic plan is to have a copy of skelp's (www.inf) setup replicated on another off site machine, in this case hobgoblin (www-dr.inf). And for this off site machine to take regular copies of the web content from skelp. So if there's some problem with skelp, it should just be a matter of updating the DNS so that www.inf resolves to hobgoblin's IP address.

More detail

The off site machine, hobgoblin.inf, includes the header file infweb-dr-server.h which includes the same header files used to configure skelp (www.inf), with some LCFG overrides and additions to:

  • to disable regular house keeping type things that would run on the real www. These would generate duplicate (possibly confusing mail) and may overwrite mirrored data with incomplete/inaccurate data.
  • update file component symlinks to paths more relevant on the DR machine.
  • Disables firewall holes, so http only accessible internally.
  • Configures a new cron job to take regular copies of the web content data from www.inf. Currently this runs at 30mins past 10am, 2pm, 6pm and 11pm. This is the update-dr script.

The update-dr script has two main parts, the first is to rsync the data from www.inf, the second is to fixup the apache and zope config (eg change IP addresses and paths), and rebuild the zope content database and postgresql database (for the ISDD) from the dump files mirrored across.

We can't just point zope and postgres at the raw databases that were copied across, as they are not in a consistent state, and give errors when the daemon processes are pointed at them. So on www.inf regular, consistent dumps of the databases are taken, and it is the mirrors of these that are used to recreate the databases the DR copy of the server will use. The "fixup" scripts do this. These "fixups" require apache and zope to be shutdown during their run, but it only takes a few moments.

At this point www-dr.inf.ed.ac.uk behaves just like www.inf.ed.ac.uk, with the redirects to Polopoly if necessary, the plone content accessible, eg www-dr.inf.ed.ac.uk/school-services/ and the usual CVS pages eg www-dr.inf.ed.ac.uk/systems/. In fact as xinetd isn't disabled by the DR header file, you can CVS publish to it, though any changes would be lost after the next rsync. The only thing that doesn't currently work is Cosign authentication, due to the fact that the apache config says "I'm www.inf.ed.ac.uk" but the request to authenticate the user comes from "www-dr.inf.ed.ac.uk", and the Cosign server isn't happy about that. This does mean that Cosign protected pages via a www-dr.inf URL are inaccessible.

It should possible to fix this, but there should also be no need. The only time users should be accessing the www-dr hardware, would be if www.inf has been updated in the DNS, then the request will come from www.inf and the authentication will work again.

How to fail over to the DR server

Currently there's nothing automatic that will switch over www.inf to the off site DR machine (hobgoblin). If the real www.inf does develop a fault, or is otherwise inaccessible but hobgoblin is accessible then these steps should be followed.

  1. edit live/infweb-dr-server.h and #define BE_WWW_INF . This will enable firewall holes and stop the mirror of the down www.inf.
  2. update the following 'rfe dns/inf' verbatim entries
    #verbatim inf.ed.ac.uk www          300 IN A 129.215.33.176
    #verbatim inf.ed.ac.uk mainweb-web  300 IN A 129.215.33.176
    #verbatim inf.ed.ac.uk mainweb-dice 300 IN A 129.215.33.177
    #verbatim inf.ed.ac.uk mainweb-lfcs 300 IN A 129.215.33.178
    
    updating the IP address to www-dr's IP address, currently 129.215.216.70. Strictly speaking you only need to do www if that's all your concerned about bringing back.
  3. Tell users not to make any changes to the DR site, unless they are happy for them to be lost when the real www.inf returns.
  4. Wait for DNS to propagate.
  5. Depending on the SSL certificate situation, you may want to manually copy them (if you can) from the live site. If we're using Let's Encrypt by then, then once the DNS has propagated to the LE servers, then just running the x509 component should generate the new certificates. Note if you do this too soon (and it fails) and too often, then the LE service will block your connection for some period of time.
  6. That should be it

Notes

The above assumes that the switch to the DR machine is only temporary, and that longer term the original www.inf will be restored. If this is not the case, then the various crons that are disabled in the DR header file may want to be turned back on, and the DR machine itself should then be mirrored and backed up as it now contains the golden copy of the web content. Normally the DR machine isn't backed up, as it is a backup of the normal www.inf machine.

If you are allowing changes to the DR stand-in version of www.inf and then want to move back to the normal location/server, then the basic steps are:

  1. new machine should have all the regular infweb-server.h headers and be ready and waiting, but disable cvs pserver and web crons
  2. disable changes on DR
  3. stop plone (if it is still a thing) and postgres (on both machines)
  4. rsync /disk/data/mirror/wwwinf/ to /disk/data/ on new machine
  5. copy any /var/lcfg/log/apacheconf.* logs across. If the new machine has done no web that you care about up until this point, then you could just copy the lot and overwrite any existing ones.
  6. restart plone and postgres. Only on the new machine is necessary, but if also the DR then any more changes via those routes will not be preserved.
  7. fixup any LIVEROOT/conf/*.conf files that the DR scripts fudge, currently just infweb.conf
  8. check new machine is running apache on the IP address you want (also check firewall holes are open on that port)
  9. start cvs and web crons - check cvs works
  10. switch the DNS for www and mainweb to new machine
  11. if you were unable to preserve the HTTPS certificates, you'll have to wait for the DNS to update before using x509 Lets Encrypt to regenerate them.
  12. after some period of time (a day) if the backups and mirrors of new machine are OK, then return DR machine back to DR duties.

The DR copy is taken 4 times a day. This could be increased, but the scripts on the real www.inf that dump the zope and postgresql databases should also be updated to the same frequency, but they should be scheduled to completed before the DR copy is taken.

As well as this off site DR service, the main www.inf is also mirrored nightly to one of the regular mirror servers and that is backed up to tape.

Note that both the DR and live machines have AFS ids and are part of system:infmainweb group, as there are somethings shared via AFS.

Just noting here until its fixed, but during the time the DR machine is standing in for the normal server, downloads of packages from the ISDD will fail due to path issues in the ISDD scripts. The scripts should be fixed, or a fudge would be to symlink the expected dir to the mirror copy.

-- NeilBrown - 01 Feb 2012

Topic revision: r7 - 13 May 2020 - 13:20:40 - NeilBrown
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies