web.inf.ed.ac.uk disaster recovery

The VM, dripped.inf, at KB is configured to take a mirror of the live web.inf.ed.ac.uk site every 2 hours. Cron runs the script /disk/data/scripts/update-edweb-dr every 2 hours. By default that script is actually mastered at the same location on the live web.inf site, so make any changes there and copy them to the DR machine, or wait for the DR machine to copy it itself, at the next cron run.

While it isn't actually being web.inf, it responds as web-dr.inf.ed.ac.uk. The external firewall holes are closed by default.

If the main web.inf.ed.ac.uk site becomes unavailable for some reason, then to switch to the DR mirror, then you:

  1. edit dripped's profile and #define I_AM_WEB_INF_STANDIN
  2. update the DNS to point web.inf at dripped.inf

Do the above steps in that order, as doing 1 first (and once the profile has pushed) will stop it trying to do further mirrors from web.inf (which it is about to become).

The content should not be more than 2 hours out of date, and it will be a full service, ie editors will be able to edit content etc.

As the standin will be a full service, either people will have to be asked NOT to make changes, so that you can simply switch back. Or you will have to migrate changes made on the DR machine, back to the regular live machine.

It is assumed that you are switching because the regular web.inf is unavailable. You don't want it to be possible for people to be updating both the live and standin site, as content will be lost when you switch back. If you are switching to the DR for convenience, eg scheduled maintenance affecting the live service (rather than an actual disaster), then make sure you stop apache on the live service.

If you need to migrate changes back to live service

If there have been changes made on the DR machine, that need to be migrated back to the live service. The basics are:

  • stop apacheconf and mysql on both machines
  • copy /var/lib/mysql/ to /disk/data/mysql/ on DR machine (this is because /var/lib/mysql is where the DB lives on the DR machine)
  • copy /disk/data from DR machine back to the live machine
  • start apacheconf and mysql on the live machine
  • update the DNS so web.inf points at the live machine

There will be a period while the DNS propagates that browsers still resolving to the DR machine will get a "not responding" error. If this is an issue, then you could restart apache on the DR machine, but if anyone makes any edits on the DR machine, they will be lost when you switch the DR machine back into standby mode. You could put the "old" site in to maintenance mode, so that people do at least get a web page.

cd /disk/data/edweb
drush vset maintenance_mode 1
Set it back to "0" to exit maintenance mode.

Once it all seems to be working back on the live site, then remove the #define I_AM_WEB_INF_STANDIN from dripped's profile so it starts mirroring the live site again. Note that at this point, any DR data will be overwritten. So make sure you've got any copies if you need it. Also, if you have put the standin site into maintenance mode as above, then this will also be undone on the first new copy of the data from the live site.

Bare "metal" restore

If you need to restore the service, and for some reason the DR site is also unavailable, then you will need to use the last regular mirror and/or tape backup of the data. In the steps below it is assumed the lost data lived in /disk/data/ (the usual location), and would be restored to the same place. This can be changed by setting #define EDWEB_DATA_DIR to the actual path.

The basic steps would be:

  • install a new machine (or VM) with a similar profile to the lost machine, a standard small server with at least a 50GB disk should be fine, but without the live/edweb-school.h header active.

  • Once the new machine is up and running, restore the last mirror or tape backup of the previous machine's /disk/data/

  • It should now be safe to activate the live/edweb-school.h header.

  • Run updaterpms. Try starting mysql and make sure things like om mysql runcommand work. Once they do, continue.

  • Reboot the new machine, as it is the simplest way to make sure all the correct components are started and configured.

  • That should be it, assuming the DNS for web.inf points at the new machine, you should have your site back. If not, check that mysql and apache are up and running, and check their error logs if not.

-- NeilBrown - 05 Jun 2018

Topic revision: r1 - 05 Jun 2018 - 13:26:26 - NeilBrown
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies