Tobermory sl6_64 Upgrade Plan

This is the detailed plan for the upgrade of tobermory to SL6.

Now with added post-upgrade comments.

Preparation

  • Open a bunch of logins to tobermory, both by ssh and on the console.
  • nsu in some of them, including the console one.
  • Nagios has already been pacified.

Shut Down The Service

  • Uncomment the following in lcfg/tobermory:
/* Restrict access to subversion to MPU only */
!subversion.authzentries_lcfg                mSET(root)
!subversion.authzallow_lcfg_root               mSET(mpu ALL)
!subversion.authzallowperms_lcfg_root_mpu  mSET(rw)

!subversion.authzentries_dice                mSET(root)
!subversion.authzallow_dice_root               mSET(mpu ALL)
!subversion.authzallowperms_dice_root_mpu  mSET(rw)

!subversion.authzentries_source                mSET(root)
!subversion.authzallow_source_root               mSET(mpu ALL)
!subversion.authzallowperms_source_root_mpu  mSET(rw)

/* Stop rsync and rfe from starting automatically */
!boot.services mREMOVE(lcfg_rsync)
!boot.services mREMOVE(lcfg_rfe)
  • Wait for tobermory's new profile to reach it.

At this point non-MPU access to svn stops.

  • Stop the LCFG client on tobermory:
    • om client stop
    • less /var/lcfg/log/client

This protects tobermory from the following step.

  • Alter lcfg/tobermory to os/sl6_64.h.

I should have also added #define DICE_STICK_WITH_SL61 to the profile. Since I didn't the install (later) tried to install SL6.2 and fell over with 180 package conflicts.

  • Wait for pxeserver (/var/lcfg/log/pxeserver on schiff) and dhcpd (/etc/dhcpd.conf on abbado) to update.

Next time I'd add in the actual commands to view those logs, ready for cutting and pasting. Not important but every little helps.

Logins to tobermory may break at this point (though probably not, since we've stopped the client) but existing services & sessions will keep running.

  • om rfe stop on tobermory

Access to rfe lcfg/foo has now stopped.

  • om server stop on all slaves (mousa, trondra, vole, circlevm12, sernabatim, benaulim) - but keep apacheconf running

Again, commands to cut and paste would improve things.

Profile building has now stopped. Slaves will continue to serve profiles but the profiles won't change.

Make Final Backups

  • on tobermory as root:
    • /usr/bin/om subversion dumpdb -- -r lcfg -d /var/lcfg/svndump/lcfg -o lcfg.sl5_final
    • /usr/bin/om subversion dumpdb -- -r dice -d /var/lcfg/svndump/dice -o dice.sl5_final
    • /usr/bin/om subversion dumpdb -- -r source -d /var/lcfg/svndump/source -o source.sl5_final

Dumping the lcfg repository took several minutes. The dice one was very quick as it's tiny as yet. The source repo dump took a couple of minutes.

All three subversion repositories will now have been dumped to a copyable format.

  • on tobermory as postgres:
    • pg_dump orders > /var/rfedata/orders.sl5_final_backup

To do this dump successfully I first had to create a directory and make it writeable by the postgres account. The dump took a couple of minutes.

The orders database is now dumped.

  • Run rmirror on sauce.
    • om rmirror run lcfghdrs lcfgrfedata lcfgstablerelease lcfgtestingrelease svndatadir lcfgsvn autocheckout lcfgreleases
    • less /var/lcfg/log/rmirror

This will have backed up these directories (rsync modules) from tobermory to sauce:

/var/rfedata (lcfgrfedata)
/var/lcfg/releases/stable (lcfgstablerelease)
/var/lcfg/releases/testing (lcfgtestingrelease)
/var/svn (svndatadir)
/var/lcfg/svndump (lcfgsvn)
/var/lib/autocheckout (autocheckout)
/var/cache/lcfgreleases (lcfgreleases)

  • Stop the DR mirroring.
    • ssh sauce
    • nsu
    • crontab -e
      • and remove the '0,15,30,45 * * * * /usr/bin/om rmirror run' line.

  • Back up tobermory's / and /var partitions completely.
    They have rsync modules defined as follows:
[root]
readonly=yes
hosts allow=sauce.inf.ed.ac.uk
hosts deny=*
path=/
uid=0

[var]
readonly=yes
hosts allow=sauce.inf.ed.ac.uk
hosts deny=*
path=/var
uid=0
using these resources:
!rsync.modules 			mEXTRA(root)
rsync.mentries_root 		readonly allow deny path uid
rsync.mentry_root_readonly	readonly=yes
rsync.mentry_root_allow		hosts allow=sauce.inf.ed.ac.uk
rsync.mentry_root_deny		hosts deny=*
rsync.mentry_root_path		path=/
rsync.mentry_root_uid		uid=0

!rsync.modules 			mEXTRA(var)
rsync.mentries_var			readonly allow deny path uid
rsync.mentry_var_readonly	readonly=yes
rsync.mentry_var_allow		hosts allow=sauce.inf.ed.ac.uk
rsync.mentry_var_deny		hosts deny=*
rsync.mentry_var_path		path=/var
rsync.mentry_var_uid		uid=0
and are being backed up to sauce:/disk/useful/tobermory/backups using this script:
  • as root on sauce:
    • /disk/useful/tobermory/backups/run-backups
The script contains this:
#!/bin/bash
/usr/bin/rsync -v -a -A -X -x -x -S tobermory.inf.ed.ac.uk::root/ /disk/useful/tobermory/backups/root/
/usr/bin/rsync -v -a -A -X -x -x -S tobermory.inf.ed.ac.uk::var/ /disk/useful/tobermory/backups/var/

# -v    verbose
# -a    do the sensible stuff
# -A    preserve ACLs
# -X    preserve extended attributes
# -x    don't cross filesystem boundaries
# -x    and omit mountpoints
# -S    handle sparse files properly

After running the script, all data has now been backed up.

The complete backups of root and var worked well and were useful.

Installation

  • Install SL6_64 on tobermory.

The install failed on the first attempt with 180 package conflicts! We could have avoided this by defining DICE_STICK_WITH_SL61 in tobermory's profile above to stop it from trying to install the newer SL 6.2. As it was we recovered from this by restarting the server component on mousa and trondra and editing the tobermory source file on mousa and trondra. They both made a new profile for tobermory with SL 6.1. The server components were then stopped again. The tobermory install then worked on the second attempt.

Stephen noticed at this point that IPMI wasn't configured correctly and fixed it.

Recover The Data

  • Login to tobermory
  • Make /var/lcfg/svndump (and other important dirs) if not already done:
    • om file configure
    • less /var/lcfg/log/file
  • Restore all of /var/lcfg/svndump:
    • rsync  -v -a -A -X -x -x -S sauce::tobermoryvar/lcfg/svndump/ /var/lcfg/svndump/

Subversion dumps are now present.

  • Restore /var/rfedata:
    • rsync  -v -a -A -X -x -x -S sauce::tobermoryvar/rfedata/ /var/rfedata/

The LCFG source files exist again.

  • Restore /var/lcfg/releases:
    • rsync  -v -a -A -X -x -x -S sauce::tobermoryvar/lcfg/releases/ /var/lcfg/releases/

The stable and testing releases are now there.

* restore /var/cache/lcfgreleases:

    • rsync  -v -a -A -X -x -x -S sauce::tobermoryvar/cache/lcfgreleases/ /var/cache/lcfgreleases/

Stable releases cache restored.

  • Start the subversion component if not already started:
    • om subversion start

This didn't work. The subversion component was unable to make the repositories since the file component had already made the top level repository dirs so that it could plant the links to the hook scripts. Since the repository directories already existed, subversion refused to make the repositories. To fix this I moved the repository directories out of the way, then started the subversion component. This started successfully and made the repositories. Then I configured the file component to remake the links to the hook scripts. We could avoid this problem in future by removing the "mkdir" option from the file component resources which make the links to the hook scripts.

The repository exists once more, though it's empty as yet.

  • Reload the lcfg repository:
    • svnadmin load /var/svn/lcfg < /var/lcfg/svndump/lcfg/lcfg.sl5_final

This took two and three quarter hours! A test load of the same repository on a VM on northern had taken 20 minutes to complete.

The lcfg repository now contains our data.

  • Recreate the pre-commit and post-commit hooks:
    • om file configure
This should make links from /var/svn/lcfg/hooks/pre-commit to /usr/lib/lcfg/lcfg-svn-hooks/pre-commit and from /var/svn/lcfg/hooks/post-commit to /usr/lib/lcfg/lcfg-svn-hooks/post-commit

The lcfg repository hooks have now been restored.

Start The Service Running

  • Restart apacheconf on tobermory.
    • om apacheconf restart and check the log.
Apache may have failed to start because of the svn repositories' absence. Also, apacheconf failing to start at this point may be an indication of a problem with the restored data.

  • Check that autocheckout is working.
    • Check out something from the repository; change it; commit
    • Look for your change (or anything) in /var/lib/autocheckout/lcfg.
    • If this doesn't work, check that the permissions and ownership on the autocheckout directory match those on the svn repository sufficiently to allow the apache account permission to do a check out, e.g.:
[tobermory]root: ls -ld /var/svn/lcfg /var/lib/autocheckout/lcfg /var/lib/autocheckout/lcfg/lcfg
drwxrwsr-x 3 apache lcfgsvn 4096 Mar 17  2008 /var/lib/autocheckout/lcfg
drwxrws--- 7 root   apache  4096 Mar 17  2008 /var/svn/lcfg
drwxrwxr-x 5 apache lcfgsvn 4096 Mar 17  2008 /var/lib/autocheckout/lcfg/lcfg

Also check /var/lib/autocheckout itself - it should have this owner, group & permissions:

drwxrwxr-x 3 root lcfg 4096 Mar 17  2008 /var/lib/autocheckout
Make a functioning new one as follows:
cd /var/lib/
mv autocheckout autocheckout-aside
mkdir autocheckout
chown root:lcfg autocheckout
chmod 755 autocheckout
om file configure

/var/lib/autocheckout should now be populated. The develop and default releases should now be there.

The MPU should now have full access to svn.

  • Start the rfe component on tobermory
    • om rfe start

CO access to lcfg/foo is restored.

  • Start the rsync component on tobermory
    • om rsync start

Changes are now available to slaves again.

  • Remove the boot.services alterations from lcfg/tobermory
  • Remove subversion.authzallow restrictions from lcfg/tobermory

This will set the repository access back to normal once tobermory has its new profile.

Removal of the authz restrictions was premature. It was right to remove them from the lcfg repository at this point, but we had not yet restored the source and dice repositories so had to keep the restrictions on those repositories until the repositories had been reloaded.

  • Delete caches on main LCFG slaves mousa and trondra to speed up rebuilds:
    • rm -f /var/lcfg/conf/server/cache/*

  • Gentlemen, start your engines.
    • om mousa.server start
    • om trondra.server start
    • om vole.server start
    • om sernabatim.server start
    • om benaulim.server start
    • om circlevm12.server start

Mammoth rebuilds now hopefully start.

The rebuilds on mousa and trondra were disappointingly slow. They took 3 hours and 25 minutes to finish. A run of iostat -xd 5 on one of them showed that the disk was at 100% utilisation, confirming the notion that they are being held back by slow disks. The simultaneous rebuild on the circlevm12 virtual machine took 55 minutes. During the rebuild a similar iostat on circle showed the disk utilisation hovering around 3%.

This at least gave us plenty of time to reload the other two repositories then remove the remaining authz restrictions from lcfg/tobermory.

  • Mail a progress report to cos

COs will now have access to rfe but not yet to subversion, not until tobermory has its new profile.

Restore the dice and source repositories too

  • Reload the dice repository:
    • svnadmin load /var/svn/dice < /var/lcfg/svndump/dice/dice.sl5_final

The dice repository has now been restored. There are no commit hooks to restore.

  • Reload the source repository:
    • svnadmin load /var/svn/source < /var/lcfg/svndump/source/source.sl5_final

  • Restore svn hooks:
    • om file configure
    • Check the hooks in /var/svn/source/hooks.

Final Touches

  • Wait for the rebuilds to finish. This is expected to take 80 minutes or so, provided the caches are empty.

CO access to svn has been restored. The LCFG service is now functional. Profile-building is now back to normal.

  • Re-enable the 15 minute rmirror cron job on sauce if LCFG hasn't already done it.
    • ssh sauce
    • nsu
    • crontab -l
      • and check that the 0,15,30,45 rmirror line is there.

The rmirror run had not been restored by LCFG, but a configure of the cron component restored it.

    • Check logs for errors after first rmirror run.

The DR arrangements are back in place.

  • Announce to COs & LCFG deployers

  • Ask Alastair to restore the ordershost database.

-- ChrisCooke - 10 Apr 2012

Edit | Attach | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r8 - 15 Apr 2012 - 18:21:53 - ChrisCooke
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies