TWiki
>
DICE Web
>
ManagedPlatformUnit
>
MPUpgradingMasterLCFGServerSL664
(revision 9) (raw view)
Edit
Attach
<!-- Scroll down for the page contents --> <style type="text/css"> h2 { border-bottom: 1px solid black; } /* this is to help differentiate projects on the print-friendly version */ div.twikiToc { display: none; } /* hide the toc on the print-friendly version */ #main-copy div.twikiToc { display: block; } #main-copy h1 { background-color: transparent; color: black; font-size: 170%; font-weight: bold; margin: 0 0; padding: 0.5ex 0 0.5ex 1ex; } #main-copy h2 { background-color: transparent; color: black; font-size: 150%; font-weight: bold; margin: 0 0; padding: 0.5ex 0 0.5ex 1ex; } #main-copy h3 { /* allows projects to look like projects */ background-color: transparent; color: black; font-size: 130%; font-weight: bold; margin: 0 0; padding: 0.5ex 0 0.5ex 1ex; } #main-copy h4 { /* allows projects to look like projects */ background-color: white; color: black; margin: 0 0; padding: 0.5ex 0 0.5ex 1ex; } </style> ---+ Tobermory sl6_64 Upgrade Plan This is the detailed plan for the upgrade of tobermory to SL6. ---++++ Now with added post-upgrade comments. ---++ Preparation * Open a bunch of logins to tobermory, both by ssh and on the console. * ==nsu== in some of them, including the console one. * Nagios has already been pacified. ---++ Shut Down The Service * Uncomment the following in lcfg/tobermory: <pre> /* Restrict access to subversion to MPU only */ !subversion.authzentries_lcfg mSET(root) !subversion.authzallow_lcfg_root mSET(mpu ALL) !subversion.authzallowperms_lcfg_root_mpu mSET(rw) !subversion.authzentries_dice mSET(root) !subversion.authzallow_dice_root mSET(mpu ALL) !subversion.authzallowperms_dice_root_mpu mSET(rw) !subversion.authzentries_source mSET(root) !subversion.authzallow_source_root mSET(mpu ALL) !subversion.authzallowperms_source_root_mpu mSET(rw) /* Stop rsync and rfe from starting automatically */ !boot.services mREMOVE(lcfg_rsync) !boot.services mREMOVE(lcfg_rfe) </pre> * Wait for tobermory's new profile to reach it. At this point non-MPU access to svn stops. * Stop the LCFG client on tobermory: * ==om client stop== * ==less /var/lcfg/log/client== This protects tobermory from the following step. * Alter lcfg/tobermory to os/sl6_64.h. ---++++ I should have also added ==#define DICE_STICK_WITH_SL61== to the profile. Since I didn't the install (later) tried to install SL6.2 and fell over with 180 package conflicts. Since I'd just shut down the entire LCFG system this was a nasty moment. * Wait for pxeserver (=/var/lcfg/log/pxeserver= on schiff) and dhcpd (=/etc/dhcpd.conf= on abbado) to update. ---++++ Next time I'd add in the actual commands to view those logs, ready for cutting and pasting. Not important but every little helps. Logins to tobermory may break at this point (though probably not, since we've stopped the client) but existing services & sessions will keep running. * ==om rfe stop== on tobermory Access to rfe lcfg/foo has now stopped. * ==om server stop== on all slaves (mousa, trondra, vole, circlevm12, sernabatim, benaulim) - but keep apacheconf running ---++++ Again, commands to cut and paste would improve things. Profile building has now stopped. Slaves will continue to serve profiles but the profiles won't change. ---++ Make Final Backups * on tobermory as root: * ==/usr/bin/om subversion dumpdb -- -r lcfg -d /var/lcfg/svndump/lcfg -o lcfg.sl5_final== * ==/usr/bin/om subversion dumpdb -- -r dice -d /var/lcfg/svndump/dice -o dice.sl5_final== * ==/usr/bin/om subversion dumpdb -- -r source -d /var/lcfg/svndump/source -o source.sl5_final== ---++++ Dumping the lcfg repository took several minutes. The dice one was very quick as it's tiny as yet. The source repo dump took a couple of minutes. All three subversion repositories will now have been dumped to a copyable format. * on tobermory as postgres: * ==pg_dump orders > /var/rfedata/orders.sl5_final_backup== ---++++ To do this dump successfully I first had to create a directory and make it writeable by the postgres account. The dump took a couple of minutes. The orders database is now dumped. * Run rmirror on sauce. * ==om rmirror run lcfghdrs lcfgrfedata lcfgstablerelease lcfgtestingrelease svndatadir lcfgsvn autocheckout lcfgreleases== * ==less /var/lcfg/log/rmirror== ---++++ This ran quickly, but only because I had already run it on previous days, so only changes had to be copied over this time. This will have backed up these directories (rsync modules) from tobermory to sauce: <pre> /var/rfedata (lcfgrfedata) /var/lcfg/releases/stable (lcfgstablerelease) /var/lcfg/releases/testing (lcfgtestingrelease) /var/svn (svndatadir) /var/lcfg/svndump (lcfgsvn) /var/lib/autocheckout (autocheckout) /var/cache/lcfgreleases (lcfgreleases) </pre> * Stop the DR mirroring. * ==ssh sauce== * ==nsu== * ==crontab -e== * and remove the '0,15,30,45 * * * * /usr/bin/om rmirror run' line. * Back up tobermory's / and /var partitions completely.<br>They have rsync modules defined as follows: <pre> [root] readonly=yes hosts allow=sauce.inf.ed.ac.uk hosts deny=* path=/ uid=0 [var] readonly=yes hosts allow=sauce.inf.ed.ac.uk hosts deny=* path=/var uid=0 </pre> using these resources: <pre> !rsync.modules mEXTRA(root) rsync.mentries_root readonly allow deny path uid rsync.mentry_root_readonly readonly=yes rsync.mentry_root_allow hosts allow=sauce.inf.ed.ac.uk rsync.mentry_root_deny hosts deny=* rsync.mentry_root_path path=/ rsync.mentry_root_uid uid=0 !rsync.modules mEXTRA(var) rsync.mentries_var readonly allow deny path uid rsync.mentry_var_readonly readonly=yes rsync.mentry_var_allow hosts allow=sauce.inf.ed.ac.uk rsync.mentry_var_deny hosts deny=* rsync.mentry_var_path path=/var rsync.mentry_var_uid uid=0 </pre> and are being backed up to sauce:/disk/useful/tobermory/backups using this script: * as root on sauce: * ==/disk/useful/tobermory/backups/run-backups== The script contains this: <pre> #!/bin/bash /usr/bin/rsync -v -a -A -X -x -x -S tobermory.inf.ed.ac.uk::root/ /disk/useful/tobermory/backups/root/ /usr/bin/rsync -v -a -A -X -x -x -S tobermory.inf.ed.ac.uk::var/ /disk/useful/tobermory/backups/var/ # -v verbose # -a do the sensible stuff # -A preserve ACLs # -X preserve extended attributes # -x don't cross filesystem boundaries # -x and omit mountpoints # -S handle sparse files properly </pre> After running the script, all data has now been backed up. ---++++ The complete backups of root and var worked well and were useful. ---++ Installation * Install SL6_64 on tobermory. ---++++ The install failed on the first attempt with 180 package conflicts! We could have avoided this by defining DICE_STICK_WITH_SL61 in tobermory's profile above to stop it from trying to install the newer SL 6.2. As it was we recovered from this by restarting the server component on mousa and trondra and editing the tobermory source file on both machines to include DICE_STICK_WITH_SL61. Thankfully they both made a new profile for tobermory with SL 6.1. The server components were then stopped again. The tobermory install then worked on the second attempt. ---++++ Stephen noticed at this point that IPMI wasn't configured correctly and fixed it. ---++ Recover The Data * Login to tobermory ---++++ At this point we should have restored the local home directories. This matters because we depend on local personal bash startup files to add the directory containing the weekly release scripts to our bash command search path; so for a short while the next Monday morning's testing release could not be installed as =installtestingrelease= was not found. The homedirs were eventually restored using this command:<br> ==rsync -v -a -A -X -x -x -S sauce::tobermoryroot/localhomes/ /localhomes/== * Make =/var/lcfg/svndump= (and other important dirs) if not already done: * ==om file configure== * ==less /var/lcfg/log/file== * Restore all of =/var/lcfg/svndump=: * ==rsync -v -a -A -X -x -x -S sauce::tobermoryvar/lcfg/svndump/ /var/lcfg/svndump/== Subversion dumps are now present. * Restore =/var/rfedata=: * ==rsync -v -a -A -X -x -x -S sauce::tobermoryvar/rfedata/ /var/rfedata/== The LCFG source files exist again. * Restore =/var/lcfg/releases=: * ==rsync -v -a -A -X -x -x -S sauce::tobermoryvar/lcfg/releases/ /var/lcfg/releases/== The stable and testing releases are now there. * restore =/var/cache/lcfgreleases=: * ==rsync -v -a -A -X -x -x -S sauce::tobermoryvar/cache/lcfgreleases/ /var/cache/lcfgreleases/== Stable releases cache restored. * Start the subversion component if not already started: * ==om subversion start== ---++++ This didn't work. The subversion component was unable to make the repositories since the file component had already made the top level repository dirs so that it could plant the links to the hook scripts. Since the repository directories already existed, subversion refused to make the repositories. To fix this I moved the repository directories out of the way, then started the subversion component. This started successfully and made the repositories. Then I configured the file component to remake the links to the hook scripts. We might avoid this problem in future by removing the "mkdir" option from the file component resources which make the links to the hook scripts. The repository exists once more, though it's empty as yet. * Reload the lcfg repository: * ==svnadmin load /var/svn/lcfg < /var/lcfg/svndump/lcfg/lcfg.sl5_final== ---++++ This took two and three quarter hours! A test load of the same repository on a VM on northern had taken 20 minutes to complete. The lcfg repository now contains our data. * Recreate the pre-commit and post-commit hooks: * ==om file configure== ---++++ This should have been ==om file start== since the component had failed to start earlier. The configure did solve the immediate problem and make the necessary files but the component remained stopped until the problem was spotted at the next weekly LCFG Check. This should make links from =/var/svn/lcfg/hooks/pre-commit= to =/usr/lib/lcfg/lcfg-svn-hooks/pre-commit= and from =/var/svn/lcfg/hooks/post-commit= to =/usr/lib/lcfg/lcfg-svn-hooks/post-commit= The lcfg repository hooks have now been restored. ---++ Start The Service Running * Restart apacheconf on tobermory. * ==om apacheconf restart== and check the log. Apache may have failed to start because of the svn repositories' absence. Also, apacheconf failing to start at this point may be an indication of a problem with the restored data. * Check that autocheckout is working. * Check out something from the repository; change it; commit * Look for your change (or anything) in =/var/lib/autocheckout/lcfg=. * If this doesn't work, check that the permissions and ownership on the autocheckout directory match those on the svn repository sufficiently to allow the apache account permission to do a check out, e.g.: <pre> [tobermory]root: ls -ld /var/svn/lcfg /var/lib/autocheckout/lcfg /var/lib/autocheckout/lcfg/lcfg drwxrwsr-x 3 apache lcfgsvn 4096 Mar 17 2008 /var/lib/autocheckout/lcfg drwxrws--- 7 root apache 4096 Mar 17 2008 /var/svn/lcfg drwxrwxr-x 5 apache lcfgsvn 4096 Mar 17 2008 /var/lib/autocheckout/lcfg/lcfg </pre> Also check =/var/lib/autocheckout= itself - it should have this owner, group & permissions: <pre> drwxrwxr-x 3 root lcfg 4096 Mar 17 2008 /var/lib/autocheckout </pre> Make a functioning new one as follows: <pre> cd /var/lib/ mv autocheckout autocheckout-aside mkdir autocheckout chown root:lcfg autocheckout chmod 755 autocheckout om file configure </pre> =/var/lib/autocheckout= should now be populated. The develop and default releases should now be there. The MPU should now have full access to svn. * Start the rfe component on tobermory * ==om rfe start== CO access to lcfg/foo is restored. * Start the rsync component on tobermory * ==om rsync start== Changes are now available to slaves again. * Remove the boot.services alterations from lcfg/tobermory * Remove subversion.authzallow restrictions from lcfg/tobermory This will set the repository access back to normal once tobermory has its new profile. ---++++ Removal of the authz restrictions was premature. It was right to remove them from the lcfg repository at this point, but we had not yet restored the source and dice repositories so had to keep the restrictions on those repositories until the repositories had been reloaded. * Delete caches on main LCFG slaves mousa and trondra to speed up rebuilds: * ==rm -f /var/lcfg/conf/server/cache/*== * Gentlemen, start your engines. * ==om mousa.server start== * ==om trondra.server start== * ==om vole.server start== * ==om sernabatim.server start== * ==om benaulim.server start== * ==om circlevm12.server start== Mammoth rebuilds now hopefully start. ---++++ The rebuilds on mousa and trondra were disappointingly slow. They took 3 hours and 25 minutes to finish. A run of ==iostat -xd 5== on one of them showed that the disk was at 100% utilisation, confirming the notion that they are being held back by slow disks. The simultaneous rebuild on the circlevm12 virtual machine took 55 minutes. During the rebuild a similar ==iostat== on circle showed the disk utilisation hovering around 3%. Admittedly though I didn't clear the server cache on circlevm12 first so perhaps it's not an entirely fair comparison. We should try a complete rebuild on this VM again, this time with a cleared cache, to see how long it takes. ---++++ This at least gave us plenty of time to reload the other two repositories then remove the remaining authz restrictions from lcfg/tobermory. * Mail a progress report to cos COs will now have access to rfe but not yet to subversion, not until tobermory has its new profile. ---++ Restore the dice and source repositories too * Reload the dice repository: * ==svnadmin load /var/svn/dice < /var/lcfg/svndump/dice/dice.sl5_final== The dice repository has now been restored. There are no commit hooks to restore. * Reload the source repository: * ==svnadmin load /var/svn/source < /var/lcfg/svndump/source/source.sl5_final== * Restore svn hooks: * ==om file configure== * Check the hooks in =/var/svn/source/hooks=. ---++ Final Touches * Wait for the rebuilds to finish. This is expected to take 80 minutes or so, provided the caches are empty. CO access to svn has been restored. The LCFG service is now functional. Profile-building is now back to normal. * Re-enable the 15 minute rmirror cron job on sauce if LCFG hasn't already done it. * ==ssh sauce== * ==nsu== * ==crontab -l== * and check that the =0,15,30,45 rmirror= line is there. ---++++ The rmirror run had not been restored by LCFG, but a configure of the cron component restored it. * Check logs for errors after first rmirror run. ---++++ I can't have checked the log as there was an error in it: the mirror of the repositories, tagged =svndatadir=, was failing as 98% of the files needed to be replaced. I solved this a couple of days later by moving the existing =svndatadir= out of the way then re-running the mirror. For the omission I can only blame mental exhaustion at the end of a tense eight hour upgrade process. It would help to note the probable failure of this mirror in future plans. The DR arrangements are back in place. * Announce to COs & LCFG deployers * Ask Alastair to restore the ordershost database. -- Main.ChrisCooke - 10 Apr 2012
Edit
|
Attach
|
P
rint version
|
H
istory
:
r10
<
r9
<
r8
<
r7
<
r6
|
B
acklinks
|
V
iew topic
|
Raw edit
|
More topic actions...
Topic revision: r9 - 16 Apr 2012 - 08:47:34 -
ChrisCooke
DICE
DICE Web
DICE Wiki Home
Changes
Index
Search
Meetings
CEG
Operational
Computing Projects
Technical Discussion
Units
Infrastructure
Managed Platform
Research & Teaching
Services
User Support
Other
Service Catalogue
Platform upgrades
Procurement
Historical interest
Emergencies
Critical shutdown
Where's my software?
Pandemic planning
This is
WebLeftBar
Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback
This Wiki uses
Cookies