Yesterday I upgraded the master LCFG server tobermory to SL5. Here's the plan I worked to, with some comments in italics saying what went wrong.

Tobermory SL5 Upgrade Plan

  • No need to alter any fstab resources or to #define anything to prepare for the SL5 install

  • Alter lcfg/tobermory to os/sl5.h, and add the following to lcfg/tobermory too:
/* Restrict access to subversion */
!subversion.authzallow_lcfg_coreinc mSET(mpu)
!subversion.authzallow_lcfg_corepack mSET(mpu)
!subversion.authzallow_lcfg_liveinc mSET(mpu)
!subversion.authzallow_lcfg_livepack mSET(mpu)
/* Stop rsync and rfe from starting automatically */
!boot.services mREMOVE(lcfg_rsync)
!boot.services mREMOVE(lcfg_rfe)
And wait for tobermory's new profile to reach it

When the machine gets its new profile with the new OS version's resources, the resource changes are enough to screw up logins; you're then locked out of the machine unless you already have a window open or you reverse the changes. Luckily I had a window open on tobermory.


CO access to svn stops ----------

  • "om rfe stop" on tobermory


access to rfe lcfg/foo stops -------

  • "om rsync stop" on tobermory


no more LCFG changes -------------

  • "om server stop" on all slaves (mousa, trondra, bressay, illustrious, dresden) - but keep apacheconf running


profile building stops ---------------


slaves serve unchanging profiles ------

  • "om subversion dumpdb -- -r lcfg -g -k 30" on tobermory

  • scp ALL of /var/lcfg/svndump/lcfg/ to another machine

I should have stopped the server process on all the slave servers but kept rsync on tobermory running to help me in copying off all the data. As it was with rsync down I pushed the data from tobermory rather than pulling it from the other machine; so I pushed it to my own account on the other machine rather than root; so I lost the owner and group information for each file. It would have been better to pull the info from root to root and keep the owner/group info.


subversion repository data saved ------

  • rsync /var/rfedata to another machine


lcfg source files now saved -----------

  • rsync /var/lcfg/releases to another machine


stable and testing releases saved -------

  • copy /var/svn/lcfg/hooks/* to another machine (these are not dumped by svnadmin!)


autocheckout mechanism saved --------

  • copy /var/lcfg/lcfgrelease.sh to another machine


release script saved -----------------

This is the point at which I should have shut down the rsync component on tobermory.

  • install SL5 on tobermory

  • Restore all of /var/lcfg/svndump from elsewhere


svn dumps now present --------------

  • Restore /var/rfedata from elsewhere

This is where I started noticing the file group and ownership problem. All the LCFG files were owned by me - whoops!


lcfg source files exist again -----------

  • Start the subversion component if not already started


repository exists once more ----------

  • Reload the repository: "svnadmin load" the dump file

On the test machine it had taken about a minute to load all 11000 revisions, but on the real server it took 15-20 minutes.


repository now contains our data ------

  • Restore the post-commit and pre-commit hooks from elsewhere (after doing the load!)


autocheckout now enabled -----------

  • Check out something from the repository; change it; commit

Access denied! It turned out that my subversion.authzallow resource changes - intended to restrict read/write access to the repository to just the MPU - hadn't worked as intended. Stephen explains: Normally the access to these areas is done in terms of higher-level groups (e.g. cos, csos). The problem was caused because the access permissions weren't set for the mpu group specifically, this meant they defaulted to "read-only". Note:

authzallowperms_lcfg_coreinc_mpu=r
authzallowperms_lcfg_corepack_mpu=r
authzallowperms_lcfg_liveinc_mpu=r
authzallowperms_lcfg_livepack_mpu=r
authzallowperms_lcfg_root_all=r
Each of those resources would need to be set to 'rw'.

Stephen solved the problem by editing /var/svn/lcfg/conf/authz directly.

After this was solved there was another problem; my svn commit would succeed, but nothing would appear in the autocheckout directory. After some digging and experimenting I found that the permissions and ownership on the autocheckout directory needed to exactly match that of the repository or nothing would be checked out. When I compared them I found that the group and the permissions didn't match. Once I had adjusted them and tried another test commit, the autocheckout succeeded.


/var/lib/autocheckout now populated ---
develop and default releases now there --

  • Restore /var/lcfg/releases from elsewhere

And correct the file ownership


stable and testing releases now there ----

  • restore /var/lcfg/lcfgrelease.sh

And correct the file ownership


release script restored ----------------

  • restore /var/cache/lcfgreleases from rsync.lcfg.org::lcfgreleases

And correct the file ownership


stable releases cache restored ---------

  • Start the rfe and rsync components on tobermory


CO access to lcfg/foo restored ---------
MPU access to svn restored ------------

  • Start the rsync component on tobermory


changes now available to slaves again ----

  • Remove the boot.services alterations from lcfg/tobermory

  • Remove subversion.authzallow restrictions from lcfg/tobermory

  • om bressay.server start

  • om illustrious.server start

  • om dresden.server start

  • om mousa.apacheconf stop

  • om mousa.server start


mammoth rebuilds now hopefully start ---

  • Mail a progress report to cos

I told the COs that they could now once again use rfe and also use the subversion repository. They could use rfe but subversion wasn't actually going to be available to them until tobermory had got its new profile, which it hadn't yet. Oops.


mousa rebuild finishes ----------------

A complete rebuild has recently been taking an enormous amount of time, two to three hours, and crippling the slave server involved. That's why I shut down apacheconf - to minimise the load on the machine. However with no existing spanning maps this complete rebuild took precisely 35 minutes! Normally it takes 30 minutes to build 1000 profiles when the stable release is installed. 35 minutes is on a par with that - we currently have a total of 1143 profiles on the slave servers. The total number of hosts to process was 2813. The difference between the two figures is from hosts which have no XML profile associated with them.

  • rfe dns/inf, point lcfg alias at mousa

  • om mousa.apacheconf start


CO access to svn restored -------------


LCFG service is now functional ----------

  • om trondra.apacheconf stop

  • om trondra.server start


mammoth trondra rebuild starts ---------


mammoth trondra rebuild finishes -------

  • om trondra.apacheconf start


LCFG service is back to normal ----------

  • Announce to cos

-- ChrisCooke - 18 Mar 2008

Topic revision: r1 - 18 Mar 2008 - 15:00:28 - ChrisCooke
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies