SL6 Upgrade for MPU Services
This is the final report for the SL6 Upgrade for MPU Services, which is
devproj.inf project number 203.
The aim of the project was to upgrade all MPU-run services from SL5 to SL6.
The project was planned and tracked at
SL6ServerUpgradeList.
Service By Service
Packages Master
We waited a long time before upgrading the packages master because the SL5 version used perl-AFS and for a long time this did not support 64 bit. When perl-AFS 2.6.3 became available we decided to use it on a 64 bit machine but with OpenAFS 1.4.14. In the longer term we will rewrite the refreshpkgs code to the use the AFS::Command module but the priority was to get away from SL5.
- Before the upgrade we noted these tips:
- Use AFS 1.4, not 1.6
- Use Toby's newer version of the Perl AFS module package. Make sure it's in the
world
bucket rather than in devel
.
- Test everything on an SL6 box first.
- Use a copy of the keytab from brendel rather than generating a new one as the new one would automatically invalidate the one currently in use!
- When we tested refreshpkgs on an SL6 VM we did this:
- Touch an
rpmlist
to trigger refreshpkgs
.
- If that worked, try a package submission.
- Check the contents of
rpmlist
.
- Check
yum
. yum clean all
should be run first to clear the local machine's cache.
- There may be a missing dependency on
createrepo
.
- Expect failures involving perl-AFS.
- Create a new test bucket and check that that works.
- We also found that the testing VM hadn't been given enough memory to handle large runs of
refreshpkgs
.
- We found an issue with the
createrepo checkts
option used by refreshpkgs
. We're running a locally fixed version until the fix appears upstream.
- Soon after the upgrade was completed an issue appeared wherein there was a delay of up to 25 minutes between package submission and availability. This was caused by a software upgrade which happened shortly after brendel was upgraded to SL6; so it had no chance of showing up in our testing beforehand. It was fixed by upgrading the OpenAFS version on the package buckets' AFS server to 1.6.2-0.pre3.
Packages Slaves and Export
The packages export server provides
exporthttp.pkgs.inf.ed.ac.uk and
rsync.pkgs.inf.ed.ac.uk both of which are used for exporting packages to other LCFG users in the University.
mod_waklog
on SL6 was a significant requirement of this and a number of other services. When it came to the upgrade itself were no real problems; the service was already in its current form and virtualised so just needed an OS upgrade.
Package Cache and PXE
The package cache servers were among the first MPU machines to be
upgraded to SL6. Before the upgrade we identified a need to use
disk-layout.h
for SL6. Since that introduces a default size of 40GB
for the root filesystem - far too much for many servers - we also
introduced
small-server.h
which has a far smaller default size.
At the same time we introduced an automatic swap size algorithm
similar to that used upstream. TSL6 introduced changes to squid e.g. to netmasks.
Packages Mirrors AFS
This upgrade was held back for a long time whilst we resolved various issues with nexsan storage arrays and multipath fibre-channel support on SL6.
This was the first AFS fileserver to be upgraded to SL6 and run the openafs 1.6 release and served as a good test case for the upgrade of the main Informatics fileservers. There were concerns that the lack of perl-AFS module would be an issue but the LCFG openafs component still worked fine.
The important thing to remember with upgrading this server is that the value of the
updaterpms.rpmpath
resource has to be altered to use
dr.pkgs instead of
cache.pkgs otherwise it doesn't work... This clearly demonstrated the usefulness of having the DR server available for more than just disaster situations.
LCFG Master
Since almost all aspects of our configuration are done using LCFG, upgrading the LCFG Master itself had to be done very carefully. The
MPUpgradingMasterLCFGServerSL664 page lists details of the entire upgrade procedure, with some discussion on what went wrong and what could have been done better.
We rehearsed this upgrade well in advance by building up a parallel LCFG Master on a VM and having it master a parallel LCFG system with a test LCFG slave of its own. This was a good way of showing up problems which may not otherwise have been spotted in advance of the real upgrade.
The LCFG Master is also the
ordershost. This functionality was moved to an SL6 VM before the upgrade then moved to the LCFG master after the upgrade. The manual steps for bootstapping the ordershost are in dice/options/ordershost.h (at the bottom).
Problems encountered
before the upgrade included:
- an issue involving web-svn.
- a lot of LCFG components did not have sl6_64 versions of their defaults RPMs. It took some months to have the missing RPMs produced or establish that components without them could be dropped.
- a certificate error involving rfe servers.
The upgrade to SL6 was done on the then LCFG master
tobermory. The service's move from
tobermory to
schiff was done at a later date.
LCFG Slaves
We upgraded our slaves
mousa and
trondra one at a time, looking for problems after the first of the upgrades. A problem with
rsync was discovered this way. The move to the virtual machines
bol and
metsu came later in the year. It was prompted by the unacceptable time taken by
mousa and
trondra to do complete LCFG profile rebuilds following the upgrades. Moving them to KVM gave us slaves with faster CPU and much faster disk, which were an advantage in preventing the extreme slowdown experienced with full profile rebuilds.
The inf level LCFG slave was first duplicated on an SL6 VM. This allowed us to compare SL6-generated and SL5-generated profiles. When the results were satisfactory the SL5 (virtual) server was decommissioned.
DIY DICE
DIY DICE was the first LCFG service to be hosted on SL6. Before we did the upgrade we tested the LCFG server code on SL6.
Disaster Recovery Service
The DR server
sauce was upgraded to SL6 after the other LCFG servers. We noted these points before the upgrade:
- It's important to remember to preserve the current contents of the sauce disk space as it would take a long time to regenerate it.
- test slave using sauce as a master server
- test client using sauce as a slave server
- test client using sauce as a package server
After the upgrade these tests showed up a couple of problems, as noted at the
29 May MPU meeting. These were subsequently fixed.
PkgForge master
This upgrade was fairly straightforward. Although PkgForge has a large set of package dependencies these had already been prepared when the client tools were upgraded to SL6 some time before. The only issue was a dependency on mod_waklog which needed patching before it could be used on SL6 with openafs 1.6
LCFG Web Services
This upgrade was held back for a long time whilst we resolved various issues with nexsan storage arrays and multipath fibre-channel support on SL6.
The original plan was to move this service from
budapest to
bakerloo but in the end we decided that hosting it on real hardware was not justified and instead we moved it to KVM (_polecat). This was a useful process since it gave us a chance to learn how to configure extra disks for VMs.
The locally-written software used to generate the main LCFG website all worked fine on SL6. As part of the upgrade we decided to switch to the latest version of the TWiki software, this presented a few problems since it was a major version change (4 to 5). In particular we had problems with theming until Alastair worked out the magic to make version 5 use the same skin as version 4.
LCFG Bug Tracking
On SL5 all of the LCFG web services were hosted on the same
machine. To ease the complication of configuration we moved the LCFG
Bugzilla service to a separate KVM for SL6.
At the same time Bugzilla itself had to be
upgraded from the outdated 3.0 series to the new 4.2 series. This was
the latest stable release at the time of the upgrade and is likely to
be maintained and receive security fixes for some time to come.
The main points to note were how to get working RPMs for Bugzilla 4.2 and the
steps involved in upgrading the Bugzilla setup from the old version to
the new. Both of these points are covered on the
ManagingBugzilla page.
VMware hosting
Since the free version of VMware we used ceased to be supported some time ago there was no possibility of our upgrading it to SL6 - indeed we had to maintain an outdated version of SL5 specially to support the VMware service. Instead VMware is being replaced by a KVM hosting service.
KVM Hosting
This service started on SL6 so didn't need an upgrade.
Storage Array Monitoring
The old VM
zig used for monitoring the IBM storage array previously had to be a self-managed machine because it required a specific version of mysql which conflicted with that installed as part of DICE SL5. With the much smaller SL6 server installations, and the resulting reduced dependencies, it is now simple to get the monitoring software installed. The new VM
giz runs DICE SL6 so now benefits from standard DICE nagios monitoring.
Summary of Machine Moves
The MPU decided to combine the SL6 upgrades with hardware upgrades and moves, and a number of the upgrades were done by preparing a replacement SL6 service on a separate machine then switching over to it, so many MPU services moved host during this project. Excluding the new KVM servers, our physical host count reduced by 3.
- Packages Master
- This service stayed on brendel, a Dell R200 from 2009.
- Packages Slaves and Export
- This moved from its SL5 KVM cochin to a new SL6 KVM porto.
- Package Cache and PXE
- The package cache service, and PXE service, moved from split and schiff to two new Dell R210 II machines hare and wildcat. split was retired and schiff redeployed as our new LCFG master server.
- Packages Mirrors AFS
- The packages mirrors AFS service stayed on telford, a Dell 1950 from 2008, though the data itself is on the MPU SAN.
- LCFG Master
- This moved to schiff from our elderly Dell 860 tobermory which was retired.
- LCFG Slaves
- These moved from elderly and slow Dell 860s mousa and trondra (now retired) to KVM guests metsu and bol. The test slave moved from KVM alboran (SL5) to vole (SL6) and the inf level slave moved from ashkenazy (SL5) to KVM barents (SL6).
- DIY DICE
- This service was already virtualised. It moved from KVM madurai (SL5) to KVM sernabatim (SL6).
- Disaster Recovery Service
- The LCFG and Packages DR service stayed on sauce, an HP DL180 from 2010.
- PackageForge master
- This moved from VMware ardbeg (SL5) to KVM pinemarten (SL6).
- LCFG Web Services
- These moved from budapest, a Dell 1950 from 2007, to two KVM virtual machines. The bugzilla service moved to heda and the other services moved to polecat.
- VMware hosting
- This service could not be upgraded to SL6. It has been deprecated and at the time of writing has one guest VM left. As VMware host servers were no longer needed they were redeployed as additional KVM host servers.
- KVM hosting
- This service started on SL6 so needed no upgrading. The number of KVM host servers has grown thanks both to purchases of new hardware and to redeployment of former VMware host servers as additional KVM host servers.
- Storage Array Monitoring
- This was already virtualised; it moved from KVM domain zig to giz for the SL6 upgrade.
Hours Taken
Period |
Hours |
2013 T1 |
14 |
2011 T3 |
27 |
2012 T2 |
82 |
2012 T3 |
125 |
2012 T1 |
168 |
Total |
416 |