SL6 Upgrade for MPU Services

This is the final report for the SL6 Upgrade for MPU Services, which is devproj.inf project number 203.

The aim of the project was to upgrade all MPU-run services from SL5 to SL6.

The project was planned and tracked at SL6ServerUpgradeList.

Service By Service

Packages Master

We waited a long time before upgrading the packages master because the SL5 version used perl-AFS and for a long time this did not support 64 bit. When perl-AFS 2.6.3 became available we decided to use it on a 64 bit machine but with OpenAFS 1.4.14. In the longer term we will rewrite the refreshpkgs code to the use the AFS::Command module but the priority was to get away from SL5.

  • Before the upgrade we noted these tips:
    • Use AFS 1.4, not 1.6
    • Use Toby's newer version of the Perl AFS module package. Make sure it's in the world bucket rather than in devel.
    • Test everything on an SL6 box first.
    • Use a copy of the keytab from brendel rather than generating a new one as the new one would automatically invalidate the one currently in use!
  • When we tested refreshpkgs on an SL6 VM we did this:
    • Touch an rpmlist to trigger refreshpkgs.
    • If that worked, try a package submission.
    • Check the contents of rpmlist.
    • Check yum. yum clean all should be run first to clear the local machine's cache.
    • There may be a missing dependency on createrepo.
    • Expect failures involving perl-AFS.
    • Create a new test bucket and check that that works.
  • We also found that the testing VM hadn't been given enough memory to handle large runs of refreshpkgs.
  • We found an issue with the createrepo checkts option used by refreshpkgs. We're running a locally fixed version until the fix appears upstream.
  • Soon after the upgrade was completed an issue appeared wherein there was a delay of up to 25 minutes between package submission and availability. This was caused by a software upgrade which happened shortly after brendel was upgraded to SL6; so it had no chance of showing up in our testing beforehand. It was fixed by upgrading the OpenAFS version on the package buckets' AFS server to 1.6.2-0.pre3.

Packages Slaves and Export

The packages export server provides exporthttp.pkgs.inf.ed.ac.uk and rsync.pkgs.inf.ed.ac.uk both of which are used for exporting packages to other LCFG users in the University. mod_waklog on SL6 was a significant requirement of this and a number of other services. When it came to the upgrade itself were no real problems; the service was already in its current form and virtualised so just needed an OS upgrade.

Package Cache and PXE

The package cache servers were among the first MPU machines to be upgraded to SL6. Before the upgrade we identified a need to use disk-layout.h for SL6. Since that introduces a default size of 40GB for the root filesystem - far too much for many servers - we also introduced small-server.h which has a far smaller default size. At the same time we introduced an automatic swap size algorithm similar to that used upstream. TSL6 introduced changes to squid e.g. to netmasks.

Packages Mirrors AFS

This upgrade was held back for a long time whilst we resolved various issues with nexsan storage arrays and multipath fibre-channel support on SL6.

This was the first AFS fileserver to be upgraded to SL6 and run the openafs 1.6 release and served as a good test case for the upgrade of the main Informatics fileservers. There were concerns that the lack of perl-AFS module would be an issue but the LCFG openafs component still worked fine.

The important thing to remember with upgrading this server is that the value of the updaterpms.rpmpath resource has to be altered to use dr.pkgs instead of cache.pkgs otherwise it doesn't work... This clearly demonstrated the usefulness of having the DR server available for more than just disaster situations.

LCFG Master

Since almost all aspects of our configuration are done using LCFG, upgrading the LCFG Master itself had to be done very carefully. The MPUpgradingMasterLCFGServerSL664 page lists details of the entire upgrade procedure, with some discussion on what went wrong and what could have been done better.

We rehearsed this upgrade well in advance by building up a parallel LCFG Master on a VM and having it master a parallel LCFG system with a test LCFG slave of its own. This was a good way of showing up problems which may not otherwise have been spotted in advance of the real upgrade.

The LCFG Master is also the ordershost. This functionality was moved to an SL6 VM before the upgrade then moved to the LCFG master after the upgrade. The manual steps for bootstapping the ordershost are in dice/options/ordershost.h (at the bottom).

Problems encountered before the upgrade included:

  • an issue involving web-svn.
  • a lot of LCFG components did not have sl6_64 versions of their defaults RPMs. It took some months to have the missing RPMs produced or establish that components without them could be dropped.
  • a certificate error involving rfe servers.

The upgrade to SL6 was done on the then LCFG master tobermory. The service's move from tobermory to schiff was done at a later date.

LCFG Slaves

We upgraded our slaves mousa and trondra one at a time, looking for problems after the first of the upgrades. A problem with rsync was discovered this way. The move to the virtual machines bol and metsu came later in the year. It was prompted by the unacceptable time taken by mousa and trondra to do complete LCFG profile rebuilds following the upgrades. Moving them to KVM gave us slaves with faster CPU and much faster disk, which were an advantage in preventing the extreme slowdown experienced with full profile rebuilds.

The inf level LCFG slave was first duplicated on an SL6 VM. This allowed us to compare SL6-generated and SL5-generated profiles. When the results were satisfactory the SL5 (virtual) server was decommissioned.

DIY DICE

DIY DICE was the first LCFG service to be hosted on SL6. Before we did the upgrade we tested the LCFG server code on SL6.

Disaster Recovery Service

The DR server sauce was upgraded to SL6 after the other LCFG servers. We noted these points before the upgrade:

  • It's important to remember to preserve the current contents of the sauce disk space as it would take a long time to regenerate it.
  • test slave using sauce as a master server
  • test client using sauce as a slave server
  • test client using sauce as a package server
After the upgrade these tests showed up a couple of problems, as noted at the 29 May MPU meeting. These were subsequently fixed.

PkgForge master

This upgrade was fairly straightforward. Although PkgForge has a large set of package dependencies these had already been prepared when the client tools were upgraded to SL6 some time before. The only issue was a dependency on mod_waklog which needed patching before it could be used on SL6 with openafs 1.6

LCFG Web Services

This upgrade was held back for a long time whilst we resolved various issues with nexsan storage arrays and multipath fibre-channel support on SL6.

The original plan was to move this service from budapest to bakerloo but in the end we decided that hosting it on real hardware was not justified and instead we moved it to KVM (_polecat). This was a useful process since it gave us a chance to learn how to configure extra disks for VMs.

The locally-written software used to generate the main LCFG website all worked fine on SL6. As part of the upgrade we decided to switch to the latest version of the TWiki software, this presented a few problems since it was a major version change (4 to 5). In particular we had problems with theming until Alastair worked out the magic to make version 5 use the same skin as version 4.

LCFG Bug Tracking

On SL5 all of the LCFG web services were hosted on the same machine. To ease the complication of configuration we moved the LCFG Bugzilla service to a separate KVM for SL6.

At the same time Bugzilla itself had to be upgraded from the outdated 3.0 series to the new 4.2 series. This was the latest stable release at the time of the upgrade and is likely to be maintained and receive security fixes for some time to come.

The main points to note were how to get working RPMs for Bugzilla 4.2 and the steps involved in upgrading the Bugzilla setup from the old version to the new. Both of these points are covered on the ManagingBugzilla page.

VMware hosting

Since the free version of VMware we used ceased to be supported some time ago there was no possibility of our upgrading it to SL6 - indeed we had to maintain an outdated version of SL5 specially to support the VMware service. Instead VMware has been replaced by the simple KVM hosting service.

KVM Hosting

This service started on SL6 so didn't need an upgrade.

Storage Array Monitoring

The old VM zig used for monitoring the IBM storage array previously had to be a self-managed machine because it required a specific version of mysql which conflicted with that installed as part of DICE SL5. With the much smaller SL6 server installations, and the resulting reduced dependencies, it is now simple to get the monitoring software installed. The new VM giz runs DICE SL6 so now benefits from standard DICE nagios monitoring.

Summary of Machine Moves

The MPU decided to combine the SL6 upgrades with hardware upgrades and moves, and a number of the upgrades were done by preparing a replacement SL6 service on a separate machine then switching over to it, so many MPU services moved host during this project. Excluding the new KVM servers, our physical host count reduced by 3.

Packages Master
This service stayed on brendel, a Dell R200 from 2009.
Packages Slaves and Export
This moved from its SL5 KVM cochin to a new SL6 KVM porto.
Package Cache and PXE
The package cache service, and PXE service, moved from split and schiff to two new Dell R210 II machines hare and wildcat. split was retired and schiff redeployed as our new LCFG master server.
Packages Mirrors AFS
The packages mirrors AFS service stayed on telford, a Dell 1950 from 2008, though the data itself is on the MPU SAN.
LCFG Master
This moved to schiff from our elderly Dell 860 tobermory which was retired.
LCFG Slaves
These moved from elderly and slow Dell 860s mousa and trondra (now retired) to KVM guests metsu and bol. The test slave moved from KVM alboran (SL5) to vole (SL6) and the inf level slave moved from ashkenazy (SL5) to KVM barents (SL6).
DIY DICE
This service was already virtualised. It moved from KVM madurai (SL5) to KVM sernabatim (SL6).
Disaster Recovery Service
The LCFG and Packages DR service stayed on sauce, an HP DL180 from 2010.
PackageForge master
This moved from VMware ardbeg (SL5) to KVM pinemarten (SL6).
LCFG Web Services
These moved from budapest, a Dell 1950 from 2007, to two KVM virtual machines. The bugzilla service moved to heda and the other services moved to polecat.
VMware hosting
This service could not be upgraded to SL6. It has been deprecated and has now been shut down. Former VMware host servers are being redeployed to host other services.
KVM hosting
This service started on SL6 so needed no upgrading. The number of KVM host servers has grown thanks both to purchases of new hardware and to redeployment of former VMware host servers as additional KVM host servers.
Storage Array Monitoring
This was already virtualised; it moved from KVM domain zig to giz for the SL6 upgrade.

Discussion

We thought it would be useful to look through the list of upgraded services and split them into two groups - those for which the upgrade to SL6 was necessary or very beneficial in the short-term and those which could have been left until we came around to replacing hardware or reinstalling for other normal operational reasons.

The LCFG slaves (and thus also the DIY DICE and DR servers) had to be upgraded to SL6 to allow the switch to the new LCFG compiler. It was just not possible to build the packages on SL5 without an enormous amount of backporting of Perl modules. There was also a critical issue related to the use of the Safe module in the version of perl installed in SL5 which meant we had been pinning it back to an ancient version for quite a long time.

Upgrading the PkgForge service to SL6 was of great benefit since it greatly reduced the number of Perl modules which we had to build and maintain ourselves.

The LCFG web server itself did not specifically require upgrading to SL6 but we were already overdue to upgrade bugzilla, websvn and twiki to much newer versions with security support. Upgrading to SL6 meant we had access to far more new Perl modules which reduced the effort of upgrading that software.

The packages service (master, slaves and AFS server) could have remained on SL5 for quite some time. There was no pressing need to upgrade and if anything upgrading at that point required more effort since we had to work out solutions for the perl-AFS problems.

It is clear that in some cases the work of upgrading to SL6 could have been left until we reached a point where we needed to replace the hardware. Also, in many cases we rolled into the "SL6 Upgrade" process the upgrading of the hardware and the updating of additional software. This undoubtedly means the effort attributed to this project has been inflated somewhat by what would otherwise have been classified as operational work.

Hours Taken

Period Hours
2011 T3 27
2012 T1 168
2012 T2 82
2012 T3 125
2013 T1 14
Total 416
Topic revision: r16 - 25 Mar 2013 - 11:07:06 - ChrisCooke
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies