SL6 Upgrade for Services Unit

Overview

This is the final report for project 219 - SL6 upgrade for servicesunit. The OS upgrade is typically a biennial-ish event, and as can be seen later, consumes a lot of effort. This only tracks the server/service SL6 effort. Effort to provide SL6 versions of services unit components used more widely (eg the mail component) was not tracked as a project.

The list of machines we had to deal with was tracked on the wiki page ServicesUnitServers2SL62, some 50 machines.

This work was also seen as an opportunity to replace older physical machines with Virtual Machines. And somewhat overlapped with the deprecation of the VMware VM service to the new KVM based VM service.

Work started around May 2012, the goal was to have all our services/servers upgraded by the end of 2012, however this over ran into April 2013. During this time we spent about 14 FTE weeks, we'd estimated 10.

The Plan

Due to the number of machines, the basic plan was to do as little as possible to migrate the machine/service to SL6. We did not upgrade the applications unless we had to. However one thing forced apon us was to migrate any use of the apache component to the apacheconf component. This did make moving some of the web based services not as straight forward, though now that's done, should be of benefit in the long run.

Obviously we also tried to cause as little disruption to the end users as possible (this includes the outside world when talking about external facing web pages).

Generally the task was left to Neil to just get on and do the bulk of the machines and, roughly, Gordon dealing with printing, Roger with blog, samba, roombooking, ifile and Craig rfe, TiBs services.

AFS File servers

Where possible they were upgraded to SL6 (or in the case of the older Forum hardware, physical machines replaced with newer machines) without the users noticing. All user volumes (and most group ones) were migrated from the server to be upgraded/replaced, so that the actual upgrade to take place without causing interruptions to users.

Where there were large group volumes on a server, that would take a long time to move (if we even had the space), the owners were contacted to arrange a convenient time to upgrade/replace the server. In some cases this would only be for a few minutes while we umounted the SAN volume from the machine to be upgraded/replaced and remounted it on another machine (one that had already been SL6'd so we wouldn't have to move them back). This process lead to us formulating our AFSMovingPartitions wiki page.

Once a server had been SL6'd then it became the home of the volumes to be migrated from the next SL5 machine to be done, so volumes have ended up shuffling around. So typically volumes were only moved once, ie they didn't end up where they started.

The moves could take a day or more to move all volumes off a server, but a script was found and modified for our purposes ~neilb/bin/share/afsmigrate-partition that made moving the contents of a partition fairly straight forward.

NFS File servers

We have a small number of NFS file servers, and some of the AFS servers also have NFS exports. In these cases we don't have the AFS benefit of being able to move data without people noticing. Fortunately this data is usually only used by a few people for research purposes, and so contacting them and just arranging for a suitable break in service was acceptable.

Web Servers

Luckily a lot of the smaller web servers were being virtualised, so we were able to create a new SL6 (apacheconf) version of the service on a new VM, before just switching the DNS (or moving the IP) to point at the new machine. Similarily the homepages/wiki service was moving to new hardware. This also meant that some amount of testing by the users of their content on the new service was possible before going live. Generally content was not an issue unless it used PHP (version increased from 5.1 to 5.3), or was a CGI.

More problematic were the main physical web servers, that were not being virtualised/replaced: www.inf and wcms.inf (Plone). In their case a spare machine was used to do develop and test the SL6 version of the service. When it came to upgrade the real machine, the test machine stood in for the live service, while the regular hardware was reinstalled, and then the DNS or IPs moved back to the original machine.

Other web services

The likes of roombooking, blog, ifile, password portal, etc were managed pretty much the same way as the plain web servers. One typical difference though was the consistency/migration of an underlying database to the new machine. For example, the roombooking service uses a MySQL DB running on the server. If we simply installed and prepared the SL6 VM version to run along side the physical SL5 version. Then we couldn't just change the DNS to point at the new SL6 server and leave it at that. As the SL6 version would have been created using a copy of the SL5 MySQL DB at the time, any bookings made on the SL5 service prior to the switch over to the SL6 version would be lost. So in situations like this we needed to stop any further updates to the SL5 service and DB, dump the database and restore it to the new SL6 service, then update the DNS to point at the new machine. This was usually straight forward, but if the software eg version of Wordpress on the SL6 had changed, then you also have to run some sort of migration step on the data.

Printing

The upgrade of the student print server was fairly straightforward. It was already running a newer version of CUPS than the staff server to support Kerberos. The staff server however caused problems when it was deployed as it became clear that MAC users could not print. Further testing showed that the newer, locally compiled, version of CUPS being used on the student server was the problem, when installed on the staff server. The student server only supports DICE clients, not Windows or MAC. Reverting to the stock SL6 supplied CUPS RPM solved the problem.

TiBs

The upgrade of the backups server was complicated by two issues, the need to upgrade TiBS to the latest version to support the 64bit architecture and the introduction of new physical hardware to run the backup server on. At first, the intention was simply to install the new version of TiBS on the new hardware and migrate the service across but it soon became clear that this would involve an unacceptable risk of major interruptions to the back up service. Instead, the existing backup server was upgraded to SL6 and TiBS upgraded on it. Once we were happy that the new version of TiBS was behaving correctly under SL6, the new backup hardware was commissioned and the migration completed. This process, although safer, obviously took longer than had been anticipated.

Mail

The virtual mail relay and authenticated smtp service had already been SL6'd with the move from VMware to KVM. Leaving only the main mail.inf service. As this was being virtualised too, we were able to have the new service running along side the old one. The switch then involved short outage while mail (and lists) services were stopped on the old hardware, last data rsynced across to new machine, some mailman migration stuff run, IP address moved to new machine and then mail services restarted.

Problems, Gotcha's, etc

* Where multipathing was used for SAN mounts, the SL6 changed the path to the devices, so we had to remember to replace /dev/mpath/ with /dev/mapper/ in the fstab.

* There was also a funny where we used multipath (for reasons of consistency and it also seems to help with SAN funnies) on machines with only a single FC connection. The initial SL6 multipath daemon did not work with single path, multipaths!

* Plone RPM. There is no standard Plone RPM shipped with SL6 (like there was for SL5). Part of the problem is that we are using an old version 3 of Plone, where the current version is 4. Even so, it seems even newer versions of 3 are virtually unpackageable, and the old SRPM won't build as it has as dependency on an old version of python. We managed to come up with something that packages, and have upgraded to a slightly newer version of Plone 3 at the same time. Though really we should see if we can move to the current Plone 4.

Effort

T Time Spent
mins hrs FTE days
T1 2012 1420 23.67 3.38
T2 2012 5415 90.25 12.89
T3 2012 12806 213.43 30.49
T1 2013 10173 169.55 24.22
Totals 29814 496.9 70.98

Which is 14.2 FTE weeks crammed into 1 year. This is purely time spent by the Unit COs. CSOs were not involved in any of the work.

Conclusion

As other groups have mentioned, some of our servers/services could quite happily have been left at SL5. Perhaps only needing to be upgraded to SL6 when hardware replacement is required or specific versions of software. Noteable exceptions being the web servers that serve user content or use CGIs, as they only have the chance to develop and test in the prevailing desktop environment, so those servers should track that version of OS.

Having said that, it's nice to have a homogeneous set of servers, so that they all behave the same way, have the same tool set etc. eg if it's using multipath, then the devices will all be /dev/mapper, and not having to remember "on this type of server it's like this, and on that server its like that".

The new minimal server rpm set, is certainly a big gain for installation times.

Having the spare AFS space to evacuate AFS servers of all volumes, is great for taking the presure off trying to schedule downtime with the users.

Edit | Attach | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r10 - 02 Oct 2013 - 15:11:28 - NeilBrown
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies