SL6 upgrade for RAT Unit Final Report

This report covers the work requried to upgrade RAT unit servers and services from SL5 to SL6. Desktop software was dealt with elsewhere

Work Done

A scan of rat services and header files produced an upgrade list stored at RATSL6Upgrade. An initial scan was made to classify which services could be retired, which could be easily upgraded and which were likely to cause problems. We also identified services which could be moved off of hard metal onto KVM. Finally we identified older servers which should be decommissioned.

Theon

Theon services (Portal, Database, UI, Trac "basecamp", supporting headers and LCFG components) were upgraded incrementally in advance of this project as a part of developing test Theon services, so it is difficult to ascribe effort directly to the project for these. However the upgrade was straightforward: most of the Theon software requirements on SL5 were already custom-built (postgresql, Trac, etc.) and so in many cases upgrades involved very little more than testing and reinstallation with the same RPMs (or in some cases reverting to upstream RPMs). The bulk of the work directly incurred against this project was scheduling the incremental OS switchover of the primary Theon servers; this posed no problems.

A large number of general-purpose headers were upgraded to support SL6 as a part of this process, including python and apache WSGI infrastructure, postgresql, pgluser and other widely-used services.

Coltex

The upgrade initially went smoothly, however having ported all the software and configured apache we found that there was some kind of authentication issue with SVN/cosign. work on this was delayed to deal with RT and then on switching back, after a great deal of investigation this turned out to be a bug hidden deep in mod_dav_svn (see Project206FinalReport ).

Exam Preparation Desktops

Not strictly a server, but conversion was tied into this project. Converted easily to SL6 but required changes to the firewall configuration and NFS, and also improvements to the SL6 LaTeX distribution.

Lab Exam Environment

Again, not solely a server, but conversion was tied into this project. This required widespread changes to configuration including supporting the switch from GDM to KDM and significant changes in firewall and component configuration, and porting / rebuilding of the exam-specific software to accommodate new package versions. The SL5 exam submission server was also replaced by two SL6 servers, which was more trivial as much of the configuration work had been completed by this point.

ISSRT/RT

For SL6 the version of rt used with ISSRT was upgraded to RT4, this involved the building of a large number of perl modules. After a couple of initial tries, including using rt's own built in cpan perl module install script a repeated recursive brute force build along with a few tailored hand tweaks (generally involving bootstrapping specific modules by installing then from cpan as root and then building a module from an rpm) managed to satisfy rt4's voracious perl dependancy requirements.

Since we were upgrading for ISSRT we also took on the work of upgrading the CO RT service This involved removing all code customisations either by using built in configuration changes, by defining a customised rt lifecycle, by the use of rt4 modules and by dropping the use of embedding the rt number in the plussed part of the email address. A number of other outstanding issues were resolved (and almost as many new ones were raised). The RT headers for ISSRT and RT were consolidated and it's now possible to more or less throw together a generic rt service. The services themselved were moved to KVM instances and the old hardware retired. A development RT service was installed and it is planned to run it with a daily synch from rt4 . There was a first pass at switching the back end database to postgresql but this was deemed to be non trivial and has been put off for now for rt4. As a result of this preliminary work the latest instanciated rt server kmrt.inf.ed.ac.uk

Research servers

A number of GPU servers (roswell, bonnybridge and rendlesham were upgraded to SL6. General-purpose (largely institute-specific) compute servers were converted following user approval, largely without difficulty, though in many cases triggering software upgrade / port requests.

Webmark and other web services

Updating Webmark for SL6 required widespread improvements to server software headers for LaTeX distribution, along with changes to install appropriate PHP modules. The software itself was largely unchanged.

All PHP-based web services required tweaks for stricter PHP compliance, such as the use of verbose <?php tags at all times.

License Servers

Work was done to port the flexlm component to SL6, which required a little effort to retain compatibility with the binary distribution of FlexLM itself. In some cases newer licence vendor daemons were sought

Cluster

The cluster upgrade was staged in order to maintain some level of service.

Firstly the GPFS data and metadata nodes were upgraded in sequence with the filesystem remaining available and data being copied off nodes as they went out of service. As the hardware did not support a 64 bit OS most of the GPFS data nodes remained as SL6.

Next the hadoop nodes were upgraded to SL6_64 with the hadoop version reaingin at 0.22.0 at the course directors request. An initial batch was decommisioned from the cluster and the data was rebalanced across the remaining nodes. This group of nodes was then set up to form the core of a new Sl6_64 based cluster and, following a tidy up of user data on HDFS as many nodes as possible were transferred to the new SL6_64 based cluster. A cluster to cluster HDFS copy was then performed and once all the suer data was on the new cluster the remains of the SL5 cluster were decomissioned and added in turn to the SL6_64 cluster. At the same time much older desktops were replaced with Desktops coming out of the labs to improve the cluster performance and reduce it's carbon footprint. In the midst of this It was decided not to upgrade gridengine because the number of users had fallen away. Some preliminary work was abandoned and the service was removed as the nodes were upgraded. Some specific servers were reallocated as GPFS data nodes and the very old 850s were retired. Unfortunately in this process we had several independent hardware failures and this time the filesystem could not be transferred. Given that we had data loss on the filesystem, that the filesystem was by definition risky and unarchived and that only one user had expressed a passing interest in any data a new filesystem was created.

Misc

Time Spent

approx 18 FTE Weeks

Conclusions/observations

This upgrade was fairly painless with a significant amount of work being done by CSOs and with most of the problem cases being unforseeable without trialling an install on SL6. If we're running services, particularly complicated ones, we probably shouldn't rely on OS upgrades to drive us to upgrade the software, whilst it does generate hard deadlines these are not always desirable.

With retrospect the swapping of attention from Coltex to RT and then back was not ideal and probably added 5 FTE days to the workload for the project in context switching alone.

Some of the workload could have been preempted. For example some of the RT perl modules were either upgrades from modules used on desktops or had dependencies that had earlier versions on the desktop. Also building the rt rpms required recreating an intensive build environment used for building the haskell modules which incurred a small overhead. It may be worth, even if just as a first cut building rpms for servers/services at the time we do the desktop build.

-- IainRae - 08 Aug 2013

Topic revision: r7 - 01 Oct 2013 - 08:51:31 - IainRae
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies