SL7 Upgrade for MPU Services

This is the final report for the SL7 Upgrade for MPU Services, which is project #357.

The aim of the project was to upgrade all MPU-run services from SL6 to SL7.

The project was planned and tracked at SL7MPUServersUpgrade.

Service By Service

Local Package Master

This was a fairly straightforward upgrade, the only changes required were for apache 2.4. Some changes were made so that fewer manual steps are required to reinstall the server, this has since proved useful when we merged the package mirror and master servers into a single machine.

Package Mirror

This was a fairly straightforward upgrade, the only changes required were for apache 2.4.

Package Slave

This was a straightforward upgrade once apache-waklog had been upgraded to SL7. The only changes required were for apache 2.4.

Package Caches

The squid caching server version changed from 3.1 to 3.3 with the upgrade to SL7, this required a bit of investigation and various changes to the component. To simplify future upgrades the component templates were changed to reflect the intended squid version rather than the OS version.

These servers also provide tftp and NFS for the PXE service. Porting tftpd to SL7 (and systemd) proved to be more difficult than expected due to a bug in TFTP (Red Hat bug#1023645) - it cannot listen for IPv4 and IPv6 on the same port. Unless instructed to listen for just one or the other, it defaults to listening for only IPv6. We got round this by restricting it to IPv4.

LCFG Master

This was the most complex upgrade. To simplify the process the ordershost was moved to a separate VM. The upgrade process requires many steps so an exhaustive plan was produced based on previous plans for SL6 and SL5. The process was practised several times on a spare machine to ensure it would all work smoothly. On completion of the upgrade we had a problem with access to the inventory data due to the rsync process preferring IPv6, thankfully we had a complete copy of the data from before the reinstall so we quickly restored that data. Unfortunately, this had a knock-on effect for the LCFG slaves which meant we had to restart them with empty caches.

The rfe daemon had to be converted to work with systemd and the associated component was completely rewritten into Perl and the templates converted over to using Template Toolkit. The GSSAPI and Authen::Krb5 Perl modules also required some attention.

The websvn web interface has not seen any maintenance in a very long time so it had to be replaced with viewvc. This required the creation of a new component. Thankfully viewvc is easier to configure than websvn.

The apache config had to be updated for 2.4 and mod_krb5 was replaced with mod_gssapi. The locally-written mod_user_rewrite needed to be ported to the "new" LCFG build tools, thankfully it still worked with Apache 2.4.

A huge number of component defaults packages had to be built and many versions in the package lists had to be updated to match the components. A few old unused defaults packages were removed and there was some shuffling between lcfg and dice lists.

LCFG and DIY DICE Slaves

The upgrade of the LCFG slaves was fairly straightforward, the only changes required were for apache 2.4 support. A new feature was added to the LCFG server code to use templates to support the new style apache 2.4 authorization, this also allows other sites to have greater control over the authorization for their local infrastructure.

We also took the opportunity to raise the memory in the two main slave servers from 2GB to 8GB so that all LCFG profile information could be held in memory at once, this led to a substantial performance improvement.

The test server and DIY DICE server were upgraded first which gave us confidence that the upgrade of the main servers would go smoothly. The DIY DICE server was also tidied to remove files for users who were long gone and also to update all the installbase/installroot profiles for the latest platforms.

LCFG inf-level test server

This is an inf-level LCFG server which compiles test profiles for the weekly testing process. To ensure that both SL6 and SL7 are working correctly we created a new SL7 VM to run alongside the current SL6 VM. Having already done the main LCFG slave servers this was a fairly simple upgrade. Upgrading this was a useful stepping stone to getting the LCFG master upgraded since it allows us to easily check for component defaults packages which had not been built for SL7.

DR Server

This is a combination of LCFG master, slave and package mirror, this means the apache config is fairly complex and it took quite a bit of effort to get the virtualhosts all configured correctly for apache 2.4. Upgrading this server was a very useful test for the upgrade of the live LCFG master server. To improve the DR mirroring process we added some new features to the rmirror component.

PkgForge

As part of the OS upgrade the PostgreSQL DB was upgraded to 9.6 and the pkgforge software was modified to utilise some of the new features (e.g. being able to store JSON data). This opens the way for a reworking of the web interface at some point to improve access to the job information. Various longstanding bugs were resolved, in particular the timestamp handling was improved. Some minor code changes were also required for changes to the Moose and DBIx::Class Perl modules.

LCFG website

This was a fairly straightforward upgrade, the only changes required were for apache 2.4. To keep the wiki secure it was upgraded to TWiki version 6 which took a bit of effort, it looks prettier than the previous version.

LCFG bugzilla

When we started looking at this we aimed to combine the move to SL7 with an upgrade to the latest stable release of Bugzilla, version 5.0.3. However there were package dependency issues, so we also looked into a possible SL7 upgrade of the version we were using, 4.4.12. That proved to have package dependency issues of its own on SL7. Eventually we solved the package issues of 5.0.3 on SL7 so we decided to go with that. A test site was brought up at testbugs.lcfg.org. This allowed us to tackle problems one by one. A failure to send email was traced to a new configuration rule that all mail messages had to be sent from a valid mail address (instead of just from "bugzilla-daemon" as before). A new bugzilla.mailfrom resource was added to the LCFG bugzilla component to support setting this. iFriend access was configured and tested with the help of the LCFG community. LCFG branding was installed using a simpler configuration than in SL6. A weekly apache failure was traced to one apache configuration file inadvertently being configured by both the apacheconf and the file components. (The file component would win this battle when the machine booted, and the apacheconf component would reassert its version at each weekly logrotate run.) Once these problems had been solved, the database contents were transferred from the old VM, the DNS changed to point bugs.lcfg.org to the new VM, and the various bits of configuration were changed to implement the change of DNS alias from testbugs.lcfg.org to bugs.lcfg.org.

SSH Servers

This was the first user-accessible MPU service to be upgraded so we had the usual problem of a small number of packages that users need being missing after the upgrade. There were a number of problems with autofs, which hadn't been seen on the desktop, that required tweaks to the systemd config and some effort to be put into improving the component. A couple of users reported that they could not login due to broken IPv6 support with some ISPs, the only workaround found was to set the ssh client AddressFamily option to be inet.

NX Servers

The NX service was upgraded early so that it would be ready for the start of teaching in semester 1. The main issue was that Gnome 3 does not work with NX so we had to change the config to provide MATE as an alternative when a user asked for Gnome.

Wake on Lan Service

For the web service, the LCFG configuration was tidied and refactored, package versions were updated and the web configuration was updated to Apache 2.4. For the backend wake service, the wake script was altered to use different command locations on SL7.

IBM DS3524 monitor

As the IBM disk array is no longer providing critical storage and is likely to be decommissioned shortly, it was decided not to upgrade the monitor server (giz) to SL7. We will leave it on SL6 until that platform dies.

BuzzSaw and Log Cabin

The BuzzSaw service upgrade was straightforward, the PostgreSQL DB was upgraded to 9.6 which allowed the use of new features which improved performance (in particular, parallel workers). It was very useful to be able to run the upgraded service on SL7 alongside the old SL6 service for a while to compare results. Some problems with input validation were discovered and fixed.

The Log Cabin web interface required a newer version of Django, thankfully porting the code to the new version didn't require as much effort as feared. There was an issue related to the removal of the local LDAP server, for some long running reports the connection would timeout after a period of inactivity. This was easily fixed by tweaking the python code to reconnect whenever necessary.

Computing Help

Three new computing help VMs were created (lagun - master, hulp - hot slave, ayuda - devel). The only significant issue was related to Cosign configuration for the new Apache-2.4.

KVM Servers

We first upgraded our test KVM servers to SL7. We found that the SL7 qemu-kvm packages don't allow VM migration between servers which don't share storage. None of our KVM servers share storage so this was a problem. We solved it by replacing the qemu-kvm packages with qemu-kvm-rhev packages. We found that a VM that was first migrated from an SL6 server to one running SL7, then power-cycled, could not then be migrated back to an SL6 server. We decided to live with this limitation. Two of our KVM servers, jubilee and hammersmith, were due for replacement. The two new replacement servers, girassol and gaivota, were installed with SL7 from the start. After a period of running production VMs on them it was decided to go ahead and upgrade all other KVM servers to SL7. This was done in a rolling programme of VM migrations - all servers were upgraded with very little VM downtime. The spare disks from the replaced KVM servers jubilee and hammersmith were redeployed to the KB servers to provide extra space for temporarily migrated VMs, allowing us to free up the servers for upgrade one at a time. The upgrade of the KVM servers was complete at the start of December 2016.

Software

  • Moose and Catalyst - three days were spent on building these Perl frameworks for SL7, including their many many perl module dependencies

Discussion

Without a doubt, the biggest change was the switch to systemd which required the reworking of the configuration for many daemons and tweaks to many components. Hopefully now that is done we won't have to do it again for a long time... We're still learning about the best way to start some services (e.g. autofs) and about systemd's interaction with other systems in general.

IPv6 caught us out on a number of occasions. Access to many of our internal services is restricted by IP address and those ACLs were only for IPv4, a lot of our software prefers IPv6 (e.g. rsync and postgres DB clients) whenever it is available which led to some confusing situations immediately post-upgrade.

The upgrade of apache from 2.2 to 2.4 definitely generated a lot of extra work, particularly for the switch to the new style authorization.

We were caught out by the significantly increased disk space requirement for logging (due to journald) - we had to re-install both the SSH servers.

As well as peer - reviewing critical code, we should also consider reviewing critical configuration, particularly when major changes are being made and/or fixing the resulting problems are expensive in effort.

Effort expended

Period Effort
2016 - T1 1 week
2016 - T2 4 weeks
2016 - T3 11 weeks
2017 - T1 4 weeks
Total 20 weeks

The equivalent project to migrate to SL6 took a total of 12 weeks. Some of this total can be offset by the cost of migrating some services to new hardware.

-- StephenQuinney - 12 Apr 2017

Topic revision: r8 - 01 Jun 2017 - 08:50:49 - ChrisCooke
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies