SL7 server upgrades -- Research and Teaching Unit

Final Report for CompProj:359

Description

The aim of the project was to upgrade all RAT Unit servers - and services - from SL6 to SL7.

This project specifically excluded the complex configuration comprising the lab exam environment, which involved many changes to components of the desktop environment -- but included the lab exam file servers. It excluded the Desktop and Research & Teaching package upgrades, which were separate projects.

Work was tracked using ServerUpgradeSL7 and and the SL7 RT (internal link).

Deliverables

Broadly:

  • All LCFG SL6 services (sourced from DICE-level headers, and their dependencies) to be reviewed and either decommissioned or upgraded to SL7.
  • All LCFG SL6 servers (physical and virtual, sourced from LCFG profiles) to be reviewed and either decommissioned or upgraded to SL7.

Customer

Internal, which is to say all users of any Informatics system, DICE or otherwise.

Time

The project took ~48 weeks(!) FTE of combined CO and CSO time. It is likely that CSO time is under-reported. The project was undertaken primarily in 2016, with some work taking place in early 2017 and (discussions over) the final legacy servers carrying on until 2018.

Though various exclusions are defined above, it's hard to tell to what extent other SL7 upgrade projects overlapped with this one. In many cases SL7 upgrades were done piecemeal as part of other (software or host) upgrades, and effort allocation wasn't always clear in those cases. In other cases, time spent decommissioning a service - which might normally have gone unrecorded - would have been allocated to his service.

It appears that the bulk of this mammoth effort of upgrading was simply due to the time consuming nature of the upgrade process, and sheer number of servers we manage - the numbers of which do not normally weigh so heavily upon us due to the power of our configuration management system. Unlike SL5 → SL6, the software and services were substantially similar and no major changes were required to most processes. The single exception to this was probably the change from Apache 2.2 → 2.4 (which was not strictly part of the SL7 upgrade) which drove a great deal of new software and configuration.

It's also clear that there was a unit tendency to allocate routine operational effort into this project, if it in any way assisted in the process of migrating to SL7. This is not to say that the figure is an overestimate; more just a note that reporting time and effort in a uniform way over multiple years is a difficult task in itself.

Observations

Most services were ported without issue, though late project reporting has lost many of these details to time. Interesting details, where they can be recovered from notes, are listed here,

Space: The biggest single bottleneck (apart from hands) was storage. In many cases servers held data which was to be preserved between installs, and good practice dictates that the primary copy should be manually moved from old to new incarnation. In some cases where hardware or VM replacement could take place this could be done in a single operation but in many cases where immutable physical hardware was involved or where very large quantities of data are involved this required us to decant off the server, reinstall, then decant back. This involved heavy reliance (and extreme contention) on transfers to and from /disk/huge/, a large partition provided by the services unit. It should be noted that in many cases upgrades / moves were eased by running concurrently with hardware upgrades, or made use of semi-retired / out of warranty servers. The benefit of having hardware "lying around" with no specific purpose shouldn't be underestimated!

It transpired that while /disk/huge may have been appropriately named from a CO perspective, ~10Tb at a maximum of 1Gb/s was (a) a vanishingly small resource when compared to massive research datasets^1, (b) in heavy contention by several units and (c) was quite appropriately not named /disk/fast, quickly dropping below network speed in most copies^2. While we thank the services unit for this resource, without which the upgrades would have in fact been impossible, we believe that many of these transfers would've been significantly hastened with the purchase of just a couple of inexpensive HDDs^3

Servers with CO-run services: in general these were the easiest to schedule; most had good levels or redundancy, or other availability / DR strategies which allowed us to either upgrade seamlessly, or schedule reinstalls for periods where disruption would be kept to a minimum. In my experience this meant that these were often done first, which I believe was helpful in establishing a good "bed" of basic services / templates which allowed us to schedule research servers with more confidence.

Servers run for research groups: in general these were the most time (not effort) consuming; even with an "iron fist", scheduling reinstalls is a fiddly operation. The pinnacle of complexity transpired to be the flybrain servers -- a small cluster of hosts running research services which included live (and very popular) web services with unique constraints on uptime, storage and service complexity. This upgrade was incredibly difficult to schedule as it required an unspecified period (estimated at several days) of downtime once all aspects of data transfer and service checks had been performed. It should, clearly in retrospect, have been prioritised higher.

Cross-unit dependencies: RAT spent virtually no time blocked by other units' dependencies. In some cases we were able to make provisional upgrades of components, etc. but in most cases such things were already RAT responsibilities. In other cases other units were able to prioritise work to assist. The SL7 tracker almost certainly helped ensure we didn't forget anything, but didn't feel as if it was particularly essential for cross-unit cooperation.

Systemd: Probably the biggest structural change between SL6 and SL7, this caused much less trouble than anticipated (except for Lab Exam development, covered in a separate project).

^1(many of which do not require backup from us but somewhat paradoxically do require best efforts to preserve under circumstances other than hardware failure)
^2(however it's hard to establish how often this was due to contention on this disk, since in many cases source disks are older and data more fragmented / in suboptimal format for such transfers.)
^3(This is not necessarily a recommendation for future projects -- the relative costs and transfer speeds may well not align in the same way next time -- but we strongly recommend that this be evaluated before beginning).

Status

Complete, with exceptions noted above.

-- GrahamDutton

Topic revision: r8 - 06 Feb 2019 - 19:53:57 - GrahamDutton
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies