SL7 server upgrades -- Research and Teaching Unit

Final Report for CompProj:359

Description

The aim of the project was to upgrade all RAT Unit servers - and services - from SL6 to SL7.

This project specifically excluded the complex configuration comprising the lab exam environment, which involved many changes to components of the desktop environment -- but included the lab exam file servers. It excluded the Desktop and Research & Teaching package upgrades, which were separate projects.

Work was tracked using ServerUpgradeSL7 and and the SL7 RT.

Customer

Internal, which is to say all users of any Informatics system, DICE or otherwise.

Deliverables

  • All LCFG SL6 services to be reviewed and either decommissioned or upgraded to SL7.
  • All LCFG SL6 servers (physical and virtual) to be reviewed and either decommissioned or upgraded to SL7.

*

Documentation

Time

The project took ~48 weeks(!) FTE of combined CO and CSO time. It is likely that CSO time is under-reported. The project was undertaken primarily in 2016, with some work taking place in early 2017 and (discussions over) the final legacy servers carrying on until 2018.

Though various exclusions are defined above, it's hard to tell to what extent other SL7 upgrade projects overlapped with this one. In many cases SL7 upgrades were done piecemeal as part of other (software or host) upgrades, and effort allocation wasn't always clear in those cases. In other cases, time spent decommissioning a service - which might normally have gone unrecorded - would have been allocated to his service.

It appears that the bulk of this mammoth effort of upgrading was simply due to the time consuming nature of the upgrade process, and sheer number of servers we manage - the numbers of which do not normally weigh so heavily upon us due to the power of our configuration management system.

It's also clear that there was a unit tendency to allocate routine operational effort into this project, if it in any way assisted in the process of migrating to SL7. This is not to say that the figure is an overestimate; more just a note that reporting time and effort in a uniform way over multiple years is a difficult task in itself.

Observations

Most services were ported without issue, though late project reporting has lost many of these details to time. Interesting details, where they can be recovered from notes, are listed here,

space: by far the biggest single bottleneck (apart from hands) was storage. In many cases servers held data which was to be preserved between installs, and good practice dictates that the primary copy should be manually moved from old to new incarnation. In some cases where hardware or VM replacement could take place this could be done in a single operation but in many cases where immutable physical hardware was involved or where very large quantities of data are involved this required us to decant off the server, reinstall, then decant back. This involved heavy reliance (and extreme contention) on transfers to and from /disk/huge/, a large partition provided by the services unit.

It transpired that while /disk/huge may have been appropriately named from a CO perspective, ~10Tb at a maximum of 1Gb/s was (a) a vanishingly small resource when compared to massive research datasets*, (b) in heavy contention by several units and (c) was quite appropriately not named /disk/fast, quickly dropping below network speed in most copies**. While we thank the services unit for this resource, without which the upgrades would have in fact been impossible, we believe that many of these transfers would've been significantly hastened with the purchase of just a couple of inexpensive HDDs*

(many of which do *not require backup from us but somewhat paradoxically do require best efforts to preserve under circumstances other than hardware failure) **(however it's hard to establish how often this was due to contention on this disk, since in many cases source disks are older and data more fragmented / in suboptimal format for such transfers.) **(This is not necessarily a recommendation for future projects -- the relative costs and transfer speeds may well not align in the same way next time -- but we *strongly recommend that this be evaluated before beginning).

servers with CO-run services: in general these were the easiest to schedule; most had good levels or redundancy, or other availability / DR strategies which allowed us to either upgrade seamlessly, or schedule reinstalls for periods where disruption would be kept to a minimum. In my experience this meant that these were often done first, which I believe was helpful in establishing a good "bed" of basic services which allowed us to schedule research servers with more confidence.

servers run for research groups: in general these were the most time (not effort) consuming; even with an "iron fist", scheduling reinstalls is a fiddly operation. The pinnacle of complexity transpired to be the flybrain servers -- a small cluster of hosts running research services which included live (and very popular) web services with unique constraints on uptime, storage and service complexity. This upgrade was incredibly difficult to schedule as it required an unspecified period (estimated at several days) of downtime once all aspects of data transfer and service checks had been performed. ****

Status

Complete, with exceptions noted above.

Future Work

Debian, one would assume.

-- GrahamDutton

Edit | Attach | Print version | History: r8 | r6 < r5 < r4 < r3 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r4 - 04 Feb 2019 - 16:55:25 - GrahamDutton
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies