Final report for project 362 - SL7 DICE servers upgrade

This project is tracked at https://computing.projects.inf.ed.ac.uk/#362.

Its goal was to Upgrade all DICE servers from SL6 to SL7. This project will involve highlighting inter-dependencies between the various unit specific upgrade projects and assisting in prioritising and monitoring the various deliverables of the individual projects.

As the goal suggests, the upgrading of DICE servers was covered by subsidiary projects which were specific to each unit. This project concentrated on trying to provide some sort of coordination between these projects, so that facilities which were required for upgrades could be provided soon enough, and that the providers of those facilities understood how much they were needed and by whom.

This report is in two parts. The first part (dependency tracking) covers the inter-project coordination which was the real work of this particular project - the tracking of dependencies and the progress reports. The second part (the upgrade overall) covers the overall upgrade of DICE servers to SL7.

Part 1: dependency tracking

Software choice

We understood that our main task would be to model the dependencies between the deliverables in the entire DICE SL7 server upgrade, in order to make it possible to highlight the cross-unit dependencies, helping them to be tackled according to an agreed timetable. Given that, our first big task seemed to be to identify suitable software.

We could either use software we already had or install and configure something we didn't have. We searched for suitable planning software - how do other people do such searches? - but didn't find anything compelling, so we decided to explore what was already available on DICE. As far as we could see, this boiled down to Bugzilla, Redmine and RT.

For the previous two major upgrades we used Bugzilla to model dependencies. In theory it has all that's required - the "Depends on" and "Blocks" links to other tickets, and the ability to draw graphs of these links. We explored how effective the project monitoring had been thought to be. To some extent it had met its goals. However, these efforts generated such legendary blizzards of interlinked tickets that using Bugzilla in this way had become something of a byword for dispiriting drudgery. These projects seem to have suffered because some people got so fed up of updating Bugzilla tickets that they largely ceased to do so. This meant that Bugzilla's picture of the overall upgrade process was significantly behind reality - leading to confusion and delays, and other users losing confidence in the system. There seemed to be a general feeling left over from previous upgrade projects of "please - not Bugzilla again".

Redmine is designed with project planning in mind. It can draw Gantt charts. We could do Critical Path Analysis! It looked interesting. We tried it out. Two things put us off it. Firstly, it seemed to be so widely configurable as to leave us a little lost: we needed local expertise to guide us, but it was a new system to everybody. We worried that the effort of dealing with it would be too much for people to (want to) cope with in the midst of a major upgrade project. We were aiming for as lightweight a process as possible. Secondly, useful Gantt charts can only be produced once every task has been entered and every dependency has been identified and plotted and (crucially) each task has had its expected duration calculated and entered. Given the quiet rebellion there had seemed to be over updating Bugzilla tickets, we thought that the need to calculate an expected duration for each task would probably be a step too far in terms of effort - definitely not a lightweight process.

We thought that, if we were going to model the dependencies between every service, server and software component somehow, the least we could do would be to lessen the necessary learning curve. We reckoned that the software which people were most familiar, which could do the job, would be RT. Rather than swamp the computing support RT server with hundreds of tickets we explored the possibility of setting up a dedicated server. This proved a lot easier than we had expected. RAT was really helpful with setting up sl7rt.inf.ed.ac.uk - by a happy coincidence we were considering a new RT server at just the time when they were looking for a test case for their new RT server configuration system.

We entered the MPU tickets as a test and modelled all the dependencies between them. We found that modelling dependencies seemed easy - just link two tickets together in the right way (we chose "Depends on") in a ticket's "Links" section. Graphing those relationships was more fiddly, but we thought that with a bit of practice it could produce very helpful-looking graphs.

SL7RT

We produced a guide for computing staff.

We thought that we had planned carefully the sorts of relationships which needed to be modelled in sl7rt, but in practice perhaps we weren't precise enough about exactly what data we wanted units to enter, and how we wanted it to be represented.

Some units were faced with an enormous and very time-consuming data entry task, and this led to occasional cutting of corners. For instance, instead of creating links between tickets representing individual services and tickets representing their dependencies, some dependency tickets just had a text comment noting that "eleven services depend on this". Since the whole point of sl7rt was to tease out all of the individual strands of dependency data, this later proved unhelpful.

The point of the project was to expose dependencies, especially cross-unit dependencies, to try to ensure that units wouldn't be left waiting for months for the porting of a service or component which its owners thought unimportant. It did expose some cross-unit dependencies, and it did provide a crude measure of progress which was useful at times. However, it also imposed an extra burden - it made it necessary for each unit to analyse every single one of its services and expose and declare every single dependency, right at the start of the upgrade projects.

Monthly reports

A lot of time was spent creating reports. Each report consisted of a graph (created in libreoffice using figures generated in SL7rt) showing how many tickets were new, opened, resolved or rejected. Each report also listed the tickets that had been resolved since the last report. In an attempt to find bottlenecks, we searched for the tickets that were most depended on by other tickets. The assumption was that resolving these would be a priority, but as some tickets hung around for a while, that didn't appear to be necessarily the case.

Hindsight

We wonder if the SL7RT infrastructure achieved very much? It did cast light on some cross-unit dependencies, but wouldn't these have come to light anyway?

We managed to avoid creating another blizzard of bugzilla tickets, but we may have focused on the wrong bit: we eliminated bugzilla, instead of the blizzard of tickets. Perhaps the natural way to tackle large upgrade programmes is for each unit to have a simple list of services to upgrade, leaving the detail of each upgrade's requirements to be tackled piecemeal.

The blizzard of tickets approach has two problems: firstly it forces units to look in some detail at every one of their services, all at once, perhaps months before those services come to be upgraded - a huge effort; and secondly the resulting mass of information is hardly used: each unit found it far easier to track its upgrade on a single web or wiki page rather than using SL7RT.

We don't think that the reports were very useful either. As all units tracked their upgrades on their own pages, much of the content of both the reports and of SL7RT seems to have been superfluous. The reports were time-consuming to compile. A report that focuses primarily on RT statistics, with a large part dedicated to last month's closed tickets, may be less useful than simply reporting which machines or services are left to do.

Recommendations for tracking and reporting

For tracking the next upgrade:

  1. Avoid the blizzard-of-tickets approach.
  2. Units should merely list the machines/services they plan to upgrade (as they all did anyway this time - see the unit upgrade links below) rather than modelling all dependencies in detail at the start of the upgrade effort.
  3. Dependencies only need to be mentioned where they require action from another unit.
  4. Cross-unit dependencies can still be tracked using tickets where desired (whether in RT or Bugzilla).
  5. To ensure that cross-unit dependencies are kept track of and tackled in a timely manner, just make a wiki page listing cross-unit upgrade dependencies, with links to tickets, and have a regular agenda item at meetings to talk about and come to agreement on any problem dependencies.

Reports:

  1. In place of regular reports, a simpler solution might be in order, such as a wiki page reporting each unit's number of machines upgraded, number still to do, and anything blocking further progress. The page could be updated at least monthly by each unit. This would make report creation less time consuming, and the report would more accurately represent progress made, and flag up pertinent issues in a more blatant way, rather than attempting to extrapolate problems by looking at the previous month's ticket numbers.

Monthly reports

Time spent (dependencies)

Period Hours
January - April 2016 77 (Chris)
21 (Graham)
17 (Ross)
May - August 2016 61 (Chris)
37 (Ross)
September - December 2016 6 (Chris)
21 Ross
January - April 2017 12 (Ross)
May - August 2017 9 (Ross)
September - December 2017 21 (Chris)
Total 286

That's slightly more than 8 FTE weeks of effort. This doesn't include the time spent by units discovering, entering and linking most of the SL7RT tickets.

Part 2: the upgrade overall

Effort including DICE SL7 server upgrade projects

357: SL7 server upgrade project - managed platform unit

358: SL7 server upgrade project - services unit

359: SL7 server upgrade project - research and teaching unit

360: SL7 server upgrade project - user support unit

361: SL7 server upgrade project - infrastructure unit

Total effort

The time counted against this project and the five unit projects totals ~144 weeks - slightly more than three 46-week working years. That doesn't include the SL7-related projects listed below, some of which split off from these projects, but some of which included either general upgrade work in addition to SL7 server upgrade work, or were relevant to a wider upgrade of all of DICE to SL7.

Other SL7-related projects

Edit | Attach | Print version | History: r25 | r23 < r22 < r21 < r20 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r21 - 17 Jun 2019 - 14:18:26 - ChrisCooke
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies