Final Report : Ubuntu 20.04 port of LCFG desktop platform (546)

This project followed on from the work done on the Investigate alternative DICE desktop platform #474 project. The aim was to deliver a fully managed DICE desktop environment based on Ubuntu 20.04 (Focal Fossa) which could be used for teaching in September 2020.

Areas of Work

Package Management

As part of the initial investigation a prototype package manager, named apteryx, was developed which uses the apt Python libraries to manage the installed packages according to the requirements in the LCFG profile. As a prototype this had a number of shortcomings, two main changes had to be made; to insert a custom progress monitor so we can bail out if there are problems that would otherwise cause loops, and to use the underlying apt conflict resolver to ensure that routine problems don't result in changes, avoiding the loops in the first place.

The new package manager now seems to function as required and to be fairly robust. There's unfinished work in terms of getting control of the terminal I/O (in all three cases of normal use, in capturing conflict details, and during a crash) and future work might also involve patching / requesting upstream fixes to prevent segfaults in some severe conflict scenarios. There's also unexplored future work in terms of providing custom answers to debconf questions.

This is a part of the project which both Graham and Stephen worked on. As the package manager is so critical to the successful management of all our machines it is definitely a "good thing" that more than one person has an understanding of the code. Also, it means that the code has been reasonably well-reviewed and thus is less likely to have any major issues lurking.

Package Repositories

For the initial investigation project, the reprepro tool was used to manage repositories for locally built packages with upstream packages being fetched directly from Ubuntu servers. During that project, we discovered that the reprepro tool has a number of shortcomings, the biggest of which is that it does not support multiple versions of a package being available within a single repository. This is a particular issue for us as we need to be able to pin packages on old versions (e.g. if a newer version contains a serious bug), we also want to be able to use the same repository from machines on different LCFG releases (e.g. develop or stable) which will require different software versions. We decided to switch to using aptly which is much more powerful and flexible. An LCFG component was developed and to ease the management of the service the software was built for SL7. aptly is a self-contained application with no real dependencies so getting it working on SL7 was, thankfully, quite straightforward. Somewhat frustratingly since that decision was made it has become clear that aptly is no longer being maintained, this is unlikely to cause us problems unless there are changes to the functionality of the upstream package repositories but it is something we need to consider for the future. There are currently no obvious alternatives that are as capable as aptly so the hope is that someone else will take on the maintenance work.

The intention was to integrate the whole service onto our primary package server - deneb - but that proved to be quite awkward so to accelerate progress an older machine with large amounts of disk space was used. Initially, we were a bit concerned as to what sort of load we could sustain when installing lots of lab machines but it has worked well. The entire service was finally migrated to deneb in late 2020. To complete the provision of the Ubuntu package service we want the clients to be fetching packages via the local squid package cache servers and we also need to provide an off-site disaster recovery service. Work on both of those is almost complete and we expect to switch clients over to using the cache service later in 2021.

Package Lists

One of the most time-consuming parts of ongoing maintenance for an LCFG platform is dealing with the lists of packages. In particular, when packages and their associated long lists of dependencies are specified in the LCFG header files there is the potential for considerable duplication and keeping them up-to-date becomes really awkward. With that in mind, a significant effort was made whilst porting to Ubuntu to move as many as possible out of the headers and into the package list files (the .pkgs files). This means that headers can now just declare high-level requirements for particular package options (via the profile.pkgcppopts resource) and not be concerned with precisely which dependencies are required. This means that in the future we will only need to manage those package options in one single place and there will be less "churn" for the headers.

Further to this, the way in which the dice package lists are included has been redesigned so that package options (i.e. the CPP macros) can be used in many of the standard package lists. Again, this helps avoid duplication and reduces ongoing maintenance efforts. These structural changes have already proved to be a massive improvement and the hope is that we can continue to build on this work and improve how complex package sets are managed, which will particularly help the RAT unit with managing teaching requirements.

The particular advantage of moving packages into the package list files is that those files can be managed using the excellent soy tool which was created by Magnus Hagdorn in Geosciences. This can generate package list files from YAML specifications which record required packages and their high-level dependencies. This will make porting to future platforms considerably easier and quicker. Currently, although we are using soy, the way we manage the package lists is still almost entirely manual but there are opportunities to build tools that can be used to automatically rebuild all the package list files whenever any of the YAML input files change (maybe use Make or similar). There are also a number of trickier package lists that have not yet been converted, it would be nice if we could get them all into a state where they can be automatically generated.

There were a couple of other unexpected issues related to the management of Ubuntu package lists. The version strings for some Ubuntu packages contain characters that broke the parser in the LCFG core libraries, the separate very simplistic parser in the LCFG server code was also badly affected. The decision was made to introduce a new package specification like name=version-release/arch which used an equals sign as the separator between name and version, this is much safer than the previous underscore which may be used in package names. The LCFG server code was also modified to use the core libraries to parse package specifications so there is no longer any duplication. Further to this, we discovered that using all as an architecture string confused the server which only handles noarch as a special case. Some work has been done to resolve that issue but for now, we require that all specifications for architecture-independent packages use noarch. At some point we need to revisit the LCFG server code and completely overhaul how the package lists are handled so that all of the work is done within the core libraries, this will fully resolve the issues and mean we only need to maintain a single code-base.

Installer

The previous Ubuntu project produced a basic PXE installer so, in this project, other than the disk partitioning which is described separately below, it was mostly a case of refining the functionality and making it more robust. Generally, the installer has worked well, it is nicely scriptable and almost everything can be controlled from the LCFG profile but we have found a few limitations which mean it will need more work in the future. In particular, it does not work well with our local package mirrors which appears to be because we retain all versions of packages. It seems that some of the tools assume there will only ever be one version of a package in a repository. This means we are currently tied to using an external upstream mirror, one solution would be to create our own simple view of our local mirrors for the first part of the bootstrap prior to apteryx being used. It is also rather difficult to persuade it to trust our gpg key during the bootstrap phase which means dropping back to an unauthenticated mode. The biggest issue we have is that the minimal netboot technology will not be supported by Ubuntu for future LTS releases. Ubuntu has a new technology named autoinstall which we will need to investigate. Other alternative options are: continue to use the minimal netboot since it is well supported by Debian and just patch it to suit our needs; or create our own in a similar way to how we do it for SL7. The advantage to the latter would be that we could have a much richer bootable Ubuntu environment that can be used for diagnosing and fixing problems with systems that have become unbootable. We also currently do not have a bootable ISO for doing installs, this is rarely an issue but if there were major network problems we might not be able to reinstall machines.

Networking

So that we could get all the lab desktop machines installed on time we decided to initially keep the networking very simple, just using dhcp and not running a local DNS server. Similarly for servers, although a lack of bonding was less than ideal we decided it was acceptable in the short term. We always intended to revert to using static addresses but as this had not caused any problems we had considered the work to replace the LCFG network component on desktops to be a low priority. During November 2020 we suffered a problem triggered by IPv6 router advertisements which broke dhcp clients due to a bug in systemd-networkd. As the actual cause of the problem was unclear the only solution was to raise the priority of reconfiguring the networking on all Ubuntu machines. Consequently, a new network component was introduced that configures systemd-networkd (via netplan) for static addresses and bonded interfaces. More work is still required to provide full support for all features of the previous network component, in particular, VLANs and bridging, that work will be done in a separate project.

Recently we experienced another issue related to IPv6 with the PXE installer. Attempts to fetch some package-related files from an upstream Ubuntu mirror were hanging for long periods of time (up to an hour) or eventually failing completely. The only solution we've found so far is to disable IPv6 support in the installer kernel. Clearly more investigations into IPv6 support on Ubuntu are required.

Notably (yet again) all the network interfaces have new names (e.g. eno1 rather than em1), hopefully, this is the last time they will need to change...

Filesystems

Getting the disk partitioning scheme correct for the Ubuntu installer was a huge challenge involving a lot of trial and error. In some ways the installer is quite clever with how it computes the sizes for the partitions but the algorithm is clearly very sensitive to the disk size and the partition requirements. Most people installing Ubuntu machines will just select one of a few ready-made layouts (e.g. atomic or multi) so don't experience any problems. For us, it doesn't really match with the LCFG way of doing things where we expect to be able to specify exact sizes. This is definitely one of the downsides of using the software provided rather than our own, where possible we want to avoid having to maintain local software but at least that gives us the option to make it suit our own needs.

The installer can support configuring multiple disks but so far we have only used it to manage the primary disk. There is also no support for configuring the mount options for extra entries (fstab.entries resource). This is a backward step from SL7 but rectifying it is not currently a high priority since it can be done manually quite easily. We will need to either improve the way we preseed the installer or create a new LCFG fstab component to handle the extra disks.

One big benefit of the Ubuntu installer is the support for LVM, now we have all our disks configured this way we have the option to resize if/when the root partition becomes full. We didn't have time to fully investigate encrypted LVM partitions so that is another step backwards from SL7 where we have encrypted swap and /tmp. That is something we definitely want to improve as soon as possible, it's not clear if that can be done with the Ubuntu installer or whether we will need to write our own tools to do it.

Mostly the various filesystem support just works, certainly, it's nice to finally be able to support exfat, but we've experienced a few minor issues with openafs. Packages for openafs are provided in Ubuntu which is better than RHEL but the version is a bit old for us (1.8.4 rather than 1.8.6). Also, the systemd config provided doesn't interact nicely with our strategy of only upgrading the package at boot time, which leads to long delays waiting on the service to stop before a final reboot can be done. This could be a change in systemd behaviour since SL7, we need to investigate whether the problem is avoidable or otherwise work to minimise the delay.

Desktop Environment

We attempted to make the graphical environment similar to that on SL7. As it is easier to configure, the lightdm desktop manager was again chosen in preference to gdm. Also, we chose MATE as the default desktop environment rather than Gnome as it is a better match for our requirements, in particular, it works much better through XRDP. The new version of MATE causes a few backward compatibility issues, once a user's desktop settings have been upgraded MATE doesn't behave nicely on SL7. To avoid ongoing compatibility issues new student users were denied access to SL7 machines. The dice-desktop selector software (in particular, switchdesk) was upgraded to python 3, thankfully we have not so far needed to hack the code for the accountsservice in a similar way to SL7. Testing it all remotely was a bit of a pain and it took us a while to realise that new users would get a very different MATE environment from those who had used it before, in the future we need to use separate test accounts for testing how users will experience a new platform, that will also avoid us trashing our own settings which can massively slow down progress.

Project Management

As this project had a large number of varied requirements the tracking of progress was done through a board in Trello. This provided a really clear overview that could easily be consulted by everyone involved in the project. It's also very simple to add, remove and edit entries which means there is little administrative burden involved. I would definitely choose to manage other large projects in a similar way in the future.

Discussion

At the time of writing (June 2021) we have almost completed a full year of teaching using the new DICE Ubuntu platform. After the inevitable initial teething troubles it has settled down nicely and can now be considered a robust and reliable system.

As originally hoped, the availability of software in Ubuntu is much better than for RHEL which definitely helped with the work to fulfill our Teaching Software requirements (for example, python modules). Third-party software providers also generally favour providing Debian packages which further reduces the amount of packaging work required. There is still always some software that needs to be packaged locally. With Redhat-based platforms, that cost had become fairly low as we were able to reuse specfiles at each platform upgrade. As this was the first port to Ubuntu there was a lot of learning and extra work involved that we will not need to do for subsequent upgrades. Generally, the Debian and Ubuntu package ecosystem is of a higher standard, in particular, the repositories are more consistent and the package dependencies are much more likely to be correct. Being able to quickly install additional software and provide more recent versions has resulted in our user community being much more satisfied with the environment.

This project was delivered by a small team, consequently, not all members of the Computing Team are yet sufficiently knowledgeable about the Debian/Ubuntu environment. We now need to ensure that everyone feels confident with working on the new platform and productivity is not inhibited. The SL7 and Focal platforms do have many similarities, for example, they both rely on systemd to manage services, the biggest, most noticeable difference is in the packaging of the software. We must provide training to ensure all computing staff know how to use the apt package repositories and also how to create and build packages.

Both this project and its predecessor involved a huge amount of learning and development work. The scale of these projects was, at times, totally overwhelming and I felt under significant pressure to deliver a usable platform on time so it could be tested and rolled out to all the student labs before the start of Semester 1 in September 2020. The Covid-19 pandemic resulted in computing staff having to attend to many other tasks which left this project woefully understaffed for too long, I think under normal circumstances that would have been noticed much sooner. Alastair had originally been slated to put considerable effort into this project but due to a combination of demands was unable to contribute anywhere near as much time as planned. The dramatic increase in University level change initiatives has effectively reduced development effort available to the School. Being the sole worker on a huge project like this can be a very lonely and stressful experience, it is not something I ever wish to repeat. The whole lockdown situation definitely didn't help with everyone being isolated and very few opportunities for informal chats and collaboration but we need to consider how we regularly manage big projects. In my opinion, unavoidably large projects should always be allocated to multiple people with the aim of splitting the work and having regular progress meetings. Once we moved to that model for this project everything began to be a lot easier. Just having a scheduled opportunity to discuss problems and collaborate on solutions makes a project feel much more achievable and less stressful.

Various work following on from this project has been identified - UbuntuFollowOn

These have been proposed as projects:

Total Effort

-- StephenQuinney - 02 Dec 2020

Topic revision: r11 - 08 Jul 2021 - 09:26:23 - StephenQuinney
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies