Lab Exam Environment port to SL7

Description

Port the Lab Exam environment to SL7. This environment incorporates the exam-desktop.h header(s), the lcfg-examlock component, the getpapers and dice-examutils software, and all associated infrastructure (greeter and session management, local home directory control, exam network filesystems, daemon and firewall control and logging) and documentation. In many cases this was the first use of several general purpose features of DICE SL7.

Customer

Informatics Teaching.

Deliverables

The final deliverable was an exam environment usable by support, teaching staff and students for the purpose of sitting laboratory exams. Progress was tracked on LabExamsSL7.

The environment is documented technically in LCFG header dice/options/exam-desktop.h and procedurally in Wiki page LabExamsProc - these two files link to all deliverables.

Time

The project took between two and three weeks' FTE of CO time, and approximately one week FTE of CSO time. The project was undertaken in a fairly intensive fashion over the course of four and a half weeks, some of which time was attributed to directly supervising and monitoring progress during the first exams of the diet (and dealing with "teething problems").

The error in the above figure is reflective of the uncertainty in effort towards multiple projects (for example, changes to the greeter infrastructure driven by lab exams but benefiting the SL7 environment and other projects).

This effort does not include time spent by other COs working on their own software to assist the port to SL7 - for instance additional development time spent on the autofsutils or (lcfg|dice)-iptables software and header infrastructure. In all cases however I'm confident the exam project simply "brought forward" effort which would have eventually been required for SL7 servers, services or desktops. I'd roughly estimate this time to be somewhere between 1-2 weeks FTE, spread across several COs.

Observations

Progress was tracked through the LabExamsSL7 wiki history (note not necessarily real-time).

Unusually the original effort estimate seems to have been accurate, however the header development required additional work outwith the SL7 project, costing more real-time than expected. See the problems section for details.

The development style for this project was fairly different to our (RAT) norm, since the project required a CO and CSO working concurrently and co-located for virtually the entire development process.

In theory much of the back-end software could have been developed in isolation (and tested in situ later) but it was identified early on that the majority of the header port would require ongoing, iterative testing in contiguous "chunks" for maximum efficiency. As a result any hours of CSO time had to be allocated in advance and removed from the support desk at times. This effort provided a significant boost to efficiency of development and without it the project would not have met its deadline.

Testing header changes can be an extremely time-consuming process and assigning a CSO had the advantage of offloading much of this effort allowing development and testing to continue in parallel. This strategy also provided side-benefits of keeping frontline support informed and allowing continuous feedback, allowing documentation to be updated "live" as the project progressed. This significantly reduced time spent exclusively on training.

As mentioned above it would have been possible to work offsite (for example using VMs as virtual lab clients) but most of the (header, at least) development was undertaken on-site in a realistic lab-like environment. There were substantial gains in direct console access to the test machines since the process required multiple reboots, logins and state changes which must be observed "live". It was also necessary to test at some stage on "real" hardware (e.g. for display manager greeter changes). Accordingly, four lab PCs of differing specifications were removed from live labs and used for the majority of the development. Without these PCs development would have been significantly slower.

Problems

Timing: It will come as no surprise that the real time (though not the effort) for this project was underestimated. However I was pleased to note that enough contingency time had been left that several full-scale tests could be undertaken.

Though some real-time delays were caused by missing DICE functionality (see below), very little time was spent "idle" pending these changes so I suspect none of our combined efforts was time wasted. More contingency time would have reduced pressure on other units to reprioritise, but I think that the environment was an unexpectedly good test-bed for lots of planned DICE SL7 functionality so would not suggest that the associated work needed to have been completed beforehand.

OS Support: It became apparent through development that I was dealing with SL7 (and DICE SL7) at comparatively early stages of development: as well as obvious missing functionality (such as iptables) there were significant limitations in our fine-control of system processes (e.g. our systemd and autofs configuration, some edge-case bugs in our lcfg context handling, and the stability of certain pieces of upstream software). Thanks to excellent support from other units I was able to resolve most OS problems very quickly.

The apparent reason for the incomplete DICE / LCFG level software was that many aspects of the system had been considered 'server only', and were not expected to be ready. So it should be made clearer for future development that the lab environment is not simply a "special desktop", and that perhaps the 'desktop' and 'server' layers are closer in development terms than we'd previously thought.

LCFG Slaves: a significant amount of time was spent testing lab profile changes (indeed, the lockdown mechanism is entirely driven by profile changes; "lock/unlock" are simple profile changes). As such we relied heavily on the LCFG slaves to churn out repeated configuration changes.

Performance of the compilers, particularly in a frenzied and typo-ridden testing process, left a great deal to be desired and probably contributed to more 'idle' time (in 30 — 120-second chunks) and frustration than any other part of the process (apart from the flip-desks; see below).

Even fractional gains in compile speed would have saved real development time here, so I'd suggest we take all steps possible to improve our LCFG slaves (for example such as solid-state storage or (yet-) faster cores) as appropriate.

I understand this is not news to the server managers, so for future extended development would recommend setting up fast, dedicated LCFG slaves where development profiles can be isolated from regular churn and large spanning maps, and "unsafe" performance improvements such as desktop CPUs and RAM-based storage can be used. *post-signoff note:* MPU commented that the test LCFG slave would have been appropriate for this purpose.

Flipdesks: the lockable lab chosen for development unfortunately consisted of a block of "flip-desks". Having now used a "flip-desk" PC for dozens of hours I have come to see them as a nuisance, a hazard to fingers and, after a few days' use, an ergonomic disaster.

I'm now seeking an opportunity to feed back my experiences - and moreover would suggest collecting the opinions of staff and students who have had to maintain and use the desks for longer than I have - should Informatics ever consider the installation or replacement of a "multi-purpose" lab in future.

Status

The project is functionally complete having allowed several exams to be held without incident. On its first test some bugs (affecting the lock/unlock process, but with no impact on the exams themselves) were found and these were fixed by changes to the lock/unlock software - and improved documentation for supporting staff.

Experience tells us that it's not feasible to expect entire labs to lock down without some faulty machines but the system is designed to detect these - procedures have been written to allow quick (student) recovery without complex decision making. The improvement in software (and procedure) has allowed the most recent exams to be held with zero disruption to examinees, zero CO effort and reduced CSO effort.

Future Work

Though the system has proven stable in practice, problems discovered through testing highlighted that it would be highly desirable to allow the system to detect an even broader variety of errors automatically to keep student disruption and "live" CSO / CO intervention to an absolute minimum.

As with all previous exam environment releases, it's important to check that the modifications made by the system are still taking effect as our configuration "drifts". The improved testing procedures should catch these, but it is always worth scheduling contingency time before each diet to deal with any issues that arise.

Recent JANET / EaStMAN outages have also highlighted that the system ought to be tested under a broader range of failure conditions: though not under the remit of this project, I produced Resilience documentation which addresses many of these issues and should guide future testing efforts.

The project itself spawned a few specific bugs and TODOs, tracked at the LabExamsSL7 page.

-- GrahamDutton - 18 Nov 2015

Topic revision: r7 - 23 Jan 2019 - 13:43:11 - GrahamDutton
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies