Final Report for Server Hardware Interaction Phase 2

https://devproj.inf.ed.ac.uk/project/show/171

Time Taken

Lots.

T1 2012: 167 hours
T2 2012: 82 hours
T3 2012: 67 hours, plus the time taken to write this report.

Total: 316 hours, or eight and a half weeks.

Two weeks were allocated to the project so it overran by more than six weeks. A lot of this time was taken up exploring the large and frequently promising-looking world of Dell system management software. These explorations often seemed to become stuck in complicated technical messes. This can be a dispiriting experience: this report suggests some ways of avoiding the problem.

Aims

The project was a second phase of our efforts to improve our interaction with server hardware. This phase concentrated on helping the computing staff to keep the servers' firmware and BIOS more up to date than hitherto.

The project's case statement identified several options:

  • A manual process - storing firmware files in common repository and documenting procedures for applying.
  • An automatically populated repository of BIOS/firmware files. Manual application.
  • A system for monitoring BIOS/firmware levels and flagging servers with out-of-date levels. Manual application.
  • Fully automated application of bios/firmware (manage as if RPMs).

What the project has delivered

The project has delivered:

  • a database record, updated daily, of firmware and BIOS details for the major hardware components of our servers.
  • a collection of tested and recommended firmware updates for our servers.
  • a web page which lists the servers in need of firmware or BIOS updates, with details of those updates.

The system is documented in two wiki pages:

Future Work

  • The project's case statement pointed out that memory error detection (for example using ipmi-sel) would be highly desirable. This is still the case.
  • Now that we have an automated machine-specific list of appropriate updates, we could have these updates copied to the machine's local disk. (It is strongly recommended that updates be applied while the machine is in single user mode, when the network is not available.) Some instructions could also be copied to the local disk. In principle this could be a fairly quick extension: a small script could do a simple SQL query; this query would produce a list of filenames; these could then be copied to the local disk. This work seems small enough to be done in spare development time rather than in a project.
  • Now that we have a collection of recommended firmware and BIOS updates, it will need regular maintenance. This will be a regular part of the MPU's operational work.

Dell's Management Software

Dell has put a lot of effort into its server management software. Many sites manage to get a lot of use out of it and seem fairly happy with it. Nevertheless, it doesn't seem to suit our situation. Several factors make it hard for us to use:

  • Software installation is made simple for the user by means of extensive and multi-layered use of multiple yum repositories. Installation goes something like this:
    • The user downloads a shell script from Dell.
    • That shell script configures some yum repositories and downloads some installation software from them.
    • The installation software then runs. It installs some Dell system admin software, but it also configures some more yum repositories, this time model-specific ones. (Counting the model-specific ones, Dell seems to have a lot of yum repositories.)
    • The system admin software can then (amongst other things) compare the machine's firmware with what the model-specific yum repositories have available. That is, if the software is working this week, and if the machine is running standard unmodified RHEL or if the user has by some means found out and applied whatever eldritch modifications are necessary in order to fool this version of the software into thinking that it's running on an unmodified RHEL system rather than on a RHEL clone.

Dell's management software can be wonderfully helpful where a machine's configuration is managed manually, but for our automatically configured machines, having multiple levels of install scripts altering a machine's RPMs and yum configuration in complex and unpredictable ways is difficult to support.

  • Where it caters for Linux, Dell's software concentrates on support for RHEL and SUSE. RHEL clones are unsupported but tend to be compatible enough to mostly work most of the time provided the user makes the right adjustments, most of which seem to be passed around as hearsay on blogs or mailing lists. These quirks and tweaks also seem to change from one version of Dell's software to the next.
  • Dell now has apparently vastly improved management software of great power and capability. Sadly for us the software's designers made the assumption that all data centres have a few Windows PCs lying about which can be used as management consoles for Linux machines. They don't seem to have encountered or allowed for entirely Linux-based sites like ours. As a result the central management control function around which this software is structured is not easily available to us, making the whole software solution far more difficult to adopt. We could consider setting up a Windows machine for this software, if the software otherwise seemed dependable and easy to install on our machines. For us though that doesn't seem to be the case.
  • Multiple generations of overlapping Dell software solutions are available; these are documented and commented upon in various different places. It can be hard to tell which generation of which sort of software might be the best in which situation; advice on this can be found online but this advice can be out of date. Dell's website helpfully tries to provide many links to useful information about the software, but these links are often not appropriate: for instance they can point to Windows-specific information or to documentation on a version not yet available for a vaguely compatible type of Linux.

Lessons learned: technical

The project initiated me into the joy of SQL and database programming, which I had somehow managed to avoid for too long. I'm glad that it did: SQL and database programming are interesting, very useful and well worth knowing about. In no particular order, here are some pointers which might be of use to someone else wanting to gain a smattering of database programming knowledge.

  • We have two database systems in common use: Postgresql and MySQL. The advice I received can be summarised as Postgresql is better.
  • The Postgresql documentation is clear, comprehensive, comprehensible and basically very good indeed.
  • Nevertheless there is also a good Postgresql book available for free online. The same site has several other SQL-themed books too including a MySQL one.
  • For Perl database programming there is unsurprisingly More Than One Way To Do It, but the classic way to interact with databases in Perl is to use DBI. The equally classic way to learn about DBI is to read the O'Reilly book Programming the Perl DBI. The book was most recently updated in 2000 but it's still available and still relevant, and I can't recommend it highly enough. The chapters which I read had explanations of enviable clarity and readability: I learned a lot from it with what felt like very little effort. I used DBI for this project's database programming needs and for small programs I would not hesitate to use it again.

Lessons learned: social

This project has given me some useful lessons in how to improve my working methods. There are several tips here, but they all come at the same problem from different angles. That problem is getting hopelessly stuck - letting my investigations take me deeper and deeper into an unproductive technical tangle. This has two effects. The first is obvious: I fail to make progress towards the objectives of the project. The second may be less obvious but can be at least as serious: the lack of progress on the project, if it continues for a while, can itself become profoundly disheartening, and the lack of confidence which results from that can become debilitating, making subsequent progress far more difficult than it need be.

I have found some solutions. I don't yet use all of them as much as I would like, but I'm working on it. In no particular order:

Blog

I spent a long time trying to find my way around the quirks in Dell's Linux server support software. See above for an idea of the complication I encountered. At one point, not particularly expecting anything to come of it, I blogged about the problems I was having. I was startled to get, not many days later, a very useful reply to my blog post. My posts normally attract helpful advice on stopping snoring or making money fast, but this was from a staff member at Dell who was able to explain what was going on, give me some pointers and generally bring some sanity to the situation. It seems that Dell staff take it in turns to search the internet for blogs which mention Dell, and contact and offer help to the bloggers. It helps that our blog server seems to be enviously well placed in Google search results: I've long since become accustomed to finding my own rather despairing blog entries showing up as a top result in web searches when I'm looking for help on an obscure topic. The high search rank means that our blogs can be found easily and read by a lot of people; our humble blog service can be well worth using.

Write

This is old advice but it's something I've only started doing recently. During this project I've started to keep in a notebook a record of everything I do. I (aim to) write down things needing done, things I have done, problems, thoughts, possible theories and solutions. This has had two benefits, only one of which I foresaw. The obvious one is that it gives me a record of my work to look back on, a useful memory aid. The less obvious but for me just as useful an effect is that the process of noting down problems and half-formed notions seems to help me to form a clearer mental picture of the situation, making the way forward more obvious.

Meet

My unit, the MPU, aims to meet every week. One of the purposes of that meeting is to go over project work, share experiences and advice, and help each other out of sticky situations. Sometimes we haven't managed to have a meeting because someone has been away (there are only three of us). However one lesson we've learned this year is that we should hold these meetings weekly without fail: going too long without someone else's perspective on my work can lead to my getting unhelpfully side-tracked into an unproductive morass. (Some people don't seem to be like this; I am. We're all different.) A fresh perspective can often be astonishingly helpful, and can quickly show a way round what has hitherto seemed like an impassable barrier. What I'm trying to promote here is the idea of regular contact with at least one colleague, in which they take time to describe to each other exactly how their projects are going and to give each other opinions and advice. It's important to be honest. If things aren't going well, don't gloss over it or try to hide the fact; describe exactly how bad the situation has become. When you're stuck is precisely when you need help the most.

-- ChrisCooke - 26 Nov 2012

Topic revision: r3 - 27 Nov 2012 - 14:23:37 - ChrisCooke
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies