Forum server room UPS-related power work, July 2019

This page is intended as a place to collect together detailed Unit expectations regarding the UPS-related power shutdown on Saturday July 20th. Please add your own details in whatever format suits you best.

Blog posts:

Infrastructure Unit

We expect to keep the core Forum network up and running throughout. There will be a few short breaks for individual items of kit as we apply firmware upgrades, but the usual failover mechanisms should cover for these. We also expect the IF console server to remain up and running.

On-site authentication and directory services will be shut down for the duration. Client systems will fail over to off-site servers. Password changes wil not be possible.

The nagios master is in the Forum, and so nagios monitoring, including for other sites, will be down.

The full list of our servers and their locations is here.

Managed Platform Unit

These user services will be down:
  • (waterloo)
  • half of (lute)
  • (hare)
  • (beaver)
This will remain up:
  • aka
Also down:
  • the master packages server (deneb)
  • one of the two package cache and PXE servers (regulus)
  • the master LCFG server (steen) and also a slave LCFG server (altair)
  • These KVM servers:
  • pkgforge.inf (pinemarten) and one of the builders (shrew)
  • tartarus.inf (tomares)

KVM note

At some point before the three Forum KVM servers are shut down, the MPU will suspend all VMs still running on those servers. Afterwards, the MPU will resume these VMs at some point after the LCFG service is back.

Powering on MPU servers

(Please let MPU do this if possible.)
  • To get the packages service back (for updaterpms), turn on deneb (rack 1, slot 34) then regulus (rack 6, slot 24). regulus is also PXE.
  • To get the LCFG service back, turn on steen (rack 11, slot 21) and altair (rack 6, slot 34).
  • KVM servers: azul (rack 5, slot 18), gaivota (rack 3, slot 14), girassol (rack 4, slot 16).
  • XRDP servers: lute (rack 5, slot 16), waterloo (rack 0, slot 8).
  • SSH: hare (rack 7, slot 33).
  • Others: jubilee (rack 4, slot 20), juice (rack 1, slot 31).

Research and Teaching Unit

For the avoidance of doubt assume that all RAT managed servers and services hosted in the Forum will be down. A complete host/alias list is included below.

Hosts scheduled for priority poweron are annotated in bold. Priority hosts will be supervised for poweron, though there are no guarantees in case of hardware failure. The remainder will be powered on if there is time, and/or they respond to remote commands.

With the exception of (some) priority machines we will not schedule poweroffs of virtual machines, instead relying on normal host suspend/resume procedure. These machines run additional risk of needing manual intervention if something interrupts the host poweroff.

Poweroff will be done on a best-effort basis and we'll double-check that the priority hosts are off.

Some reboots will take longer than normal due to scattered SL7.6 package upgrades, and there will almost certainly be a couple of periodic fsck runs; however these delays will only be notable on the few remaining ext3 hosts.

We do not intend to schedule firmware upgrades during the poweroff.

adamski, alecto (dream, www.dream), arcsim (svn.arcsim, test-64.arcsim), arcsimvm1 (freescalelm), arcsimvm3, bakerstreet (wrbsrt, wrbadminrt), barclay (webmark), barham, barre, blackburn, blanik (vfbsandbox, braintrapdev, vfbdev, vfbsandbox1), bloor, bocian (flybrain, vfbsandbox2, vfbsandbox3, vfbaligner), bollin, bonnybridge, bowden, bravas, broma, broom, buccleuch, buck, bumbo (sl7rt), burnaby, chatelet, cheetah, cheshire, clafoutis, clulow, commonrail (cay), crabbe (project-archive), craggy, cup02 (demitasse), daifuku, davie, dechmont, deltic (storm), draco (infros), dufferin, dyatlov, edwards, eglington (eglington-atopen), elion, ellsworth, escience7 (escience7-o), filius (coltex), flapjack (pgteach), fluffy (oldinv), gibbs, giger, gladiator (weaslesink, rt4ngpg, rt4test), gorman, goyle (projsubs), greekie (mlp1), greenbeach (rt4ng, rt4, rt, multidesk), greider, griselda (project-submit), groomlake, hannah, hardwell, harnoncourt, henwen, hessdalen, hodgkin, horsforth, hydra, hynek, inflexible, invincible, islington (webots-lserv, mathlm), jantar (voicebank), jerry, kapok, karenin (braintrap, vfbwiki, flytrap, fruitfly), kensal, kinloch (www.cohort, cohort), landonia18, landonia19, landonia20, landonia21, landonia22, landonia23, landonia24, landonia25, lazar, levi, livy, lubbock, majestic, malaya (enragedpando), marzipan (ui.theon, portal.theon, student, course, computing.projects), mayer, mcclintock, mcgill, melmac, meringue (infdb) moorgate (issrt, issrt4) morris, moser, mustard, newt (dicedesk) nicolson, nuesslein, osiris, ostrom, pataphysique, plaistow (seeker) pug01, pug02, pyramid, quanticol, quarry, quarter, queensway (costco, rdmpubs, kmrt), rat1 (nn.exctest), rat2 (rm.exctest, hadoop.exctest), rat3, rat4, rat5, rat6, rat7, redsea, rockall, s1443541, s1509375, schaffner, shortbread, sloth, snippy, southwark, sponge (pgresearch), sprinkles, spume, starariel, stardale, starleader, stkilda, stoner (otptestportal.theon, otptestui.theon), stonesoup, strudel, summastorage, swift, tambo, tangmere, teacake, teasel (beetle-grow), telford, templeton (dpmt), teurgoule, tiramisu, tom (synthsys), tomintoul, tonks (referee), tulips (survey.tulips), tullibardine, tunguska, vfbbs3, vfbbs4, vfbbs5, vfbbs6, vfbbs7, vfbbs8, vfbbs9, wasserboxer (wbx), wilbur, wolfburn, yonath, youyou, yuecheng, zamora, zho, zutism

Services Unit

  • Most noteworthy things that will remain working: mail, printing, AFS space in AT, web.inf, www.inf (readonly), groups webserver (mostly)
  • Things that will be down: jabber, NFS and AFS in the Forum, homepages, wiki, rfehost, roombooking, lots of web sites
  • A complete list of services unit machines that will be down ForumSRPowerWork2019ServicesUnit

We expect to start shutting things down at 10:30am, and to have them all down by 11am. We presume there will be enough infrastructure to do that. What deps do LCFG/package servers have on AFS?

We'll turn on file servers first, then the likes of jabber, etc when VM hosts come back.

During the down time we will be shuffling some hardware, and switching to SL7.6 machines that were powered down when they come back up. Will the package servers cope when they all start coming backup? Our ServicesUnitJuly2019PowerWork

User Support Unit

All US managed servers will be down during the UPS work. We will try to shutdown as many as possible on Friday afternoon. None of the servers are high priority. The staff.compute server is normally considered medium priority but it has not seen much activity recently so is not time critical. The main users of the 'Institute' servers have been emailed.

  • student.login
  • selma (already off)
  • castor (already off)
  • staff.compute (added motd)
  • catzilla
  • joule (already off)
  • vetinari
  • cup01
  • dalfaber
  • hubel
  • jekyll
  • salmon

Possible additional work: remove nsl101-nsl108 (9 Dell servers) from SMSR - see RT ticket

-- GeorgeRoss - 15 Jul 2019

Topic revision: r15 - 19 Jul 2019 - 08:50:42 - AlisonDownie
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies