This page is being used to record discussions / actions wrt. short,medium and long term solutions for addressing the shortfall in space for hosting our growing GPU estate.

  • IF-B.01
  • IF rain tank room
    • We reckon that there is space in this room for 1000 of GPUs assuming 4U 8ofGPU servers - but would need to confirm
  • IF-B.02/B.Z14
    • in-rack cooling
      • this seems like the most promising short-medium route to increasing capacity in the School's server rooms
      • it is expected to cost substantially less than replacing/upgrading the existing AHUs and backup chiller, but we have no firm evidence of this
      • We need to be clear on whether or not the expectation is that GPUs installed in this way would be powered via the server room UPS. If we expect that, then we need to check that it's viable; if not, we need additional non-UPS-backed circuits (and distribution boards?) installed
      • Currently two known options:
        • HP ARCS racks (suggested by EPCC) - see https://support.hpe.com/hpesc/public/docDisplay?docLocale=en_US&docId=a00062312en_us
          • logically installed as a set of 4 equipment racks plus one cooling rack, each 1600mm deep
          • it is likely that the only home for an in-rack cooled row would be opposite the comms racks
          • we estimate that we could install 32 of 8-GPU nodes (at 2.4kW each, = 76.8kW in total) in one row (4 active racks, 1 cooling rack)
          • we might have problems installing in-rack units due to access limitations (lift,doors) en-route to IF-B.02
        • ColdLogik CL20 - see https://www.usystems.com/data-centre-products/cl20/
          • more modular
          • racks less deep
          • possibly (?) easier to install
          • physically might (?) be able to install in location of current shelving near B.02 entrance door, as well as opposite the comms racks
    • replace / upgrade existing AHUs and backup chiller
      • Given the indicative costings, this is unlikely to be feasible in the short-medium term
    • enclosed hot/cold aisles (ala AT DC)
      • this wouldn't increase capacity (would it?) but should improve our PUE
      • it might be possible to apply for funding from the University Sustainability funds?
    • Heat batteries Discussed with Estates and more-or-less dismissed
      • One suggestion for increasing cooling provision is to use heat batteries (eg from sunamp.com).
        • But - has the University any use for the heat released from a battery (relatively low temperature released)
        • But - we may need to "replace" the battery on a weekly (or even daily?) basis
      • ACTION ask Estates for their views on the heat battery idea
    • consultant
      • EPCC have suggested that we call on the consultancy company (that have advised on ACF) to provide advice
      • ACTION Confirm with Estates whether they are happy to engage with EPCC's consultants They are - and in addition were directly involved with the ACF rear-door cooling installation
    • Questions need answers to
      • ACTION Alastair to ask EPCC how they arrange for conditioned power (to protect against voltage spikes) - where not using a UPS
      • ACTION Ian to ask Iain on his views whether 8 of GPUs in 2U chassis is practical/advisable Iain favours 4U 8-GPU servers over 2U ones - see https://rt4.inf.ed.ac.uk/Ticket/Display.html?id=107931#txn-2130643
      • ACTION Need to decide on whether we're sticking with 8 of GPUs in 4 U nodes
      • ACTION Alastair to take to Procurement/RAT - need to measure any new model wrt PSU behaviour (current draw on 3 PSUs/ 2 PSUs)

  • IF-3.44 Now discounted as a viable option
    • Action: ask Estates if they can think of any way to make this space usable
    • Looks an awkward shape. Not sure if it's really useful.
    • Images: looking in the door; from inside

  • CSR
    • additional rack (taking place of old tape library)
    • Information from Estates : generator provides 120kW. Max historical demand is 76kW.
    • Ian guesstimate: the backup cooling is rated somewhere between 80 to 96kW
    • ACTION: Ian to continue to press Estates for official rating of main and backup cooling
    • Agreed that we could install a rack with 4 of 8-GPU servers now
      • but we'd need another rack and 3 PDUs
      • ACTION: George/Ian order up rack and 3 PDUs for CSR https://rt4.inf.ed.ac.uk/Ticket/Display.html?id=108002
        • Confirmed that switches cs0.kb.net and cs1.kb.net have sufficient free ports (15 and 18 respectively)
        • What we would need to order is 1x 42U 600x1000 Prism PI server cabinet, 1x matching plinth, and 3x APC AP8959EU3 rack PDUs (see RT#86848 for the most recent similar order)
        • BUT, before ordering a rack and 3 PDUs, need to confirm that three unused underfloor bus-bar pick-off points are available within physical reach of the proposed rack location (we do have sufficient spare flying leads)
        • Possible alternative: current rack space usage and power usage suggests it might be feasible to install one GPU server in each of the existing racks 0, 1 and 3, making a total of three 8-GPU servers, rather than four.
        • Need to physically inspect the CSR to confirm some of the above before proceeding
  • Bayes basement
    • We had planned to deploy some of Tim Hospedales's servers in Bayes basement as they are Citydeal funded.
    • Unfortunately all five existing racks are comms racks and thus unsuitable for GPU servers. Starting from south most rack, racks 1 and 2 are populated with network termination and IS network switches, rack 3 is empty, rack 4 has some CCTV recording equipment and rack 5 is empty.
      • To release the originally-planned-for space for 3 server racks, the CCTV kit will need to be moved to rack 3 and racks 4 and 5 replaced with proper server racks
      • ACTION: Alastair to check with CIS (Fraser) that they won't need rack 3 as part of the EdLAN replacement programme Malcolm has confirmed that CIS are happy with this
      • ACTION: Alastair to arrange for CCTV kit to be moved from rack4 to rack 3, assuming IS don't need rack3
      • ACTION: Replace rack 5, and possibly rack 4, with server racks.
      • It is unlikely that the above can be completed in the short term, so Tim H's servers will need to be located elsewhere.
  • AT offices
    • Potential for short term (until Sept 2021) use of AT-3.09, AT-4.05, AT-4.07, AT-4.09, AT-4.12, AT-4.14 and AT-4.14A.
      • Between 1 and 2 GPU servers per room could temporarily be located on table tops in these rooms
      • We're not aware of any power constraints that would cause problems
      • ACTION: Alastair to confirm with Neil Heatley which of these rooms are available (emailed) Neil has given go-ahead for use until Semester 1
      • It is possible that some of these spaces will continue to be available past Sept 2021
      • 29/04/21 - Likely that we could use parts of Level 8 past Sept 2021, and even now.
      • ACTION: Alastair to identify rooms in Level 8 that we could use now for servers
  • AT basement
    • ACTION: Ask joy if we could use this space (assuming funds available for kitting out)
  • Wilkie
    • It is possible that we could temporarily locate GPU servers on tables in offices on the upper floors of Wilkie
      • There is a concern that transporting GPU servers to these floors might not be practicable (no lift, narrow stairs)
    • ACTION: Alastair to ask Joy whether we can ask for space on upper floors in Wilkie (Joy is checking)
    • ACTION: ?? to check that transporting a GPU server to the upper floors is practicable
  • Mary Somerville Data Centre (KB)
    • We have been allocated two racks in the MSDC. Each rack is supplied with 2 of 32A supplies (†). CIS would need to be consulted/involved wrt. any inter-rack networking.
      († Comment: Each 8-GPU node is rated at 2.4kW, = 9.6A. 2x 32A PDUs limits the equipment we might plan to install to six such nodes per rack, and those would have no power redundancy.)
    • ACTION: We need discussion between Inf, RAT and Paul Hutton (and potentially CIS)
    • ACTION Ian to consider PSU behaviour implications for locating servers in MSDC
    • Unlikely that we will be able to make use of this space until after KBS DRT migration
    • Could possibly use spare(?) fibre between CSR and MSDC to connect our racks direct to our switches in CSR
  • ACTION: ??? to sum up short term capacity in MSDC, CSR, AT, (Wilkie)
  • ACTION: Alastair to ask RAT/Procurement what are the current GPU spending requests already known about (and what are the servers currently delivered in Boxes (and for whom))

-- AlastairScobie - 20 Apr 2021

Topic revision: r17 - 05 May 2021 - 22:07:42 - IanDurkacz
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies