RAT Unit Meeting -- 09-July-2018

Development Projects Awaiting Completion

Teaching Software Requirements (16/17)

  • Final report being written tomorrow (iainr)
  • timc: supply numbers

Procure 200 GPUs for taught students

  • final report finished bar the numbers (iainr)
  • timc: supply numbers, Iain to note Ian's contribution on report

Teaching Software 2017/2018

Mock REF Review System

  • service now in use - more monitoring would be desirable
  • Final report pending (timc)

Teaching Support Forms and Workflow Enhancements

Development Projects

SL7 Server Project

  • Final report to commence this afternoon (gdutton)
  • Flybrain servers:
    • vfbblog has left SL6 and DICE altogether (wordpress)
    • disk copy has completed (13TB - 2 out of 3 servers) and now incrementally cycling
    • working on new servers vfb 7,8 and 9; porting one to SL7 for reassurance
    • temporarily blocked waiting on filesystem

Free workstation service

  • Service now in use
  • prototype Android app - requires authentication backend
  • may try and hook into AT lab usage display

DICE desktop review

  • gdutton talked to ascobie, direction unclear, gdutton will produce a brain dump this week of how to proceed

Production Hadoop Service

  • Current managing a test service (forked from existing headers).
    • Plan to keep a test cluster.
  • TODO:
    • attempting to kerberise - still tricky. investigating keystore involvement. look at if easier to use different Hadoop version.
    • improve startup process - looking at systemd
    • investigating newer Hadoop versions / branches - might be tied by teaching requirements
      • ideally want to use newest/easiest Hadoop version - will use 2.9.1
      • problems due to security changes in newest Java, fixed now awaiting deployment - 1.8 is only now getting security fixes
        • caused by java packaging problem, stopped distributing cryptography policy manifest files, manually added back
    • list of ongoing Hadoop problems (best just search https://rt4.inf.ed.ac.uk/User/Summary.html?id=372166 - iainr to create master list)
  • important to check which courses are planning on using Hadoop this (upcoming) year
    • contacted EXC lecturers - Volker, Pramod - happy with any Hadoop version that works
  • need to think about how to manage the data (1TB+) in HDFS on current Hadoop cluster

Teaching Support Forms etc. Pt 2

  • Tim and Graham met with Alex to spec/agree requirements.
  • Going to work on a development branch of Webmark
  • Graham/Tim to liaise over prioritisation, Richard can pick up some of the form changes

Live Chat Service

  • pending - DICE desktop review first

Investigate Managed Mac Service

  • pending - DICE desktop review first

Production Gluster Service

Teaching Software 2018/19

  • initial mailshot done, software page created
  • a few biggies done - AGDA, Java, Eclipse ongoing, PyCharm happening, looking at SDP requirements
  • Haskell going to be more difficult

Merged MLP/MSC Teaching Clusters

  • slurm up and running, looking at how to configure and control access
  • all hardware has arrived, switches in place, need longer power cable
  • DOA file server node reported
  • file system still on MSC GPU nodes, hoping to get fileserver up and running on file servers, need to look at then how to transfer data which may need a shutdown for the time to run the transfer - or users could be asked to copy their own data across with a limited time frame
  • new head node in place and running fine
  • better advice for users wanting to run full IDEs (such as Atom) which are not really suitable for clusters
  • testing NVME card
  • also looking at a support environment/forum

Unit Small Development Projects

  • Projects to be starred(*) for inclusion as development

Yubikey deployment for Theon

  • Following on from Yubikey implementation:
    • frontend GUI needs to be improved: gdutton working on test server (BAD-RAT)
  • Theon rollout:
    • per-directory / conduit permissions to be deployed: DB/LCFG split responsibility...
    • (2nd-factor) failure mode in each case needs to be approved/documented.
  • timc: re-raise at CEG whether to go ahead with existing implementation

Projects DB changes

  • Major changes underway this year so no meeting yet.
  • New version of DPMT to be migrated to PHP7 / test server on quarter

infdb krb5curl backend 'incoming' service

  • first guinea pig: trialled successfully in sl7rt
  • second guinea pig: webmark -> theon remctl replacement - TODO
  • documentation -- and explanation(!) -- required
    • more of a framework / wrapper than a (very small) tool
    • can be invoked directly by remote server, eliminating mail (sometimes)
    • can be invoked directly by mail server, eliminating remctl
  • some basic configuration required to attach this to 'Incoming'.
    • until this is done, hash strings should maybe live in machine profiles?

infdb backup strategy

  • need to review points of failure / tape strategy, timc/gdutton
    • GDPR gives hard deadline of May
  • related issues to be discussed in meeting to be set up by cms.
  • must discuss backups / retention / continuity of logs (on UI machine, also)
  • TIBS3 will allow this
    • tibs3 component configuration will allow fine-grained streaming of data
  • timc/gdutton to make more detailed action plan this week

postgresql component

  • added replication monitoring support but won't work as need superuser access
    • new strategy is to run a named function instead (and require client-side changes)
    • knock-on changes required to pgluser (per-rule connections).

Admin Staff / AFS

  • Tied to Windows 10 MDP.
  • changes to webmark to write into CIFS/datastore
    • AFS transition group is investigating group mapping with IS.
      • toby looking into prometheus group mapping conduit
      • webmark functional account now exists
    • gdutton: move files (coordinating with cms/alisond)

UG4 project requests

  • how to elicit complex requirements / major resources (beyond DPMT report ?
    • timc: CEG discussion perhaps

Projects Submission process improvements

  • a few changes required to reduce computing support workload due to exceptional / failed submissions
  • see RT:89190 for details.

Operational Matters

  • Cluster tracking
    • Merged MLP / MSc CompProj:463 pending
    • Physical setup:
      • Rationalised disk / network cabling
      • IPMI mostly configured, consoles all working (probably no BIOS)
      • 3/4 fileservers configured; one DOA (heatsink fault, not our only such fault!)
      • Servers came preconfigured UEFI despite requesting BIOS -- can't boot from RAID controller using legacy BIOS.
      • Head node arrived, installed.
      • Faster storage:
        • 10G switch purchased.
        • Intel 10G server cards handle full offloading and are more than twice as fast as trial ASUS 10G cards.
        • Trialling some 400Gb NVMe SSDs for gluster tiering.
    • Cluster management
      • sinfo and squeue give useful overview information.
      • data from sacctmgr will need to be automated and/or procedurally documented.
      • considering fairshare for longer term resource management
  • CDT
    • some server, GPU rationalisation
    • new hardware on its way
    • being converted to slurm whilst not in (heavy?) use.

  • GPU machines
    • our first GPU "server" was lost last week.
    • Lots of new orders in process.
      • Space is an issue, as usual.
      • Large (Tim H) order hitting all sorts of procurement snags.
        • Harder to order sole cards without assured compatibility.
        • Currently attempting to find a framework (and supplier) to suit (and reuse for all purchases for a period of time).
    • iainr: script in progress to insert GPU data into inventory
      • (iainr suggests CPD core count web page)
    • iainr writing some documentation
      • "how to order" for staff - in progress - computing.help link pending
      • "how to order" for COs - not yet - advice to be added to MPU Ordering page?
    • Ordering GPUs now that licensing is OK.
      • first 3.5" T640 on its way shortly. Two more to come.

  • meringue decant (R730 kernel issues)
    • kernel still pinned
    • good chance to test/document procedures

  • videoconferencing installation (5.02)
    • working - mostly plug and play (on Mac...)
    • Documentation available at http://computing.help.inf.ed.ac.uk/av-502 and in situ
    • Further testing / debugging / documentation required.
    • laptops only just now - podium with MDP required

  • role expiry for students
    • in progress. edge-cases such as multi-year students are tricky. Should be pinned to semester?
    • iainr thinking about automated filespace cleanup - actually, simplistically, purely file age?
      • robinhood trial service running, seems to do what's required

  • pgteach replacement spec
    • on order.

  • new operational items?

This Week

whole unit

  • RT Purge: TBA


  • Final report for REF
  • free workstation service: android app
  • WFH Tue


  • Cluster wrangling
  • Project reports
  • trying to acquire rwb time
  • WFH Thu


  • Theon & spec
  • DICE desktop review meeting
  • A final report
  • WFH Tue, prob. off Thu.




Edit | Attach | Print version | History: r296 | r285 < r284 < r283 < r282 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r283 - 06 Aug 2018 - 13:31:41 - Main.TimColles
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies