RAT Unit Meeting -- 27-August-2018 (Unattended Update)

Development Projects Awaiting Completion

Teaching Software Requirements (16/17)

  • Final report being written tomorrow (iainr)
  • timc: numbers supplied

Procure 200 GPUs for taught students

Teaching Software 2017/2018

Mock REF Review System

  • service now in use - more monitoring would be desirable
  • Final report pending (timc)

Teaching Support Forms and Workflow Enhancements

Development Projects

SL7 Server Project

  • Final report to commence this afternoon (gdutton)
  • Flybrain servers:
    • vfbblog has left SL6 and DICE altogether (wordpress)
    • disk copy has completed (13TB - 2 out of 3 servers) and now incrementally cycling
    • working on new servers vfb 7,8 and 9; porting one to SL7 for reassurance
    • temporarily blocked waiting on filesystem

Free workstation service

  • Service now in use
  • prototype Android app - requires authentication backend
  • may try and hook into AT lab usage display

DICE desktop review

  • gdutton talked to ascobie, direction unclear, gdutton will produce a brain dump this week of how to proceed

Production Hadoop Service

  • Current managing a test service (forked from existing headers).
    • Plan to keep a test cluster.
    • now have a functioning kerberos configuration
    • finalised on using Hadoop 2.9.1
    • extra resources and keystore handling mechanism added to component
    • test kerberos cluster (rat1 aka nn.exctest, rat2 aka rm.exctest and hadoop.exctest, rat3 to rat7 standard nodes running data node and node manager) released for end user testing
  • TODO:
    • improve startup process - looking at systemd
    • list of ongoing Hadoop problems (best just search https://rt4.inf.ed.ac.uk/User/Summary.html?id=372166 - iainr to create master list)
    • need to think about how to manage the data (1TB+) in HDFS on current Hadoop cluster

Teaching Support Forms etc. Pt 2

  • Graham/Tim to liaise over prioritisation, Richard can pick up some of the form changes

Live Chat Service

  • pending - DICE desktop review first

Investigate Managed Mac Service

  • pending - DICE desktop review first

Production Gluster Service

Teaching Software 2018/19

  • initial mailshot done, software page created
  • a few biggies done - AGDA, Java, loads of Python, Eclipse ongoing, PyCharm happening, looking at SDP requirements
  • Haskell (platform) going to be more difficult - source compilation issues
  • AndroidStudio hard to test workflow, wants to update within home directory
  • Deploy the (built and tested) GHC, Haskell Platform, Agda, etc.
  • Decide how to deploy the (build and tested) Java 10, deploy (built & tested) scenebuilder.
  • Rebuild Xilinx 2015.2 -> 2015.3
  • Ask Garry to test "Zybo" boards.

Merged MLP/MSC Teaching Clusters

  • slurm up and running, looking at how to configure and control access
  • 10gig switch up and running, has separate management port so could either access web page direct or link into infrastructure systems
  • iainr inadvertently tested data recovery for disk partition scenario
  • DOA file server node fixed, OS running (UEFI)
  • file system still on MSC GPU nodes, hoping to get fileserver up and running on file servers
    • transfer of data should not need a shutdown
    • space currently being used for Flybrain decant, but could ultimately share space
  • data from sacctmgr will need to be automated and/or procedurally documented.
    • writing code to generate accounts in the slurm config database
    • considering fairshare for longer term resource management
  • better advice for users wanting to run full IDEs (such as Atom) which are not really suitable for clusters
  • trialling some 400Gb NVMe SSDs for gluster tiering.
  • IPMI mostly configured, consoles all working (probably no BIOS)
  • also looking at a support environment/forum
  • still waiting on committee decision on prioritisation rules

Unit Small Development Projects

  • Projects to be starred(*) for inclusion as development

Yubikey deployment for Theon

  • Following on from Yubikey implementation:
    • frontend GUI needs to be improved: gdutton working on test server (BAD-RAT)
  • Theon rollout:
    • per-directory / conduit permissions to be deployed: DB/LCFG split responsibility...
    • (2nd-factor) failure mode in each case needs to be approved/documented.
  • timc: re-raise at CEG whether to go ahead with existing implementation

Projects DB changes

  • Major changes underway this year so no meeting yet.
  • New version of DPMT to be migrated to PHP7 / test server on quarter

infdb krb5curl backend 'incoming' service

  • first guinea pig: trialled successfully in sl7rt
  • second guinea pig: webmark -> theon remctl replacement - TODO
  • documentation -- and explanation(!) -- required
    • more of a framework / wrapper than a (very small) tool
    • can be invoked directly by remote server, eliminating mail (sometimes)
    • can be invoked directly by mail server, eliminating remctl
  • some basic configuration required to attach this to 'Incoming'.
    • until this is done, hash strings should maybe live in machine profiles?

infdb backup strategy

  • need to review points of failure / tape strategy, timc/gdutton
    • GDPR gives hard deadline of May
  • related issues to be discussed in meeting to be set up by cms.
  • must discuss backups / retention / continuity of logs (on UI machine, also)
  • TIBS3 will allow this
    • tibs3 component configuration will allow fine-grained streaming of data
  • timc/gdutton to make more detailed action plan this week
  • look at implementing ISSRT retention rules

postgresql component

  • added replication monitoring support but won't work as need superuser access
    • new strategy is to run a named function instead (and require client-side changes)
    • knock-on changes required to pgluser (per-rule connections).

Admin Staff / AFS

  • Tied to Windows 10 MDP.
  • changes to webmark to write into CIFS/datastore
    • AFS transition group is investigating group mapping with IS.
      • toby looking into prometheus group mapping conduit
      • webmark functional account now exists
    • gdutton: move files (coordinating with cms/alisond)

UG4 project requests

  • how to elicit complex requirements / major resources (beyond DPMT report ?
    • timc: refer decision back to Tom (mandatory drop-down for default resources)

Projects Submission process improvements

  • a few changes required to reduce computing support workload due to exceptional / failed submissions
  • see RT:89190 for details.
  • look at wishlist (gdutton/timc to meet Wed 8th 2PM)

Operational Matters

  • CDT
    • some server, GPU rationalisation
    • new hardware arrived, head node up and running, file servers installed but redo as UEFI, juggling to cover data transfers
    • switch still to be ordered - waiting to see what performance improvements are on MLP cluster
    • now running Slurm on all nodes with 4 GPUs, should eventually fall down to all but 4 of the (older) GPU nodes
    • need to discuss queues and prioritisation with Amos
    • gpu cluster fileservers up and am copying data on to them: http://812nas.inf.ed.ac.uk/ganglia .

  • Facebook Gift
    • may have a fault, Ian does not know who to contact
      • iainr to investigate this week
      • timc: discussed with Alastair
      • passed over to Stephen, it seems to be some kind of twin unit with one of the twins suffering an NMI error. not heard back any more.

  • GPU machines
    • our first GPU "server" was lost last week.
    • Large stack of GPU servers waiting for installation in server room
      • Richard/Iain will do as possible
    • Lots of new orders in process.
      • Space is an issue, as usual.
      • Large (Tim H) order - now have quotes, need to contact procurement on how to proceed, will go out this week
        • Harder to order sole cards without assured compatibility.
        • Currently attempting to find a framework (and supplier) to suit (and reuse for all purchases for a period of time).
      • Someone else looking for machines that can deal with 8 GPU's
      • Investigating options for installing a further set of GPU servers for teaching
    • iainr: script in progress to insert GPU data into inventory
      • (iainr suggests CPD core count web page)
    • iainr writing some documentation
      • "how to order" for staff - in progress - computing.help link pending
      • "how to order" for COs - not yet - advice to be added to MPU Ordering page?
    • Ordering GPUs now that licensing is OK.
      • first 3.5" T640 on its way shortly. Two more to come.

  • meringue decant (R730 kernel issues)
    • kernel still pinned
    • good chance to test/document procedures

  • videoconferencing installation (5.02)
    • working - mostly plug and play (on Mac...)
    • Documentation available at http://computing.help.inf.ed.ac.uk/av-502 and in situ
    • Further testing / debugging / documentation required.
    • laptops only just now - podium with MDP required
    • need to order a sound bar
    • possibly same setup now needed in 4.02 following ATI move to Bayes

  • role expiry for students
    • in progress. edge-cases such as multi-year students are tricky. Should be pinned to semester?
    • iainr thinking about automated filespace cleanup - actually, simplistically, purely file age?
      • robinhood trial service running, seems to do what's required

  • pgteach replacement spec
    • installed (teacake), service to be migrated (rwb)

This Week

whole unit

  • RT Purge: Done, very successful.
  • BAD-RAT-VII: TBA

timc

  • Final report for REF
  • Hadoop
  • course work timings and Learn

iainr

  • final report
  • android studio
  • slinging servers
  • sorting out mscteaching cluster fileservers
  • WFH Tue/Thu

gdutton

  • teaching software
  • TSP forms
  • WFH Tue

rwb

  • helping iainr
  • resits prep

aburford

AOCB

Edit | Attach | Print version | History: r288 < r287 < r286 < r285 < r284 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r285 - 03 Sep 2018 - 08:00:58 - Main.TimColles
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies