RAT Unit Meeting -- 17-September-2018 (Unattended Update)

Development Projects Awaiting Completion

Procure 200 GPUs for taught students

  • finished - final report
  • timc: tick off remaining deliverables

Teaching Software 2017/2018

Mock REF Review System

  • Final report pending (timc)

Teaching Support Forms and Workflow Enhancements

Development Projects

SL7 Server Project

  • Final report to commence this afternoon (gdutton)
  • Flybrain servers:
    • vfbblog has left SL6 and DICE altogether (wordpress)
    • disk copy has completed (13TB - 2 out of 3 servers) and now incrementally cycling
    • working on new servers vfb 7,8 and 9; porting one to SL7 for reassurance
    • temporarily blocked waiting on filesystem

Free workstation service

  • Service now in use
  • prototype Android app - requires authentication backend
    • working on auth still, though not a rollout requirement
  • may try and hook into AT lab usage display
  • find out how to deploy through Uni Google Play account, or get School account

DICE desktop review

  • gdutton talked to ascobie, direction unclear, gdutton will produce a brain dump this week of how to proceed

Production Hadoop Service

  • Current managing a test service (forked from existing headers).
    • Plan to keep a test cluster.
    • test kerberos cluster (rat1 aka nn.exctest, rat2 aka rm.exctest and hadoop.exctest, rat3 to rat7 standard nodes running data node and node manager) released for end user testing
      • still awaiting user testing and feedback
  • TODO:
    • improve startup process - looking at systemd
    • list of ongoing Hadoop problems (best just search https://rt4.inf.ed.ac.uk/User/Summary.html?id=372166 - iainr to create master list)
    • need to think about how to manage the data (1TB+) in HDFS on current Hadoop cluster
      • could try hdfs copy or try upgrading to 2.9.1
      • for now iain will copy off content (for users still here), use msc cluster for temporary storage, then try upgrade in place

Teaching Support Forms etc. Pt 2

  • Richard can pick up some of the form changes
  • Tim to create data sets for some forms

Live Chat Service

  • pending - DICE desktop review first

Investigate Managed Mac Service

  • pending - DICE desktop review first

Production Gluster Service

Teaching Software 2018/19

  • Haskell platform done but not deployed (S2 requirement)
  • AndroidStudio done - students still likely to need increased disk quota
  • GHC done
  • Done Java 10, deploy (built & tested) scenebuilder.
  • Rebuild Xilinx 2015.2 -> 2015.3 - done looks ok
  • Garry has test "Zybo" boards - all ok

Merged MLP/MSC Teaching Clusters

  • slurm up and running, looking at how to configure and control access
  • 10gig switch up and running, has separate management port so could either access web page direct or link into infrastructure systems
  • iainr inadvertently tested data recovery for disk partition scenario
  • file system still on MSC GPU nodes, hoping to get fileserver up and running on file servers
    • transfer of data should not need a shutdown
    • space currently being used for Flybrain decant, but could ultimately share space
  • data from sacctmgr will need to be automated and/or procedurally documented.
    • writing code to generate accounts in the slurm config database
    • considering fairshare for longer term resource management
  • better advice for users wanting to run full IDEs (such as Atom) which are not really suitable for clusters
  • trialling some 400Gb NVMe SSDs for gluster tiering.
  • IPMI mostly configured, consoles all working (probably no BIOS)
  • also looking at a support environment/forum
  • still waiting on committee decision on prioritisation rules

Unit Small Development Projects

  • Projects to be starred(*) for inclusion as development

Yubikey deployment for Theon

  • Following on from Yubikey implementation:
    • frontend GUI needs to be improved: gdutton working on test server (BAD-RAT)
  • Theon rollout:
    • per-directory / conduit permissions to be deployed: DB/LCFG split responsibility...
    • (2nd-factor) failure mode in each case needs to be approved/documented.
  • timc: re-raise at CEG whether to go ahead with existing implementation

Projects DB changes

  • Major changes underway this year so no meeting yet.
  • New version of DPMT to be migrated to PHP7 / test server on quarter

infdb krb5curl backend 'incoming' service

  • first guinea pig: trialled successfully in sl7rt
  • second guinea pig: webmark -> theon remctl replacement - TODO
  • documentation -- and explanation(!) -- required
    • more of a framework / wrapper than a (very small) tool
    • can be invoked directly by remote server, eliminating mail (sometimes)
    • can be invoked directly by mail server, eliminating remctl
  • some basic configuration required to attach this to 'Incoming'.
    • until this is done, hash strings should maybe live in machine profiles?

infdb backup strategy

  • need to review points of failure / tape strategy, timc/gdutton
    • GDPR gives hard deadline of May
  • related issues to be discussed in meeting to be set up by cms.
  • must discuss backups / retention / continuity of logs (on UI machine, also)
  • TIBS3 will allow this
    • tibs3 component configuration will allow fine-grained streaming of data
  • timc/gdutton to make more detailed action plan this week
  • look at implementing ISSRT retention rules

postgresql component

  • added replication monitoring support but won't work as need superuser access
    • new strategy is to run a named function instead (and require client-side changes)
    • knock-on changes required to pgluser (per-rule connections).

Admin Staff / AFS

  • Tied to Windows 10 MDP.
  • changes to webmark to write into CIFS/datastore
    • AFS transition group is investigating group mapping with IS.
      • toby looking into prometheus group mapping conduit
      • webmark functional account now exists
    • gdutton: move files (coordinating with cms/alisond)

UG4 project requests

  • how to elicit complex requirements / major resources (beyond DPMT report ?
    • timc: refer decision back to Tom (mandatory drop-down for default resources)

Projects Submission process improvements

  • a few changes required to reduce computing support workload due to exceptional / failed submissions
  • see RT:89190 for details.
  • webmark data connection fixed for msc, should do for ug4 as well

Operational Matters

  • CDT
    • file servers installed with UEFI and running as home directory providers, still some data transfers
    • switch still to be ordered - waiting to see what performance improvements are on MLP cluster
    • now running Slurm on all nodes apart from two which have old homedir data
    • need to discuss queues and prioritisation with Amos
    • http://812nas.inf.ed.ac.uk/ganglia
    • all charles nodes up to 7.5, needed re-installation due to lack of space - now have 160GB root partition, some still re-installing
    • hit odd problem which turns out to be kernel bug in cgroups, Slurm configuration fix applied
    • either because of above or 7.5 nodes are now being randomly drained ... needs periodic check and switch out of drain state
    • ready to go - some support infrastructure to do

  • Facebook Gift
    • Stephen has both nodes up and running 7.5 but no other software as yet, waiting for response from Amos on what he needs

  • GPU machines
    • rendlesham GPU "tower server" was lost last week.
      • Stephen to recover disk and GPUs
    • Large stack of GPU servers waiting for installation in server room
      • Richard/Iain done 9 so far, 3 more to do
    • Lots of new orders in process.
      • Large (Tim H) order - gone out - some supplier issues with components
        • Harder to order sole cards without assured compatibility.
        • Currently attempting to find a framework (and supplier) to suit (and reuse for all purchases for a period of time).
      • Someone else looking for machines that can deal with 8 GPU's - gone for t630s
        • procurement process exists for this, bigger orders should probably be aggregated
      • Investigating options for installing a further set of GPU servers for teaching
    • iainr: script in progress to insert GPU data into inventory
      • (iainr suggests CPD core count web page)
    • iainr writing some documentation
      • "how to order" for staff - in progress - computing.help link pending
      • "how to order" for COs - not yet - advice to be added to MPU Ordering page?
    • iain in consultancy role fixed another schools GPU problem

  • meringue decant (R730 kernel issues)
    • kernel still pinned
    • good chance to test/document procedures

  • videoconferencing installation (5.02)
    • working - mostly plug and play (on Mac...)
    • Documentation available at http://computing.help.inf.ed.ac.uk/av-502 and in situ
    • Further testing / debugging / documentation required.
    • laptops only just now - podium with MDP required
    • need to order a sound bar
    • possibly same setup now needed in 4.02 following ATI move to Bayes

  • role expiry for students
    • in progress. edge-cases such as multi-year students are tricky. Should be pinned to semester?
    • iainr thinking about automated filespace cleanup - actually, simplistically, purely file age?
      • robinhood trial service running, seems to do what's required

  • pgteach replacement spec
    • installed (teacake), service to be migrated (rwb)

  • AWS EC2
    • investigating costs, may be worth checking with Kenneth

  • Macro DICE_INHIBIT_LARGE_RAT_PACKAGES added to reduce RAT package footprint by pulling stuff out but with easy add back

  • New minor Postgres release on stable next week

This Week

whole unit

  • RT Purge: Done, very successful.
  • BAD-RAT-VII: TBA

timc

  • Final report for REF
  • AWS EC2
  • course work pages and Learn
  • android app deployment

iainr

  • slinging servers (Wed)
  • sorting out mscteaching and CDT cluster nodes
  • WFH Tue/Thu

gdutton

  • TSP forms
  • WFH Thu

rwb

  • helping iainr AND gdutton

aburford

AOCB

Topic revision: r286 - 17 Sep 2018 - 14:18:52 - Main.TimColles
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies