Surm: Care and Feeding

Note that this page is currently being developed contact iainr@infREMOVE_THIS.ed.ac.uk for more information.

This page covers the operation of slurm on the clusters in use in the School.

Overview

We are running a scheduler on the cdt and teachign clusters in order to manage use. the compute nodes have a daemon (slurmd) running on tem which distributes the jobs scheduled by the scheduler running on a separate admin machine. The admin machine also runs a database to log job data for reporting and calculating job priorities. The admin nodes are currently 902nas for the teaching cluster and 812nas for the cdtcluster. User access is via head nodes (mlp, mlp1, mlp2 for teachingcluster and cdt1 cdt2 for teh cdt cluster). nodes are only sshable for a select number of dice machines including the admin machines.

Authentication

slurm authentication is based on [[munge][https://dun.github.io/munge/] this is set up to restart on boot, there shold be a munged daemon running. the only configuration required is the installation of a key file in /etc/munge/munge.key which should be oened by munge and group munge. lack of the key will stop the daemons communicating with the scheduler jobs from running on that node and the command line tools from working. Obviously lack of a key on the scheduler will mean that nothing works. all the nodes use the same key.

Aggregate syntax

Throughout the configuration and the utilities slurm may use an aggregate syntax like landonia[01-04,15] this means that it's referring to nodes landonia01 through landonia04 and landonia15

Scheduler

The scheduler process running on the admi machine is /usr/sbin/slurmctld. it should be restarted on boot via systemd. logging should go to /var/log/slurm/slurmctld.log (NB it's /var/log/slurmctld.log on 812nas...waiting change) The scheduler will halt if restarted with an incorrect configuration file. logging is quite good and can be used to track what's happeing in real time.

Database

The job database is a mysql database and data is injected via the slurmdbd daemon. If this is not running jobs will still be scheduled but the scheduler will probably fall back to FIFO mode, or continue trying to do fair share scheduling without historical data. TBH I'm not sure whaich.

Configuration

There are a number of configurations files, all of which should be in /etc/slurm on all nodes. The configuration should be identical on all nodes apart from the slurmdbd.conf file which is only needed on the admin machine. All of the configuration files we're actually useing are currently deployed using the file component. (teachingcluster.conf and cdtscheduler.h)

slurm.conf

This is a large configuration file which defines the configuration of the compute node. Specifically it defines the expected configuration of the nodes and the partitions (queues):

 COMPUTE NODES
NodeName=landonia[01-25] CPUs=12 Boards=1 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=1 RealMemory=96000 Gres=gpu:8
NodeName=letha[01-04] CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=64000 Gres=gpu:4 
NodeName=letha05 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=64000
NodeName=letha06 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=128569
NodeName=marax CPUs=16 Boards=1 SocketsPerBoard=2 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=64000
PartitionName= Interactive Nodes=landonia[01,25] Default=NO MaxTime=2:0:0 State=UP
PartitionName= Standard Nodes=landonia[04-09,11-17,20,22-24] Default=Yes Maxtime=8:0:0 State=UP
PartitionName= Short Nodes=landonia[02,18] Default=NO Maxtime=4:0:0 State=UP
PartitionName= LongJobs  Nodes=landonia[03,04,10,11,19,21] Default=NO MaxTime=3-8 State=UP
PartitionName=MSC Nodes=letha[01-06],marax  Default=NO MaxTime=3-8 State=UP

The client on the compute nodes will scan the server and if the hardware doesn't match the configuration it will either stop the client daemon running or not run jobs when they arrive. the short term fix would be to down the node (see later) until the configuration can be fixed.

gres.conf

This is used for user defined requestable resources. IN our case this means the GPUs on the cluster and looks like:

NodeName=landonia[01-19] Name=gpu File=/dev/nvidia0 CPUs=0-4
NodeName=landonia[01-19] Name=gpu File=/dev/nvidia1 CPUs=4-8
NodeName=landonia[01-19] Name=gpu File=/dev/nvidia2 CPUs=8-11
NodeName=landonia[01-19] Name=gpu File=/dev/nvidia3 CPUs=0-4
NodeName=landonia[01-19] Name=gpu File=/dev/nvidia4 CPUs=4-8
NodeName=landonia[01-19] Name=gpu File=/dev/nvidia5 CPUs=8-11
NodeName=landonia[01-19] Name=gpu File=/dev/nvidia6 CPUs=0-4
NodeName=landonia[01-19] Name=gpu File=/dev/nvidia7 CPUs=8-11
NodeName=letha[01-06] Name=gpu File=/dev/nvidia0 CPUs=0-3
NodeName=letha[01-06] Name=gpu File=/dev/nvidia1 CPUs=4-7
NodeName=letha[01-06] Name=gpu File=/dev/nvidia2 CPUs=8-11
NodeName=letha06 Name=gpu File=/dev/nvidia3 CPUs=12-15
NodeName=letha06 Name=gpu File=/dev/nvidia4 CPUs=16-19
NodeName=letha06 Name=gpu File=/dev/nvidia5 CPUs=20-23
NodeName=letha06 Name=gpu File=/dev/nvidia6 CPUs=24-27
NodeName=letha06 Name=gpu File=/dev/nvidia7 CPUs=28-32


Note that all nodes have the configuration of teh while cluster, also that you can aggregate nodes using [] syntac. The syntax of the file is

NodeName=<name of node> Name=<name of resource> File=<file associated with resource> CPUS=<cpu affinity with resource>
cpu affinity should match with the NUMA affinity with the PCI slot that the gpu uses. At the moment this may not be correct, but won't cause errors it's just suboptimal.

slurmdbd.conf

This contains the config for the slurmdbd daemon to interact with the myql database. notable the connecting username and password are stored here.

Information Commands

There are a number of commands that give information about the cluster, these are available on all nodes

sinfo

This lists information about the cluster hardware and the partitions
[802nas]iainr: sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
Interactive    up    2:00:00      2   idle landonia[01,25]
Standard*      up    8:00:00     16    mix landonia[04-09,11-17,20,23-24]
Standard*      up    8:00:00      1   idle landonia22
Short          up    4:00:00      1    mix landonia02
Short          up    4:00:00      1   idle landonia18
LongJobs       up 3-08:00:00      5    mix landonia[03-04,10-11,19]
LongJobs       up 3-08:00:00      1   idle landonia21
MSC            up 3-08:00:00      4  down* letha[01-02,04,06]
MSC            up 3-08:00:00      1  drain letha03
MSC            up 3-08:00:00      2   down letha05,marax
[802nas]iainr: 
This is showing that there are a number of partitions (Interactive, Standard(the * indicates this is the default), Short Longjobs and MSC) Availability shows if the partition is up and accepting jobs, Timelimit is the maximum time a job can be running on a nore Nodes indicates the number of nodes in that partition in a given state. State indicates which state nodes in the partition are. We want idle(not in use) or alloc:(fully in use) or mix(partly used). We don't want down(slurmd has a problem and has stopped accepting jobs) any states with a * against them are the last state known for the node as connectivity to the node has been lost. Finally nodelist is the nodes that are in that partition in that state.

squeue

This is used to list jobs currently running in the queue and those pending.
[802nas]iainr: squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            233877       MSC script.s sYYYYYYY PD       0:00      1 (ReqNodeNotAvail, UnavailableNodes:letha[01-06])
            235775  Standard cal11.sh sZZZZZZZ PD       0:00      1 (AssocMaxJobsLimit)
            235776  Standard cal12.sh sZZZZZZZ PD       0:00      1 (AssocMaxJobsLimit)
            235777  Standard cal13.sh sZZZZZZZ PD       0:00      1 (AssocMaxJobsLimit)
            235778  Standard cal14.sh sZZZZZZZ PD       0:00      1 (AssocMaxJobsLimit)
            232871  LongJobs train_sh sXXXXXXX  R    4:31:37      1 landonia10
            235868  Standard compute_ sAAAAAAA  R       0:13      1 landonia15
            235867  Standard train.6. sAAAAAAA  R       0:19      1 landonia09
            235866  Standard train.6. sAAAAAAA  R       0:30      1 landonia07
            235865  Standard train.6. sAAAAAAA  R       0:35      1 landonia23
            235859  Standard train.6. sAAAAAAA  R       1:48      1 landonia12
            235858  Standard train.6. sAAAAAAA  R       1:59      1 landonia14
            235856  Standard train.6. sAAAAAAA  R       2:04      1 landonia17
            235852  Standard train.29 sBBBBBBB  R       2:25      1 landonia05
            235844  Standard train.4. sAAAAAAA  R       5:32      1 landonia20
            235843  Standard train.4. sAAAAAAA  R       5:54      1 landonia13
            235773  Standard  cal9.sh sZZZZZZZ  R      22:36      1 landonia24
            235774  Standard cal10.sh sZZZZZZZ  R      22:36      1 landonia24
            235771  Standard  cal7.sh sZZZZZZZ  R      23:36      1 landonia16
            235772  Standard  cal8.sh sZZZZZZZ  R      23:36      1 landonia16
            235769  Standard  cal5.sh sZZZZZZZ  R      23:41      1 landonia06
            235770  Standard  cal6.sh sZZZZZZZ  R      23:41      1 landonia06
            235768  Standard  cal4.sh sZZZZZZZ  R      24:04      1 landonia22
            235766  Standard  cal2.sh sZZZZZZZ  R      24:07      1 landonia11
            235767  Standard  cal3.sh sZZZZZZZ  R      24:07      1 landonia22
            235765  Standard  cal1.sh sZZZZZZZ  R      24:10      1 landonia11
            235735  Standard  exec.sh sCCCCCCC  R      29:09      1 landonia08
            235233  LongJobs sub_run_ sDDDDDDD  R    1:27:25      1 landonia04
            235207  Standard rnn_mode sEEEEEEE  R    1:42:00      1 landonia04
            235014     Short emnist_s sFFFFFFF  R    2:34:02      1 landonia02
            235008  LongJobs lre_tr_1 sBBBBBBB  R    3:11:36      1 landonia03
            225665  LongJobs train_ma sGGGGGGG  R 1-13:17:44      1 landonia04
            222632  LongJobs rsgan.sh sHHHHHHH  R 2-02:41:11      1 landonia19
            222631  LongJobs rsgan.sh sHHHHHHH  R 2-02:41:26      1 landonia19
[802nas]iainr: 

so this is showing the current queue, various users have jobs running on various nodes. Note that since not all users have asked for all resources on the nodes multiple jobs are running on the same node. The jobs are listed in order of priority, at the head are a number of jobs which have not been allocated because their requests can't be met.

  233877       MSC script.s sYYYYYYY PD       0:00      1 (ReqNodeNotAvail, UnavailableNodes:letha[01-06])
In this case none of the nodes on the MSC partition are availabe, either because they are down or because it's not possible to run the job on the available resources.
235775  Standard cal11.sh sZZZZZZZ PD       0:00      1 (AssocMaxJobsLimit)
In this case the user has hit the concurrent-- jobs limit (10 on the teaching cluster at the time of writing) and he cannot have any more jobs running concurrently. These jobs will get reassessed at the next cluster schedule and will run when the resources become available.

sprio

sprio shows the priority of pending jobs

[802nas]iainr: sprio
          JOBID PARTITION   PRIORITY        AGE  FAIRSHARE    JOBSIZE  PARTITION
         233877 MSC             1046         30          0         17       1000
         235942 Standard        1016          0          0         17       1000
[802nas]iainr: 

Note that jobs will be allocateed via the priority, highest first. the scheduler will then apply any conditions like max concurrent jobs to exclude jobs and allocate the jobs to node in priority order.

sshare

sshare show information about fairshare usage, slurm tracks fairshare usage by individual user and by agtgregating the users by accounts. In our case the accounts are mapped to course names (module-* roles) (so we can say, split the usage equally over two courses). Individual users may be associated with multiple accounts, but they have a default account which gets used unless they specify differently.

[802nas]root: sshare
             Account       User  RawShares  NormShares    RawUsage  EffectvUsage  FairShare 
-------------------- ---------- ---------- ----------- ----------- ------------- ---------- 
root                                          1.000000     6678375      1.000000   0.500000 
 root                      root          1    0.250000           0      0.000000   1.000000 
 cos                                     1    0.250000         290      0.000044   0.999879 
 informatics                             1    0.250000     6678084      0.999956   0.062508 
  degree-phd                             1    0.003968           0      0.022726   0.018880 
  module-mlp                            20    0.079365     6678084      0.999956   0.000161 
  module-rl                             20    0.079365           0      0.454526   0.018880 
  module-slp                            20    0.079365           0      0.454526   0.018880 
  project_student                        1    0.003968           0      0.022726   0.018880 
  staff                                  1    0.003968           0      0.022726   0.018880 
 none                                    1    0.250000           0      0.000000   1.000000 
[802nas]root: 
This shows the high level accounts, the allocation of shares and fairshare priorities as worked out by the share allocation and the usage on the account. for a more detailed breakdown you can use sshare -a


             Account       User  RawShares  NormShares    RawUsage  EffectvUsage  FairShare  
-------------------- ---------- ---------- ----------- ----------- ------------- ----------  
root                                          1.000000     6679657      1.000000   0.500000  
 root                      root          1    0.250000           0      0.000000   1.000000  
 cos                                     1    0.250000         290      0.000044   0.999879  
  cos                     iainr          1    0.250000         290      0.000044   0.999879  
 informatics                             1    0.250000     6679366      0.999956   0.062508  
  degree-phd                             1    0.003968           0      0.022726   0.018880  
   degree-phd          sXXXXXXX          1    0.000011           0      0.000064   0.018880  
   degree-phd          sAAAAAAA          1    0.000011           0      0.000064   0.018880  
   degree-phd          sBBBBBBB          1    0.000011           0      0.000064   0.018880  
   degree-phd          sCCCCCCC          1    0.000011           0      0.000064   0.018880  
   degree-phd          sDDDDDDD          1    0.000011           0      0.000064   0.018880  
...                                                                                       
  module-mlp                            20    0.079365     6679366      0.999956   0.000161  
   module-mlp          sXXXXXXX         20    0.000062           0      0.000777   0.000161  
   module-mlp          sFFFFFFF         20    0.000062           0      0.000777   0.000161  
   module-mlp          sGGGGGGG         20    0.000062       25557      0.004600   0.000000  
... 


Note that in this case user sXXXXXXX has entries under two accounts. Job usage isn't cumulative, jobs are allocated to either the default account associated with the user or to a specific account specified in the batch script or on the command line.

Admin commands

Note hat some parts of the admin commands can be run by ordinary users

scontrol

scontrol is a large aggregate command for manipulating and getting info about the nodes, partitions and jobs, some subcommands are available to everyone, some only to root. We mainly use it to manipulate the state of the nodes via scontrol update. common functions are:
  • scontrol update nodename= status= [reason=] change the node status and (sometimes required) give a reason for the change

  • scontrol show jobid -dd shod detailed information about a job.

Common Issues

A cheatsheet for common issues

node X is down

If a node is marked as down then there has been an issue with slurmd on the node.
  • First check the node is up, reboot if OOMkiller has been at work.
  • Secondly check that munged is running.
  • Thirdly restart slurmd (and deal with any config issues)
  • Tell the scheduler the node is ok.
The first two will remove the * from the node in sinfo to do the third we need to run
scontrol update nodename=<node> state=idle

We need to reboot/work on a node but it's got jobs running on it.

we can use the scontrol command to put the node into a state called draining (drng)), this will mark the node as down when the current jobs running on it finish and will immediately stop the server allocating jobs to it. You will ahve to give a reason as to why the nodes are going into that state using the reason tag and you'll have to reset the state to idle when the machine comes back into use. You can use the pattern matching to put multiple machines into a drain state. You need to be logged in as root on the admin machines to do this.

[802nas]root: scontrol update nodename=landonia[12-17] state=drain reason="drac issues" 
[802nas]root: sinfo landonia12
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
Interactive    up    2:00:00      2   idle landonia[01,25]
Standard*      up    8:00:00      6   drng landonia[12-17]
Standard*      up    8:00:00      8    mix landonia[04-05,07-08,11,20,22,24]
Standard*      up    8:00:00      3   idle landonia[06,09,23]
Short          up    4:00:00      1    mix landonia02
Short          up    4:00:00      1   idle landonia18
LongJobs       up 3-08:00:00      6    mix landonia[03-04,10-11,19,21]
MSC            up 3-08:00:00      1  down* letha06
MSC            up 3-08:00:00      1   idle marax
We can remove the nodes from the drain state by
[802nas]root: scontrol update nodename=landonia[12-17] state=resume
[802nas]root: sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
Interactive    up    2:00:00      2   idle landonia[01,25]
Standard*      up    8:00:00     16    mix landonia[04-09,11-12,14-17,20,22-24]
Standard*      up    8:00:00      1   idle landonia13
Short          up    4:00:00      1    mix landonia02
Short          up    4:00:00      1   idle landonia18
LongJobs       up 3-08:00:00      6    mix landonia[03-04,10-11,19,21]
MSC            up 3-08:00:00      1  down* letha06
MSC            up 3-08:00:00      1   idle marax
[802nas]root: 

-- Main.IainRae - 12 Feb 2019
Topic revision: r3 - 12 Feb 2019 - 16:58:42 - IainRae
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies