CDT Cluster: Care and Feeding
Note that this page is currently being developed contact iainr@infREMOVE_THIS.ed.ac.uk for more information.
This page covers the design, configuration and maintenance of the CDT cluster(s) AKA the James and Charles machines. If you just want to
use the cluster then look at compuing.help instead.
Overview
The cluster is made up of a set of nodes specced for GPU coding (the Charles machines), a set specced for CPU coding (the James machines) and two large memory nodes (anne and mary). The cluster nodes are named after the first Kings and queens of the United Kingdom (the James's, Charles' Anne and Mary ) current head nodes are hood.inf.ed.ac.uk and renown.inf.ed.ac.uk (both virual).
The head nodes are purely for submitting jobs and copying files from /afs or other otherwise inaccessible filesystems, feel free to kill any processes which are hogging the CPU.
access to the head nodes and to the cluster are via the cdtcluster-visitor cdtcluster-staff and ds-cdt-member ppar-cdt-member roles
execution nodes are only sshable via cluster infrastructure nodes such as fondant or escience7 this is to stop students logging in and running jobs outsde of the scheduler.
Hardware
As originally purchased the cluster consisted of:
- 10 off Dell Poweredge R730 (the charles machines) , 2 x Xeon(R) CPU E5-2640 v3 @ 2.60GHz 16 Core, 64G RAM, 4T HDD, 1 or 2 Tesla K40m
- 21 off Dell Poweredge R815 (the james machines), 4 x AMD Opteron(tm) Processor 6376 16 core, 256G RAM, 4T HDD
- 2 off Dell Poweredge R815 (anne and mary), 4 x AMD Opteron(tm) Processor 6376 16 core, 1T RAM, 2T HDD
it has since been expanded with the GPU spaces filled with nvidia Titan X cards and the purchase of
- 4 off Dell Poweredge T630, 2 x Xeon(R) CPU E5-2620 v3 @ 2.40GHz 12 Core 64G RAM 1T HDD with some Titan X(maxwell) cards and the rest Titan-X (pascal) cards.
- 4 off Scan/Asus 3xs_hpc-t6a with Titan-X (pascal) cards
These are all in the cdt racking in the Appleton tower server room.
There are two head nodes, renown and hood which are virtual nodes and a scheduler escience7 which is an R320 in the cdt rack.
Information about the loading of the cluster can be obtained from
ganglia
OS
These nodes were installed with an early version of SL7, the charles machines have all been reinstalled with the full DICE release of SL7. The james are currently being reinstalled piecemeal with the full DICE SL7. The main difference is the partitioning scheme and the use of ext4 rather than ext3.
Scheduler
see
SlurmCareandFeeding
Shared filesystem
Slurm needs a shared filesystem to operate, because gridnegine is not kerberizes this filesystem can't be on afs. Also becuase the jobs are forked off from sgeexecd on the compute nodes we need to have non afs user home directories. In order to provide this we have used gluster which is the standard redhat distributed/enterprise filesystem. There are mounted as /mnt/mscteach_gridengine_common for the gridengine admin filesystem and /mnt/mscteach_home for the user home directories
the two filesystems are configured as follows
filesystems are currenty mounted via systemd and fstab
e.g
charles11.inf.ed.ac.uk:/cdt-gridengine-common 402910080 1029248 381400832 1% /mnt/cdt_gridengine_common
note that although it appears that the filesstem is mounted off charles 11 it's distributed across four fileservers
i.e.
gluster>
gluster vo[malcolm02]root: gluster volume info
Volume Name: cdtcluster_home
Type: Distributed-Replicate
Volume ID: b4373a23-2f2d-4bcc-8aa5-5db82874c7d9
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x 3 = 12
Transport-type: tcp
Bricks:
Brick1: malcolm01storage:/disk/gluster/brick_malcolm01_02/data
Brick2: malcolm02storage:/disk/gluster/brick_malcolm02_02/data
Brick3: malcolm03storage:/disk/gluster/brick_malcolm03_02/data
Brick4: malcolm02storage:/disk/gluster/brick_malcolm02_03/data
Brick5: malcolm03storage:/disk/gluster/brick_malcolm03_03/data
Brick6: malcolm04storage:/disk/gluster/brick_malcolm04_02/data
Brick7: malcolm01storage:/disk/gluster/brick_malcolm01_03/data
Brick8: malcolm04storage:/disk/gluster/brick_malcolm04_03/data
Brick9: malcolm02storage:/disk/gluster/brick_malcolm02_04/data
Brick10: malcolm01storage:/disk/gluster/brick_malcolm01_04/data
Brick11: malcolm03storage:/disk/gluster/brick_malcolm03_04/data
Brick12: malcolm04storage:/disk/gluster/brick_malcolm04_04/data
Options Reconfigured:
cluster.self-heal-daemon: enable
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off
[malcolm02]root:
This is a distributed-replicated filestystem with three replicas and distributed over four servers. This is not like a raid 3 set up these are three distinct copies of the same data
for most gluster issues we'd recommend the
https://access.redhat.com/documentation/en/red-hat-gluster-storage/redhat documents
Common gluster issues
- "too many symboic links" on login. this has started to appear and seems to be a race condition between systemd mounting the filesystem and the links being in place.
umount /home
followed by mount -a
will fix.
then check that it can see the other nodes in the filesystem
[forthea]root: gluster
gluster> peer status
Number of Peers: 6
Hostname: tatties
Uuid: efb6684f-d516-4bc6-92ea-0bf687b8bd47
State: Peer in Cluster (Connected)
Hostname: 129.215.18.94
Uuid: 70066537-d6e9-4d35-8fb8-dd5bb0280a4c
State: Peer in Cluster (Connected)
Other names:
letha04
Hostname: 129.215.18.93
Uuid: 7b240054-5ca0-4d68-ae0a-059b283c71c1
State: Peer in Cluster (Connected)
Other names:
letha03.inf.ed.ac.uk
Hostname: letha02
Uuid: 92757806-6409-4f31-a081-277447c53c0a
State: Peer in Cluster (Connected)
Hostname: letha05
Uuid: a0740a46-92a7-4d10-a4b4-505cc652c9c9
State: Peer in Cluster (Connected)
Hostname: letha01
Uuid: 7324a5f0-8a84-4132-a988-04312cbbf01d
State: Peer in Cluster (Connected)
gluster>
If this is showing any peers as lost or shutdown then they'll need restarted, you can restart the gluster daemon via:systemctl
[forthea]root: systemctl status glusterd
● glusterd.service - GlusterFS, a clustered file-system server
Loaded: loaded (/usr/lib/systemd/system/glusterd.service; disabled; vendor preset: enabled)
Active: active (running) since Thu 2017-06-29 10:28:56 BST; 6h ago
Main PID: 1039 (glusterd)
CGroup: /system.slice/glusterd.service
├─1039 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INF...
├─1278 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glust...
├─1284 /usr/sbin/glusterfsd -s forthea --volfile-id mscteach-gride...
├─1285 /usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -...
└─1329 /sbin/rpc.statd
Jun 29 10:28:57 forthea.inf.ed.ac.uk rpc.statd[1329]: Version 1.3.0 starting
Jun 29 10:28:57 forthea.inf.ed.ac.uk sm-notify[1330]: Version 1.3.0 starting
[forthea]root:
- "Stale file handles " are basically as per NFS, find the open copy of the file and kill the process or reboot the server
GPUs
Information about the gpus can be obtained via the nvidia-smi command
[charles17]iainr: nvidia-smi
Fri Jun 30 14:53:13 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39 Driver Version: 375.39 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN X (Pascal) On | 0000:02:00.0 Off | N/A |
| 24% 35C P8 17W / 250W | 0MiB / 12189MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 1 TITAN X (Pascal) On | 0000:03:00.0 Off | N/A |
| 29% 42C P8 17W / 250W | 0MiB / 12189MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 2 TITAN X (Pascal) On | 0000:81:00.0 Off | N/A |
| 26% 38C P8 17W / 250W | 0MiB / 12189MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 3 TITAN X (Pascal) On | 0000:82:00.0 Off | N/A |
| 25% 37C P8 16W / 250W | 0MiB / 12189MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
If thre are processes accessing the GPU then they will be listed
nv[charles12]iainr: nvidia-smi
Fri Jun 30 14:57:54 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39 Driver Version: 375.39 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN X (Pascal) On | 0000:02:00.0 Off | N/A |
| 41% 69C P2 141W / 250W | 11705MiB / 12189MiB | 46% E. Process |
+-------------------------------+----------------------+----------------------+
| 1 TITAN X (Pascal) On | 0000:04:00.0 Off | N/A |
| 23% 31C P8 16W / 250W | 0MiB / 12189MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 2 TITAN X (Pascal) On | 0000:83:00.0 Off | N/A |
| 72% 86C P2 254W / 250W | 11705MiB / 12189MiB | 74% E. Process |
+-------------------------------+----------------------+----------------------+
| 3 TITAN X (Pascal) On | 0000:84:00.0 Off | N/A |
| 34% 58C P2 83W / 250W | 11191MiB / 12189MiB | 91% E. Process |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 99508 C python 11703MiB |
| 2 99045 C python 11703MiB |
| 3 45132 C python 11189MiB |
+-----------------------------------------------------------------------------+
there should be an associated process with each PID above
s1473470 99508 116 18.8 45058704 12437588 ? Rl Jun29 1606:12 python train_celeb_wgan.py --experiment_title relational_wgan --relational_plus_discr True --batch_size 100
iainr 143110 0.0 0.0 112656 964 pts/0 S+ 14:59 0:00 grep --color=auto 99508
sometimes a user space process will die and zombify, in which case the GPU will fault
[charles19]root: nvidia-smi
Fri Jun 30 15:01:18 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66 Driver Version: 375.66 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN X (Pascal) Off | 0000:02:00.0 Off | N/A |
| 23% 18C P8 14W / 250W | 0MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN X (Pascal) Off | 0000:03:00.0 Off | N/A |
| 23% 24C P8 16W / 250W | 11621MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 TITAN X (Pascal) Off | 0000:81:00.0 Off | N/A |
| 23% 21C P8 15W / 250W | 0MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 TITAN X (Pascal) Off | 0000:82:00.0 Off | N/A |
| 23% 23C P8 16W / 250W | 0MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 1 17235 C Unknown Error 11619MiB |
+-----------------------------------------------------------------------------+
[charles19]root: ps auxww|grep 17235
root 14869 0.0 0.0 112652 948 pts/0 S+ 15:01 0:00 grep 17235
s1473470 17235 0.0 0.0 0 0 ? Zl Jun17 0:42 [python] <defunct>
[charles19]root:
The only solution we're aware o to fix this is to reboot the node. It's usually best to disable all the queue instances first and let other jobs finish.
--
IainRae - 18 May 2016