CDT Cluster: Care and Feeding

Note that this page is currently being developed contact iainr@infREMOVE_THIS.ed.ac.uk for more information.

This page covers the design, configuration and maintenance of the CDT cluster(s) AKA the James and Charles machines. If you just want to use the cluster then look at compuing.help instead.

Overview

The cluster is made up of a set of nodes specced for GPU coding (the Charles machines), a set specced for CPU coding (the James machines) and two large memory nodes (anne and mary). The cluster nodes are named after the first Kings and queens of the United Kingdom (the James's, Charles' Anne and Mary ) current head nodes are hood.inf.ed.ac.uk and renown.inf.ed.ac.uk (both virual).

The head nodes are purely for submitting jobs and copying files from /afs or other otherwise inaccessible filesystems, feel free to kill any processes which are hogging the CPU.

access to the head nodes and to the cluster are via the cdtcluster-visitor cdtcluster-staff and ds-cdt-member ppar-cdt-member roles

execution nodes are only sshable via cluster infrastructure nodes such as fondant or escience7 this is to stop students logging in and running jobs outsde of the scheduler.

Hardware

As originally purchased the cluster consisted of:

  • 10 off Dell Poweredge R730 (the charles machines) , 2 x Xeon(R) CPU E5-2640 v3 @ 2.60GHz 16 Core, 64G RAM, 4T HDD, 1 or 2 Tesla K40m
  • 21 off Dell Poweredge R815 (the james machines), 4 x AMD Opteron(tm) Processor 6376 16 core, 256G RAM, 4T HDD
  • 2 off Dell Poweredge R815 (anne and mary), 4 x AMD Opteron(tm) Processor 6376 16 core, 1T RAM, 2T HDD

it has since been expanded with the GPU spaces filled with nvidia Titan X cards and the purchase of

  • 4 off Dell Poweredge T630, 2 x Xeon(R) CPU E5-2620 v3 @ 2.40GHz 12 Core 64G RAM 1T HDD with some Titan X(maxwell) cards and the rest Titan-X (pascal) cards.
  • 4 off Scan/Asus 3xs_hpc-t6a with Titan-X (pascal) cards

These are all in the cdt racking in the Appleton tower server room.

There are two head nodes, renown and hood which are virtual nodes and a scheduler escience7 which is an R320 in the cdt rack.

Information about the loading of the cluster can be obtained from ganglia

OS

These nodes were installed with an early version of SL7, the charles machines have all been reinstalled with the full DICE release of SL7. The james are currently being reinstalled piecemeal with the full DICE SL7. The main difference is the partitioning scheme and the use of ext4 rather than ext3.

Scheduler

see SlurmCareandFeeding

Shared filesystem

Slurm needs a shared filesystem to operate, because gridnegine is not kerberizes this filesystem can't be on afs. Also becuase the jobs are forked off from sgeexecd on the compute nodes we need to have non afs user home directories. In order to provide this we have used gluster which is the standard redhat distributed/enterprise filesystem. There are mounted as /mnt/mscteach_gridengine_common for the gridengine admin filesystem and /mnt/mscteach_home for the user home directories

the two filesystems are configured as follows

filesystems are currenty mounted via systemd and fstab

e.g charles11.inf.ed.ac.uk:/cdt-gridengine-common 402910080 1029248 381400832 1% /mnt/cdt_gridengine_common

note that although it appears that the filesstem is mounted off charles 11 it's distributed across four fileservers

i.e.

gluster> 

gluster vo[malcolm02]root: gluster volume info
 
Volume Name: cdtcluster_home
Type: Distributed-Replicate
Volume ID: b4373a23-2f2d-4bcc-8aa5-5db82874c7d9
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x 3 = 12
Transport-type: tcp
Bricks:
Brick1: malcolm01storage:/disk/gluster/brick_malcolm01_02/data
Brick2: malcolm02storage:/disk/gluster/brick_malcolm02_02/data
Brick3: malcolm03storage:/disk/gluster/brick_malcolm03_02/data
Brick4: malcolm02storage:/disk/gluster/brick_malcolm02_03/data
Brick5: malcolm03storage:/disk/gluster/brick_malcolm03_03/data
Brick6: malcolm04storage:/disk/gluster/brick_malcolm04_02/data
Brick7: malcolm01storage:/disk/gluster/brick_malcolm01_03/data
Brick8: malcolm04storage:/disk/gluster/brick_malcolm04_03/data
Brick9: malcolm02storage:/disk/gluster/brick_malcolm02_04/data
Brick10: malcolm01storage:/disk/gluster/brick_malcolm01_04/data
Brick11: malcolm03storage:/disk/gluster/brick_malcolm03_04/data
Brick12: malcolm04storage:/disk/gluster/brick_malcolm04_04/data
Options Reconfigured:
cluster.self-heal-daemon: enable
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off
[malcolm02]root: 
This is a distributed-replicated filestystem with three replicas and distributed over four servers. This is not like a raid 3 set up these are three distinct copies of the same data

for most gluster issues we'd recommend the https://access.redhat.com/documentation/en/red-hat-gluster-storage/redhat documents

Common gluster issues

  • "too many symboic links" on login. this has started to appear and seems to be a race condition between systemd mounting the filesystem and the links being in place.
     umount /home  
    followed by
     mount -a 
    will fix.

  • filesystem failing to mount. THis is ususally indicative that lots of nodes in the filesystem are no longer talking to each other. log onto the machine listed in fstab as the mount and chec that glusterd is running
    [forthea]iainr: ps auxww|grep glusterd
    root      1039  0.0  0.0 605060 24708 ?        Ssl  10:28   0:01 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
    root      1278  0.0  0.1 778204 57384 ?        Ssl  10:28   0:01 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/864101a86f98b91e4af178b78b1602a1.socket --xlator-option *replicate*.node-uuid=5ab018c5-c637-4910-b31a-b0935ab28ee8
    root      1284  2.1  0.1 1627476 46460 ?       Ssl  10:28   7:49 /usr/sbin/glusterfsd -s forthea --volfile-id mscteach-gridengine-common.forthea.disk-gluster-brick_forthea_01-data -p /var/lib/glusterd/vols/mscteach-gridengine-common/run/forthea-disk-gluster-brick_forthea_01-data.pid -S /var/run/gluster/5d6a98a082c13ef0e162a9026afe6516.socket --brick-name /disk/gluster/brick_forthea_01/data -l /var/log/glusterfs/bricks/disk-gluster-brick_forthea_01-data.log --xlator-option *-posix.glusterd-uuid=5ab018c5-c637-4910-b31a-b0935ab28ee8 --brick-port 49152 --xlator-option mscteach-gridengine-common-server.listen-port=49152
    root      1285  0.0  0.3 695560 113744 ?       Ssl  10:28   0:00 /usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/lib/glusterd/nfs/run/nfs.pid -l /var/log/glusterfs/nfs.log -S /var/run/gluster/c689254873f7f171afe7d97355c08e4a.socket
    iainr    24583  0.0  0.0 112656   960 pts/0    S+   16:41   0:00 grep --color=auto glusterd
    
then check that it can see the other nodes in the filesystem

[forthea]root: gluster
gluster> peer status
Number of Peers: 6

Hostname: tatties
Uuid: efb6684f-d516-4bc6-92ea-0bf687b8bd47
State: Peer in Cluster (Connected)

Hostname: 129.215.18.94
Uuid: 70066537-d6e9-4d35-8fb8-dd5bb0280a4c
State: Peer in Cluster (Connected)
Other names:
letha04

Hostname: 129.215.18.93
Uuid: 7b240054-5ca0-4d68-ae0a-059b283c71c1
State: Peer in Cluster (Connected)
Other names:
letha03.inf.ed.ac.uk

Hostname: letha02
Uuid: 92757806-6409-4f31-a081-277447c53c0a
State: Peer in Cluster (Connected)

Hostname: letha05
Uuid: a0740a46-92a7-4d10-a4b4-505cc652c9c9
State: Peer in Cluster (Connected)

Hostname: letha01
Uuid: 7324a5f0-8a84-4132-a988-04312cbbf01d
State: Peer in Cluster (Connected)
gluster> 
If this is showing any peers as lost or shutdown then they'll need restarted, you can restart the gluster daemon via:systemctl
[forthea]root: systemctl status glusterd
● glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/usr/lib/systemd/system/glusterd.service; disabled; vendor preset: enabled)
   Active: active (running) since Thu 2017-06-29 10:28:56 BST; 6h ago
 Main PID: 1039 (glusterd)
   CGroup: /system.slice/glusterd.service
           ├─1039 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INF...
           ├─1278 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glust...
           ├─1284 /usr/sbin/glusterfsd -s forthea --volfile-id mscteach-gride...
           ├─1285 /usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -...
           └─1329 /sbin/rpc.statd

Jun 29 10:28:57 forthea.inf.ed.ac.uk rpc.statd[1329]: Version 1.3.0 starting
Jun 29 10:28:57 forthea.inf.ed.ac.uk sm-notify[1330]: Version 1.3.0 starting
[forthea]root: 

  • "Stale file handles " are basically as per NFS, find the open copy of the file and kill the process or reboot the server

GPUs

Information about the gpus can be obtained via the nvidia-smi command

[charles17]iainr: nvidia-smi
Fri Jun 30 14:53:13 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39                 Driver Version: 375.39                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN X (Pascal)    On   | 0000:02:00.0     Off |                  N/A |
| 24%   35C    P8    17W / 250W |      0MiB / 12189MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  TITAN X (Pascal)    On   | 0000:03:00.0     Off |                  N/A |
| 29%   42C    P8    17W / 250W |      0MiB / 12189MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   2  TITAN X (Pascal)    On   | 0000:81:00.0     Off |                  N/A |
| 26%   38C    P8    17W / 250W |      0MiB / 12189MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   3  TITAN X (Pascal)    On   | 0000:82:00.0     Off |                  N/A |
| 25%   37C    P8    16W / 250W |      0MiB / 12189MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

If thre are processes accessing the GPU then they will be listed

nv[charles12]iainr: nvidia-smi
Fri Jun 30 14:57:54 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39                 Driver Version: 375.39                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN X (Pascal)    On   | 0000:02:00.0     Off |                  N/A |
| 41%   69C    P2   141W / 250W |  11705MiB / 12189MiB |     46%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  TITAN X (Pascal)    On   | 0000:04:00.0     Off |                  N/A |
| 23%   31C    P8    16W / 250W |      0MiB / 12189MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   2  TITAN X (Pascal)    On   | 0000:83:00.0     Off |                  N/A |
| 72%   86C    P2   254W / 250W |  11705MiB / 12189MiB |     74%   E. Process |
+-------------------------------+----------------------+----------------------+
|   3  TITAN X (Pascal)    On   | 0000:84:00.0     Off |                  N/A |
| 34%   58C    P2    83W / 250W |  11191MiB / 12189MiB |     91%   E. Process |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     99508    C   python                                       11703MiB |
|    2     99045    C   python                                       11703MiB |
|    3     45132    C   python                                       11189MiB |
+-----------------------------------------------------------------------------+

there should be an associated process with each PID above

s1473470  99508  116 18.8 45058704 12437588 ?   Rl   Jun29 1606:12 python train_celeb_wgan.py --experiment_title relational_wgan --relational_plus_discr True --batch_size 100
iainr    143110  0.0  0.0 112656   964 pts/0    S+   14:59   0:00 grep --color=auto 99508

sometimes a user space process will die and zombify, in which case the GPU will fault

[charles19]root: nvidia-smi
Fri Jun 30 15:01:18 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66                 Driver Version: 375.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN X (Pascal)    Off  | 0000:02:00.0     Off |                  N/A |
| 23%   18C    P8    14W / 250W |      0MiB / 12189MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN X (Pascal)    Off  | 0000:03:00.0     Off |                  N/A |
| 23%   24C    P8    16W / 250W |  11621MiB / 12189MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN X (Pascal)    Off  | 0000:81:00.0     Off |                  N/A |
| 23%   21C    P8    15W / 250W |      0MiB / 12189MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  TITAN X (Pascal)    Off  | 0000:82:00.0     Off |                  N/A |
| 23%   23C    P8    16W / 250W |      0MiB / 12189MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    1     17235    C   Unknown Error                                11619MiB |
+-----------------------------------------------------------------------------+
[charles19]root: ps auxww|grep 17235
root     14869  0.0  0.0 112652   948 pts/0    S+   15:01   0:00 grep 17235
s1473470 17235  0.0  0.0      0     0 ?        Zl   Jun17   0:42 [python] <defunct>
[charles19]root: 
The only solution we're aware o to fix this is to reboot the node. It's usually best to disable all the queue instances first and let other jobs finish.

-- IainRae - 18 May 2016


This topic: DICE > CareandFeeding-CDTCluster
Topic revision: r10 - 12 Feb 2019 - 11:46:54 - IainRae
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies