CDT Cluster: Care and Feeding
Note that this page is currently being developed contact iainr@infREMOVE_THIS.ed.ac.uk for more information.
This page covers the design, configuration and maintenance of the CDT cluster(s) AKA the James and Charles machines. If you just want to
use the cluster then look at compuing.help instead.
Overview
The cluster is made up of a set of nodes specced for GPU coding (the Charles machines), a set specced for CPU coding (the James machines) and two large memory nodes (anne and mary). The cluster nodes are named after the first Kings and queens of the United Kingdom (the James's, Charles' Anne and Mary )I current head nodes are hood.inf.ed.ac.uk and renown.inf.ed.ac.uk (both virual).
The head nodes are purely for submitting jobs and copying files from /afs or other otherwise inaccessible filesystems, feel free to kill any processes which are hogging the CPU.
execution nodes are only sshable via cluster infrastructure nodes such as fondant or escience7 this is to stop students logging in and running jobs outsde of the scheduler.
Hardware
As originally purchased the cluster consisted of:
- 10 off Dell Poweredge R730 (the charles machines) , 2 x Xeon(R) CPU E5-2640 v3 @ 2.60GHz 16 Core, 64G RAM, 4T HDD, 1 or 2 Tesla K40m
- 21 off Dell Poweredge R815 (the james machines), 4 x AMD Opteron(tm) Processor 6376 16 core, 256G RAM, 4T HDD
- 2 off Dell Poweredge R815 (anne and mary), 4 x AMD Opteron(tm) Processor 6376 16 core, 1T RAM, 2T HDD
it has since been expanded with the GPU spaces filled with nvidia Titan X cards and the purchase of
- 4 off Dell Poweredge T630, 2 x Xeon(R) CPU E5-2620 v3 @ 2.40GHz 12 Core 64G RAM 1T HDD with some Titan X(maxwell) cards and the rest Titan-X (pascal) cards.
- 4 off Scan/Asus 3xs_hpc-t6a with Titan-X (pascal) cards
These are all in the cdt racking in the Appleton tower server room.
There are two head nodes, renown and hood which are virtual nodes and a scheduler escience7 which is an R320 in the cdt rack.
Information about the loading of the cluster can be obtained from
ganglia
OS
These nodes were installed with an early version of SL7, the charles machines have all been reinstalled with the full DICE release of SL7. The james are currently being reinstalled piecemeal with the full DICE SL7. The main difference is the partitioning scheme and the use of ext4 rather than ext3.
Scheduler
We were initially asked not to install a scheduler but to allow open access and there was the possibility that students might get either bare metal or VM based access to install different operating systems, as of May 2016 nothing has come of this. In Arpil 2016 we were asked to install gridengine to provide a comparable system to ECDF/Eddie. Installation of hadoop based frameworks has been mooted at various tilmes but has not progressed
Shared filesystem
Gridengine needs a shared filesystem to operate, because gridnegine is not kerberizes this filesystem can't be on afs. Also becuase the jobs are forked off from sgeexecd on the compute nodes we need to have non afs user home directories. In order to provide this we have used gluster which is the standard redhat distributed/enterprise filesystem. There are mounted as /mnt/mscteach_gridengine_common for the gridengine admin filesystem and /mnt/mscteach_home for the user home directories
the two filesystems are configured as follows
filesystems are currenty mounted via systemd and fstab
e.g
charles11.inf.ed.ac.uk:/cdt-gridengine-common 402910080 1029248 381400832 1% /mnt/cdt_gridengine_common
note that although it appears that the filesstem is mounted off charles 11 it's distributed across four fileservers
i.e.
gluster> volume info cdt-gridengine-common
Volume Name: cdt-gridengine-common
Type: Distributed-Replicate
Volume ID: 26e1a62a-ffd9-40f7-a425-f929537bc1ee
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: charles11:/disk/gluster/brick_charles11_01/data
Brick2: charles12:/disk/gluster/brick_charles12_01/data
Brick3: charles13:/disk/gluster/brick_charles13_01/data
Brick4: charles14:/disk/gluster/brick_charles14_01/data
Options Reconfigured:
performance.readdir-ahead: on
gluster>
Distributed means that the files are spread over two (or more) bricks, Replicated means that there are two (or more) copies of any file on multiple bricks.
In this case it;s like a disk raid setup with 2 replicated 2 disk stripes
similarly for the home directories
[anne]root: gluster
gluster> volume info
Volume Name: cdt-gridengine-home
Type: Distributed-Replicate
Volume ID: 49a16c6a-1d07-4aea-b9a5-c6b0f5ae2b10
Status: Started
Number of Bricks: 7 x 2 = 14
Transport-type: tcp
Bricks:
Brick1: anne:/disk/gluster/brick_anne_01/data
Brick2: mary:/disk/gluster/brick_mary_01/data
Brick3: charles03:/disk/gluster/brick_charles03_01/data
Brick4: charles04:/disk/gluster/brick_charles04_01/data
Brick5: charles05:/disk/gluster/brick_charles05_01/data
Brick6: charles06:/disk/gluster/brick_charles06_01/data
Brick7: charles09:/disk/gluster/brick_charles09_01/data
Brick8: charles10:/disk/gluster/brick_charles10_01/data
Brick9: charles07:/disk/gluster/brick_charles07_01/data
Brick10: charles08:/disk/gluster/brick_charles08_01/data
Brick11: charles15.inf.ed.ac.uk:/disk/gluster/brick_charles15
Brick12: charles16.inf.ed.ac.uk:/disk/gluster/brick_charles16
Brick13: charles17.inf.ed.ac.uk:/disk/gluster/brick_charles17_01/data
Brick14: charles18.inf.ed.ac.uk:/disk/gluster/brick_charles18_01/data
Options Reconfigured:
network.inode-lru-limit: 9000
performance.md-cache-timeout: 60
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
performance.readdir-ahead: on
gluster>
which is like 2 replicated 7 disk stripes.
for most gluster issues we'd recommend the
https://access.redhat.com/documentation/en/red-hat-gluster-storage/redhat documents
Common gluster issues
- "too many symboic links" on login. this has started to appear and seems to be a race condition between systemd mounting the filesystem and the links being in place.
umount /home
followed by mount -a
will fix.
then check that it can see the other nodes in the filesystem
[forthea]root: gluster
gluster> peer status
Number of Peers: 6
Hostname: tatties
Uuid: efb6684f-d516-4bc6-92ea-0bf687b8bd47
State: Peer in Cluster (Connected)
Hostname: 129.215.18.94
Uuid: 70066537-d6e9-4d35-8fb8-dd5bb0280a4c
State: Peer in Cluster (Connected)
Other names:
letha04
Hostname: 129.215.18.93
Uuid: 7b240054-5ca0-4d68-ae0a-059b283c71c1
State: Peer in Cluster (Connected)
Other names:
letha03.inf.ed.ac.uk
Hostname: letha02
Uuid: 92757806-6409-4f31-a081-277447c53c0a
State: Peer in Cluster (Connected)
Hostname: letha05
Uuid: a0740a46-92a7-4d10-a4b4-505cc652c9c9
State: Peer in Cluster (Connected)
Hostname: letha01
Uuid: 7324a5f0-8a84-4132-a988-04312cbbf01d
State: Peer in Cluster (Connected)
gluster>
If this is showing any peers as lost or shutdown then they'll need restarted, you can restart the gluster daemon via:systemctl
[forthea]root: systemctl status glusterd
● glusterd.service - GlusterFS, a clustered file-system server
Loaded: loaded (/usr/lib/systemd/system/glusterd.service; disabled; vendor preset: enabled)
Active: active (running) since Thu 2017-06-29 10:28:56 BST; 6h ago
Main PID: 1039 (glusterd)
CGroup: /system.slice/glusterd.service
├─1039 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INF...
├─1278 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glust...
├─1284 /usr/sbin/glusterfsd -s forthea --volfile-id mscteach-gride...
├─1285 /usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -...
└─1329 /sbin/rpc.statd
Jun 29 10:28:57 forthea.inf.ed.ac.uk rpc.statd[1329]: Version 1.3.0 starting
Jun 29 10:28:57 forthea.inf.ed.ac.uk sm-notify[1330]: Version 1.3.0 starting
[forthea]root:
- "Stale file handles " are basically as per NFS, find the open copy of the file and kill the process or reboot the server
Gridengine
We've been requested to install gridengine, unfortunately gridengine has rotted somewhat since sun was bought up and oracle sold it off, we are currently using
Son of gridengine which is an open source fork. Note that other than the above website there is no real definitive set of documentation for the version of gridengine we are running.
Suns n6 docs should give a good overview.
Gridengine is made up of three types of nodes, execution hosts which run the jobs, head nodes which provide a working environment for people to schedule jobs from and a scheduler node which runs the scheduler and manages the execution nodes. All configuration is in the shared gridengine-common filesystem which should be mounted as /opt/sge/default. configuration of the scheduler can be done via the qconf command or the qmoncommand. current jobs can be listed via qstat and the current state of hosts can be listed via qhost.
some useful commands
- qhost -list hosts
- qsub - sumbit jobs
- qstat - list info about jobs
- qconf - configure aspects about the cluster
- qmon (gui configuration utility)
Generaly we would recommend using qmon, it's really horrible but it's not as steep a learning curve as qconf.
common issues
the -f option can be used to remove more stubborn jobs. Not that for reasons that remain bizzare to list all users jobs (and not just your own) you have to use -u and escape the * (all users)
- host information: you can dumo information that the scheduler knwos about the hosts using the qhost command
[renown]iainr: qhost
HOSTNAME ARCH NCPU NSOC NCOR NTHR LOAD MEMTOT MEMUSE SWAPTOS
---------------------------------------------------------------------------------------
global - - - - - - - - --
charles01 lx-amd64 32 2 16 32 0.09 62.7G 5.2G 31.3G0
charles02 lx-amd64 32 2 16 32 1.11 62.7G 8.3G 31.3GM
charles03 lx-amd64 32 2 16 32 0.06 62.7G 7.4G 31.3GM
charles04 lx-amd64 32 2 16 32 0.01 62.7G 7.4G 31.3GM
charles05 lx-amd64 32 2 16 32 0.03 62.7G 6.7G 31.3GM
charles06 lx-amd64 32 2 16 32 0.03 62.7G 5.4G 31.3GM
charles07 lx-amd64 32 2 16 32 0.01 62.7G 5.2G 31.3GM
charles08 lx-amd64 32 2 16 32 1.01 62.7G 9.7G 31.3GG
charles09 lx-amd64 32 2 16 32 0.02 62.7G 6.2G 31.3GG
charles10 lx-amd64 32 2 16 32 0.04 62.7G 3.8G 31.3GM
charles11 lx-amd64 24 2 12 24 0.04 62.8G 2.7G 31.4G0
charles12 lx-amd64 24 2 12 24 2.18 62.8G 41.6G 31.4GG
charles13 lx-amd64 24 2 12 24 0.01 62.8G 1.6G 31.4GM
charles14 lx-amd64 24 2 12 24 1.01 62.8G 26.1G 31.4GM
charles15 lx-amd64 32 2 16 32 4.49 62.8G 43.6G 31.2GG
charles16 lx-amd64 32 2 16 32 0.02 62.8G 5.4G 31.2G0
charles17 lx-amd64 32 2 16 32 1.14 62.8G 4.2G 31.2GM
charles18 lx-amd64 32 2 16 32 2.04 62.8G 13.3G 31.2GG
...
Note that if the load and memuse columns are sowing - then the execution daemon (sgeexecd) is not running on that node. If a host is showing all -'s then it has not been installed yet.There is also a "global" host that is used to hold some default configuration for all hosts.
*restating sgeexecd
- Restarting qmaster: In the last couple of weeks systemd has had issues in restarting qmaster if the machine is rebooted, for the oment I'm started it by hand
/opt/sge/bin/lx-amd64/sge_qmaster
if you want debugging information then you need to set an environmental variable
export SGE_ND=1
/opt/sge/bin/lx-amd64/sge_qmaster
- To install an execution node
[forthea]iainr: cd /opt/sge
[forthea]iainr: ls
bin doc install_execd inst_sge man pvm util
default examples install_qmaster lib mpi qmon utilbin
make sure that default is mounted
[forthea]iainr: ls default
common spool
[forthea]iainr:
</verbatim
install the exec node
<verbatim>[forthea]iainr: ./install_execd
Welcome to the Grid Engine execution host installation
------------------------------------------------------
If you haven't installed the Grid Engine qmaster host yet, you must execute
this step (with >install_qmaster<) prior the execution host installation.
For a successful installation you need a running Grid Engine qmaster. It is
also necessary that this host is an administrative host.
You can verify your current list of administrative hosts with
the command:
# qconf -sh
You can add an administrative host with the command:
# qconf -ah <hostname>
The execution host installation will take approximately 5 minutes.
Hit <RETURN> to continue >>
</verbatim>
The script should take configuration information from the common filesystem. You can install the qmaster node in a similar fashion.
---+++GPUs
Information about the gpus can be obtained via the nvidia-smi command
<verbatim>
[charles17]iainr: nvidia-smi
Fri Jun 30 14:53:13 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39 Driver Version: 375.39 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN X (Pascal) On | 0000:02:00.0 Off | N/A |
| 24% 35C P8 17W / 250W | 0MiB / 12189MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 1 TITAN X (Pascal) On | 0000:03:00.0 Off | N/A |
| 29% 42C P8 17W / 250W | 0MiB / 12189MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 2 TITAN X (Pascal) On | 0000:81:00.0 Off | N/A |
| 26% 38C P8 17W / 250W | 0MiB / 12189MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 3 TITAN X (Pascal) On | 0000:82:00.0 Off | N/A |
| 25% 37C P8 16W / 250W | 0MiB / 12189MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
</verbatim>
If thre are processes accessing the GPU then they will be listed
<verbatim>
nv[charles12]iainr: nvidia-smi
Fri Jun 30 14:57:54 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39 Driver Version: 375.39 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN X (Pascal) On | 0000:02:00.0 Off | N/A |
| 41% 69C P2 141W / 250W | 11705MiB / 12189MiB | 46% E. Process |
+-------------------------------+----------------------+----------------------+
| 1 TITAN X (Pascal) On | 0000:04:00.0 Off | N/A |
| 23% 31C P8 16W / 250W | 0MiB / 12189MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 2 TITAN X (Pascal) On | 0000:83:00.0 Off | N/A |
| 72% 86C P2 254W / 250W | 11705MiB / 12189MiB | 74% E. Process |
+-------------------------------+----------------------+----------------------+
| 3 TITAN X (Pascal) On | 0000:84:00.0 Off | N/A |
| 34% 58C P2 83W / 250W | 11191MiB / 12189MiB | 91% E. Process |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 99508 C python 11703MiB |
| 2 99045 C python 11703MiB |
| 3 45132 C python 11189MiB |
+-----------------------------------------------------------------------------+
</verbatim>
there should be an associated process with each PID above
<verbatim>
s1473470 99508 116 18.8 45058704 12437588 ? Rl Jun29 1606:12 python train_celeb_wgan.py --experiment_title relational_wgan --relational_plus_discr True --batch_size 100
iainr 143110 0.0 0.0 112656 964 pts/0 S+ 14:59 0:00 grep --color=auto 99508
<./verbatim>
sometimes a user space process will die and zombify, in which case the GPU will fault
<verbatim>
[charles19]root: nvidia-smi
Fri Jun 30 15:01:18 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66 Driver Version: 375.66 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN X (Pascal) Off | 0000:02:00.0 Off | N/A |
| 23% 18C P8 14W / 250W | 0MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN X (Pascal) Off | 0000:03:00.0 Off | N/A |
| 23% 24C P8 16W / 250W | 11621MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 TITAN X (Pascal) Off | 0000:81:00.0 Off | N/A |
| 23% 21C P8 15W / 250W | 0MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 TITAN X (Pascal) Off | 0000:82:00.0 Off | N/A |
| 23% 23C P8 16W / 250W | 0MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 1 17235 C Unknown Error 11619MiB |
+-----------------------------------------------------------------------------+
[charles19]root: ps auxww|grep 17235
root 14869 0.0 0.0 112652 948 pts/0 S+ 15:01 0:00 grep 17235
s1473470 17235 0.0 0.0 0 0 ? Zl Jun17 0:42 [python] <defunct>
[charles19]root:
</verbatim>
The only solution we're aware o to fix this is to reboot the node. It's usually best to disable all the queue instances first and let other jobs finish.
-- Main.IainRae - 18 May 2016
</verbatim>
<nop>