TWiki> DICE Web>CareandFeeding-CDTCluster (revision 6)EditAttach

CDT Cluster: Care and Feeding

Note that this page is currently being developed contact iainr@infREMOVE_THIS.ed.ac.uk for more information.

This page covers the design, configuration and maintenance of the CDT cluster(s) AKA the James and Charles machines. If you just want to use the cluster then look at compuing.help instead.

Overview

The cluster is made up of a set of nodes specced for GPU coding (the Charles machines), a set specced for CPU coding (the James machines) and two large memory nodes (anne and mary). The cluster nodes are named after the first Kings and queens of the United Kingdom (the James's, Charles' Anne and Mary )I current head nodes are hood.inf.ed.ac.uk and renown.inf.ed.ac.uk (both virual).

The head nodes are purely for submitting jobs and copying files from /afs or other otherwise inaccessible filesystems, feel free to kill any processes which are hogging the CPU.

execution nodes are only sshable via cluster infrastructure nodes such as fondant or escience7 this is to stop students logging in and running jobs outsde of the scheduler.

Hardware

As originally purchased the cluster consisted of:

  • 10 off Dell Poweredge R730 (the charles machines) , 2 x Xeon(R) CPU E5-2640 v3 @ 2.60GHz 16 Core, 64G RAM, 4T HDD, 1 or 2 Tesla K40m
  • 21 off Dell Poweredge R815 (the james machines), 4 x AMD Opteron(tm) Processor 6376 16 core, 256G RAM, 4T HDD
  • 2 off Dell Poweredge R815 (anne and mary), 4 x AMD Opteron(tm) Processor 6376 16 core, 1T RAM, 2T HDD

it has since been expanded with the GPU spaces filled with nvidia Titan X cards and the purchase of

  • 4 off Dell Poweredge T630, 2 x Xeon(R) CPU E5-2620 v3 @ 2.40GHz 12 Core 64G RAM 1T HDD with some Titan X(maxwell) cards and the rest Titan-X (pascal) cards.
  • 4 off Scan/Asus 3xs_hpc-t6a with Titan-X (pascal) cards

These are all in the cdt racking in the Appleton tower server room.

There are two head nodes, renown and hood which are virtual nodes and a scheduler escience7 which is an R320 in the cdt rack.

Information about the loading of the cluster can be obtained from ganglia

OS

These nodes were installed with an early version of SL7, the charles machines have all been reinstalled with the full DICE release of SL7. The james are currently being reinstalled piecemeal with the full DICE SL7. The main difference is the partitioning scheme and the use of ext4 rather than ext3.

Scheduler

We were initially asked not to install a scheduler but to allow open access and there was the possibility that students might get either bare metal or VM based access to install different operating systems, as of May 2016 nothing has come of this. In Arpil 2016 we were asked to install gridengine to provide a comparable system to ECDF/Eddie. Installation of hadoop based frameworks has been mooted at various tilmes but has not progressed

Shared filesystem

Gridengine needs a shared filesystem to operate, because gridnegine is not kerberizes this filesystem can't be on afs. Also becuase the jobs are forked off from sgeexecd on the compute nodes we need to have non afs user home directories. In order to provide this we have used gluster which is the standard redhat distributed/enterprise filesystem. There are mounted as /mnt/mscteach_gridengine_common for the gridengine admin filesystem and /mnt/mscteach_home for the user home directories

the two filesystems are configured as follows

filesystems are currenty mounted via systemd and fstab

e.g charles11.inf.ed.ac.uk:/cdt-gridengine-common 402910080 1029248 381400832 1% /mnt/cdt_gridengine_common

note that although it appears that the filesstem is mounted off charles 11 it's distributed across four fileservers

i.e.

gluster> volume info cdt-gridengine-common
 
Volume Name: cdt-gridengine-common
Type: Distributed-Replicate
Volume ID: 26e1a62a-ffd9-40f7-a425-f929537bc1ee
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: charles11:/disk/gluster/brick_charles11_01/data
Brick2: charles12:/disk/gluster/brick_charles12_01/data
Brick3: charles13:/disk/gluster/brick_charles13_01/data
Brick4: charles14:/disk/gluster/brick_charles14_01/data
Options Reconfigured:
performance.readdir-ahead: on
gluster> 

Distributed means that the files are spread over two (or more) bricks, Replicated means that there are two (or more) copies of any file on multiple bricks.

In this case it;s like a disk raid setup with 2 replicated 2 disk stripes

similarly for the home directories
[anne]root: gluster
gluster> volume info
 
Volume Name: cdt-gridengine-home
Type: Distributed-Replicate
Volume ID: 49a16c6a-1d07-4aea-b9a5-c6b0f5ae2b10
Status: Started
Number of Bricks: 7 x 2 = 14
Transport-type: tcp
Bricks:
Brick1: anne:/disk/gluster/brick_anne_01/data
Brick2: mary:/disk/gluster/brick_mary_01/data
Brick3: charles03:/disk/gluster/brick_charles03_01/data
Brick4: charles04:/disk/gluster/brick_charles04_01/data
Brick5: charles05:/disk/gluster/brick_charles05_01/data
Brick6: charles06:/disk/gluster/brick_charles06_01/data
Brick7: charles09:/disk/gluster/brick_charles09_01/data
Brick8: charles10:/disk/gluster/brick_charles10_01/data
Brick9: charles07:/disk/gluster/brick_charles07_01/data
Brick10: charles08:/disk/gluster/brick_charles08_01/data
Brick11: charles15.inf.ed.ac.uk:/disk/gluster/brick_charles15
Brick12: charles16.inf.ed.ac.uk:/disk/gluster/brick_charles16
Brick13: charles17.inf.ed.ac.uk:/disk/gluster/brick_charles17_01/data
Brick14: charles18.inf.ed.ac.uk:/disk/gluster/brick_charles18_01/data
Options Reconfigured:
network.inode-lru-limit: 9000
performance.md-cache-timeout: 60
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
performance.readdir-ahead: on
gluster> 
which is like 2 replicated 7 disk stripes.

for most gluster issues we'd recommend the https://access.redhat.com/documentation/en/red-hat-gluster-storage/redhat documents

Common gluster issues

  • "too many symboic links" on login. this has started to appear and seems to be a race condition between systemd mounting the filesystem and the links being in place.
     umount /home  
    followed by
     mount -a 
    will fix.

  • filesystem failing to mount. THis is ususally indicative that lots of nodes in the filesystem are no longer talking to each other. log onto the machine listed in fstab as the mount and chec that glusterd is running
    [forthea]iainr: ps auxww|grep glusterd
    root      1039  0.0  0.0 605060 24708 ?        Ssl  10:28   0:01 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
    root      1278  0.0  0.1 778204 57384 ?        Ssl  10:28   0:01 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/864101a86f98b91e4af178b78b1602a1.socket --xlator-option *replicate*.node-uuid=5ab018c5-c637-4910-b31a-b0935ab28ee8
    root      1284  2.1  0.1 1627476 46460 ?       Ssl  10:28   7:49 /usr/sbin/glusterfsd -s forthea --volfile-id mscteach-gridengine-common.forthea.disk-gluster-brick_forthea_01-data -p /var/lib/glusterd/vols/mscteach-gridengine-common/run/forthea-disk-gluster-brick_forthea_01-data.pid -S /var/run/gluster/5d6a98a082c13ef0e162a9026afe6516.socket --brick-name /disk/gluster/brick_forthea_01/data -l /var/log/glusterfs/bricks/disk-gluster-brick_forthea_01-data.log --xlator-option *-posix.glusterd-uuid=5ab018c5-c637-4910-b31a-b0935ab28ee8 --brick-port 49152 --xlator-option mscteach-gridengine-common-server.listen-port=49152
    root      1285  0.0  0.3 695560 113744 ?       Ssl  10:28   0:00 /usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/lib/glusterd/nfs/run/nfs.pid -l /var/log/glusterfs/nfs.log -S /var/run/gluster/c689254873f7f171afe7d97355c08e4a.socket
    iainr    24583  0.0  0.0 112656   960 pts/0    S+   16:41   0:00 grep --color=auto glusterd
    
then check that it can see the other nodes in the filesystem

[forthea]root: gluster
gluster> peer status
Number of Peers: 6

Hostname: tatties
Uuid: efb6684f-d516-4bc6-92ea-0bf687b8bd47
State: Peer in Cluster (Connected)

Hostname: 129.215.18.94
Uuid: 70066537-d6e9-4d35-8fb8-dd5bb0280a4c
State: Peer in Cluster (Connected)
Other names:
letha04

Hostname: 129.215.18.93
Uuid: 7b240054-5ca0-4d68-ae0a-059b283c71c1
State: Peer in Cluster (Connected)
Other names:
letha03.inf.ed.ac.uk

Hostname: letha02
Uuid: 92757806-6409-4f31-a081-277447c53c0a
State: Peer in Cluster (Connected)

Hostname: letha05
Uuid: a0740a46-92a7-4d10-a4b4-505cc652c9c9
State: Peer in Cluster (Connected)

Hostname: letha01
Uuid: 7324a5f0-8a84-4132-a988-04312cbbf01d
State: Peer in Cluster (Connected)
gluster> 
If this is showing any peers as lost or shutdown then they'll need restarted, you can restart the gluster daemon via:systemctl
[forthea]root: systemctl status glusterd
● glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/usr/lib/systemd/system/glusterd.service; disabled; vendor preset: enabled)
   Active: active (running) since Thu 2017-06-29 10:28:56 BST; 6h ago
 Main PID: 1039 (glusterd)
   CGroup: /system.slice/glusterd.service
           ├─1039 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INF...
           ├─1278 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glust...
           ├─1284 /usr/sbin/glusterfsd -s forthea --volfile-id mscteach-gride...
           ├─1285 /usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -...
           └─1329 /sbin/rpc.statd

Jun 29 10:28:57 forthea.inf.ed.ac.uk rpc.statd[1329]: Version 1.3.0 starting
Jun 29 10:28:57 forthea.inf.ed.ac.uk sm-notify[1330]: Version 1.3.0 starting
[forthea]root: 

  • "Stale file handles " are basically as per NFS, find the open copy of the file and kill the process or reboot the server

Gridengine

We've been requested to install gridengine, unfortunately gridengine has rotted somewhat since sun was bought up and oracle sold it off, we are currently using Son of gridengine which is an open source fork. Note that other than the above website there is no real definitive set of documentation for the version of gridengine we are running. Suns n6 docs should give a good overview.

Gridengine is made up of three types of nodes, execution hosts which run the jobs, head nodes which provide a working environment for people to schedule jobs from and a scheduler node which runs the scheduler and manages the execution nodes. All configuration is in the shared gridengine-common filesystem which should be mounted as /opt/sge/default. configuration of the scheduler can be done via the qconf command or the qmoncommand. current jobs can be listed via qstat and the current state of hosts can be listed via qhost.

some useful commands

  • qhost -list hosts
  • qsub - sumbit jobs
  • qstat - list info about jobs
  • qconf - configure aspects about the cluster
  • qmon (gui configuration utility)

Generaly we would recommend using qmon, it's really horrible but it's not as steep a learning curve as qconf.

common issues

  • Deleting stuck jobs
    [renown]iainr: qstat -u \*
    job-ID  prior   name       user         state submit/start at     queue                
    ---------------------------------------------------------------------------------------
      17717 0.56000 scheduleex s0908078     r     06/28/2017 01:06:12 gpgpu@charles02.inf. 
      17715 0.56000 scheduleex s0908078     r     06/28/2017 01:04:42 gpgpu@charles17.inf. 
      17739 0.55750 TC_feedbac s1126151     r     06/28/2017 16:42:42 gpgpu@charles15.inf.3
      17741 0.55750 fdgfd      s1564225     r     06/28/2017 17:24:42 gpgpu@charles08.inf. 
      17745 0.55750 sdfds      s1564225     r     06/28/2017 18:10:42 gpgpu@charles04.inf. 
      17740 0.55750 fdsfd      s1564225     r     06/28/2017 17:17:12 gpgpu@charles07.inf. 
      17719 0.56000 scheduleex s0908078     r     06/28/2017 01:06:42 gpgpu@charles03.inf. 
      17714 0.55750 celeb_wgan s1473470     r     06/28/2017 00:40:42 gpgpu@charles15.inf. 
      17743 0.55750 fdsfd      s1564225     r     06/28/2017 17:37:42 gpgpu@charles09.inf. 
      17738 0.55750 TC_feedbac s1126151     r     06/28/2017 16:40:57 gpgpu@charles14.inf.1
      17738 0.55750 TC_feedbac s1126151     r     06/28/2017 16:40:57 gpgpu@charles18.inf.4
      17738 0.55750 TC_feedbac s1126151     r     06/28/2017 16:40:57 gpgpu@charles12.inf.2
      17784 0.55750 imagenet_w s1473470     r     06/29/2017 15:05:25 gpgpu@charles15.inf. 
      17787 0.55750 baseline_w s1473470     r     06/29/2017 15:50:10 gpgpu@charles12.inf. 
      17790 0.55750 relational s1473470     r     06/29/2017 15:54:40 gpgpu@charles12.inf. 
      16848 0.55100 test.sh    iainr        Eqw   06/14/2017 16:23:01                      
      17739 0.00000 TC_feedbac s1126151     hqw   06/28/2017 16:42:39                     4
    [renown]iainr: qdel 16848
    iainr has deleted job 16848
    [renown]iainr: qstat -u \*
    job-ID  prior   name       user         state submit/start at     queue                
    ---------------------------------------------------------------------------------------
      17717 0.56000 scheduleex s0908078     r     06/28/2017 01:06:12 gpgpu@charles02.inf. 
      17715 0.56000 scheduleex s0908078     r     06/28/2017 01:04:42 gpgpu@charles17.inf. 
      17739 0.55750 TC_feedbac s1126151     r     06/28/2017 16:42:42 gpgpu@charles15.inf.3
      17741 0.55750 fdgfd      s1564225     r     06/28/2017 17:24:42 gpgpu@charles08.inf. 
      17745 0.55750 sdfds      s1564225     r     06/28/2017 18:10:42 gpgpu@charles04.inf. 
      17740 0.55750 fdsfd      s1564225     r     06/28/2017 17:17:12 gpgpu@charles07.inf. 
      17719 0.56000 scheduleex s0908078     r     06/28/2017 01:06:42 gpgpu@charles03.inf. 
      17714 0.55750 celeb_wgan s1473470     r     06/28/2017 00:40:42 gpgpu@charles15.inf. 
      17743 0.55750 fdsfd      s1564225     r     06/28/2017 17:37:42 gpgpu@charles09.inf. 
      17738 0.55750 TC_feedbac s1126151     r     06/28/2017 16:40:57 gpgpu@charles14.inf.1
      17738 0.55750 TC_feedbac s1126151     r     06/28/2017 16:40:57 gpgpu@charles18.inf.4
      17738 0.55750 TC_feedbac s1126151     r     06/28/2017 16:40:57 gpgpu@charles12.inf.2
      17784 0.55750 imagenet_w s1473470     r     06/29/2017 15:05:25 gpgpu@charles15.inf. 
      17787 0.55750 baseline_w s1473470     r     06/29/2017 15:50:10 gpgpu@charles12.inf. 
      17790 0.55750 relational s1473470     r     06/29/2017 15:54:40 gpgpu@charles12.inf. 
      17739 0.00000 TC_feedbac s1126151     hqw   06/28/2017 16:42:39                     4
    [renown]iainr: 
    
the -f option can be used to remove more stubborn jobs. Not that for reasons that remain bizzare to list all users jobs (and not just your own) you have to use -u and escape the * (all users)

  • host information: you can dumo information that the scheduler knwos about the hosts using the qhost command
[renown]iainr: qhost
HOSTNAME                ARCH         NCPU NSOC NCOR NTHR  LOAD  MEMTOT  MEMUSE  SWAPTOS
---------------------------------------------------------------------------------------
global                  -               -    -    -    -     -       -       -       --

charles01               lx-amd64       32    2   16   32  0.09   62.7G    5.2G   31.3G0
charles02               lx-amd64       32    2   16   32  1.11   62.7G    8.3G   31.3GM
charles03               lx-amd64       32    2   16   32  0.06   62.7G    7.4G   31.3GM
charles04               lx-amd64       32    2   16   32  0.01   62.7G    7.4G   31.3GM
charles05               lx-amd64       32    2   16   32  0.03   62.7G    6.7G   31.3GM
charles06               lx-amd64       32    2   16   32  0.03   62.7G    5.4G   31.3GM
charles07               lx-amd64       32    2   16   32  0.01   62.7G    5.2G   31.3GM
charles08               lx-amd64       32    2   16   32  1.01   62.7G    9.7G   31.3GG
charles09               lx-amd64       32    2   16   32  0.02   62.7G    6.2G   31.3GG
charles10               lx-amd64       32    2   16   32  0.04   62.7G    3.8G   31.3GM
charles11               lx-amd64       24    2   12   24  0.04   62.8G    2.7G   31.4G0
charles12               lx-amd64       24    2   12   24  2.18   62.8G   41.6G   31.4GG
charles13               lx-amd64       24    2   12   24  0.01   62.8G    1.6G   31.4GM
charles14               lx-amd64       24    2   12   24  1.01   62.8G   26.1G   31.4GM
charles15               lx-amd64       32    2   16   32  4.49   62.8G   43.6G   31.2GG
charles16               lx-amd64       32    2   16   32  0.02   62.8G    5.4G   31.2G0
charles17               lx-amd64       32    2   16   32  1.14   62.8G    4.2G   31.2GM
charles18               lx-amd64       32    2   16   32  2.04   62.8G   13.3G   31.2GG
...
Note that if the load and memuse columns are sowing - then the execution daemon (sgeexecd) is not running on that node. If a host is showing all -'s then it has not been installed yet.There is also a "global" host that is used to hold some default configuration for all hosts.

*restating sgeexecd

  • Restarting qmaster: In the last couple of weeks systemd has had issues in restarting qmaster if the machine is rebooted, for the oment I'm started it by hand
/opt/sge/bin/lx-amd64/sge_qmaster

if you want debugging information then you need to set an environmental variable

export  SGE_ND=1
 /opt/sge/bin/lx-amd64/sge_qmaster

  • To install an execution node


[forthea]iainr: cd /opt/sge
[forthea]iainr: ls
bin      doc       install_execd    inst_sge  man  pvm   util
default  examples  install_qmaster  lib       mpi  qmon  utilbin

make sure that default is mounted

[forthea]iainr: ls default
common  spool
[forthea]iainr: 
</verbatim

install the exec node

<verbatim>[forthea]iainr: ./install_execd


Welcome to the Grid Engine execution host installation
------------------------------------------------------

If you haven't installed the Grid Engine qmaster host yet, you must execute
this step (with >install_qmaster<) prior the execution host installation.

For a successful installation you need a running Grid Engine qmaster. It is
also necessary that this host is an administrative host.

You can verify your current list of administrative hosts with
the command:

   # qconf -sh

You can add an administrative host with the command:

   # qconf -ah <hostname>

The execution host installation will take approximately 5 minutes.

Hit <RETURN> to continue >> 

</verbatim>

The script should take configuration information from the common filesystem. You can install the qmaster node in a similar fashion.


---+++GPUs
Information about the gpus can be obtained via the nvidia-smi command

<verbatim>
[charles17]iainr: nvidia-smi
Fri Jun 30 14:53:13 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39                 Driver Version: 375.39                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN X (Pascal)    On   | 0000:02:00.0     Off |                  N/A |
| 24%   35C    P8    17W / 250W |      0MiB / 12189MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  TITAN X (Pascal)    On   | 0000:03:00.0     Off |                  N/A |
| 29%   42C    P8    17W / 250W |      0MiB / 12189MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   2  TITAN X (Pascal)    On   | 0000:81:00.0     Off |                  N/A |
| 26%   38C    P8    17W / 250W |      0MiB / 12189MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   3  TITAN X (Pascal)    On   | 0000:82:00.0     Off |                  N/A |
| 25%   37C    P8    16W / 250W |      0MiB / 12189MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
</verbatim>

If thre are processes accessing the GPU then they will be listed
<verbatim>
nv[charles12]iainr: nvidia-smi
Fri Jun 30 14:57:54 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39                 Driver Version: 375.39                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN X (Pascal)    On   | 0000:02:00.0     Off |                  N/A |
| 41%   69C    P2   141W / 250W |  11705MiB / 12189MiB |     46%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  TITAN X (Pascal)    On   | 0000:04:00.0     Off |                  N/A |
| 23%   31C    P8    16W / 250W |      0MiB / 12189MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   2  TITAN X (Pascal)    On   | 0000:83:00.0     Off |                  N/A |
| 72%   86C    P2   254W / 250W |  11705MiB / 12189MiB |     74%   E. Process |
+-------------------------------+----------------------+----------------------+
|   3  TITAN X (Pascal)    On   | 0000:84:00.0     Off |                  N/A |
| 34%   58C    P2    83W / 250W |  11191MiB / 12189MiB |     91%   E. Process |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     99508    C   python                                       11703MiB |
|    2     99045    C   python                                       11703MiB |
|    3     45132    C   python                                       11189MiB |
+-----------------------------------------------------------------------------+
</verbatim>

there should be an associated process with each PID above

<verbatim>
s1473470  99508  116 18.8 45058704 12437588 ?   Rl   Jun29 1606:12 python train_celeb_wgan.py --experiment_title relational_wgan --relational_plus_discr True --batch_size 100
iainr    143110  0.0  0.0 112656   964 pts/0    S+   14:59   0:00 grep --color=auto 99508

<./verbatim>

sometimes a user space process will die and zombify, in which case the GPU will fault
<verbatim>
[charles19]root: nvidia-smi
Fri Jun 30 15:01:18 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66                 Driver Version: 375.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN X (Pascal)    Off  | 0000:02:00.0     Off |                  N/A |
| 23%   18C    P8    14W / 250W |      0MiB / 12189MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN X (Pascal)    Off  | 0000:03:00.0     Off |                  N/A |
| 23%   24C    P8    16W / 250W |  11621MiB / 12189MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN X (Pascal)    Off  | 0000:81:00.0     Off |                  N/A |
| 23%   21C    P8    15W / 250W |      0MiB / 12189MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  TITAN X (Pascal)    Off  | 0000:82:00.0     Off |                  N/A |
| 23%   23C    P8    16W / 250W |      0MiB / 12189MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    1     17235    C   Unknown Error                                11619MiB |
+-----------------------------------------------------------------------------+
[charles19]root: ps auxww|grep 17235
root     14869  0.0  0.0 112652   948 pts/0    S+   15:01   0:00 grep 17235
s1473470 17235  0.0  0.0      0     0 ?        Zl   Jun17   0:42 [python] <defunct>
[charles19]root: 
</verbatim>
The only solution we're aware o to fix this is to reboot the node. It's usually best to disable all the queue instances first and let other jobs finish.

                                                            





-- Main.IainRae - 18 May 2016
</verbatim>
<nop>
Edit | Attach | Print version | History: r10 | r8 < r7 < r6 < r5 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r6 - 30 Jun 2017 - 14:19:03 - IainRae
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies