GPU machines: Care and Feeding

This page covers the setup of the DICE managed GPU machines in the School.

Fault Log

  • 2014-04-28: dechmont - network card bouncing
    • seems to have been going on for some days; approx. 50% packet loss
    • dmesg includes eth0 resets; nagios has a long log of 'soft' failures.
  • 2014-04-30: dechmont - cold reboot to reset network card
    • IPMI commands and console input don't work (output seems fine)
    • doesn't come back up from total power failure - this might be fault of "shutdown -h"
    • Configured IPMI with help from idurkacz; console remains wired for now.

Machines

The machines are mainly basic compute servers and all should have gpfs mounted but the hardware requirements have forced some hardware purchasing compromises . Most of the machines are based around Viglen Personal Supercomputer/Supermicro X9DRG-QF. roswell

FIrst generation

These are large towers with two of the older nvidia cards

hostname Model GPU TYPE Location Serial Number Console Comments
roswell Alienware Area51 2 x Nvidia GTX480 Server room,dexian shelving beside beowulf rack 4RP6X4J NO SERIAL CONSOLE CAPABILITY Out of warranty. This is getting fairly old now and the cards can only support the older nvidia drivers.
rendlesham Viglen VIG410P 2 x Nvidia GTX 590 Server room,dexian shelving beside beowulf rack 2211023   In warranty. Fairly well behaved

Second generation

These are ummm.. through decked servers wink

hostname Model GPU TYPE Location Serial Number Console Comments
bonnybridge Viglen/Supermicro X8DTG-QF Originally 2 x NVIDIA GTX680 now 1x GTX 680 plus 2 x GTX690   2240648   additional SSDs were bought and installed. This is the only "older" config of the Viglen PSC the other 680 has been "lent" to schaffner
schaffner Viglen/Supermicro/X9DRG-Q Originally 2 x NVIDIA GTX690. Currently 1 NVIDIA GTX690, 1 GTX680, 1 GTX TItan   2329899   Motherboard replaced under warranty. This should have 2 x 690 and 1 GTX 680
dechmont Viglen/Supermicro/X9DRG-Q Originally 2 x NVIDIA GTX690   2289955 IPMI + wired console fairly well behaved. This should have 2x GTX 690 and 1 x GTX Titan
We should have 3 graphics cards per machine but one is dead and needs to be replaced.

Third Generation

These are more powerful rack mounted workstations.

hostname Model GPU TYPE Location Serial Number Console Comments
lazar Viglen/Supermicro/X9DRG-Q 3 x Nvidia GTX-TItan Rack -  
adamski Viglen/Supermicro/X9DRG-Q 3 x Nvidia GTX-TItan Rack
hynek Viglen/Supermicro/X9DRG-Q 3 x Nvidia GTX-TItan Rack -

Configuration

The GPU machines are basically compute severs but with a cuda header to provide the cuda software and additional in profile configuration for hardware specific configurations. From dechmont on they have a small root disk and the main disk is configured as disk/scratch or disk/scratch_ssd or disk/scratch_something

SSDs

From bonnybridge onwards the Supermicro machines have SSDs installed. These use ext4 with some SSD specific mount options

fstab.mntopts_sdb1      defaults,discard,data=writeback,noatime,commit=15

and we tweak the kernel to change the scheduler

!kernel.set             mEXTRA(ssdsched)
kernel.tag_ssdsched     block/sda/queue/scheduler
kernel.value_sshsched   noop

These options were all that was possible with the kernel available in early 2014. Better kernel options are available with later kernels and this should be revisited (say may1014)

LSI Raid cards and Cachecade

From dechmot onwards the Supermicro machines have LSI SAS 9260-8i cards installed, in theory these can be configured using the megaraid linux cli In practice the bios level interface is much easier. Dechmont does not have a cachecade license key so the SSDs are just in a raid 0. The three third generation machines have Cachecade licenses and are configured with 400G (2x 200G disk) SSD cache for the large disk and the remainder of SSD in a raid0 as per dechmont. the LSI card presents these as two physical disks to linux

sd 0:2:0:0: [sda] 3905945600 512-byte logical blocks: (1.99 TB/1.81 TiB)
sd 0:2:1:0: [sdb] 1998323712 512-byte logical blocks: (1.02 TB/952 GiB)
sd 0:2:2:0: [sdc] 3905945600 512-byte logical blocks: (1.99 TB/1.81 TiB)
sd 0:2:1:0: [sdb] Write Protect is off
sd 0:2:2:0: [sdc] Write Protect is off
sd 0:2:1:0: [sdb] Mode Sense: 1f 00 00 08
sd 0:2:2:0: [sdc] Mode Sense: 1f 00 00 08
sd 0:2:2:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO
 or FUA
sd 0:2:1:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO
 or FUA

and obviously stuff like smart doesnt work (it may be possible to query through the LSI card ut I've not looked at this).

Software

The only software installed in addition to compute_server.h is cuda.h which includes a number of versions of the cuda development environment. Initially none were set up as the default environment but in order to aid debugging and building some cuda utilities (cudamemtest) cuda5 has been set up as the default environment on some servers. This will probably be done across the board now that cuda6 has become available.

cudamemtest

This is a cuda based memory test suite along the lines of memtest86 it's been installed on some of the servers for testing various issues and it's planned to roll it out to all the servers in the near future. To install it on an arbritrary machine you need
!profile.packages mEXTRA(+cuda50-5.0.35-2.inf +cudamemtest-1.2.3-1.inf)
in the profile. There's a script to do a quick sanity check
[dechmont]iainr: /usr/bin/sanity_check.sh
[04/25/2014 12:30:32][dechmont][0]:Running cuda memtest, version 1.2.2
[04/25/2014 12:30:32][dechmont][0]:Warning: Getting serial number failed
[04/25/2014 12:30:32][dechmont][0]:NVRM version: NVIDIA UNIX x86_64 Kernel Module  331.49  Wed Feb 12 20:42:50 PST 2014
[04/25/2014 12:30:32][dechmont][0]:num_gpus=4
[04/25/2014 12:30:32][dechmont][0]:Device name=GeForce GTX 690, global memory size=2147287040
[04/25/2014 12:30:32][dechmont][0]:major=3, minor=0
[04/25/2014 12:30:32][dechmont][1]:Device name=GeForce GTX 690, global memory size=2146762752
and do a very quick stress test

nvidia-smi

Nvidia provide nvidia-smi to interrogate the server for cards and to retrieve information back

[dechmont]iainr: nvidia-smi
Fri Apr 25 12:32:20 2014       
+------------------------------------------------------+                       
| NVIDIA-SMI 331.49     Driver Version: 331.49         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 690     Off  | 0000:05:00.0     N/A |                  N/A |
| 30%   38C  N/A     N/A /  N/A |      7MiB /  2047MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 690     Off  | 0000:06:00.0     N/A |                  N/A |
| 30%   39C  N/A     N/A /  N/A |      7MiB /  2047MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 690     Off  | 0000:86:00.0     N/A |                  N/A |
| 30%   41C  N/A     N/A /  N/A |      7MiB /  2047MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 690     Off  | 0000:87:00.0     N/A |                  N/A |
| 30%   38C  N/A     N/A /  N/A |      7MiB /  2047MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|    0            Not Supported                                               |
|    1            Not Supported                                               |
|    2            Not Supported                                               |
|    3            Not Supported                                               |
+-----------------------------------------------------------------------------+
[dechmont]iainr: 

If this can't run then either there are no working GPUs or the nvida diver is improperly installed, it ought to be fairly clear from the output which is the most likely option. there are various command line options see he manpage for details. Note that some GTX cards (680 and 690s for example) are actually two Nvidia cards bolted together so they will appear twice.

GPUs

Drivers for the GPUs are the standard NVIDIA proprietry ones, roswell is not supported by the current drivers and some of the servers have had their drivers fixed at a specific value in response to requests from users.

Monitoring

Most of the monitoring is passive at the moment and the graphs are generated via ganglia Some may feed into other things like too hot via indirect routes.

Consoles/power

Most of the machines are running serial console redirection so access to the console and the BIOS. Roswell has no BIOS level redirection and no serial console, a USB dongle is provided but this seems to have broken at some point. schaffners replaced motherboard has an updated BIOS and there are currently unresolved issues with it's config. Hynek is being used to test SOL/IPMI and may not be working at any given time. All of the power outlets should be controllable via FPU

Foibles

schaffner will sometimes hang in grub, ususally battering it at the press any hey to continue part will clear it. if you decide to use an attached console note that you can't just attach a monitor and keyboard and off you go. The NVIDIA driver loses track of which connector has a monitor attached and you'll have to reboot the thing. THe most reliable way to do this seems to connect to the rightmost card (looking at the back chassis and wait....speaking of which

When booting prepared to wait a LLLLOOONNNGG time before anything appears on the screen DELETE will get you the bios config and F12 a boot menu.

Help it's all gone wrong

Machine won't boot

Hang a monitor and see what comes out, try more than one graphics card as one may have died. If it's hanging on the "press any key to continue" try and get in there quickly and/or drop into the boot item menu. Try removing the graphics cards, one might have died...though they tend to be independent of each other.

Wrong number of graphics cards

It is possible that one of the cards has died, it's really unlikely that someone had come along and stuffed another card in but it's more likely that you're getting confused by the fact that some NVIDIA cards are actually two cards in one chassis (e.g. GTX-690s) and they're being reported twice.

-- IainRae - 24 Apr 2014

Topic revision: r8 - 21 Mar 2017 - 11:48:48 - IainRae
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies