GPU machines: Care and Feeding
This page covers the setup of the DICE managed GPU machines in the School.
Fault Log
- 2014-04-28: dechmont - network card bouncing
- seems to have been going on for some days; approx. 50% packet loss
- dmesg includes eth0 resets; nagios has a long log of 'soft' failures.
- 2014-04-30: dechmont - cold reboot to reset network card
- IPMI commands and console input don't work (output seems fine)
- doesn't come back up from total power failure - this might be fault of "shutdown -h"
- Configured IPMI with help from idurkacz; console remains wired for now.
Machines
The machines are mainly basic compute servers and all should have gpfs mounted but the hardware requirements have forced some hardware purchasing compromises .
Most of the machines are based around Viglen Personal Supercomputer/Supermicro X9DRG-QF.
roswell
FIrst generation
These are large towers with two of the older nvidia cards
Second generation
These are ummm.. through decked servers
We should have 3 graphics cards per machine but one is dead and needs to be replaced.
Third Generation
These are more powerful
rack mounted workstations.
Configuration
The GPU machines are basically compute severs but with a cuda header to provide the cuda software and additional in profile configuration for hardware specific configurations. From dechmont on they have a small root disk and the main disk is configured as disk/scratch or disk/scratch_ssd or disk/scratch_something
SSDs
From bonnybridge onwards the Supermicro machines have SSDs installed. These use ext4 with some SSD specific mount options
fstab.mntopts_sdb1 defaults,discard,data=writeback,noatime,commit=15
and we tweak the kernel to change the scheduler
!kernel.set mEXTRA(ssdsched)
kernel.tag_ssdsched block/sda/queue/scheduler
kernel.value_sshsched noop
These options were all that was possible with the kernel available in early 2014. Better kernel options are available with later kernels and this should be revisited (say may1014)
LSI Raid cards and Cachecade
From dechmot onwards the Supermicro machines have LSI SAS 9260-8i cards installed, in theory these can be configured using the megaraid linux cli In practice the bios level interface is much easier. Dechmont does not have a cachecade license key so the SSDs are just in a raid 0. The three third generation machines have Cachecade licenses and are configured with 400G (2x 200G disk) SSD cache for the large disk and the remainder of SSD in a raid0 as per dechmont. the LSI card presents these as two physical disks to linux
sd 0:2:0:0: [sda] 3905945600 512-byte logical blocks: (1.99 TB/1.81 TiB)
sd 0:2:1:0: [sdb] 1998323712 512-byte logical blocks: (1.02 TB/952 GiB)
sd 0:2:2:0: [sdc] 3905945600 512-byte logical blocks: (1.99 TB/1.81 TiB)
sd 0:2:1:0: [sdb] Write Protect is off
sd 0:2:2:0: [sdc] Write Protect is off
sd 0:2:1:0: [sdb] Mode Sense: 1f 00 00 08
sd 0:2:2:0: [sdc] Mode Sense: 1f 00 00 08
sd 0:2:2:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO
or FUA
sd 0:2:1:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO
or FUA
and obviously stuff like smart doesnt work (it may be possible to query through the LSI card ut I've not looked at this).
Software
The only software installed in addition to compute_server.h is cuda.h which includes a number of versions of the cuda development environment. Initially none were set up as the default environment but in order to aid debugging and building some cuda utilities (cudamemtest) cuda5 has been set up as the default environment on some servers. This will probably be done across the board now that cuda6 has become available.
cudamemtest
This is a cuda based memory test suite along the lines of memtest86 it's been installed on some of the servers for testing various issues and it's planned to roll it out to all the servers in the near future.
To install it on an arbritrary machine you need
!profile.packages mEXTRA(+cuda50-5.0.35-2.inf +cudamemtest-1.2.3-1.inf)
in the profile.
There's a script to do a quick sanity check
[dechmont]iainr: /usr/bin/sanity_check.sh
[04/25/2014 12:30:32][dechmont][0]:Running cuda memtest, version 1.2.2
[04/25/2014 12:30:32][dechmont][0]:Warning: Getting serial number failed
[04/25/2014 12:30:32][dechmont][0]:NVRM version: NVIDIA UNIX x86_64 Kernel Module 331.49 Wed Feb 12 20:42:50 PST 2014
[04/25/2014 12:30:32][dechmont][0]:num_gpus=4
[04/25/2014 12:30:32][dechmont][0]:Device name=GeForce GTX 690, global memory size=2147287040
[04/25/2014 12:30:32][dechmont][0]:major=3, minor=0
[04/25/2014 12:30:32][dechmont][1]:Device name=GeForce GTX 690, global memory size=2146762752
and do a very quick stress test
nvidia-smi
Nvidia provide nvidia-smi to interrogate the server for cards and to retrieve information back
[dechmont]iainr: nvidia-smi
Fri Apr 25 12:32:20 2014
+------------------------------------------------------+
| NVIDIA-SMI 331.49 Driver Version: 331.49 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 690 Off | 0000:05:00.0 N/A | N/A |
| 30% 38C N/A N/A / N/A | 7MiB / 2047MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 690 Off | 0000:06:00.0 N/A | N/A |
| 30% 39C N/A N/A / N/A | 7MiB / 2047MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 690 Off | 0000:86:00.0 N/A | N/A |
| 30% 41C N/A N/A / N/A | 7MiB / 2047MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 690 Off | 0000:87:00.0 N/A | N/A |
| 30% 38C N/A N/A / N/A | 7MiB / 2047MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| 0 Not Supported |
| 1 Not Supported |
| 2 Not Supported |
| 3 Not Supported |
+-----------------------------------------------------------------------------+
[dechmont]iainr:
If this can't run then either there are no working GPUs or the nvida diver is improperly installed, it ought to be fairly clear from the output which is the most likely option. there are various command line options see he manpage for details. Note that some GTX cards (680 and 690s for example) are actually two Nvidia cards bolted together so they will appear twice.
GPUs
Drivers for the GPUs are the standard NVIDIA proprietry ones, roswell is not supported by the current drivers and some of the servers have had their drivers fixed at a specific value in response to requests from users.
Monitoring
Most of the monitoring is passive at the moment and the graphs are generated via ganglia Some may feed into other things like too hot via indirect routes.
Consoles/power
Most of the machines are running serial console redirection so access to the console and the BIOS. Roswell has no BIOS level redirection and no serial console, a USB dongle is provided but this seems to have broken at some point. schaffners replaced motherboard has an updated BIOS and there are currently unresolved issues with it's config. Hynek is being used to test SOL/IPMI and may not be working at any given time. All of the power outlets should be controllable via FPU
Foibles
schaffner will sometimes hang in grub, ususally battering it at the press any hey to continue part will clear it.
if you decide to use an attached console note that you can't just attach a monitor and keyboard and off you go. The NVIDIA driver loses track of which connector has a monitor attached and you'll have to reboot the thing. THe most reliable way to do this seems to connect to the rightmost card (looking at the back chassis and wait....speaking of which
When booting prepared to wait a LLLLOOONNNGG time before anything appears on the screen DELETE will get you the bios config and F12 a boot menu.
Help it's all gone wrong
Machine won't boot
Hang a monitor and see what comes out, try more than one graphics card as one may have died. If it's hanging on the "press any key to continue" try and get in there quickly and/or drop into the boot item menu.
Try removing the graphics cards, one might have died...though they tend to be independent of each other.
Wrong number of graphics cards
It is possible that one of the cards has died, it's really unlikely that someone had come along and stuffed another card in but it's more likely that you're getting confused by the fact that some NVIDIA cards are actually two cards in one chassis (e.g. GTX-690s) and they're being reported twice.
-- IainRae - 24 Apr 2014