Teaching GPU Cluster Hardware Configuration

This is the final report for CompProj:462

Description

This Project follows on from CompProj:455 (Purchasing a 200 GPU cluster). Becuase of time constraints involved in getting this cluster into service and because the nodes were hardware [[https://www.asus.com/uk/Commercial-Servers-Workstations/ESC8000_G3/][(ASUS ESC8000 G3 ) we'd not dealt with before t was not possible to set up the serial consoles, also we had identified a number of issues with the hardware which needed resolution. This project covers the work required to get the cluster nodes up to standard.

Issues

With little time before the start of the course there was no time to resolve some hardware issues with the nodes, which were hardware we'd not dealt with before specifically:

1 Serial Consoles

We needed to configure the BIOS to support IPMI and serial console access

2 Standardise disk order

The ESC 8000 G3 comes with six removable 2.5" disks in unlabbeled hot swap bays on the right of the front of the chassis. During the OS install of the first batch of nodes we noticed that with some nodes the drives ran sda-sdf top to bottom whilst on other nodes the drives ran sdd-sdf, sda-sdc top to bottom. Initially we juggled disks to ensure that the OS was installed on the correct disk, but this meant that we had a random distribution of 2 different hard disk configurations across the cluster

3 Standardiise BIOS/Configuration

In the time period between the first batch of servers arriving and the last (they were delivered in 3ish batches) the BIOS was revised so we had two different BIOS versions across the cluster In addition the first batch were configured to use the GPU cards for video out (we had spotted this with the example machine but the first batch were already shipped by the time this was passed back to viglen) as were a couple of the other nodes.

Work Completed

1 Serial Consoles

The serial console configuration was not significantly different from the Asus ESC 4000 G4 computers we'd previously bought, modified instrcutions have been created Up till this point it's not been possible to get serial access to the BIOS screens, just to the OS. .

2 Standardise disk order

On investigation the disk bays were connected to the motherboard via two (three includind the front facia controls) high density sata connectors. swapping these around got the disks into a logical order, but required removal or 3 of the GPUs for access. the drives were reshuffled and all the nodes are configured with the drives running sda-sdf downwards.

3 Standardiise BIOS/Configuration

Unfortunately neither Viglen nor Asus could provide us with any configuration tool so this had to be done by hand. Firstly the latest version of the BIOS was installed on the nodes, then the IPMI/serial console configuration was done and then there was a quick visual insepction of the video, CPU and disk configurations to ensure that virtualisation was enabled, vga output was onboard and that the disks were being seen correctly.

Finally after all of the above the nodes were re-installed, partly to verify that the hardware was all confgured properly and part because some nodes were still installed with the OS on the wrong disks

Time taken

approx 9 days.

Conclusions

The issue with the vga output was probably foreseeable and we should specify this with all GPU server purchases. The disk miscabling was probably not, arguably we could have asked Viglen to rectify this but working out which machines were involved, arranging visits supervising engineers would probably have taken as much time as us doing it ourselves. We had hoped to work out the initial configuration of the hardware with the first machine to be shipped, delays involved in shipping getting the node installed and other work partly negated this but it did allow us to confirm that we could configure a basic serial console and didn't have to purchase any additional kit so it's something I would do again.

-- IainRae - 04 Feb 2019

Topic revision: r3 - 06 Mar 2019 - 09:31:37 - IainRae
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies