200 GPU Cluster

This is the final report for CompProj:447

Description

This project covers the acquisition of a 200 GPU cluster to support the MLP course in February 2018. Upgrading to a fully DICE supported cluster is covered in a separate project.

Initial request

We were contacted by Steve Renals in September 2017 who was concerned about the resources which were available for his MLP course. At an initial meeting it became apparent that the numbers would far outstrip what was currently available through the Msc teaching cluster and that we would have to source something of the order of 200 GPUs plus associated infrastructure. Given the number of students he had funding to expand the number of GPUs available to the course.

The use of cloud based solutions was ruled out because there was no guarantee of concurrent availablity for lab sessions and because the price for one years access would be a significan proportion of the estimated cost of enough hardware to run the course.

Steve was particularly concerned that if at all possible each student should get access to their "own" GPU, given the sizing of the class (~190) this gave an initial back of the envelope costing of ~$380,000 if we were to use our "standard" Dell T630 with 4 Geforce 1080-TI cards. It would also mean that we would have to provide networking and power for 48 servers and find 240U of racking. It was felt that this was impractical. Other issues we had to deal with were, power and somehow restricting access to one GPU per student as our current scheduling software (gridengien) had issues doing this..

In trying to resolve these problems we investigated a number of options:

1. cheaper/lower power cards

Given that the cluster was to be used for primarily teaching it was decided that lower power, cheaper nvidia cards could be used instead of the 1080s. We obtained a gtx 1050 (~120) card and tested sample code on this before settling on gtx-1060s (~200) to give a little headroom in performance. Both cards were considerably cheaper than the 1080s at ~750 each. Unfortunately none of the cards would fit in existing lab machines.

2 using workstation PCs rather than rackmount

One initial thought was to replace existing lab PCs with machines capable of taking an appropriately powered GPU or to add more machines, This would involve the purchase of a considerable number of PCs but would in theory have been possible to do off contract. it would also have simplified the provision of GPUs since there would be no need for a scheduler, login nodes, filesystems or other services. This was eventually discounted because:

Users would want to tie PCs up with 8 hour long jobs, preventing others from using the PC. It's likely that the load on the PC from tensorflow would be very noticable to anyone logged into the main console. There was unlikely to be enough space for additional machines and it was also unclear if there would be sufficient spare power capacity in the labs to add in the workstations required. Only one contract supplier could supply a suitable workstation and as they were unable to provide a sample machine within 21 days because of CPU supply issues it was felt they couldn't be relied on to produce 100-200 by the start of semester 2.

3. Higher density servers

In parallel to the above we looked into the possibility of reducing the space the GPUs would take up by purchasing servers with a higher density of GPUs than the T630s we have been using up till now. The T630s provide 4 GPUs in 5U. We have a number of rebadged ASUS servers which provide 4 GPUs in 2U and there is equipment on the market wich can provide up to 10 GPUs in 4U. Even with this level of density it was estimated that we would need three 42U racks worth of space to install the cluster and The only place where this would be possible was the Forum. Unfortunately the power supply to the Forum server room is already fairly high and given the estimated load it did not seem possible to install all the cluster nodes in the Forum without major electrical works. Given that the timetable for the project would not allow even minor electrical works it did not seem possible to install the cluster in any of the Schools server rooms. The Forum had the space but not the power, JCMB had the power but not the space and AT had the power but not the space.

4 alternative software

If we were to use any of the configurations other than having dedicated Desktops we would need to install some kind of scheduling software. If we were to have more than one GPU per node then the software would have to be able to enforce any resource restrictions (e.g. one GPU no more than CPU 4 cores). Unfortunately Gridengine does not do this natively and whilst we are working towards a homebuilt solution involving restricting GPUs via cgroups it wasn't obvious that this could be completed in time. The two most obvious solutions would be:

Univa grideengine

A closed source, paid for support version of grid engine http://www.univa.com/products/. this would allow us ot preserve some of our knowledge investment in gridengine but would mean we would either be running two different versions (cdt and the msc teaching cluster were using gridengine) or else we would have to add support for these other clusters.

slurm

Another open source scheduler which also supports GPU resource allocating using [https://slurm.schedmd.com/gres.html][cgroups]]. The only downside of running slurm is that it would mean we would be running multiple schedulers, unless we could move the cdt and Msc clusters off of gridengine.

.

5 Decisions and the Purchasing Process

Having decided on a rackmount solution as the most sensible we looked at where we could install it. In the end we decided the only practical proposition was to run with one rack at each of KB JCMB and AT. In doing so we were accepting that this might introduce network constraints for non course related use of the cluster. Also it meant that we had one rack of kit which was much less accessible than the other two so remote console usage would be a major factor in the purchasing decision. We used a number of online configuration tools to develop a likely configuration/ price. What we found was that most companies offering 6+ GPUs in 4U were system integrators rather than manufacturers and as such none were on the NSSA framweaork. We conatcted the integrators to find out their resellers and at the suggestion of purchasing we also looked at non NSSA frameworks to see if we could get a purchasing path to our possible suppliers. We Identified the Crown commerical services RM3733 Products 2 Lot 1 as the most likely framework and then approached Purchasing to generate the actual ITT. Because of the preliminary work done in identifying a solution, the timeframe and possible suppliers this allowed us to short circuit some of the standard purchasing "journey" and we were able to generate an ITT fairly quickly. We had also contacted all the suppliers on the lot to advise them that we would be putting out an ITT, and what it was for in very general terms, asking any intersted parties to contact us.

The ITT was fairly straightforwards although there was the traditional "note really understanding the requirements" responses from some suppliers. The winners were Viglen (XMA) supplying ASUS kit. we had asked for an initial sample as soon as possible, followed by 9 followed by the balance. Delivery was somewhat messed up with the initial server taking a lot longer to get here than the rest of the first batch and the rest arrived in fits and starts probably as it came off the viglen assembly lines. We were unable to get the batch for KB delivered directly there but it all worked out in the end.

We also got a number of spare GPUs to allow us to quickly replace any fauly cards outside of the Viglen warranty process.

6 Supporting Infrastructure

In addition to the nodes we purchased three racks and appropriate PDU's and networking throught the ususal suppliers. Racks were installed by Ian and the Techs.

head nodes were press ganged machines destined for the bin, the scheduer software was run on A VM on a RAT development server.

7 Installation

With very limited time available between delivery and the start of the course much of the cluster was thrown together, often without really any thought to the config and wihout the usual acceptance tests we'd usually do on a spend like this.

7.1 Infrastructure

Because of the lead time with some of the networking kit we had to use switches from the Infrastructure units "spares pile" and the switches will be swapped out for the new kit at a later date. There was a deliberate descision not to install redundant networking in order to ensure we got the maximum number of GPUs possible.

7.2 Nodes

The nodes were intalled at AT first, then KB and then in the Forum being the order that the racks were ready. There was no real attempt to configure remote consoles bejond an initial attempt which failed because of the ipmi password length problem that we have since found with all our ipmi consoles. We decided to rely on powercycling (ether hands on or remotely via ipmi or PDU. Another foible was that once the OS was installed the hard drives ran sde sdf sda sdb sdc sdd

which was left to be fixed at a later date.

7.3 glusterfs

There being no time, or budget to put together a filesystem it was decided to steal the msc teaching clusters filesystems and press-gang it's GPU servers into service, it had already been decided that the MSC teching cluster needed deidcated fileservers anway. the dropping of sge provided a volume with 24Tb of replicated filesystem spread over 12 bricks on 6 servers

slurm

Initially a basic slurm configuration was set up with the configuration files generated via the file component and the daemons controlled (sort of) via systemd. Slurm uses a system of credentials called "munge" for job and other authentication and the credential keys were copied by hand.

Time taken

approx 602 hours

Conclusions

Whilst we can do purchases like this at very short notice it's not to be advised as it will likely take more time untangling the mess we've been left with than just buying the thing properly in the first place.

As ever when specifying an ITT you need to spell out IN WORDS OF ONE SYLLABLE OR LESS exactly what you want from suppliers, for example if you want an os disk and data storage on the node and don't care about the speed or makeup it's probably better to state something like must provide an OS filesystem of > 500G must provide a data storage filesystem of >=10Tb

In 1 and 2 sata based storage is acceptable"

rather than

"must provide a 1Tb home disk must provide a 10Tb data disk"

since the supplier might make assumptions about the type of disk required and might not think of providing a raid array in the case of the data storage.

It's work contacting purchasing in the early stages to get advice, it's equally probably better to do your owin homework and contact possible suppliers in advance. In general you can do this and give them a spec to provide a sample quote provided youdo the same for all suppliers you contact.

Where you've got a lot of suppleiers possibly bidding on an ITT it's worth giving them a heads up of roughly what you're looking for as it'll give them a heads up and you an idea of how many bids you might recieve.

-- IainRae - 16 May 2018

Topic revision: r8 - 27 Aug 2018 - 12:16:08 - IainRae
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies