Teaching Cluster Technology

This is the final report for CompProj:381

Description

This project was intended to deliver clusters to support teaching compute Intensive MSC teaching courses including EXC,MLP,MT and ASR. The provision was to provide a cluster to support cpu and GPU based teaching/projects. As a secondary role the cluster would be made available to project students when not otherwise in use.

CPU cluster

The CPU cluster is primarily dedicated to running hadoop but nodes can be used for non hadoop computationally intensive work via gridengine if the hadoop cluster is not in use. The hardware consists of:

  • 7 R430 with 40 cores and 6Tb HDD
  • 4 R430 with 48 cores and 6TB HDD
Which were bought in two tranches in 2016 and 2017 This provides a hadoop cluster with 432 Virtual Cores, 1TB of RAM and 64Tb of replicated disk space. The version of hadoop is an updated version from that used in the Schools original cluster and from that used in the hadoop on eddie project. This was rolled out with hadoop 2.7.3 using the YARN scheduler which involved a complete rewrite of the original sl6 hadoopconf file component. Since the new cluster is based on SL7 and the integration with systemd was not complete at the time the component was written and given the lack of experience with YARN it was decided to hold on on the daemon control aspect of the component for a later date.

GPU Cluster

The GPU cluster consists of 6 T630s with a mixture of Titan X and Titan X (Pascal) cards with an r330 running as a gridengine scheduler and the nodes running gluster to provide both a filesystem for gridengine and for local home directories. THe T630s were easy to procure, the Titans were a complete pain.

Infrtastructure

both clusters come with an associated bunch of networking , power and racking infrastructure.

Software

The gridengine/gluster combination was chosen to match the CDT cluster largely for the same reasons.

Time taken

Approximately 8 weeks.

Conclusions/observations

The hyperconverged architecture on the GPU nodes works well with early cuda code however with the introduction of tensorflow and miniconda it's much harder for gridengine to manage node resources and prevent the job tasks from overwhelming the infrastructure, a move to a more modern scheduler is being considered and will be handled in a later project.

-- IainRae - 14 Dec 2017

Topic revision: r3 - 19 Dec 2017 - 09:59:34 - TimColles
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies