Distributed Computing

An Informatics Computing Innovation Meeting: BP Conference Suite, Thursday 2nd August 2007, 14:00-17:00

This is the second in an occasional series of meeting within Informatics to look at ways in which we can move forward with new developments in the Informatics Computing Service. The theme of the meeting is Distributed Computing - cluster computing, Gridengine and Condor.

Many researchers within the School use some form of distributed computing already, but there is not a lot of information about the clusters we have and how to make best use of them. There is also now a University cluster provision using Gridengine in the form of ECDF and any future developments in Informatics need to be viewed in that context.

The meeting will be chaired by Steve Renals and although there will be a few very brief presentations it will, however, be mostly question and answer discussion based. Also attending will be representatives from IS for the ECDF project. There will be an opportunity for everyone to present their research requirements.

The aim of the meeting is twofold. Firstly to disseminate information about distributed computing facilities in the School and University. Secondly to obtain a good view of what research requirements for distributed computing will be in the future and what direction we should be taking to meet them.

Before this meeting takes place we would like to get a rough idea of how people use the existing clusters and what their future requirements might be. This will be used to help focus discussion. To this end there is a web form with a few questions that should only take a few minutes to complete. If everyone who has an interest in distributed computing facilities could complete this form before the end of Wednesday next week, even if you are not able to attend the meeting, it would be very helpful. The URL for the web form is:

http://www.dice.inf.ed.ac.uk/units/research_and_teaching/distcomp/

If you will not be able to make the meeting but would like to have some input, please feel free to mail me with comments, suggestions and requirements if the web form is not suitable.

Agenda

The meeting will be chaired by Steve Renals.

Brief Presentations + Q&A

  1. Overview of Current Cluster Provision - Tim
  2. ECDF Storage & Compute Services - Orlando
  3. SAN Space - Craig
  4. GPFS - Iain
  5. Gridengine Scheduling - Iain

Break & Coffee

Hot Topics Discussion

  1. Survey Response - Tim
  2. Topics Prioritization
  3. Discuss - Underlying Filesystems (GPFS, AFS, ECDF/Desktop shared space)
  4. Discuss - Scientific Linux 5 on Clusters
  5. Discuss - What do people need ECDF at present does not provide?
  6. Discuss - Other topics as prioritized

Actions

  1. Prioritize active/pending work

Comments, Suggestions, Requirements

Some potential discussion items. Not in any priority order. It is unlikely there will be an opportunity to discuss all of these at the meeting so we will prioritize them beforehand and at the meeting itself.

  • purpose of meeting, what we want to achieve
  • brief summary of existing cluster and condor provision
  • stats - how much are the clusters and condor used
  • gridengine stats - plus stuff Iain is doing on scheduling
  • underlying filesystems
  • how best to use ECDF/eddie - impact on our own clusters
  • sharing our cluster filespace with ECDF filespace
  • clusters for researching clusters (not something suitable for ECDF)
  • user requirements now and in the future
  • use of Scientific Linux on our own clusters (to match ECDF)
  • use of GPFS
  • usage visualisation tools (Ganglia, Condor View?)
  • prioritized list of things to do
  • considering "transfer queues" for shifting jobs between clusters
  • more submit nodes per cluster
  • merging clusters
  • AFS credentials - issues/solutions
  • should we continue allowing direct access (not via gridengine)
  • why have separate home directory on clusters
  • future requirements - large memory (32GB), 64bit?
  • get ganglia available again (broke on fc5)
  • GPFS has security issues, OS it supports issues and license issues
  • currently no accounting - might be useful to have some
  • will ECDF meet our needs or does Informatics need to purchase a new cluster
  • what are disk spare requirements, and does it need backed up (ECDF uses mirrored disks so costs more)
  • external service providers - reasons for not using

Presentations

  • dcotalk.pdf: Slides for Overview of Current Cluster Provision
  • edikt-june07.pdf: Slides for ECDF Storage & Compute Services
  • Slides for SAN Space
  • Slides for GPFS
  • Slides for Gridengine Scheduling
  • survey.pdf: Slides for Survey Response (updated for late responses and annotated with original survey questions)

Survey Comments

Reasons users gave for not using Condor

Executive summary - effort to re-engineer code, too early days yet and desktops not powerful enough.

  • Currently developed own scheduling tool
  • Becuase I don't run on the clusters!
  • It slows down my desktop
  • Gridengine has been working fine for me.
  • It would be very time consuming to change the scripts I use for the grid engine.
  • Not using at the moment, may do so in future
  • hassle to recompile libraries, etc
  • we are in early stage of distributed computing.
  • don't know how to use it yet
  • I need copious amounts of RAM and disk
  • not familiar

Other comments made

Executive summary - direct cluster access and ECDF access to Informatics space.

  • Open access to clusters due to research on resource-aware structured parallelism. Further details can be gathered at: http://homepages.inf.ed.ac.uk/s0340602/, in particular, refer to Gonz??ez-V??ez, H. (2006). Self-adaptive skeletal task farm for computational grids. Parallel Comput., 32(7-8):479-490. http://dx.doi.org/10.1016/j.parco.2006.07.002
  • Problem with Condor is that it seems to be quite unstable and with my 10+ hour jobs things don't get done and are evicted lot of times. I however really like the 'interface' to condor - much more the the gridengine - it seems to be more flexible. I just moved all my stuff from Condor to townhill and it is way more efficient.
  • I think the meeting needs to address issues relating to compute-intensive work, even if this is not currently on the clusters.
  • I'd particularly like to be able to see the AMI project disk spaces from ECDF: /group/project/ami[1-8]
  • I run parallel programming model experiments, where what matters to me is to compare performance of different programming mechanisms - the results themselves are irrelevant. Thus I need exclusive access to groups of nodes (even better, to all nodes/comms) to make things as repeatable as possible. My runs are very short - a few minutes is typically enough to make the point.
  • Primarily, my need is to carry out network simulation experiments involving several tens of nodes to few thousands of wired/wireless nodes.
  • Proper prioritisation of jobs, so that I can't take up too much processing power.
  • These are just some general comments based on typical patterns of use from IPAB. Specific users might have other patterns.
  • I'm new to UoE. I'm using a genetic algorithm to estimate parameters of a computational model. I have been using townhill so far. It's slow - in 24 h 10 processes generated 300 generations only; I need ca. 2000 generations to have parameters converge. I will try out Condor.
  • run matlab in clusters

Notes from the meeting

Points noted on Orlandos talk

  • infiniband/path scales almost linearly up to 60 nodes for multi-processor work with almost no communication overhead.
  • users can use environment "modules" to setup their environment for particular compilers and multi-processor libraries etc.
  • local scratch space used by a job is automatically wiped after the completion of the job.
  • cluster has 512 nodes. The plan is to double this with a new procurement in September'ish.
  • cluster split so 25% is spare free resource (currently what everyone using eddie is using at present) and 75% paid for resource (FEC from grant) which is not used yet. We can pay at any time for guaranteed resource.
  • AFSOSD provides an AFS front onto a parallel file system, such as GPFS although currently based around Luster, with full parallel access. This is a long term option.
  • GPFS kernel module is binary only with open source compatibility shim. Some question over long term maintainability.
  • looking at a Matlab license for the cluster using the Distributed Computing Edition.

Questions raised by Craig on his SAN talk

  • what GPFS blocksize to use, currently 8K.
  • how best to use hawthorn (our machine at the Bush), just for rsync and commodity at present and AFS for performance, could re-use the 6TB on it for the ECDF accessable space?

Points noted on Iains talk

  • resource reservation is something ECDF do but we don't do on our clusters
  • logging will be available on ECDF (via ARCo), access to our own cluster data via NWS or ARCo would be useful (currently we run ARCo but data is not available generally).
  • Informatics do not have a commercial license for GPFS, however ECDF do (for eddie only). They got at very reduced cost.
  • GPFS will not work for Condor (due to its incompatibility with our normal desktop OS and the old NFS root access security issue). There may also be network bandwith issues, although this could be factored into the network design of the new building. GPFS files could be accessed via NFS instead but its not easy to throttle condor to limit bandwidth. There is a question over whether we should be running Condor at all - given the running costs (desktop power) which will be charged to the School in the new building as well as issues over the heat/noise generated in labs.

Other Points

  • very few people need a tight interconnect, in fact the Myrinet facility on lutzow has barely been used in 5yrs.
  • it was apparent that although there are now ways to run long jobs under AFS that the documentation on how to do so was not that visible.
  • hadoop was briefly mentioned, it would mean effort to support another filesystem and effort for IS to support for ECDF as well to allow shared systems.
  • co-location of resources - multiple clusters computing a single job (possibly across multiple organizations) is not effectively supported although the MPI libraries for doing so are available on ECDF.

Outcomes

  • There was consensus that GPFS should be introduced on our clusters so that ECDF space would be accessable. Conversely making our space available via GPFS to ECDF was also desirable. There was a strong desire to see cluster space visible on desktops such that shuffling data from place to place was not required.
  • There was consensus that in general cross-mounting (seeing the same data whether on ECDF, desktop or cluster) was more important than data integrity. In general the data can be re-generated, and normal backup covers loss. The concern that data could be manipulated without knowledge did not seem to be an issue to overcome the inconvenience of having to shuffle data from home directory to cluster filespace.
  • People want a fat pipe between ECDF and Inf clusters (terabytes), but a thin pipe between desktops and clusters (including ECDF) would probably be OK.
  • There was consensus that our clusters should be upgraded to Scientific Linux 5 so as to support GPFS and more closely match ECDF (currently Scientific Linux 4, but will move to 5 as soon as available). SL5 is a rebuild of RHEL5 with some closed source packages stripped and some additional packages added - essentially identical. Some concern that paths should be the same on SL5 machines as on DICE machines (and/or on ECDF) such that code/data could be run anywhere without effort. Putting SL5 on desktops was not practical in a general commodity sense but could be done on demand for those users that want it.
  • There was consensus on creating a single queue for our townhill and hermes clusters so that spare capacity on hermes could be used when townhill was very loaded. The townhill nodes would still only be available to HCRC people though.
  • There was consensus that, despite aging hardware on two of the School clusters, we should not aim to replace them at this stage and instead see how use of ECDF covers our research requirements.
  • One user requested NWS be available on our clusters for accessing node load statistics, or general access to ARCo collected data.

And Finally ...

Steve thanked all those giving talks and everyone else for attending and providing their contribution.

-- TimColles - 23 Jul 2007

Topic attachments
I Attachment Action SizeSorted ascending Date Who Comment
pdfpdf dcotalk.pdf manage 45.3 K 03 Aug 2007 - 12:51 TimColles  
pdfpdf survey.pdf manage 291.5 K 03 Aug 2007 - 13:36 TimColles  
pdfpdf edikt-june07.pdf manage 695.8 K 03 Aug 2007 - 12:52 TimColles  
Edit | Attach | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r7 - 06 Aug 2007 - 11:38:56 - TimColles
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies