Hadoop Cluster

This is the final report for Devproj.inf project number 175.

What Hadoop Is

Hadoop consists of two services:

This is a distributed filesystem which stores its data on the local disks of the Hadoop nodes. All data is spread automatically over several nodes, so that each item of data exists in several places at once, and if a node disappears the filesystem redistributes the data to maintain the number of redundant copies.
This is a way of doing distributed processing of data over a number of nodes at once. The "Map" step is where the problem is divided into a number of smaller problems, and the "Reduce" step is where the solutions to those smaller problems are combined.

Time Taken

The project took 10 weeks in total, spread between Q3 2010, Q4 2010 and Q1 2011. I ended up doing a lot of work two or three times. Some of this could probably have been avoided - for instance if I'd known - if anybody had known - that over sixty students were going to register for the 2010-11 Extreme Computing course. If I'd been less quick to jump to implementation and taken more time with experimentation I could also have saved quite a bit of time later (see the Cloudera explanation below).

What Was Implemented

The final product is a Hadoop cluster. At the time of writing it had a total of 56 nodes, with 20 or so spare machines soon to be added. The HadoopCluster wiki page documents it for users and potential users. HadoopCareAndFeeding documents it for computing staff. The cluster runs on all of the Beowulf nodes and on spare desktop machines. The spare desktop machines can be added to the cluster and removed from it as needed; HadoopCareAndFeeding describes how to do this.

Recommendations for Future Work

The configuration might be improved in a couple of ways:

Use a Spanning Map
Some of the configuration files on the cluster's two central nodes (its namenode and its jobtracker) need to contain names and addresses of the cluster's other nodes. This sort of information could be spread automatically from node to node using a spanning map. At the moment it's added to header files instead.
Automated Start & Stop
Currently Hadoop's two main services (the HDFS filesystem and the Map/Reduce processing) are started and stopped by hand. This isn't difficult as each can be done for the whole cluster with one command issued on a central node. It's also advisable to carefully monitor the progress of the cluster start, for instance to ensure that the HDFS filesystem comes up cleanly and fully before the Map/Reduce daemons are started. I have been wary about automating the starting and stopping of HDFS and Map/Reduce more than it already is automated, so as it stands the Hadoop cluster does not have the usual LCFG full automation. However it may, given very careful thought and experimentation, be possible to automate Hadoop starts and stops more fully. Whether or not this might be worth the effort for our Hadoop use is another question.

Lessons I learned from this project

When working in an area where responsibility is shared, talk to everyone
... who shares responsibility, and do it early in the project. Don't just talk to one person and assume that their views are representative of everybody's. In the case of this project, the lecturers sharing the course turned out to have quite different ideas on the size of the cluster that would be needed and on its configuration, so for instance I ended up having to drastically increase the number of hosts. (The eventual number of hosts needed turned out to be a lot bigger than the biggest estimate, however!)
It does help to talk to people at some point rather than just mailing them
You can establish a better working relationship (which can of course later be carried on over email if you like) and it seems like the quickest and most efficient way for you to get an accurate understanding of what they need, and for them to get an accurate understanding of what you can deliver.
Don't be afraid to spend a good amount of time trying different possible technical solutions to the project, before deciding on the right one
I chose a solution too quickly. The one I first opted for (using Hadoop RPMs packaged by Cloudera) looked at first simpler than the alternative (building my own Hadoop RPMs) but later proved to be far more complex and less suitable; I eventually had to dump the Cloudera distribution, and the LCFG component that I'd written for it, and entirely re-implement the cluster based on my own Hadoop RPMs and a new LCFG virtual component. (It turns out that the Cloudera RPMs suffered from two problems which made them unsuitable for me: they are massively more complex than just Hadoop itself - they bundle in a lot of extras and add-ons, and implement their own configuration system which makes extensive use of /etc/alternatives; and perhaps because of this complexity they are a major version or two behind Hadoop itself, so missing some features and bug-fixes which turned out to be required.)
The number of students taking a course can be quite unpredictable
I started off catering for half a dozen students, expecting to have to expand the cluster to cope with perhaps two dozen students at the start of the next session. In the event more than sixty students registered for the Extreme Computing course so we had to rethink.
Buy more books
I tend to be shy about buying books, especially expensive ones. In this case I should have bought a copy of the first edition of O'Reilly's "Hadoop: The Definitive Guide" but because of the expense I held back so that I could get best value for money by buying the apparently imminent second edition. The second edition came out at about the time when I was finishing work on Hadoop so this was a pointless economy.
Remember the University Library
In my case it turned out to have not just paper copies of the book I needed, but also online copies. These proved to be very handy indeed. In this case I can't really say that these were an adequate substitute for having my own copy of the book: I could have saved time and effort by having my own copy permanently to hand. However if you just want to use a book briefly, the online copies in the University Library can be very handy. Certainly they're very easy to use - just search the catalogue at www.lib.ed.ac.uk, find a book which is available as an "electronic resource", then authenticate via EASE when prompted to get the book's pages appearing on your screen.
Sometimes a virtual component (a clone of the file component) is perfectly good enough
You don't always have to write your own component. My first gen Hadoop cluster - the one using the Cloudera RPMs - was configured with a "proper" component (lcfg-hadoop) but for the second and better incarnation - using my own simple Hadoop RPMs - the virtual lcfg-hadoopconf component did the job well.

-- ChrisCooke - 28 Jan 2011

Topic revision: r1 - 28 Jan 2011 - 16:32:07 - ChrisCooke
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies