How to run Hadoop when the world has gone mad

Perhaps triffids have invaded, or perhaps your colleagues have won the lottery and emigrated. Whatever the reason, you suddenly need to know about Hadoop.

There are three Hadoop clusters.

They're all managed the same way, using the same LCFG headers.
  1. The exc cluster is the main Hadoop cluster. It's the one the students use. It's important to keep this running healthily, at least while it's in use by the students and staff of the Extreme Computing module. It's based on physical servers in the AT server room.
  2. The exctest cluster is used for testing things before deploying them on the exc cluster. It's used from time to time by computing staff to test new configurations, and by teaching staff to check cluster capabilities or to run through coursework before handing it out to students. Its nodes are virtual machines.
  3. The devel cluster is used when developing new configurations. It is only ever used by computing staff and it can be trashed with impunity. Its nodes are virtual machines.

The LCFG hadoop headers.

Each node in a Hadoop cluster includes two special LCFG headers:
  1. A header to say which cluster this node is in:
    • live/hadoop-exc-cluster.h
    • live/hadoop-exctest-cluster.h
    • live/hadoop-devel-cluster.h
  2. A header to say what role this node plays in the cluster:
    • dice/options/hadoop-cluster-slave-node.h
    • dice/options/hadoop-cluster-master-hdfs-node.h
    • dice/options/hadoop-cluster-master-yarn-node.h
So one way to find out which nodes are in a cluster would be to find out which machines are using its LCFG file: rfe -xf lcfg/hadoop-exc-cluster Using this technique you can find out what node is in each cluster and what role it has in the cluster. There are quite a few LCFG Hadoop headers. Some are included by the above headers. Others are left over from previous configurations and are now out of use.

The two essential parts of Hadoop.

The software which collectively runs a Hadoop cluster can come in many parts. Most of them are optional, but two of them are basic and fairly essential - HDFS and YARN.

HDFS.

HDFS is the filesystem. It automatically distributes files across the cluster's nodes. It keeps several copies of every file, making sure that the copies are on different nodes, if possible on different racks.

At its simplest, one machine in a Hadoop cluster is an HDFS namenode and most of the others are datanodes. The datanodes hold copies of data. The namenode keeps track of every file's metadata, and coordinates HDFS matters between the datanodes.

YARN.

Topic revision: r1 - 15 Mar 2019 - 16:49:48 - ChrisCooke
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies