Pandemic info for Hadoop

(a work in progress)

Basic info

  • We have Hadoop clusters.
  • They're small - only a few nodes each.
  • They use DICE.
  • LCFG does most of the config.
  • Hadoop is needed by the Extreme Computing (EXC) module.

The Clusters

There are three clusters. They all use the same Hadoop configuration headers.

  1. The exc cluster is our one real Hadoop service, users for the use of, on proper machines. To list its nodes:
    $ profmatch exc-cluster
  2. The exctest cluster is for testing out new config before deploying it to the live service. You can let staff on this to test things out, but the nodes are tiny VMs so it can only run tiny test jobs. To list its nodes:
    $ profmatch exctest-cluster
  3. The devel cluster is for computing staff to trash and rebuild as necessary. Never let users near it. Its nodes are more tiny VMs. To list them:
    $ profmatch devel-cluster
    (profmatch is in /afs/

Types of node

Each cluster has an HDFS master, a YARN master and some slaves. To find out which is which, use profmatch again:
$ profmatch hdfs exc-cluster
$ profmatch yarn exctest-cluster
$ profmatch slave devel-cluster

Configuration and control

  • The LCFG hadoop component makes the configuration files.
  • The LCFG file component makes some directories and symlinks.
  • systemd controls the Hadoop processes.
  • Most of the nitty-gritty is in dice/options/hadoop-cluster-node.h .
  • You can override config using live/hadoop-cluster-node.h .

Daemon trouble

If something's not working:
  • Are the nodes running the correct processes? (See below.)
  • To restart a daemon use systemctl, for example:
    # systemctl restart hadoop-datanode.service
  • Check the log files (see below). If something has gone wrong, the log file will generally end with a spectacular java crash message.
  • Users can login to the HDFS master, and that's fine, but sometimes they run computing jobs there, and that's not OK! Computing jobs on the HDFS master can make the whole cluster grind to a halt, so kill any you find.
  • If you need to reboot the whole cluster in an emergency, just go ahead - systemd should ensure that all daemons are stopped and started cleanly. After having rebooted the cluster, login to the HDFS master and monitor the namenode's log file. It will have put HDFS into "safe mode", which is read-only. Safe mode lasts until the namenode has received a satisfactory report from each datanode in the cluster - then HDFS transitions to normal operating mode. This can take a few minutes.
This node runs this using this accountSorted ascending this systemd service It logs to here ps -ef | grep hadoop
The HDFS master name node hdfs hadoop-namenode.service /disk/scratch/hdfsdata/hadoop/logs java -Dproc_namenode ...
Each slave data node hdfs hadoop-datanode.service /disk/scratch/hdfsdata/hadoop/logs java -Dproc_datanode ...
The YARN master Map Reduce job history mapred hadoop-mapred.service /disk/scratch/mapred/logs java -Dproc_historyserver ...
The YARN master resource manager yarn hadoop-resourcemanager.service /disk/scratch/yarn/logs java -Dproc_resourcemanager ...
Each slave node manager yarn hadoop-nodemanager.service /disk/scratch/yarn/logs java -Dproc_nodemanager ...

Further reading

Edit | Attach | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r6 - 14 May 2019 - 10:00:47 - ChrisCooke
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies