Pandemic info for Hadoop

This is a quick and dirty pandemic guide, presenting a greatly abbreviated version of the information in HadoopClusters. If this page doesn't tell you what you need to know, read HadoopClusters.

Basic info

  • We have Hadoop clusters.
  • They're small - only a few nodes each.
  • They use DICE.
  • LCFG does most of the config.
  • Hadoop is needed by the Extreme Computing (EXC) module.
  • Hadoop is a distributed framework for processing big data.

The Clusters

There are three clusters. They all use the same Hadoop configuration headers.

1. The exc cluster

This is our one real Hadoop service, users for the use of. It's on physical servers in Appleton Tower. To list its nodes:
profmatch hadoop-exc-cluster
To find the power and network connections:
for i in `profmatch hadoop-exc-cluster`; do rfe -xf apdu/$i; done
for i in `profmatch hadoop-exc-cluster`; do rfe -xf atnet/$i; done
To shut it down:
for i in `profmatch hadoop-exc-cluster`; do echo Shutting down $i; ssh $i nsu -c poweroff ; done

2. The exctest cluster

This is for testing out new config before deploying it to the live service. You can let staff on this to test things out, but bear in mind that the nodes are tiny VMs so it can only run tiny test jobs. To list its nodes:
profmatch hadoop-exctest-cluster
To find each VM's KVM host (2 ways):
for i in `profmatch hadoop-exctest-cluster`; do kvmtool --name $i locate; done
for i in `profmatch hadoop-exctest-cluster`; do ii query --host $i --detail | grep host; done
To find each VM's physical site:
for i in `profmatch hadoop-exctest-cluster`; do ii query --host $i; done
To shut it down:
for i in `profmatch hadoop-exctest-cluster`; do echo Shutting down $i; kvmtool --name $i shutdown ; done

3. The devel cluster

This is for computing staff to trash and rebuild as necessary. Never let users near it. Its nodes are more tiny VMs. To list them:
profmatch hadoop-devel-cluster
To find each VM's KVM host (2 ways):
for i in `profmatch hadoop-devel-cluster`; do kvmtool --name $i locate; done
for i in `profmatch hadoop-devel-cluster`; do ii query --host $i --detail | grep host; done
To find each VM's physical site:
for i in `profmatch hadoop-devel-cluster`; do ii query --host $i; done
To shut it down:
for i in `profmatch hadoop-devel-cluster`; do echo Shutting down $i; kvmtool --name $i shutdown ; done

(profmatch is in /afs/inf.ed.ac.uk/group/cos/utils)

Types of node

Each cluster has an HDFS master, a YARN master and some slaves. To find out which is which, use profmatch again:
      profmatch hdfs hadoop-exc-cluster
      profmatch yarn hadoop-exctest-cluster
      profmatch slave hadoop-devel-cluster

Configuration and control

  • The LCFG hadoop component makes the configuration files.
  • The LCFG file component makes some directories and symlinks.
  • systemd controls the Hadoop processes.
  • Most of the nitty-gritty is in dice/options/hadoop-cluster-node.h .
  • You can override config using live/hadoop-cluster-node.h .

Trouble?

If something's not working:
  • Are the nodes running the correct processes? (See below.)
  • To restart a daemon use systemctl, for example:
    # systemctl restart hadoop-datanode.service
  • Check the log files (see below). If something has gone wrong, the log file will generally end with a spectacular java crash message.
  • Users can login to the HDFS master, and that's fine, but sometimes they run computing jobs there, and that's not OK! Computing jobs on the HDFS master can make the whole cluster grind to a halt, so kill any you find.
  • If you need to reboot the whole cluster in an emergency, just go ahead - systemd should ensure that all daemons are stopped and started cleanly. After having rebooted the cluster, login to the HDFS master and monitor the namenode's log file. It will have put HDFS into "safe mode", which is read-only. Safe mode lasts until the namenode has received a satisfactory report from each datanode in the cluster - then HDFS transitions to normal operating mode. This can take a few minutes. For more about safe mode see hdfs dfsadmin in the HDFS Commands Guide.
This node runs this using this account this systemd service It logs to here ps -ef | grep hadoop
The HDFS master name node hdfs hadoop-namenode.service /disk/scratch/hdfsdata/hadoop/logs java -Dproc_namenode ...
The YARN master resource manager yarn hadoop-resourcemanager.service /disk/scratch/yarn/logs java -Dproc_resourcemanager ...
Map Reduce job history mapred hadoop-mapred.service /disk/scratch/mapred/logs java -Dproc_historyserver ...
Each slave data node hdfs hadoop-datanode.service /disk/scratch/hdfsdata/hadoop/logs java -Dproc_datanode ...
node manager yarn hadoop-nodemanager.service /disk/scratch/yarn/logs java -Dproc_nodemanager ...

Further reading

Topic revision: r9 - 24 May 2019 - 14:44:34 - ChrisCooke
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies