Hadoop Cluster: Care and Feeding

If you just want to use Hadoop, see computing.help instead.
This page covers maintenance and configuration of the Hadoop EXC cluster.
Note that this page is out of date and is currently being revised contact cc@infREMOVE_THIS.ed.ac.uk for more information.

Nodes

The machines are LCFG-maintained DICE servers running the current desktop version of DICE.

Machine Role Account Abbreviation
scutter01 The namenode (the master node for the HDFS filesystem). hdfs nn
scutter02 The resource manager (the master node for the YARN resource allocation system).
The job history server.
yarn
mapred
rm
jhs
scutter03
to
scutter12
The compute nodes. These run:
a datanode (stores HDFS data) and
a node manager (manages YARN and jobs on this node).

hdfs
yarn

dn
nm

The nodes are in the AT server room.

Kerberos and privilege

The cluster uses Kerberos for authentication. To get privileged access to the cluster, you'll need to authenticate. For this you'll need to know the right machine and account and abbreviation to use. Find them in the table above, then do this:

  • ssh machine
  • nsu account
  • newgrp hadoop
  • export KRB5CCNAME=/tmp/account.cc
  • kinit -k -t /etc/hadoop.abbreviation.keytab abbreviation/${HOSTNAME}

So here's how to get privileged access to the HDFS filesystem:

  • ssh scutter01
  • nsu hdfs
  • newgrp hadoop
  • export KRB5CCNAME=/tmp/hdfs.cc
  • kinit -k -t /etc/hadoop.nn.keytab nn/${HOSTNAME}

Running a Test Job

... to check that the cluster is working.

First, create user filespace on HDFS. You only need to do this once per user per cluster. Start by logging in to the namenode with ssh and acquiring privileged access to HDFS as per the instructions above. Then, make yourself an HDFS home directory:

 hdfs dfs -mkdir /user/${USER}
 hdfs dfs -chown ${USER} /user/${USER}
 exit
 exit
 logout
Now ssh to the YARN master node:
 ssh scutter02

Put some files into your HDFS dir. These will act as input for the test job:

 hdfs dfs -put $HADOOP_PREFIX/etc/hadoop input
List your files to check that they got there:
 hdfs dfs -ls input
Only do this next command if you have already run the job and you want to rerun it - it removes the output dir which the job makes.
 hdfs dfs -rm -r output

Now submit the test job:

 hadoop jar $HADOOP_PREFIX/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar grep input output 'dfs[a-z.]+'

Once it's finished, transfer the job's output from HDFS:

 hdfs dfs -get output
... and take a look at what the job did:
 cd output
 ls
You should see two files - an empty file called _SUCCESS and a file with a few word counts in it called part-r-00000. if you don't see _SUCCESS then the job didn't work.

Checking the log files

Hadoop keeps comprehensive and informative log files. They're worth checking when you're doing anything with Hadoop, or when something seems to be wrong, or just to check that things are OK. Here's where to find them:

Component Log directory Host
HDFS namenode /disk/scratch/hdfsdata/hadoop/logs The namenode (the master HDFS host)
HDFS datanode /disk/scratch/hdfsdata/hadoop/logs All the compute nodes
YARN resource manager /disk/scratch/yarn/logs The resource manager (the master YARN host)
YARN node manager /disk/scratch/yarn/logs All the compute nodes
Job History Server /disk/scratch/mapred/logs The job history server host

For hostnames see #Nodes.

Removing a node from the cluster

Here's how to remove a node from the cluster. You might need to do this if it has hardware trouble, or if you want to upgrade its firmware or its software, for example. You can only do this with a compute node - not one of the two master nodes.
  1. Add this to the bottom of the node's LCFG file:
    !hadoop.excluded   mSET(true)
  2. HDFS has to decommission the node (i.e. move its share of the HDFS data to other nodes):
    • Login to the namenode.
    • Acquire hdfs privilege.
    • Tell the namenode to take a fresh look at its config files:
      hdfs dfsadmin -refreshNodes
    • Look at the namenode log. Wait for it to announce Decommissioning complete for node and the node's IP address.
    • You can also check that HDFS reports that the node is "Decommissioned":
      hdfs dfsadmin -report
  3. YARN has to decommission the node.
    • (work in progress!)
    • This part is not working. You should be able to do:
      yarn rmadmin -refreshNodes
      and then the node manager on the excluded host should stop and it should be shown here as decommissioned
      yarn node -list
      but instead it shows as "running", and the mention of the exclude procedure in the resource manager log mentions no hostnames. Broken, it seems.

Documentation

The manuals for this release are at hadoop.apache.org/docs/r2.9.2/. They're listed down the left hand side of the page. There are a lot of them, but quite a few just document one concept or one optional extension (there are a lot of these too).

systemd services

Hadoop is started and stopped using systemd services (which are configured by LCFG).
Hadoop component systemd service Host
HDFS namenode hadoop-namenode.service The namenode (the master HDFS host)
HDFS datanode hadoop-datanode.service All the compute nodes
YARN resource manager hadoop-resourcemanager.service The resource manager (the master YARN host)
YARN node manager hadoop-nodemanager.service All the compute nodes
Job History Server hadoop-mapred.service The job history server host
These can be queried, started and stopped using systemctl in the usual way. For example:
# systemctl status hadoop-nodemanager
● hadoop-nodemanager.service - The hadoop nodemanager daemon
   Loaded: loaded (/etc/systemd/system/hadoop-nodemanager.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2019-09-19 10:16:33 BST; 1 weeks 4 days ago
 Main PID: 4573 (java)
   CGroup: /system.slice/hadoop-nodemanager.service
           └─4573 /usr/lib/jvm/java-1.8.0-sun/bin/java -Dproc_nodemanager -Xmx4000m -Dhadoop.log.dir=/disk/scratch/yarn/logs -Dya...

Topic revision: r41 - 30 Sep 2019 - 15:39:46 - ChrisCooke
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies