TWiki
>
DICE Web
>
HadoopCluster
>
HadoopCareAndFeeding
(revision 51) (raw view)
Edit
Attach
---+ !!Hadoop Cluster: Care and Feeding *If you want to use Hadoop, see [[http://computing.help.inf.ed.ac.uk/hadoop-cluster][this page on computing.help]].* This page covers maintenance and configuration of the Hadoop EXC cluster. <br> %TOC% ---++ Nodes The machines are LCFG-maintained DICE servers running the current desktop version of DICE. | *Machine* | *Role* | *Account* | *Abbreviation* | | scutter01 | The namenode (the master node for the HDFS filesystem). | hdfs | nn | | scutter02 | The resource manager (the master node for the YARN resource allocation system).<br>The job history server. | yarn<br>mapred | rm<br>jhs | | scutter03 <br> <em>to</em> <br> scutter12 | The compute nodes. These run: <br>a datanode (stores HDFS data) and <br>a node manager (manages YARN and jobs on this node). | <br>hdfs<br>yarn | <br>dn<br>nm | The nodes are in the AT server room. ---++ Kerberos and privilege The cluster uses Kerberos for authentication. To get privileged access to the cluster, you'll need to authenticate. For this you'll need to know the right *machine* and *account* and *abbreviation* to use. Find them in the table above, then do this: * <code>ssh <strong>machine</strong></code> * <code>nsu <strong>account</strong></code> * <code>newgrp hadoop</code> * <code>export !KRB5CCNAME=/tmp/<strong>account</strong>.cc</code> * <code>kinit -k -t /etc/hadoop.<strong>abbreviation</strong>.keytab <strong>abbreviation</strong>/${HOSTNAME}</code> So here's how to get privileged access to the HDFS filesystem: * <code>ssh <strong>scutter01</strong></code> * <code>nsu <strong>hdfs</strong></code> * <code>newgrp hadoop</code> * <code>export !KRB5CCNAME=/tmp/<strong>hdfs</strong>.cc</code> * <code>kinit -k -t /etc/hadoop.<strong>nn</strong>.keytab <strong>nn</strong>/${HOSTNAME}</code> There's a handy way to check that the mapping between the service abbreviation (e.g. ==rm==) and its account (e.g. ==yarn==) has been configured correctly: * <code>ssh</code> to any Hadoop node * <code>hadoop org.apache.hadoop.security.HadoopKerberosName <strong>abbreviation</strong>/${HOSTNAME}@INF.ED.AC.UK</code> For example,<pre>[scutter04]: <strong>hadoop org.apache.hadoop.security.HadoopKerberosName rm/${HOSTNAME}@INF.ED.AC.UK</strong> Name: rm/scutter04.inf.ed.ac.uk@INF.ED.AC.UK to yarn [scutter04]: </pre> So *rm* maps to the *yarn* account. ---++ Users On the *exc* cluster this is driven by roles and capabilities, so it's automated. A prospective user of the <strong>exc</strong> cluster needs to gain the ==hadoop/exc/user== capability. Several roles grant that, and you can discover them with e.g. <pre>rfe -xf roles/hadoop</pre> Most student users of the cluster will probably gain a suitable role automatically thanks to the Informatics database and Prometheus. ---+++ Making HDFS directories On the *exc* cluster this is done by a script called ==mkhdfs== which runs nightly. It ensures that each user with ==hadoop/exc/user== has an HDFS directory. It runs on the namenode of the cluster, and it's installed by the =hadoop-cluster-master-hdfs-node.h= header. There's a companion script called ==rmhdfs==. It runs weekly, and looks for and lists those HDFS directories which don't have the capability associated with them. You can then consider deleting those directories at your leisure. For other clusters, you could either either adapt ==mkhdfs== or you could make an HDFS directory manually. Here's how to do that: 1. Log in to the namenode with ssh and [[#Kerberos_and_privilege][acquiring privileged access to HDFS]]. 1. Then make the HDFS home directory:<pre> hdfs dfs -mkdir /user/${USER} hdfs dfs -chown ${USER} /user/${USER} exit exit logout </pre> ---++ Jobs ---+++ How to run a test job This is how to check that the cluster is working. 1. If you don't yet have an HDFS directory, [[#Making_HDFS_directories][here's how to make one]]. 1. Now ssh to the YARN master node:<pre> ssh scutter02</pre> 1. Put some files into your HDFS dir. These will act as input for the test job: <pre> hdfs dfs -put $HADOOP_PREFIX/etc/hadoop input</pre> 1. List your files to check that they got there:<pre> hdfs dfs -ls input</pre> 1. If you have already run the job and you want to rerun it, remove the output dir which the job makes:<pre> hdfs dfs -rm -r output</pre> 1. Now submit the test job:<pre> hadoop jar $HADOOP_PREFIX/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar grep input output 'dfs[a-z.]+'</pre>You should see lots of messages about the job's progress. The job should finish within a minute or two. 1. Once it's finished, transfer the job's output from HDFS:<pre> hdfs dfs -get output</pre> 1. ... and take a look at what the job did:<pre> cd output ls</pre>You should see two files - an empty file called _SUCCESS and a file with a few word counts in it called part-r-00000. If you don't see _SUCCESS then the job didn't work. ---+++ What jobs are running? This command needs privilege, see [[#Kerberos_and_privilege][above]]. It lists the jobs which are currently running on the cluster. <pre>mapred job -list</pre> ---++ Checking the log files Hadoop keeps comprehensive and informative log files. They're worth checking when you're doing anything with Hadoop, or when something seems to be wrong, or just to check that things are OK. Here's where to find them: | *Component* | *Log directory* | *Host* | | HDFS namenode | =/disk/scratch/hdfsdata/hadoop/logs= | The namenode (the master HDFS host) | | HDFS datanode | =/disk/scratch/hdfsdata/hadoop/logs= | All the compute nodes | | YARN resource manager | =/disk/scratch/yarn/logs= | The resource manager (the master YARN host) | | YARN node manager | =/disk/scratch/yarn/logs= | All the compute nodes | | Job History Server | =/disk/scratch/mapred/logs= | The job history server host | For hostnames see [[#Nodes]]. ---++ Nodes ---+++ How to list the nodes There are several configuration files which list the cluster nodes. To find them first ==ssh== to any cluster node, then go to the Hadoop configuration directory:<pre>cd $HADOOP_CONF_DIR</pre> The nodes are named in these files: | *File* | *Contains* | | =masters= | The cluster's master servers. For a simple cluster this would just be the HDFS namenode and the YARN resource master. | | =slaves= | The slave nodes of the cluster. | | =exclude= | Those slaves which are currently excluded from the cluster. | | =hosts= | All the nodes (=masters= + =slaves=). | Which host does HDFS think is the namenode?<pre>hdfs getconf -namenodes</pre> Does YARN know the state of the nodes?<pre>yarn node -list -all</pre> ---+++ Removing a node from the cluster Here's how to remove a node from the cluster. You might need to do this if a machine has hardware trouble, or if you want to upgrade its firmware or its software, for example. You can only do this with a slave node - not one of the two master nodes. 1. Add this to the bottom of the node's LCFG file:<pre>!hadoop.excluded mSET(true)</pre> 2. HDFS has to decommission the node (i.e. move its share of the HDFS data to other nodes): * Login to the namenode. * Acquire hdfs privilege (see [[#Kerberos_and_privilege][above]]). * Tell the namenode to reconsider which nodes it should be using: <pre>hdfs dfsadmin -refreshNodes</pre> * Look at the namenode log. Wait for it to announce *Decommissioning complete for node* and the node's IP address. * You can also check that HDFS reports that the node's *Decommission Status* has changed from *Normal* to *Decommissioned*: <pre>hdfs dfsadmin -report</pre>This means that the node's share of the HDFS data has been copied off onto other nodes. 3. YARN has to decommission the node. * <em>This should work from 28 November 2019.</em> * Login to the resource manager (the YARN master) node. * Acquire privilege over the yarn resource manager. * <pre>yarn rmadmin -refreshNodes</pre> * Having done this, a list of the nodes should show your decommissioned host as DECOMMISSIONED:<pre>yarn node -list -all</pre> ---+++ Re-adding an excluded node to the cluster Remove the =hadoop.excluded= resource that you added in [[#Removing_a_node_from_the_cluster][Removing a node from the cluster]] then run through the same procedure as you did there. ---++ Documentation The manuals for this release are at [[https://hadoop.apache.org/docs/r2.9.2/][hadoop.apache.org/docs/r2.9.2/]]. They're listed down the left hand side of the page. There are a lot of them, but quite a few just document one concept or one optional extension (there are a lot of these too). ---++ systemd services Hadoop is started and stopped using systemd services (which are configured by LCFG). | *Hadoop component* | *systemd service* | *Host* | | HDFS namenode | =hadoop-namenode.service= | The namenode (the master HDFS host) | | HDFS datanode | =hadoop-datanode.service= | All the compute nodes | | YARN resource manager | =hadoop-resourcemanager.service= | The resource manager (the master YARN host) | | YARN node manager | =hadoop-nodemanager.service= | All the compute nodes | | Job History Server | =hadoop-mapred.service= | The job history server host | These can be queried, started and stopped using ==systemctl== in the usual way. For example: <pre> # <strong>systemctl status hadoop-nodemanager</strong> %GREEN%●%ENDCOLOR% hadoop-nodemanager.service - The hadoop nodemanager daemon Loaded: loaded (/etc/systemd/system/hadoop-nodemanager.service; enabled; vendor preset: enabled) Active: %GREEN%active (running)%ENDCOLOR% since Thu 2019-09-19 10:16:33 BST; 1 weeks 4 days ago Main PID: 4573 (java) CGroup: /system.slice/hadoop-nodemanager.service └─4573 /usr/lib/jvm/java-1.8.0-sun/bin/java -Dproc_nodemanager -Xmx4000m -Dhadoop.log.dir=/disk/scratch/yarn/logs -Dya... </pre> ---++ How to make a new cluster Mostly, you'll just need to copy the LCFG config which sets up the ==exc== cluster; but there are a few manual steps too. You'll need to make a header, a namenode, a resource manager and a bunch of slave nodes. Once you've made your new cluster, and you've checked that [[#Checking_the_log_files][the log files]] and [[#systemd_services][systemd services]] look OK, don't forget to [[#Running_a_test_job][run a test job]] to check that your cluster works. ---+++ Hardware resources Note that the YARN node manager - which runs on each slave, and matches up jobs with hardware resources - automatically determines what hardware resources are available. If you don't give the slaves enough hardware, the node managers will decide that jobs can't run! Even if you're making a little play cluster on VMs, you'll need to give each slave several CPUs. If you don't, not even [[#Running_a_test_job][a wee diddy test job]] will run. In tests, slaves with 3 VCPUs and 4GB memory were sufficient for the test job. ---+++ Make the header Pick a one-word name for your cluster. These instructions will use the name *dana*. Make a new header in subversion for your cluster. In our example we'll make ==live/hadoop-dana-cluster.h==. 1. Check out the =live= SubversionRepository and ==cd== to the ==include/live== directory. 1. ==svn copy hadoop-exc-cluster.h hadoop-dana-cluster.h== 1. Edit it appropriately - perhaps like this:<pre>#ifndef LIVE_HADOOP_DANA_CLUSTER #define LIVE_HADOOP_DANA_CLUSTER #define HADOOP_CLUSTER_NAME dana #define HADOOP_CLUSTER_HDFS_MASTER dana01.inf.ed.ac.uk #define HADOOP_CLUSTER_YARN_MASTER dana02.inf.ed.ac.uk #define HADOOP_CLUSTER_KERBEROS #endif /* LIVE_HADOOP_DANA_CLUSTER */</pre> 1. Commit it with ==svn ci -m "Header to configure the dana Hadoop cluster" hadoop-dana-cluster.h== ---+++ Make a namenode 1. Add your cluster's header to the profile of the machine that'll be the new cluster's namenode. Our example header is ==live/hadoop-dana-cluster.h==. 1. Below it, add the node type header, in this case probably ==dice/options/hadoop-cluster-master-hdfs-node.h== 1. Let LCFG make the machine's new profile and wait for it to reach the machine. 1. ==ssh== onto the machine. 1. Acquire hdfs namenode privilege [[#Kerberos_and_privilege][as described above]]. 1. ==hdfs namenode -format== 1. In a separate session, make a directory then format the HDFS filesystem:<pre>$ <strong>nsu</strong> # <strong>mkdir /disk/scratch/hdfsdata/hadoop/namenode</strong> # <strong>chown hdfs:hadoop /disk/scratch/hdfsdata/hadoop/namenode</strong> # <strong>systemctl restart hadoop-namenode</strong></pre> 1. Back in the session with hdfs namenode privilege, build the base filesystem in HDFS.<pre> hdfs dfs -mkdir /user hdfs dfs -mkdir /tmp hdfs dfs -chmod 1777 /tmp hdfs dfs -mkdir /tmp/hadoop-yarn hdfs dfs -mkdir /tmp/hadoop-yarn/staging hdfs dfs -chmod 1777 /tmp/hadoop-yarn/staging hdfs dfs -mkdir /tmp/hadoop-yarn/staging/history hdfs dfs -mkdir /tmp/hadoop-yarn/staging/history/done_intermediate hdfs dfs -chmod 1777 /tmp/hadoop-yarn/staging/history/done_intermediate hdfs dfs -chown -R mapred:hadoop /tmp/hadoop-yarn/staging hdfs dfs -ls / exit exit </pre> 1. And to support ==distcp==:<pre> nsu cd /disk/scratch/hdfsdata mkdir cache chown hdfs:hdfs cache chmod go+wt cache exit </pre> Here are a couple of ways to test that it's successfully up and running: * Check that the relevant [[#systemd_services][systemd services]] have started and are running. * Check [[#Checking_the_log_files][the log files]]. ---+++ Make a resource manager 1. As for [[#Make_a_namenode][the namenode]], first add the cluster's LCFG header to the profile of the machine that'll be the resource manager. Our example header is ==live/hadoop-dana-cluster.h==. 1. Below it, add the node type header, in this case probably ==dice/options/hadoop-cluster-master-yarn-node.h==. 1. Let LCFG make the machine's new profile and wait for it to reach the machine. 1. Reboot the machine once or twice. Here are a couple of ways to test that it's successfully up and running: * Check that the relevant [[#systemd_services][systemd services]] have started and are running. * Check [[#Checking_the_log_files][the log files]]. ---+++ Make a slave node The rest of the hosts in the cluster will all be slave nodes. Here's how to make one: 1. As for [[#Make_a_namenode][the namenode]], first add the cluster's LCFG header to the profile of the machine that'll be a slave node. Our example header is ==live/hadoop-dana-cluster.h==. 1. Below it, add the node type header, in this case probably ==dice/options/hadoop-cluster-slave-node.h==. 1. Let LCFG make the machine's new profile and wait for it to reach the machine. 1. Reboot the machine once or twice. Here are a couple of ways to test that it's successfully up and running: * Check that the relevant [[#systemd_services][systemd services]] have started and are running. * Check [[#Checking_the_log_files][the log files]]. <!-- old content starts here ---++ Configuration Most of the extra LCFG configuration these machines need is in =dice/options/hadoop.h= (generic DICE-level hadoop config) =live/hadoop.h= (volatile DICE-level hadoop config) =live/mscteach_hadoop.h= (cluster-specific hadoop) (If you wish to create another cluster you'll have to replicate the mscteach-level headers) The Hadoop installation is owned and run by the *hadoop* account. It has a local homedir ==/opt/hadoop== on each machine. The header installs a (locally-built) Hadoop RPM into this directory. It contains the 2.7.x Hadoop distribution. The configuration files are made by the =lcfg-hadoop= component. The following things need to be configured. Most of them are documented in chapters 9 and 10 of the O'Reilly book. $ SSH : This has to be configured to allow *hadoop* on each machine to ==ssh== freely to any other machine in the cluster. This needs some files in ==/opt/hadoop/.ssh==. All of them are created automatically by =lcfg-file= except for the private key file ==id_dsa==. Copy this by hand to any machine which doesn't already have it. $ conf : Configuration files go in ==/opt/hadoop/conf==. Hadoop expects them to go in ==/opt/hadoop/hadoop-2.7.x/conf== so =lcfg-file= automatically makes a symlink there. $ ==hadoop-env.sh== : Define ==JAVA_HOME==. Add =-Xmx2000m= to ==HADOOP_NAMENODE_OPTS== and ==HADOOP_SECONDARY_NAMENODE_OPTS== to make more memory available to those processes. Add =-o StrictHostKeyChecking=no= to ==HADOOP_SSH_OPTS==. Change ==HADOOP_LOG_DIR==. Change ==HADOOP_PID_DIR==. $ ==core-site.xml== : Define the namenode (==fs.default.name==) and the temporary directory (==hadoop.tmp.dir==). Point Hadoop at the rack awareness script (see below) using ==topology.script.file.name==. $ ==hdfs-site.xml== : Specify the location of HDFS data (==dfs.name.dir==, ==dfs.data.dir==, ==fs.checkpoint.dir==). Turn off file permissions (==dfs.permissions==). Point to the web interfaces to stop Hadoop doing the wrong thing (==dfs.http.address==, ==dfs.secondary.http.address==). Point Hadoop at the permitted hosts file (see below) (==dfs.hosts==). $ ==mapred-site.xml== : Define the jobtracker (==mapred.job.tracker==). Increase the tasktracker memory (==mapred.child.java.opts==). Define the maximum number of tasks per machine (==mapred.tasktracker.map.tasks.maximum==, ==mapred.tasktracker.reduce.tasks.maximum==) - the book advises that you set this to one less than the number of cores on the machine. Turn on speculative execution (==mapred.map.tasks.speculative.execution==, ==mapred.reduce.tasks.speculative.execution==). When Hadoop starts up make it attempt to recover any jobs that were running when it shut down (==mapred.jobtracker.restart.recover==). Define the maximum memory of Map/Reduce child processes (==mapred.child.ulimit==). Point to the permitted hosts file (==mapred.hosts==). Enable the Fair Scheduler (==mapred.jobtracker.taskScheduler==). Set the number of reduce tasks per job (==mapred.reduce.tasks==). $ ==slaves== : Contains the FQDN of each data node. $ ==masters== : Contains the FQDN of the secondary name node. $ ==hosts== : contains the full name of every permitted Hadoop node. If a machine isn't in this list it can't connect to the supervising nodes. $ ==rack-awareness.py== : a script which takes the name or address of a node and returns the name of the switch it's connected to - ==/bw00==, ==/bw01== or ==/bw02==. The script could be improved :-) All of this is done by LCFG. See the files mentioned above for details. Spanning maps noting the roles of each node should be updated using the following resources: $ =hadoop.slave=: does a thing? $ =hadoop.hosts=: does a thing? $ =hadoop.master=: does a thing? ---++ Adding a New User to Hadoop Hadoop users need the secondary role =hadoopuser=. The primary role =module-exc= gives you =hadoopuser= automatically but people not on the Extreme Computing course need to be given =hadoopuser=. From the capability Hadoop users (and HDFS directories) should be created automatically by an crontab script on each node's =bin= directory. You can establish a current list with: =hdfs dfs -ls /user= If you would like to hasten this process, you can create the user and directory manually: * ==ssh namenode.inf.ed.ac.uk== * ==nsu hadoop== * ==~/bin/addhadoopuser <em>username</em>== ---++ Removing a User from Hadoop Just undo what you did when the user was added to Hadoop. 1. Delete the user's HDFS directory: * ==ssh namenode.inf.ed.ac.uk== * ==nsu hadoop== * ==hadoop dfs -rmr /user/<em>UUN</em>== 1. Remove the secondary role =hadoopuser=, if the account still has it (and didn't get it automatically by means of a primary role like =module-exc=). ---++ Starting and Stopping Hadoop $ To Start Hadoop : 1. On the namenode ==nsu hadoop== then ==/opt/hadoop/hadoop-2.7.x/sbin/start-dfs.sh==. Wait a little while then check the HDFS health at [[http://namenode.inf.ed.ac.uk:50070]]. 1. Then on the jobtracker ==nsu hadoop== then ==/opt/hadoop/hadoop-2.7.x/sbin/start.yarn.sh==. Wait a little while then check the yarn tracker status at [[http://jobtracker.inf.ed.ac.uk:8088]]. $ To Stop Hadoop : 1. On the jobtracker ==nsu hadoop== then ==/opt/hadoop/hadoop-2.7.x/sbin/stop-yarn.sh==. 1. On the namenode ==nsu hadoop== then ==/opt/hadoop/hadoop-2.7.x/bin/stop-dfs.sh==. ---++ Shutting down all the nodes In an emergency the whole cluster can be safely shut down by logging into the namenode and running ~hadoop/bin/shutdownhadoop. This should log onto the jobtracker node, shut down yarn, then generate a list of active nodes, shutdown dfs and finally log into each node and run poweroff. %RED%It's currently untested%ENDCOLOR% %RED%Sections beyond this have not been revised%ENDCOLOR% ---++ Spare Machines %RED%Not really relevant -- but due to be revised to cover emergency replacement%ENDCOLOR% <s> Spare desktop machines can be added to the cluster. *To add a machine*: 1. Install DICE on the machine. 1. Add ==#include <live/bwhadoop_2core.h>== to its profile. Reboot the machine once it has its new LCFG profile. 1. Edit an up to date copy of ==live/list-of-spares-on-hadoop.h== : i. Add the machine's full hostname to the SPARE_MACHINES_HOSTNAMES list i. Add the machine's IP address to the SPARE_MACHINES_ADDRESSES list 1. ==svn commit== your changes. 1. Wait for the new LCFG profiles to be compiled and downloaded to each Hadoop machine. 1. If there are no jobs running on the jobtracker node then restart jobtracker: i. Check [[http://.32:50030/jobtracker.jsp][the jobtracker web page]] to make sure that there are no jobs running. i. Stop Hadoop as described above. i. Start Hadoop as described above. 1. If there are jobs running the jobtracker i. On the node to be added a. ==nsu hadoop== a. ==cd /opt/hadoop/hadoop-0.20.2/bin== a. ==./hadoop-daemon.sh start datanode== a. ==./hadoop-daemon.sh start tasktracker== i. Check that your machine is in the namenode's [["http://namenode.inf.ed.ac.uk:50070/dfsnodelist.jsp?whatNodes=LIVE"][list of live HDFS nodes]] and in the jobtracker's [[http://jobtracker.inf.ed.ac.uk:50030/machines.jsp?type=active][list of active Map/Reduce nodes]]. *To _remove_ a machine from the cluster*: 1. Edit ==live/list-of-spares-on-hadoop.h== : i. _Add_ the machine's full hostname to the REMOVE_MACHINES_HOSTNAMES list i. _Leave_ it in the other lists for now. 1. ==svn commit== your changes. 1. Wait for the new LCFG profiles to be compiled and downloaded to each Hadoop machine. 1. Refresh the namenode: i. ==ssh namenode.inf.ed.ac.uk== i. ==nsu hadoop== i. ==/opt/hadoop/hadoop-0.20.2/bin/hadoop dfsadmin -refreshNodes== 1. The [[http://namenode.inf.ed.ac.uk:50070/dfshealth.jsp"][Namenode HDFS status pages]] should now show your machines with the status _Decommission In Progress_ then the status _Decommissioned_. Once they have the status _Decommissioned_ they are removed from HDFS. 1. to shutdown a trasktracker node i. ssh <node> i. ==nsu hadoop== i. ==cd /opt/hadoop/hadoop-0.20.2/bin== i. ==./hadoop-daemon.sh stop tasktracker 1. When there are no jobs running, stop and start Map/Reduce: i. ==ssh jobtracker== i. ==nsu hadoop== i. ==cd /opt/hadoop/hadoop-0.20.2/bin== i. ==./stop-mapred.sh== i. ==./start-mapred.sh== 1. Once this has been done, edit ==live/list-of-spares-on-hadoop.h== once again: i. Remove the machine's hostname from REMOVE_MACHINES_HOSTNAMES and SPARE_MACHINES_HOSTNAMES. i. Remove the machine's IP address from SPARE_MACHINES_ADDRESSES. 1. ==svn commit==. </s> ---++ When the Cluster Breaks (This section is an aide memoire for the maintainer of the cluster.) When the cluster gets into difficulties - for instance, nodes dropping out of HDFS or Map/Reduce; errors appearing from various nodes when other nodes are processing the same job perfectly well; or other odd or unusual behaviour - here are some general approaches to try: * Check the namenode and map/reduce status pages on the web. * Login to the namenode and look for relevant error messages in the logs in ==/disk/scratch/hdfsdata/hadoop/logs==. Today's namenode log will always be called ==/disk/scratch/hdfsdata/hadoop/logs/hadoop-hadoop-namenode-<i>NODENAME</i>.inf.ed.ac.uk.log==. * Similarly check on the jobtracker host for the jobtracker logs. * Look for full disk partitions on any nodes. <s>On Beowulf machines, check that the relationship between the ==/disk/scratch1== and ==/disk/scratch2== partitions and the ==/disk/scratch== link is as it should be and make sure that Hadoop is using the intended partition for its ==/disk/scratch== files (e.g. log files).</s> On most nodes, look for big files staying around in Hadoop cache directories. These cache directories shouldn't have any long term residents. * To check for HDFS problems, run ==hadoop fsck== for a short report or ==hadoop fsck -files -blocks -locations== for a longer one. * When the cluster is shut down, look for stray Hadoop daemons which are still running. Most nodes should run a !DataNode and a !TaskTracker process while the cluster is up; these should be shut down when the cluster is shut down, but occasionally they stay in existence. This can cause problems later when they attempt to be part of the cluster when it's next started up. Look for and kill any stray Hadoop processes you find while the cluster is down. * Check the job tracker ( http://jobtracker.inf.ed.ac.uk:8088 ) and clear out any old jobs (in practice simply restarting the cluster should achieve this and not lose active, valid jobs). If you've lost a disk: * %RED% just a summary, iain will provide more!%ENDCOLOR% * admin scripts live in /opt/hadoop/hadoop-2.7.x/sbin/ * once a node has become inoperable it will be declared dead by the namenode after an unknown period of time; you should be able to see this in the "datanodes" page. It's updated every few minutes. HDFS will try to struggle along with reduced storage. Ideally you should be able to restart the individual node with: <br> ==$HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs stop datanode== <br> ==$HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start datanode== If the filesystem id down it may come back up in safe mode (e.g. if there are disk failures) * Turn safe mode off, hdfs dfsadmin -safemode leave * Run fsck on the filesystem hdfs dfsamdin -fsck / hope that there's not too much data loss as with a normal fsck. *If you've lost a node* if It's just one node then ssh to the relevant node, nsu to hadoop and stop and restart the node manager with:<br> <br> ==$HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR stop nodemanager== <br> ==$HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start nodemanager== if it's all of the cluster then ssh onto jobtracker and run $HADOOP_PREFIX/sbin/stop-yarn.sh $HADOOP_PREFIX/sbin/start-yarn.sh It's preferable to restart the individual nodes because if you restart the scheduler you will lose jobs. There are commands for starting/stopping all the various processes at http://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-common/ClusterSetup.html#Operating_the_Hadoop_Cluster %RED%WATCH OUT, ANY COMMAND WITH "-daemons.sh" WILL AFFECT ALL THE NODES ON THE CLUSTER......%ENDCOLOR% drop the s for the command just to apply to that node. ---++DON'TS DON'T run start-yarn on anything other than jobtracker.inf.ed.ac.uk DON'T shut down HDFS without first shutting down yarn. ---++ Further Reading * _Hadoop: The Definitive Guide_ * ( [[http://blob.inf.ed.ac.uk/chris/2010/09/08/online-books/][available at the library]] ) * [[https://hadoop.apache.org/common/docs/2.7.3/][The Hadoop 2.7.3 documentation]]. * [[https://hadoop.apache.org/common/docs/stable/cluster_setup.html][Cluster Setup]] : contains lots of configuration help, including links to: * [[https://hadoop.apache.org/common/docs/stable/mapred_tutorial.html][Map/Reduce Tutorial]] : despite the name this contains a lot of configuration wisdom. * [[https://hadoop.apache.org/common/docs/current/core-default.html][Core configuration resource defaults]] * [[https://hadoop.apache.org/common/docs/current/hdfs-default.html][HDFS configuration resource defaults]] * [[https://hadoop.apache.org/common/docs/current/mapred-default.html][Map/Reduce configuration resource defaults]] * [[http://www.inf.ed.ac.uk/teaching/courses/exc/labs/lab1.html][The exc lab exercises]]. -->
Edit
|
Attach
|
P
rint version
|
H
istory
:
r55
|
r53
<
r52
<
r51
<
r50
|
B
acklinks
|
V
iew topic
|
Raw edit
|
More topic actions...
Topic revision: r51 - 18 Nov 2019 - 14:46:52 -
ChrisCooke
DICE
DICE Web
DICE Wiki Home
Changes
Index
Search
Meetings
CEG
Operational
Computing Projects
Technical Discussion
Units
Infrastructure
Managed Platform
Research & Teaching
Services
User Support
Other
Service Catalogue
Platform upgrades
Procurement
Historical interest
Emergencies
Critical shutdown
Where's my software?
Pandemic planning
This is
WebLeftBar
Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback
This Wiki uses
Cookies