Tuesday, March 29, 2016

Hadoop Cluster setup: 2 nodes - master & slave

First off, this is just a description of my experience setting up a hadoop two node cluster, primarily for my own future reference, and secondarily for the edification of the curious blog-reader. I do not claim to be an expert on Hadoop, and there probably are errors in my setup steps below that someone who knows hadoop well may be able to refine.

Starting point: 
To start off, I have two VMWare virtual machines, hadoop-master and hadoop-slave1. Each run Ubuntu 15.10 x64 desktop. The ip addresses of these respectively are
Both VMs have a single user named "hadoop"

Step 0: Packages Required
  1. sudo apt-get install ssh rsync default-jdk
  2. download hadoop from http://hadoop.apache.org/releases.html and extract it somewhere on each vm

Step 1: Make the hosts know each other
Here we update /etc/hosts so that
(1) the hostname of this host refers to its external ip address and not
(2) the hostname of the other VM is also mapped to that VM's IP address

So, /etc/hosts for hadoop-master looks like:
and /etc/hosts for hadoop-slave1 looks like: 

Step 2: Setup ssh without password
Do these three commands on each VM
  1. ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
  2. ssh-copy-id -i ~/.ssh/id_dsa.pub hadoop@hadoop-master
  3. ssh-copy-id -i ~/.ssh/id_dsa.pub hadoop@hadoop-slave1
Update 04/30/2016: Ubuntu Xenial (16.0.4) disabled support for dsa. So update all references to rsa instead in the above.

The first command create a reciprocal pair of keys for ssh without password. The public key is in id_dsa.pub (since we specified the target file as id_dsa). The other two commands will install the public key on this machine as well as the other machine. So once this is done, from each machine you should be able to ssh to localhost, hadoop-master/hadoop-slave1 without providing a password.

Step 3: JAVA_HOME environment variable
The command "java -version" will tell you the java version on your VMs. You don't need to know this, in our case, but just FYI and also if you need to look up java version vs. hadoop version compatibility.

On my ubuntu instance, JAVA_HOME should point to a version specific subfolder under /usr/lib/jvm/. Conveniently, there is a symbolic link 'default-java' which will automatically point to the correct version folder.

so my JAVA_HOME needs to be set to "/usr/lib/jvm/default-java/" Now, where do I need to set this? There are two places (need to do this on both VMs):
  1. In your .bashrc file. At the very end of the file, export JAVA_HOME
  2. In hadoop's etc/hadoop/hadoop-env.sh
PS: you need to restart you shell/console for the .bashrc change to take effect.

Step 4: Basic hadoop / HDFS
From this step on, we are going to follow the Single Node Setup tutorial, except paralleling it as cluster setup. The Cluster tutorial itself is a rather complicated list of all the options, most of which we do not care about.

Configure core-site.xml and hdfs-site.xml on both VMs. They are identical on both VMs.

core-site.xml specifies the NameNode as defined in the Cluster tutorial. We make hadoop-master:9000 the NameNode. hdfs-site.xml defines the replication count.

Next we update etc/hadoop/slaves to list all the machines in the pool. I listed both machines, and I made this file identical on both machines - not sure if that is the right thing to do.

At this point, we have basic hadoop setup. Next you need to format the HDFS filesystem since this is a new hdfs filesystem. This needs to just be done on the master, and is a one time operation - you don't need to do this every time you bring up HDFS.
  • bin/hdfs namenode -format
Now we can start HDFS from the master using this command:
  • sbin/start-dfs.sh
At this point, we have a blank HDFS filesystem up and running, let's play around a little bit with it. First let's make a folder for our user (do this on the master)
  • bin/hdfs dfs -mkdir /user
  • bin/hdfs dfs -mkdir /user/hadoop
So now our hadoop user has a home folder on the HDFS filesystem. If you specify a relative path, on the HDFS filesystem it will be with relation to this home folder.
Let's add some files onto the HDFS file system from the master. Specifically, we copy the folder etc/hadoop on our local filesystem as a folder named "input" on the HDFS filesystem so we can do the map-reduce example in the tutorial. (If you did the single node setup, then you will see that this put operation takes longer. This is because, we specified a replication level of 2, and so each file needs to be moved over the network to the slave as well).
  • bin/hdfs dfs -put etc/hadoop input
Now, on the slave, we can see list the contents of the input folder, and we will see the files we uploaded from the master.
  • bin/hdfs dfs -ls input
 At this point, we can run the example hadoop map-reduce program from the tutorial on the master.
  • bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep input output 'had[a-z.]+'
Make note of how quickly this operation completes. We can examine the output from the slave.
  •  bin/hdfs dfs -ls output
  • bin/hdfs dfs -cat output/*
  • bin/hdfs dfs -get output output
  • cat output/*
At this point, to recap, we used sbin/start-dfs.sh to start the HDFS filesystem/namenode. To bring it down, we can use similarly use sbin/stop-dfs.sh.

Also, we can examine the status of HDFS by going to the URL
  • http://hadoop-master:50070
 The "Live nodes" listing and link on that page will show you that the master as well as the slave are registered as part of the HDFS filesystem.

Step 5: Actual distribution / YARN
The above step really only set up the HDFS filesystem. The map-reduce job we ran still ran locally and was not executed as a distributed map-reduce job (sorry!). To actually distribute processing, we need to setup the ResourceManager node specified in cluster tutorial. For that we need to setup the map-reduce framework, which in our case is YARN.

First, configure mapred-site.xml and yarn-site.xml as follows (same on both VMs):

The former tells hadoop to use YARN for the map-reduce framework. The latter tells hadoop that "hadoop-master" is the resource-manager. You could setup a multi-node cluster where the resource manager is not the same host as the node manager. (I have no idea what mapreduce_shuffle does).

At this point, we are ready to start YARN. Remember that YARN is practically useless unless you run it on top of HDFS. So we first start HDFS and then start YARN. (from the master)
  • sbin/start-dfs.sh
  • sbin/start-yarn.sh
As mentioned before, http://hadoop-master:50070 shows HDFS status page. YARN status page will now be available at
  • http://hadoop-master:8088
 Now, let us run our example map-reduce job in actual distributed mode. Remember that we ran it once on this HDFS filesystem, and so an "output" folder was created. We will first delete the output folder before running map-reduce. Also, just for fun, we'll run this map-reduce job from the slave rather than from the master. So on the slave:
  • bin/hdfs dfs -rm -r output
  • bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep input output 'had[a-z.]+'
Notice that this time around, the map-reduce job takes a lot longer. This is because it is actually being done in distributed mode, and network lateness far overwhelm the super-fast processing of modern CPUs. Also, the framework will show us updates on the progress of the job as it is running. Once the job completes, you can again look at the output using the same commands we used in Step 4.

Finally, don't forget to shutdown correctly (from master):
  • sbin/stop-yarn.sh
  • sbin/stop-dfs.sh
Step 6: reformatting HDFS
[Reference: http://stackoverflow.com/questions/26545524/there-are-0-datanodes-running-and-no-nodes-are-excluded-in-this-operation]
This is not really a step in itself, and you don't normally need to do it. However, if you need to do it, first stop both yarn and dfs, then reformat.
  • sbin/stop-yarn.sh
  • sbin/stop-dfs.sh
  • bin/hdfs namenode -format
At this point, if you start dfs, things will strangely go wrong. This is because, there are hadoop temp files in the /tmp/ folder that contain the old Cluster ID reference. To fix this, clear your /tmp/ folder as the hadoop user
  • rm -rf /tmp/* 
(Not sure if the above is the safest way to do this; are there other apps that have files open in the /tmp/ folder?)
Once you do that, you can bring up HDFS and create home folder for the hadoop user
  • sbin/start-dfs.sh
  • bin/hdfs dfs -mkdir -p /user/hadoop
 Then, you can bring up YARN
  • sbin/start-yarn.sh 
Optimization: Better storage persistence
Obviously, as we found out in step 6, hadoop is storing data on /tmp/ folder, which is susceptible to being deleted. However, we do not want the data we store on the HDFS filesystem being deleted. To correct this, first make two folders named "hadoop-name" and "hadoop-data" somewhere that the hadoop user can write to. I have made them in /home/hadoop as below. The hadoop-name folder will store hadoop namenode information and logs. The hadoop-data folder will store the actual data on the filesystem.
  • /home/hadoop/hadoop-name
  • /home/hadoop/hadoop-data
Next, edit etc/hadoop/hadoop-env.sh (on all machines). On there, set the two variables "HADOOP_PID_DIR" and "HADOOP_LOG_DIR" to both point to /home/hadoop/hadoop-name

Finally, edit etc/hadoop/hdfs-site.xml (on all machines) and add these two properties:
dfs.namenode.name.dir should point to the hadoop-name folder we created, and dfs.datanode.data.dir should point to the hadoop-data folder we created. At this point, you can delete any hadoop files in /tmp, reformat the namenode and start HDFS and YARN
  • rm -rf /tmp/*
  • bin/hdfs namenode -format
  • sbin/start-dfs.sh
  • bin/hdfs dfs -mkdir -p /user/hadoop
  • sbin/start-yarn.sh
You will see that hadoop no longer writes data to /tmp, but instead uses hadoop-name and hadoop-data folders.

No comments: