Recently, I learnt how to set up a Hadoop cluster from my lab friend and colleague Gautham. We set up a Hadoop cluster using 6 systems in our lab and together developed a Java project to implement Indexing (which I had already implemented in a course mini project) in a distributed set-up. In this post, I would only explain the Hadoop cluster set-up which we created in our lab.
We used 6 systems from our lab owned by six different people (Prasad, Gautham, Khyathi, Arpita and Sai) . We created a separate user account in each of the systems for our set up and named the account as cluster.
- We created the following folders in each of the systems(nodes) :
cd ~ mkdir /home/cluster/work mkdir /home/cluster/work/dfs mkdir /home/cluster/work/dfs/data mkdir /home/cluster/work/dfs/name mkdir /home/cluster/work/mapred mkdir /home/cluster/work/mapred/local mkdir /home/cluster/work/mapred/system mkdir /home/cluster/work/logs mkdir /home/cluster/work/tmp chmod -R 777 /home/cluster/work
- Execute following command to download Hadoop version 1.2.1 :
- Extract the hadoop files and place it in the /home/cluster/work directory for every cluster node :
tar -zxvf hadoop-1.2.1.tar.gz mv hadoop-1.2.1 /home/cluster/work/hadoop-1.2.1
- Edit the host file : Since we would require the IP addresses for each node later in our setup, we save the IP to hostname mapping for better readability and identification of the nodes. We copy the IP address and hostname separated by tab for each node in one temperory file temp.txt.
We find the IP address by typing :
Copy the systems’s hostname from this file :
The temp.txt would look like this :
10.2.8.100 prasadmsd 10.2.8.103 kushwanth 10.2.8.101 arpita 10.2.8.102 hari-ThinkCentre-E73 10.2.8.104 khyathi-ThinkCentre-M72e 10.2.8.105 sai
Next, copy these contents from temp.txt and append it at the end of the hosts file mentioned below for each node in the cluster.
sudo gedit /etc/hosts
- Set up the configuration files: This is the most important step where all the cluster configurations take place. We made Prasad’s node as namenode and job tracker. We open and make the following changes in the core-site.xml of each node. core-site.xml contains information about namenode:
We replace all the text in that file by the following :
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>fs.default.name</name> <value>hdfs://prasadmsd:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/home/cluster/work/tmp </property> </configuration>
Next we open and make the following changes in the mapred-site.xml.
mapred-site.xml contains information about jobtracker :
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>mapred.job.tracker</name> <value>prasadmsd:9001</value> </property> <property> <name>mapred.local.dir</name> <value>/home/cluster/work/mapred/local</value> </property> <property> <name>mapred.system.dir</name> <value>/home/cluster/work/mapred/system</value> </property> </configuration>
Next we open and make the following changes in the hdfs-site.xml
hdfs-site.xml contains the information of replication factor, which is basically how many number of copies of a block of data must be replicated, in order to ensure the working is fine even if a datanode fails in between a running program.The following text is replaced in the hdfs-site.xml .
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>dfs.replication</name> <value>2</value> </property> <property> <name>dfs.name.dir</name> <value>/home/cluster/work/dfs/name</value> </property> <property> <name>dfs.data.dir</name> <value>/home/cluster/work/dfs/data</value> </property> </configuration> </code></pre>
Next we open and make the following changes in the slaves.
slaves contains information about datanodes, jobtracker and tasktrackers. We copy all the hostnames of the nodes in the file :
prasadmsd kushwanth arpita hari-ThinkCentre-E73 khyathi-ThinkCentre-M72e sai
Next we open and make the following changes in the masters.
We store the secondary namenode information in this file :
Next we open the hadoop-env.sh
We set the JAVA HOME path in the file. We append the following line in the file :
Now, we are done with the configurations. Mind that the above changes are to be done in all the nodes’ hadoop folder.
- RSA keygen in namenode : We type the following commands in the namenode (Prasad‘s system in our case).
sudo rm -R ~/.ssh ssh localhost ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Copy the authorized_keys in other nodes :
scp ~/.ssh/id_dsa.pub cluster@kushwanth:~/.ssh/authorized_keys scp ~/.ssh/id_dsa.pub cluster@hari-ThinkCentre-E73:~/.ssh/authorized_keys scp ~/.ssh/id_dsa.pub cluster@arpita:~/.ssh/authorized_keys scp ~/.ssh/id_dsa.pub cluster@khyathi-ThinkCentre-M72e:~/.ssh/authorized_keys scp ~/.ssh/id_dsa.pub cluster@sai:~/.ssh/authorized_keys
- Namenode format :
Type the following commands :
cd /home/cluster/work/hadoop-1.2.1/bin/ hadoop namenode format chmod -R 755 ~/work/dfs
- Start the the Hadoop server using the command :
cd /home/cluster/work/hadoop-1.2.1/bin/ ./start-all.sh
Run jps command to see your all the services up and running :
In Prasad‘s system it showed :
cluster@prasadmsd:~$ jps 22110 NameNode 22624 TaskTracker 22290 DataNode 22710 Jps 22462 JobTracker
In Gauthams‘s system it showed :
cluster@kushwanth:~$ jps 28457 Jps 27705 TaskTracker 27451 DataNode 27586 SecondaryNameNode
For others it showed :
cluster@hari-ThinkCentre-E73:~$ jps 14182 DataNode 14330 TaskTracker 14447 Jps
After everything is up and running, we can browse the HDFS using the following port addresses :
50070 : Namenode
50030 : Jobtracker
Open a browser and type <namenode hostname>:50070 . It should look something like this below :
From the above image, we can see that we have 6 live nodes. In another tab open a browser and type <jobtracker hostname>:50030 . It should look something like this below :
Voilla….we have successfully set-up our multi-node Hadoop cluster up and running !!!