How to setup a Hadoop multi-node cluster in Ubuntu

Recently, I learnt how to set up a Hadoop cluster from my lab friend and colleague Gautham. We set up a Hadoop cluster using 6 systems in our lab and together developed a Java project to implement Indexing (which I had already implemented in a course mini project) in a distributed set-up. In this post, I would only explain the Hadoop cluster set-up which we created in our lab.

We  used 6 systems from our lab owned by six different people (Prasad, Gautham, Khyathi, Arpita and Sai) . We created a separate user account in each of the systems for our set up and named the account as cluster.

 

      1. We created the following folders in each of the systems(nodes) :
        cd ~
        mkdir /home/cluster/work
        mkdir /home/cluster/work/dfs
        mkdir /home/cluster/work/dfs/data
        mkdir /home/cluster/work/dfs/name
        mkdir /home/cluster/work/mapred
        mkdir /home/cluster/work/mapred/local
        mkdir /home/cluster/work/mapred/system
        mkdir /home/cluster/work/logs
        mkdir /home/cluster/work/tmp
        chmod -R 777 /home/cluster/work
        
      2. Execute following command to download Hadoop version 1.2.1 :

        
        wget  http://mirrors.gigenet.com/apache/hadoop/common/hadoop-1.2.1/hadoop-1.2.1.tar.gz
        
      3. Extract the hadoop files and place it in the /home/cluster/work  directory for every cluster node :

        
        tar -zxvf hadoop-1.2.1.tar.gz
        mv hadoop-1.2.1 /home/cluster/work/hadoop-1.2.1
        
      4. Edit the host file : Since we would require the IP addresses for each node later in our setup, we save the IP to hostname mapping for better readability and identification of the nodes. We copy the IP address and hostname separated by tab for each node in one temperory file temp.txt.
        We find the IP address by typing :

        
        ifconfig
        

        Copy the systems’s hostname from this file :

        
        vi /etc/hostname
        

        The temp.txt would look like this :

        
        10.2.8.100    prasadmsd
        10.2.8.103    kushwanth
        10.2.8.101    arpita
        10.2.8.102    hari-ThinkCentre-E73
        10.2.8.104    khyathi-ThinkCentre-M72e
        10.2.8.105    sai
        

        Next, copy these contents from temp.txt and append it at the end of the hosts file mentioned below for each node in the cluster.

        
        sudo gedit /etc/hosts
        
      5. Set up the configuration files: This is the most important step where all the cluster configurations take place. We made Prasad’s node as namenode and job tracker. We open and make the following changes in the core-site.xml of each node. core-site.xml contains information about namenode:
        
        gedit /home/cluster/work/hadoop-1.2.1/conf/core-site.xml 
        

        We replace all the text in that file by the following :

        
        <?xml version="1.0"?>
        <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
        <!-- Put site-specific property overrides in this file. -->
        <configuration>
        <property>
        <name>fs.default.name</name>
        <value>hdfs://prasadmsd:9000</value>
        </property>
        <property>
        <name>hadoop.tmp.dir</name>
        <value>/home/cluster/work/tmp </property>
        </configuration>
        

        Next we open and make the following changes in the mapred-site.xml.

        
        gedit /home/cluster/work/hadoop-1.2.1/conf/mapred-site.xml
        

        mapred-site.xml contains information about jobtracker :

        
        <?xml version="1.0"?>
        <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
        <!-- Put site-specific property overrides in this file. -->
        <configuration>
         <property>
         <name>mapred.job.tracker</name>
         <value>prasadmsd:9001</value>
        </property>
        <property>
         <name>mapred.local.dir</name>
         <value>/home/cluster/work/mapred/local</value>
        </property>
        <property>
         <name>mapred.system.dir</name>
         <value>/home/cluster/work/mapred/system</value>
        </property>
        </configuration>
        
        

        Next we open and make the following changes in the hdfs-site.xml

        
        gedit /home/cluster/work/hadoop-1.2.1/conf/hdfs-site.xml
        

        hdfs-site.xml contains the information of replication factor, which is basically how many number of copies of a block of data must be replicated, in order to ensure the working is fine even if a datanode fails in between a running program.The following text is replaced in the hdfs-site.xml .

        
        <?xml version="1.0"?>
        <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
        <!-- Put site-specific property overrides in this file. -->
        
        <configuration>
        <property>
         <name>dfs.replication</name>
         <value>2</value>
        </property>
        <property>
         <name>dfs.name.dir</name>
         <value>/home/cluster/work/dfs/name</value>
        </property>
        <property>
         <name>dfs.data.dir</name>
         <value>/home/cluster/work/dfs/data</value>
        </property>
        </configuration>
        </code></pre>
        
        

        Next we open and make the following changes in the slaves.

        
        gedit /home/cluster/work/hadoop-1.2.1/conf/slaves
        

        slaves contains information about datanodes, jobtracker and tasktrackers. We copy all the hostnames of the nodes in the file :

        
        prasadmsd
        kushwanth
        arpita
        hari-ThinkCentre-E73
        khyathi-ThinkCentre-M72e
        sai
        

        Next we open and make the following changes in the masters.

        
        gedit /home/cluster/work/hadoop-1.2.1/conf/masters
        

        We store the secondary namenode information in this file :

        
        kushwanth
        

        Next we open the hadoop-env.sh

        
        gedit /home/cluster/work/hadoop-1.2.1/conf/hadoop-env.sh
        

        We set the JAVA HOME path in the file. We append the following line in the file :

        
        export JAVA_HOME=/usr/lib/jvm/java-7-oracle
        

        Now, we are done with the configurations. Mind that the above changes are to be done in all the nodes’ hadoop folder.

      6. RSA keygen in namenode : We type the following commands in the namenode (Prasad‘s system in our case).
        
        sudo rm -R ~/.ssh
        ssh localhost
        ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
        cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
        

        Copy the authorized_keys in other nodes :

        
        scp ~/.ssh/id_dsa.pub cluster@kushwanth:~/.ssh/authorized_keys
        scp ~/.ssh/id_dsa.pub cluster@hari-ThinkCentre-E73:~/.ssh/authorized_keys
        scp ~/.ssh/id_dsa.pub cluster@arpita:~/.ssh/authorized_keys
        scp ~/.ssh/id_dsa.pub cluster@khyathi-ThinkCentre-M72e:~/.ssh/authorized_keys
        scp ~/.ssh/id_dsa.pub cluster@sai:~/.ssh/authorized_keys
        
      7. Namenode format :
        Type the following commands :

        
        cd /home/cluster/work/hadoop-1.2.1/bin/
        hadoop namenode format
        chmod -R 755 ~/work/dfs
        
      8. Start the the Hadoop server using the command :
        
        cd /home/cluster/work/hadoop-1.2.1/bin/
        ./start-all.sh
        

        Run jps command to see your all the services up and running :
        In Prasad‘s system it showed :

        
        cluster@prasadmsd:~$ jps
        22110 NameNode
        22624 TaskTracker
        22290 DataNode
        22710 Jps
        22462 JobTracker
        

        In Gauthams‘s system it showed :

        
        cluster@kushwanth:~$ jps
        28457 Jps
        27705 TaskTracker
        27451 DataNode
        27586 SecondaryNameNode
        

        For others it showed :

        
        cluster@hari-ThinkCentre-E73:~$ jps
        14182 DataNode
        14330 TaskTracker
        14447 Jps

 

After everything is up and running, we can browse the HDFS using the following port addresses :

50070 : Namenode
50030 : Jobtracker

Open a browser and type  <namenode hostname>:50070 . It should look something like this below :

namenode

From the above image, we can see that we have 6 live nodes. In another tab open a browser and type <jobtracker hostname>:50030 . It should look something like this below :

jobtracker

Voilla….we have successfully set-up our multi-node Hadoop cluster up and running !!!

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s