Setting up hadoop-0.20.2 single node on Ubuntu

2

In this article I explain, how to set up hadoop 0.20.2, configure and transfer data on to a Hadoop Distributed File System(HDFS). I am using 0.20.2 as it is the most widely used and also, it comes with a pre-built eclipse plugin which would be very convenient to write map reduce programs, I will not be talking about running a sample program on hadoop in this article.

 Download and untar

From you HOME folder run to download hadoop 0.20.2 from the apache archive.


[~]$ wget -c http://archive.apache.org/dist/hadoop/core/hadoop-0.20.2/hadoop-0.20.2.tar.gz

Untar the downloaded file.


[~]$ tar -zxvf hadoop-0.20.2.tar.gz

Configure

Assuming you have java installed , JAVA_HOME and PATH set, add the following to the end or your .bash_profile.


export HADOOP_HOME=$HOME/hadoop-0.20.2
export PATH=$PATH:$HADOOP_HOME/bin

and run


source ~/.bash_profile

vi into the HADOOP_HOME/conf/hadoop-env.sh

 [~/hadoop-0.20.2/conf]$ vi hadoop-env.sh 

and add this

 export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-i386/ 

Copy the following into HADOOP_HOME/conf/core-site.xml  save and exit.


<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

Copy the following into HADOOP_HOME/conf/hdfs-site.xml  save and exit.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
 <name>dfs.replication</name>
 <value>1</value>
 <!-- set to 1 to reduce warnings when running on a single node -->
 </property>
</configuration>

Copy the following into HADOOP_HOME/conf/mapred-site.xml save and exit.


<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
 <property>
 <name>mapred.job.tasker</name>
 <value>localhost:9001</value>
 </property>
</configuration>

Running hadoop

If your HADOOP_HOME is properly set


[~]$ hadoop
Usage: hadoop [--config confdir] COMMAND
where COMMAND is one of:
 namenode -format format the DFS filesystem
 secondarynamenode run the DFS secondary namenode

you should see the above output else go to your HADOOP_HOME directory and run.


[~/hadoop-0.20.2]$ bin/hadoop
Usage: hadoop [--config confdir] COMMAND
where COMMAND is one of:
 namenode -format format the DFS filesystem
 secondarynamenode run the DFS secondary namenode
 namenode run the DFS namenode

Format namenode , datanode , secondarynamenode and trasktracker


[~/hadoop-0.20.2]$ bin/hadoop namenode -format

[~/hadoop-0.20.2]$ bin/hadoop secondarynamenode -format

[~/hadoop-0.20.2]$ bin/hadoop datanode -format

[~/hadoop-0.20.2]$ bin/hadoop tasktracker -format

Loading data into HDFS

Say you want to load a *.txt file that you want to analyse on HDFS.


[~/hadoop-0.20.2]$ bin/hadoop dfs -mkdir /user/venu/input_data/

[~/hadoop-0.20.2]$ bin/hadoop dfs -ls
Found 1 items
drwxr-xr-x - venu supergroup 0 2012-11-14 23:06 /user/venu/input_data

[~/hadoop-0.20.2]$ bin/hadoop dfs -put ~/clusteranalyze.txt /user/venu/input_data/

[~/hadoop-0.20.2]$ bin/hadoop dfs -cat /user/venu/input_data/clusteranalyze.txt

[~/hadoop-0.20.2]$ bin/hadoop dfs -rmr /user/venu/input_data/clusteranalyze.txt  ######## to delete data from HDFS
Deleted hdfs://localhost:9000/user/venu/input_data/clusteranalyze.txt

FYI /user/venu/ directory in this case is the HOME directory of HDFS.

That was intro to hadoop, lets look at map reduce with hadoop after the break.