In this article I explain, how to set up hadoop 0.20.2, configure and transfer data on to a Hadoop Distributed File System(HDFS). I am using 0.20.2 as it is the most widely used and also, it comes with a pre-built eclipse plugin which would be very convenient to write map reduce programs, I will not be talking about running a sample program on hadoop in this article.
Download and untar
From you HOME folder run to download hadoop 0.20.2 from the apache archive.
[~]$ wget -c http://archive.apache.org/dist/hadoop/core/hadoop-0.20.2/hadoop-0.20.2.tar.gz
Untar the downloaded file.
[~]$ tar -zxvf hadoop-0.20.2.tar.gz
Configure
Assuming you have java installed , JAVA_HOME and PATH set, add the following to the end or your .bash_profile.
export HADOOP_HOME=$HOME/hadoop-0.20.2 export PATH=$PATH:$HADOOP_HOME/bin
and run
source ~/.bash_profile
vi into the HADOOP_HOME/conf/hadoop-env.sh
[~/hadoop-0.20.2/conf]$ vi hadoop-env.sh
and add this
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-i386/
Copy the following into HADOOP_HOME/conf/core-site.xml save and exit.
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration>
Copy the following into HADOOP_HOME/conf/hdfs-site.xml save and exit.
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>dfs.replication</name> <value>1</value> <!-- set to 1 to reduce warnings when running on a single node --> </property> </configuration>
Copy the following into HADOOP_HOME/conf/mapred-site.xml save and exit.
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>mapred.job.tasker</name> <value>localhost:9001</value> </property> </configuration>
Running hadoop
If your HADOOP_HOME is properly set
[~]$ hadoop Usage: hadoop [--config confdir] COMMAND where COMMAND is one of: namenode -format format the DFS filesystem secondarynamenode run the DFS secondary namenode
you should see the above output else go to your HADOOP_HOME directory and run.
[~/hadoop-0.20.2]$ bin/hadoop Usage: hadoop [--config confdir] COMMAND where COMMAND is one of: namenode -format format the DFS filesystem secondarynamenode run the DFS secondary namenode namenode run the DFS namenode
Format namenode , datanode , secondarynamenode and trasktracker
[~/hadoop-0.20.2]$ bin/hadoop namenode -format [~/hadoop-0.20.2]$ bin/hadoop secondarynamenode -format [~/hadoop-0.20.2]$ bin/hadoop datanode -format [~/hadoop-0.20.2]$ bin/hadoop tasktracker -format
Loading data into HDFS
Say you want to load a *.txt file that you want to analyse on HDFS.
[~/hadoop-0.20.2]$ bin/hadoop dfs -mkdir /user/venu/input_data/ [~/hadoop-0.20.2]$ bin/hadoop dfs -ls Found 1 items drwxr-xr-x - venu supergroup 0 2012-11-14 23:06 /user/venu/input_data [~/hadoop-0.20.2]$ bin/hadoop dfs -put ~/clusteranalyze.txt /user/venu/input_data/ [~/hadoop-0.20.2]$ bin/hadoop dfs -cat /user/venu/input_data/clusteranalyze.txt [~/hadoop-0.20.2]$ bin/hadoop dfs -rmr /user/venu/input_data/clusteranalyze.txt ######## to delete data from HDFS Deleted hdfs://localhost:9000/user/venu/input_data/clusteranalyze.txt
FYI /user/venu/ directory in this case is the HOME directory of HDFS.
That was intro to hadoop, lets look at map reduce with hadoop after the break.