Running jobs on Elastic Map Reduce(EMR) with data on S3

1

This article gives step-by-step instructions on a creating job flow on Amazon EMR with data on Amazon S3. I am using the same wordcount.jar file from the previous post and also the same data file.

Upload data and source to S3

Go to your S3 account here and create a Bucket, my bucket name is buckwell and create a folder for data and another folder for jar files called source. Now upload the data file into the folder s3n://buckwell/data/ as shown in the fig 1 below.

Upload haha.txt to data folder

Fig 1: Upload haha.txt to data folder

Also upload code into the folder  s3n://buckwell/source/ as shown in fig 2.

Fig 2 : Upload wordcount.jar into S3n://buckwell/source/

Fig 2 : Upload wordcount.jar into S3n://buckwell/source/

Creating job flow on EMR

Login into your EMR here. Click on the “create a new job flow” button , type in a job flow name, select Hadoop version as Hadoop 0.20.205 (MapR M3 Edition v1.2.8)  then select your own job flow and in the drop down select Custom JAR as shown in fig 3.

Fig 3: EMR job flow

Fig 3: EMR job flow

Click continue and set the input parameters and arguments as in Fig 4

Fig 4: Job Flow arguments and input parameters

Fig 4: Job Flow arguments and input parameters

jar location :

s3n://buckwell/source/wordcount.jar

jar Arguments :

s3n://buckwell/data/haha.txt s3n://buckwell/output

If you have used different folders , change the input parameters accordingly. Also , you don’t need to create the output/ folder in your bucket , it will be automatically created in the course of the job flow, EMR throws an error if the folder already exists. Click continue and in the Advanced options tab set the Log Path to :

s3n://buckwell/logs/

Click continue and finally create your job flow and close. You should now be back at the Job Flows window which looks like Fig 5.

Fig 5: Job flow running

Fig 5: Job flow running

Its going to go through its different phases of STARTING, BOOTSTRAPPING , RUNNING and COMPLETED. It should look somewhat like Fig 6:

Fig 6: Completed screen with controller, stderr, stdout, syslog

Fig 6: Completed screen with controller, stderr, stdout, syslog

Click on the controller, stderr, stdout, syslog to look at the logs and error messages. If you get the completed message stderr should be empty else use it debug. To look at the output file go back to your S3 and open/download this file s3n://buckwell/output/part-r-00000 with a notepad.