This article gives step-by-step instructions on a creating job flow on Amazon EMR with data on Amazon S3. I am using the same wordcount.jar file from the previous post and also the same data file.
Upload data and source to S3
Go to your S3 account here and create a Bucket, my bucket name is buckwell and create a folder for data and another folder for jar files called source. Now upload the data file into the folder s3n://buckwell/data/ as shown in the fig 1 below.
Also upload code into the folder s3n://buckwell/source/ as shown in fig 2.
Creating job flow on EMR
Login into your EMR here. Click on the “create a new job flow” button , type in a job flow name, select Hadoop version as Hadoop 0.20.205 (MapR M3 Edition v1.2.8) then select your own job flow and in the drop down select Custom JAR as shown in fig 3.
Click continue and set the input parameters and arguments as in Fig 4
jar location :
s3n://buckwell/source/wordcount.jar
jar Arguments :
s3n://buckwell/data/haha.txt s3n://buckwell/output
If you have used different folders , change the input parameters accordingly. Also , you don’t need to create the output/ folder in your bucket , it will be automatically created in the course of the job flow, EMR throws an error if the folder already exists. Click continue and in the Advanced options tab set the Log Path to :
s3n://buckwell/logs/
Click continue and finally create your job flow and close. You should now be back at the Job Flows window which looks like Fig 5.
Its going to go through its different phases of STARTING, BOOTSTRAPPING , RUNNING and COMPLETED. It should look somewhat like Fig 6:
Click on the controller, stderr, stdout, syslog to look at the logs and error messages. If you get the completed message stderr should be empty else use it debug. To look at the output file go back to your S3 and open/download this file s3n://buckwell/output/part-r-00000 with a notepad.