For this I assume you have configured hadoop and have maven and SVN installed. I am using Ubuntu 12.04
Setting up mahout
Execute the following commands in your eclipse workspace.
[~/workspace]$ svn co http://svn.apache.org/repos/asf/mahout/trunk [~/workspace]$ mv trunk/ mahout/ [~/workspace/mahout]$ cd mahout/ [~/workspace/mahout]$mvn install [~/workspace/mahout/core]$ cd core/ [~/workspace/mahout/core]$ mvn compile [~/workspace/mahout/core]$ mvn install [~/workspace/mahout/core]$ cd ../examples [~/workspace/mahout/examples]$ mvn compile
If want to further configure mahout please refer to these blogs here and here.
About the data and upload data into HDFS
Recommender, a part of mahout which runs on top of hadoop takes inputs in the form of <key,value> pairs. The recommender needs two files, an input file and a users file. The input file contains data which has been essentially converted into <key,value> pairs. The users file has the key of the users that you want recommendations for. You can download the input file from my google drive. For the users.txt file create a file with one key value in the first line, which looks like this.
[~/input]$ cat users.txt 8
The number 8 above is the key, say you are trying to get recommendations for user # 8. You could have more keys too but it is going to take a way longer time.
Make sure your hadoop is running and upload your files into HDFS.
[~/input]$ hadoop dfs -mkdir input/ [~/input]$ hadoop dfs -put links-converted.txt input/ [~/input]$ hadoop dfs -put users.txt input/
Run recommender on the local machine
This is the command to run recommender job , I know its kind of loaded, I will try and explain it below.
[~/workspace/mahout]$ hadoop jar ~/workspace/mahout/core/target/mahout-core-0.7-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -Dmapred.input.dir=input/links-converted.txt -Dmapred.output.dir=output --usersFile input/users.txt --booleanData -s SIMILARITY_LOGLIKELIHOOD
hadoop jar ~/workspace/mahout/core/target/mahout-core-0.7-job.jar tells hadoop, we are giving it a jar file and the location of the jar file.
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob is the main class where execution starts from.
-Dmapred.input.dir=input/links-converted.txt -Dmapred.output.dir=output are the HDFS input and output directories.
–usersFile input/users.txt has the key value of the users that you want recommendations for.
-s SIMILARITY_LOGLIKELIHOOD This gives the probability of similarity between 8 (as given in the users.txt file ) and the rest of the keys. You could also do other analysis like SIMILARITY_COOCCURRENCE, SIMILARITY_LOGLIKELIHOOD, SIMILARITY_TANIMOTO_COEFFICIENT, SIMILARITY_CITY_BLOCK, SIMILARITY_EUCLIDEAN_DISTANCE.
This is the output I got
8[3303009:1.0,4393611:1.0,5292042:1.0,2583882:1.0,1850305:1.0,275656:1.0,1254637:1.0,1720928:1.0,5575496:1.0,3956845:1.0] <pre>
Running on EMR with data on S3
Follow the instructions from my previous post of running jobs on EMR with the following changes.
In the jar location field,
s3n://buckwell/input/mahout-core-0.7-job.jar
Note : jar file is in bucket buckwell, in folder input/ , change S3 directories accordingly.
In the arguments field,
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -Dmapred.input.dir=s3n://buckwell/data/links-converted.txt -Dmapred.output.dir=s3n://buckwell/output --usersFile s3n://buckwell/data/users.txt --booleanData -s SIMILARITY_LOGLIKELIHOOD
Note : links-converted.txt and users.txt file are in bucket buckwell, in the folder data , change S3 directories accordingly.