Setting up Mahout and running recommender job – Associatation rule

3

For this I assume you have configured hadoop and have maven and SVN installed. I am using Ubuntu 12.04

Setting up mahout

Execute the following commands in your eclipse workspace.

[~/workspace]$ svn co http://svn.apache.org/repos/asf/mahout/trunk
[~/workspace]$ mv trunk/ mahout/
[~/workspace/mahout]$ cd mahout/
[~/workspace/mahout]$mvn install
[~/workspace/mahout/core]$ cd core/
[~/workspace/mahout/core]$ mvn compile
[~/workspace/mahout/core]$ mvn install
[~/workspace/mahout/core]$ cd ../examples
[~/workspace/mahout/examples]$ mvn compile

If want to further configure mahout please refer to these blogs here and here.

About the data and upload data into HDFS

Recommender, a part of mahout which runs on top of hadoop takes inputs in the form of <key,value> pairs. The recommender needs two files, an input file and a users file. The input file contains data which has been essentially converted into <key,value> pairs. The users file has the key of the users that you want recommendations for. You can download the input file from my google drive. For the users.txt file create a file with one key value in the first line, which looks like this.

[~/input]$ cat users.txt
8

The number 8 above is the key, say you are trying to get recommendations for user # 8. You could have more keys too but it is going to take a way longer time.

Make sure your hadoop is running and upload your files into HDFS.

[~/input]$ hadoop dfs -mkdir input/
[~/input]$ hadoop dfs -put links-converted.txt input/
[~/input]$ hadoop dfs -put users.txt input/

Run recommender on the local machine

Recomendation-AssociationRuleMahout

This is the command to run recommender job , I know its kind of loaded, I will try and explain it below.

[~/workspace/mahout]$ hadoop jar ~/workspace/mahout/core/target/mahout-core-0.7-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -Dmapred.input.dir=input/links-converted.txt -Dmapred.output.dir=output --usersFile input/users.txt --booleanData -s SIMILARITY_LOGLIKELIHOOD

hadoop jar ~/workspace/mahout/core/target/mahout-core-0.7-job.jar tells hadoop, we are giving it a jar file and the location of the jar file.

org.apache.mahout.cf.taste.hadoop.item.RecommenderJob is the main class where execution  starts from.

-Dmapred.input.dir=input/links-converted.txt -Dmapred.output.dir=output are the HDFS input and output directories.

–usersFile input/users.txt has the key value of the users that you want recommendations for.

-s SIMILARITY_LOGLIKELIHOOD This gives the probability of similarity between 8 (as given in the users.txt file ) and the rest of the keys. You could also do other analysis like SIMILARITY_COOCCURRENCE, SIMILARITY_LOGLIKELIHOOD,  SIMILARITY_TANIMOTO_COEFFICIENT, SIMILARITY_CITY_BLOCK, SIMILARITY_EUCLIDEAN_DISTANCE.

This is the output I got

 8[3303009:1.0,4393611:1.0,5292042:1.0,2583882:1.0,1850305:1.0,275656:1.0,1254637:1.0,1720928:1.0,5575496:1.0,3956845:1.0]
<pre>

Running on EMR with data on S3

Follow the instructions from my previous post of running jobs on EMR with the following changes.

In the jar location field,

s3n://buckwell/input/mahout-core-0.7-job.jar

Note : jar file is in bucket buckwell, in folder input/ , change S3 directories accordingly.

In the arguments field,

org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -Dmapred.input.dir=s3n://buckwell/data/links-converted.txt -Dmapred.output.dir=s3n://buckwell/output --usersFile s3n://buckwell/data/users.txt --booleanData -s SIMILARITY_LOGLIKELIHOOD

Note : links-converted.txt and users.txt file are in bucket buckwell, in the folder data , change S3 directories accordingly.