I have implemented apriori algorithm on dataset using map-reduce framework in hadoop.
Can anyone please guide me How can I optimize apriori algorithm (in hadoop map-reduce)?
I will be very thankful.
Thanks!
EDITED CODE:
//MAPPER
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
Utils.count++;
String line = value.toString();
String[] items = line.split(" ");
Arrays.sort( items );
LinkedHashSet myPowerSet = powerset(items);
for (Iterator iterator = myPowerSet.iterator(); iterator.hasNext();) {
Object i = iterator.next();
String _key = i.toString().replaceAll("\\[|\\]| +", "");
context.write(new Text(_key), new IntWritable(1));
}
}
//COMBINER
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int localSum = 0;
for (IntWritable value : values) {
localSum += value.get();
}
context.write(key, new IntWritable(localSum));
}
//REDUCER
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException
{
int minSupportCount = 3;
int supportCount = 0;
for(IntWritable value : values) {
supportCount += value.get();
}
if (supportCount >= minSupportCount) {
context.write(key, new IntWritable(supportCount));
}
}
First of all:
The code you posted is not Apriori
It lacks all the important ideas of Apriori. Rather than doing these clever optimizations, you do an incredibly expensive materialization that will blow up your data consumption exponentially. Don't do this.
Avoid:
LinkedHashSet
(very slow)IntWritable
(garbage collection cost)For a starter, try profiling your application. Also compare it to the known good implementations in ELKI and SPMF. What is the largest data set you can process in these tools (on a single core; try FPgrowth too) compared to your code (on a cluster). I wouldn't be surprised if these tools can do 10000x larger data than your code on a single CPU.