How to optimize Apriori Algorithm?

635 views Asked by At

I have implemented apriori algorithm on dataset using map-reduce framework in hadoop.

Can anyone please guide me How can I optimize apriori algorithm (in hadoop map-reduce)?

I will be very thankful.

Thanks!

EDITED CODE:

//MAPPER 
public void map(LongWritable key, Text value, Context context)
        throws IOException, InterruptedException {
    Utils.count++;
    String line = value.toString();
    String[] items = line.split(" ");

    Arrays.sort( items );
    LinkedHashSet myPowerSet = powerset(items);
    for (Iterator iterator = myPowerSet.iterator(); iterator.hasNext();) {
        Object i = iterator.next();
        String _key = i.toString().replaceAll("\\[|\\]| +", "");
        context.write(new Text(_key), new IntWritable(1));
    }
}
//COMBINER
public void reduce(Text key, Iterable<IntWritable> values, Context context)
        throws IOException, InterruptedException {

    int localSum = 0;

    for (IntWritable value : values) {
        localSum += value.get();
    }
    context.write(key, new IntWritable(localSum));
}
//REDUCER
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException
{
    int minSupportCount = 3;
    int supportCount = 0;

    for(IntWritable value : values) {
        supportCount += value.get();
    }
    if (supportCount >= minSupportCount) {
        context.write(key, new IntWritable(supportCount));  
    }
}
2

There are 2 answers

0
Has QUIT--Anony-Mousse On

First of all:

The code you posted is not Apriori

It lacks all the important ideas of Apriori. Rather than doing these clever optimizations, you do an incredibly expensive materialization that will blow up your data consumption exponentially. Don't do this.

Avoid:

  • LinkedHashSet (very slow)
  • powerset (use the real Apriori algorithm, which avoids powerset!)
  • untyped iterators (use Generics)
  • regular expressions (slow, in particular when not precompiled)
  • unnecessary materialization (nassive shuffle cost)
  • recreating IntWritable (garbage collection cost)

For a starter, try profiling your application. Also compare it to the known good implementations in ELKI and SPMF. What is the largest data set you can process in these tools (on a single core; try FPgrowth too) compared to your code (on a cluster). I wouldn't be surprised if these tools can do 10000x larger data than your code on a single CPU.

0
Phil On

Actually, Apriori is one of the slowest algorithm for frequent itemset mining. Many algorithms have been proposed after that such as Eclat, FPGrowth, H-Mine, and LCM. Some of them can be more than 1000 faster than Apriori. Thus, optimizing Apriori is not really useful because it has some fundamental problems. It is better to simply change from Apriori to another algorithm that is faster like LCM or FPGrowth.

But in your code, it seems that it is not the real Apriori anyway. If you want to see an optimized version of Apriori implemented in Java and faster algorithms like HMine and FPGrowth, you can check the SPMF software implemented in Java (which I am the founder). It is the software offering the largest number of itemset and pattern mining algorithm implementations (more than 100), and is open source.