Apriori Algorithm

5.9k views Asked by At

I've heard about the Apriori algorithm several times before but never got the time or the opportunity to dig into it, can anyone explain to me in a simple way the workings of this algorithm? Also, a basic example would make it a lot easier for me to understand.

4

There are 4 answers

1
Spencer Ruport On BEST ANSWER

Well, I would assume you've read the wikipedia entry but you said "a basic example would make it a lot easier for me to understand". Wikipedia has just that so I'll assume you haven't read it and suggest that you do.

Read the wikipedia article.

0
Phil On

The best introduction to Apriori can be downloaded from this book:

http://www-users.cs.umn.edu/~kumar/dmbook/index.php

you can download the chapter 6 for free which explain Apriori very clearly.

Moreover, if you want to download a Java version of Apriori and other algorithms for frequent itemset mining, you can check my website:

http://www.philippe-fournier-viger.com/spmf/

0
lmsasu On

See Top 10 algorithms in data mining (free access) or The Top Ten Algorithms in Data Mining. The latter gives a detailed description of the algorithm, together with details on how to get optimized implementations.

2
Thilina Samiddhi On

Apriori Algorithm

It is a candidate-generation-and-test approach for frequent pattern mining in datasets. There are two things you have to remember.

Apriori Pruning Principle - If any itemset is infrequent, then its superset should not be generated/tested.

Apriori Property - A given (k+1)-itemset is a candidate (k+1)-itemset only if everyone of its k-itemset subsets are frequent.

Now, here is the apriori algorithm in 4 steps.

  1. Initially, scan the database/dataset once to get the frequent 1-itemset.
  2. Generate length k+1 candidate itemsets from length k frequent itemsets.
  3. Test the candidates against the database/dataset.
  4. Terminate when no frequent or candidate set can be generated.

Solved Example

Suppose there is a transaction database as follows with 4 transactions including their transaction IDs and items bought with them. Assume the minimum support - min_sup is 2. The term support is the number of transactions in which a certain itemset is present/included.

Transaction DB

tid  | items
-------------
10   | A,C,D
20   | B,C,E
30   | A,B,C,E
40   | B,E

Now, let's create the candidate 1-itemsets by the 1st scan of the DB. It is simply called as the set of C_1 as follows.

itemset | sup
-------------
  {A}   |  2
  {B}   |  3
  {C}   |  3
  {D}   |  1
  {E}   |  3

If we test this with min_sup, we can see {D} does not satisfy the min_sup of 2. So, it will not be included in the frequent 1-itemset, which we simply call as the set of L_1 as follows.

itemset | sup
-------------
  {A}   |  2
  {B}   |  3
  {C}   |  3
  {E}   |  3

Now, let's scan the DB for the 2nd time, and generate candidate 2-itemsets, which we simply call as the set of C_2 as follows.

itemset | sup
-------------
  {A,B} |  1
  {A,C} |  2
  {A,E} |  1
  {B,C} |  2
  {B,E} |  3
  {C,E} |  2

As you can see, {A,B} and {A,E} itemsets do not satisfy the min_sup of 2 and hence they will not be included in the frequent 2-itemset, L_2

itemset | sup
-------------
  {A,C} |  2
  {B,C} |  2
  {B,E} |  3
  {C,E} |  2

Now let's do a 3rd scan of the DB and get candidate 3-itemsets, C_3 as follows.

itemset  | sup
-------------
 {A,B,C} |  1
 {A,B,E} |  1
 {A,C,E} |  1
 {B,C,E} |  2

You can see that, {A,B,C}, {A,B,E} and {A,C,E} does not satisfy min_sup of 2. So they will not be included in frequent 3-itemset, L_3 as follows.

itemset  | sup
-------------
 {B,C,E} |  2

Now, finally, we can calculate the support (supp), confidence (conf) and lift (interestingness value) values of the Association/Correlation Rules that can be generated by the itemset {B,C,E} as follows.

         Rule       | supp  |  conf   |  lift
    -------------------------------------------
     B ->  C & E    |  50%  |  66.67% |  1.33
     E ->  B & C    |  50%  |  66.67% |  1.33
     C ->  E & B    |  50%  |  66.67% |  1.77
     B & C ->  E    |  50%  |  100%   |  1.33
     E & B ->  C    |  50%  |  66.67% |  1.77
     C & E ->  B    |  50%  |  100%   |  1.33