Apriori is not Apriori. It is missing all the nice optimzations #644

kno10 · 2019-12-13T22:17:15Z

The current apriori implementation is really really slow.

It does not correctly implement the apriori-gen and pruning parts of Apriori. So it is not really apriori as it was published, but a naive approximation to it. The current codes uses (previous itemsets) x (items) to generate tuples, which will generate many false positives that the original Apriori would avoid to generate in the first place (and hence, not need to count later).
For good performance, it is crucial to generate only very few candidates.

As such, this implementation benchmarks particularly poor on a data set with 33841 transactions and 1587 items, and minsupp=0.005 (Note that apyrori also does not correctly implement Apriori).
Note that I wanted to use minsupp=0.001, but eventually interrupted mlxtend because it was too slow. So these results are now with 5x higher minsupp. Experiments were run on Google Colab, so no guarantees on the reliability of these measurements; but they repeatedly gave similar results with a similar data set before.

Implementation	run time
mlxtend	1min 55s
apyrori	4min 2s
efficient-apriori	2min 56s
ELKI (Java)	7 s
pyFIM Apriori	149 ms

This is not just because pyFIM is a C library, but also because it really implements the Apriori algorithm with some of its clever optimizations (c.f. apriori-gen in the original paper; join+prune; as well as optimizations to counting each iteration). The ELKI implementation does not implement the hash tree of the original Apriori, so the 7 seconds could also be a bit better...
And yes, I know that FPgrowth is much better. pyFIM has a very nice implementation of it, too; it took just 90 ms for this data.

For example at length 3 there are 2802 frequent itemsets, containing 719 different items. The mlxtend code will generate 2802*719=2014638 candidates from this. The real apriori algorithm joins 40095 cadidates, which are pruned to just 5737 candidates before going to the data again. So the mlxtend code has to check about 350x as many candidates as necessary compared to the original Apriori algorithm in this step. This will, of course, have a major impact on runtime.

rasbt · 2019-12-13T23:20:15Z

Yes, it is currently a naive implementation and could be optimized.

dbarbier · 2019-12-15T09:44:15Z

@kno10 Are your benchmark scripts and data sets available somewhere?

kno10 · 2019-12-15T11:51:37Z

A collection of standard benchmark data sets for FIM is here:
http://fimi.uantwerpen.be/data/
An effort to benchmark these implementations should likely begin with this widely used collection.

rasbt · 2019-12-15T17:14:25Z

That's a nice collection, thanks for sharing! Not sure what the sharing policies are exactly, but it would be interesting to incorporate one or two of the smaller ones in the unit tests as well. Also, for aiding the potential improvement of existing implementations, there could be a developer-script added to mlxtend.frequent_patterns for running a set of benchmarks based on these datasets. This could even be a function in mlxtend.frequent_patterns that users can call with different datasets that can be then fetched from the website above to assess the speed of these implementations on their local machine.

The apriori-gen function described in section 2.1.1 of Apriori paper has two steps; first, the join step looks for itemsets with the same prefix, and creates new candidates by appending all pairs combinations to this prefix. Here is pseudocode copied from paper: select p.1, p.2, ..., p.k-1, q.k-1 from p in L(k-1), q in L(k-1) where p.1 = q.1, ..., p.k-2 = q.k-2, p.k-1 < q.k-1 The reason is that if a sequence q with the same prefix as p does not belong to L(k-1), there is no point to check itemset p+(q.k-1,) because it cannot be frequent. Before this commit, we were considering p+(q.k-1,) for any q.k-1 > p.k-1. The second step of apriori-gen function is called prune step, it will be implemented in a distinct commit. See discussion in rasbt#644.

The apriori-gen function described in section 2.1.1 of Apriori paper has two steps; first, the join step looks for itemsets with the same prefix, and creates new candidates by appending all pairs combinations to this prefix. Here is pseudocode copied from paper: select p.1, p.2, ..., p.k-1, q.k-1 from p in L(k-1), q in L(k-1) where p.1 = q.1, ..., p.k-2 = q.k-2, p.k-1 < q.k-1 The reason is that if a sequence q with the same prefix as p does not belong to L(k-1), itemset p+(q.k-1,) cannot be frequent. Before this commit, we were considering p+(q.k-1,) for any q.k-1 > p.k-1. The second step of apriori-gen function is called prune step, it will be implemented in a distinct commit. See discussion in rasbt#644.

dbarbier mentioned this issue Dec 18, 2019

[WIP] Implement apriori-gen as in original paper #646

Open

5 tasks

kno10 mentioned this issue May 26, 2020

Apriori - I'm only interested in one product / consequent. Can I speed up the apriori algorithm? #658

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apriori is not Apriori. It is missing all the nice optimzations #644

Apriori is not Apriori. It is missing all the nice optimzations #644

kno10 commented Dec 13, 2019

rasbt commented Dec 13, 2019

dbarbier commented Dec 15, 2019

kno10 commented Dec 15, 2019

rasbt commented Dec 15, 2019 •

edited

Loading

Apriori is not Apriori. It is missing all the nice optimzations #644

Apriori is not Apriori. It is missing all the nice optimzations #644

Comments

kno10 commented Dec 13, 2019

rasbt commented Dec 13, 2019

dbarbier commented Dec 15, 2019

kno10 commented Dec 15, 2019

rasbt commented Dec 15, 2019 • edited Loading

rasbt commented Dec 15, 2019 •

edited

Loading