Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apriori is not Apriori. It is missing all the nice optimzations #644

Open
kno10 opened this issue Dec 13, 2019 · 4 comments
Open

Apriori is not Apriori. It is missing all the nice optimzations #644

kno10 opened this issue Dec 13, 2019 · 4 comments

Comments

@kno10
Copy link

kno10 commented Dec 13, 2019

The current apriori implementation is really really slow.

It does not correctly implement the apriori-gen and pruning parts of Apriori. So it is not really apriori as it was published, but a naive approximation to it. The current codes uses (previous itemsets) x (items) to generate tuples, which will generate many false positives that the original Apriori would avoid to generate in the first place (and hence, not need to count later).
For good performance, it is crucial to generate only very few candidates.

As such, this implementation benchmarks particularly poor on a data set with 33841 transactions and 1587 items, and minsupp=0.005 (Note that apyrori also does not correctly implement Apriori).
Note that I wanted to use minsupp=0.001, but eventually interrupted mlxtend because it was too slow. So these results are now with 5x higher minsupp. Experiments were run on Google Colab, so no guarantees on the reliability of these measurements; but they repeatedly gave similar results with a similar data set before.

Implementation run time
mlxtend 1min 55s
apyrori 4min 2s
efficient-apriori 2min 56s
ELKI (Java) 7 s
pyFIM Apriori 149 ms

This is not just because pyFIM is a C library, but also because it really implements the Apriori algorithm with some of its clever optimizations (c.f. apriori-gen in the original paper; join+prune; as well as optimizations to counting each iteration). The ELKI implementation does not implement the hash tree of the original Apriori, so the 7 seconds could also be a bit better...
And yes, I know that FPgrowth is much better. pyFIM has a very nice implementation of it, too; it took just 90 ms for this data.

For example at length 3 there are 2802 frequent itemsets, containing 719 different items. The mlxtend code will generate 2802*719=2014638 candidates from this. The real apriori algorithm joins 40095 cadidates, which are pruned to just 5737 candidates before going to the data again. So the mlxtend code has to check about 350x as many candidates as necessary compared to the original Apriori algorithm in this step. This will, of course, have a major impact on runtime.

@rasbt
Copy link
Owner

rasbt commented Dec 13, 2019

Yes, it is currently a naive implementation and could be optimized.

@dbarbier
Copy link
Contributor

@kno10 Are your benchmark scripts and data sets available somewhere?

@kno10
Copy link
Author

kno10 commented Dec 15, 2019

A collection of standard benchmark data sets for FIM is here:
http://fimi.uantwerpen.be/data/
An effort to benchmark these implementations should likely begin with this widely used collection.

@rasbt
Copy link
Owner

rasbt commented Dec 15, 2019

That's a nice collection, thanks for sharing! Not sure what the sharing policies are exactly, but it would be interesting to incorporate one or two of the smaller ones in the unit tests as well. Also, for aiding the potential improvement of existing implementations, there could be a developer-script added to mlxtend.frequent_patterns for running a set of benchmarks based on these datasets. This could even be a function in mlxtend.frequent_patterns that users can call with different datasets that can be then fetched from the website above to assess the speed of these implementations on their local machine.

dbarbier added a commit to dbarbier/mlxtend that referenced this issue Dec 18, 2019
The apriori-gen function described in section 2.1.1 of Apriori paper
has two steps; first, the join step looks for itemsets with the same
prefix, and creates new candidates by appending all pairs combinations
to this prefix. Here is pseudocode copied from paper:

  select p.1, p.2, ..., p.k-1, q.k-1
  from p in L(k-1), q in L(k-1)
  where p.1 = q.1, ..., p.k-2 = q.k-2, p.k-1 < q.k-1

The reason is that if a sequence q with the same prefix as p does not
belong to L(k-1), there is no point to check itemset p+(q.k-1,) because
it cannot be frequent.
Before this commit, we were considering p+(q.k-1,) for any q.k-1 > p.k-1.

The second step of apriori-gen function is called prune step, it will
be implemented in a distinct commit.

See discussion in rasbt#644.
dbarbier added a commit to dbarbier/mlxtend that referenced this issue Dec 19, 2019
The apriori-gen function described in section 2.1.1 of Apriori paper
has two steps; first, the join step looks for itemsets with the same
prefix, and creates new candidates by appending all pairs combinations
to this prefix. Here is pseudocode copied from paper:

  select p.1, p.2, ..., p.k-1, q.k-1
  from p in L(k-1), q in L(k-1)
  where p.1 = q.1, ..., p.k-2 = q.k-2, p.k-1 < q.k-1

The reason is that if a sequence q with the same prefix as p does not
belong to L(k-1), there is no point to check itemset p+(q.k-1,) because
it cannot be frequent.
Before this commit, we were considering p+(q.k-1,) for any q.k-1 > p.k-1.

The second step of apriori-gen function is called prune step, it will
be implemented in a distinct commit.

See discussion in rasbt#644.
dbarbier added a commit to dbarbier/mlxtend that referenced this issue Dec 28, 2019
The apriori-gen function described in section 2.1.1 of Apriori paper
has two steps; first, the join step looks for itemsets with the same
prefix, and creates new candidates by appending all pairs combinations
to this prefix. Here is pseudocode copied from paper:

  select p.1, p.2, ..., p.k-1, q.k-1
  from p in L(k-1), q in L(k-1)
  where p.1 = q.1, ..., p.k-2 = q.k-2, p.k-1 < q.k-1

The reason is that if a sequence q with the same prefix as p does not
belong to L(k-1), there is no point to check itemset p+(q.k-1,) because
it cannot be frequent.
Before this commit, we were considering p+(q.k-1,) for any q.k-1 > p.k-1.

The second step of apriori-gen function is called prune step, it will
be implemented in a distinct commit.

See discussion in rasbt#644.
dbarbier added a commit to dbarbier/mlxtend that referenced this issue Jan 1, 2020
The apriori-gen function described in section 2.1.1 of Apriori paper
has two steps; first, the join step looks for itemsets with the same
prefix, and creates new candidates by appending all pairs combinations
to this prefix. Here is pseudocode copied from paper:

  select p.1, p.2, ..., p.k-1, q.k-1
  from p in L(k-1), q in L(k-1)
  where p.1 = q.1, ..., p.k-2 = q.k-2, p.k-1 < q.k-1

The reason is that if a sequence q with the same prefix as p does not
belong to L(k-1), itemset p+(q.k-1,) cannot be frequent.
Before this commit, we were considering p+(q.k-1,) for any q.k-1 > p.k-1.

The second step of apriori-gen function is called prune step, it will
be implemented in a distinct commit.

See discussion in rasbt#644.
dbarbier added a commit to dbarbier/mlxtend that referenced this issue Jan 2, 2020
The apriori-gen function described in section 2.1.1 of Apriori paper
has two steps; first, the join step looks for itemsets with the same
prefix, and creates new candidates by appending all pairs combinations
to this prefix. Here is pseudocode copied from paper:

  select p.1, p.2, ..., p.k-1, q.k-1
  from p in L(k-1), q in L(k-1)
  where p.1 = q.1, ..., p.k-2 = q.k-2, p.k-1 < q.k-1

The reason is that if a sequence q with the same prefix as p does not
belong to L(k-1), itemset p+(q.k-1,) cannot be frequent.
Before this commit, we were considering p+(q.k-1,) for any q.k-1 > p.k-1.

The second step of apriori-gen function is called prune step, it will
be implemented in a distinct commit.

See discussion in rasbt#644.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants