-
Notifications
You must be signed in to change notification settings - Fork 877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apriori is not Apriori. It is missing all the nice optimzations #644
Comments
Yes, it is currently a naive implementation and could be optimized. |
@kno10 Are your benchmark scripts and data sets available somewhere? |
A collection of standard benchmark data sets for FIM is here: |
That's a nice collection, thanks for sharing! Not sure what the sharing policies are exactly, but it would be interesting to incorporate one or two of the smaller ones in the unit tests as well. Also, for aiding the potential improvement of existing implementations, there could be a developer-script added to |
The apriori-gen function described in section 2.1.1 of Apriori paper has two steps; first, the join step looks for itemsets with the same prefix, and creates new candidates by appending all pairs combinations to this prefix. Here is pseudocode copied from paper: select p.1, p.2, ..., p.k-1, q.k-1 from p in L(k-1), q in L(k-1) where p.1 = q.1, ..., p.k-2 = q.k-2, p.k-1 < q.k-1 The reason is that if a sequence q with the same prefix as p does not belong to L(k-1), there is no point to check itemset p+(q.k-1,) because it cannot be frequent. Before this commit, we were considering p+(q.k-1,) for any q.k-1 > p.k-1. The second step of apriori-gen function is called prune step, it will be implemented in a distinct commit. See discussion in rasbt#644.
The apriori-gen function described in section 2.1.1 of Apriori paper has two steps; first, the join step looks for itemsets with the same prefix, and creates new candidates by appending all pairs combinations to this prefix. Here is pseudocode copied from paper: select p.1, p.2, ..., p.k-1, q.k-1 from p in L(k-1), q in L(k-1) where p.1 = q.1, ..., p.k-2 = q.k-2, p.k-1 < q.k-1 The reason is that if a sequence q with the same prefix as p does not belong to L(k-1), there is no point to check itemset p+(q.k-1,) because it cannot be frequent. Before this commit, we were considering p+(q.k-1,) for any q.k-1 > p.k-1. The second step of apriori-gen function is called prune step, it will be implemented in a distinct commit. See discussion in rasbt#644.
The apriori-gen function described in section 2.1.1 of Apriori paper has two steps; first, the join step looks for itemsets with the same prefix, and creates new candidates by appending all pairs combinations to this prefix. Here is pseudocode copied from paper: select p.1, p.2, ..., p.k-1, q.k-1 from p in L(k-1), q in L(k-1) where p.1 = q.1, ..., p.k-2 = q.k-2, p.k-1 < q.k-1 The reason is that if a sequence q with the same prefix as p does not belong to L(k-1), there is no point to check itemset p+(q.k-1,) because it cannot be frequent. Before this commit, we were considering p+(q.k-1,) for any q.k-1 > p.k-1. The second step of apriori-gen function is called prune step, it will be implemented in a distinct commit. See discussion in rasbt#644.
The apriori-gen function described in section 2.1.1 of Apriori paper has two steps; first, the join step looks for itemsets with the same prefix, and creates new candidates by appending all pairs combinations to this prefix. Here is pseudocode copied from paper: select p.1, p.2, ..., p.k-1, q.k-1 from p in L(k-1), q in L(k-1) where p.1 = q.1, ..., p.k-2 = q.k-2, p.k-1 < q.k-1 The reason is that if a sequence q with the same prefix as p does not belong to L(k-1), itemset p+(q.k-1,) cannot be frequent. Before this commit, we were considering p+(q.k-1,) for any q.k-1 > p.k-1. The second step of apriori-gen function is called prune step, it will be implemented in a distinct commit. See discussion in rasbt#644.
The apriori-gen function described in section 2.1.1 of Apriori paper has two steps; first, the join step looks for itemsets with the same prefix, and creates new candidates by appending all pairs combinations to this prefix. Here is pseudocode copied from paper: select p.1, p.2, ..., p.k-1, q.k-1 from p in L(k-1), q in L(k-1) where p.1 = q.1, ..., p.k-2 = q.k-2, p.k-1 < q.k-1 The reason is that if a sequence q with the same prefix as p does not belong to L(k-1), itemset p+(q.k-1,) cannot be frequent. Before this commit, we were considering p+(q.k-1,) for any q.k-1 > p.k-1. The second step of apriori-gen function is called prune step, it will be implemented in a distinct commit. See discussion in rasbt#644.
The current apriori implementation is really really slow.
It does not correctly implement the apriori-gen and pruning parts of Apriori. So it is not really apriori as it was published, but a naive approximation to it. The current codes uses (previous itemsets) x (items) to generate tuples, which will generate many false positives that the original Apriori would avoid to generate in the first place (and hence, not need to count later).
For good performance, it is crucial to generate only very few candidates.
As such, this implementation benchmarks particularly poor on a data set with 33841 transactions and 1587 items, and minsupp=0.005 (Note that apyrori also does not correctly implement Apriori).
Note that I wanted to use minsupp=0.001, but eventually interrupted mlxtend because it was too slow. So these results are now with 5x higher minsupp. Experiments were run on Google Colab, so no guarantees on the reliability of these measurements; but they repeatedly gave similar results with a similar data set before.
This is not just because pyFIM is a C library, but also because it really implements the Apriori algorithm with some of its clever optimizations (c.f. apriori-gen in the original paper; join+prune; as well as optimizations to counting each iteration). The ELKI implementation does not implement the hash tree of the original Apriori, so the 7 seconds could also be a bit better...
And yes, I know that FPgrowth is much better. pyFIM has a very nice implementation of it, too; it took just 90 ms for this data.
For example at length 3 there are 2802 frequent itemsets, containing 719 different items. The mlxtend code will generate 2802*719=2014638 candidates from this. The real apriori algorithm joins 40095 cadidates, which are pruned to just 5737 candidates before going to the data again. So the mlxtend code has to check about 350x as many candidates as necessary compared to the original Apriori algorithm in this step. This will, of course, have a major impact on runtime.
The text was updated successfully, but these errors were encountered: