-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4001][MLlib] adding parallel FP-Growth algorithm for frequent pattern mining in MLlib #2847
Conversation
Can one of the admins verify this patch? |
1 similar comment
Can one of the admins verify this patch? |
Had an offline discussion with @jackylk . We plan to implement a more scalable version of Apriori, as described in PFP: Parallel FP-Growth for Query Recommendation (http://dl.acm.org/citation.cfm?id=1454027) |
As mentioned in one of comments of SPARK-2432. I was wondering how the PFP version compares with YAFIM (http://pasa-bigdata.nju.edu.cn/people/ronggu/pub/YAFIM_ParLearning.pdf). Probably i will do a bit more reading on this. |
Maybe it is better to use RDD[BitSet] as transactions RDD? Then you can add a preprocessor trait and make any transformations for source RDD to RDD of BitSets. For example, transformation of RDD[Array[String]] to RDD[BitSet]. Or even better idea is to make Transaction entity, which will contain it's BitSet representation and all necessary convinient methods. And then anyone could make a preprocessor of RDD[...Any Type...] to RDD[Transaction]. |
As long as itemset mining is under consideration, has anybody tried a Spark implementation of "Logical Itemset Mining": |
Dou you use SON algorithm for Apriori parallel implementation? |
Had an offline discussion with @jackylk and here is the summary:
|
add to whitelist |
Test build #25596 has started for PR 2847 at commit
|
Test build #25596 has finished for PR 2847 at commit
|
Test FAILed. |
Test build #25742 has started for PR 2847 at commit
|
Yes, I have tested the parallel FP-Growth algorithm using a open data set from http://fimi.ua.ac.be/data/, performance test result can be found at https://issues.apache.org/jira/browse/SPARK-4001 All modification is done except for the 7th (generic type), please review the code for now. |
Test build #25742 has finished for PR 2847 at commit
|
Test FAILed. |
Please test again |
Test build #25752 has started for PR 2847 at commit
|
Test build #25752 has finished for PR 2847 at commit
|
} | ||
|
||
// Sort it and create the item combinations | ||
val sortedItems = items.sortWith(_._1 > _._1).sortWith(_._2 > _._2).toArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why sorting twice? The second will overwrite the first. Besides, using sortBy(-_._2)
would be better.
@jackylk I made a brief scan of the implementation. Besides inline comments, I have some high-level suggestions:
|
@mengxr . I am working with Jacky together to develop and test this algorithm. I answered this question: |
By "reduce", did you mean skipping the process of growing trees? The FP-Growth algorithm reduces memory requirement using the tree representation of candidate sets. If we skip this step, it is hard to call it
It is important to grow the tree on the mapper side to save communication cost. |
@mengxr |
The advantage of FP-Growth over Apriori is the tree structure to present candidate set. Both algorithms are taking advantage on the fact that the candidate set is small. I'm asking whether the current implementation uses the tree structure to save communication.
I'm not surprised by the 10x speed-up. It is not equivalent to say the current implementation is correct and high-performance. I believe that we can be much faster.
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions |
Had an offline discussion with @jackylk and @zhangyouhua2014 . We plan to add a utility class named
and the use |
Test build #26406 has started for PR 2847 at commit
|
Test build #26406 has finished for PR 2847 at commit
|
Test FAILed. |
@mengxr |
Test build #26407 has started for PR 2847 at commit
|
Test build #26407 has finished for PR 2847 at commit
|
Test PASSed. |
@jackylk Thanks for the update! Did you see any performance improvement on your dataset with |
I have not tested performance yet. I will test it at weekend |
simplify FPTree and update FPGrowth
Test build #26486 has started for PR 2847 at commit
|
Test build #26486 has finished for PR 2847 at commit
|
Test FAILed. |
LGTM. Merged into master. Thanks!! (The failed test is a known flakey test. All relevant tests passed.) |
Apriori is the classic algorithm for frequent item set mining in a transactional data set. It will be useful if Apriori algorithm is added to MLLib in Spark. This PR add an implementation for it.
There is a point I am not sure wether it is most efficient. In order to filter out the eligible frequent item set, currently I am using a cartesian operation on two RDDs to calculate the degree of support of each item set, not sure wether it is better to use broadcast variable to achieve the same.
I will add an example to use this algorithm if requires