-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-8998][MLlib] Distribute PrefixSpan computation for large projected databases #7783
[SPARK-8998][MLlib] Distribute PrefixSpan computation for large projected databases #7783
Conversation
Use PrefixSpan.scala instead of Prefixspan.scala. Delete Prefixspan.scala
Use PrefixSpanSuite.scala instead of PrefixspanSuite.scala, Delete PrefixspanSuite.scala.
Initilize local master branch.
[Spark-8998]Collect Enough Prefixes Improvements
@mengxr I've made the lineage changes as requested, but have concerns about the scalability of these changes. See in-line comments. My preference is to prefer the longer lineage chain over the potential lack of scalability introduced by these changes. |
pairsForDistributed = largerPairsPart | ||
pairsForDistributed.persist(StorageLevel.MEMORY_AND_DISK) | ||
pairsForLocal ++= smallerPairsPart | ||
resultsAccumulator ++= nextPatternAndCounts.collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will cause all results except for those generated from pairsForLocal
to be collected to driver since we continue processing until pairsForDistributed
is empty.
Could potentially be many times the size of the dataset since a length k sequence has up to 2^k subsequences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is the worst case. We should assume that the number of frequent patterns are small. Having 1 billion frequent patterns doesn't provide any useful insights. So users should start with a high minSupport
and collect just-enough number of frequent patterns.
Test build #39000 has finished for PR 7783 at commit
|
Jenkins test this please |
Test build #39004 has finished for PR 7783 at commit
|
Test build #39010 has finished for PR 7783 at commit
|
Test build #158 has finished for PR 7783 at commit
|
LGTM. Merged into master. Thanks! |
Continuation of work by @zhangjiajin
Closes #7412