[SPARK-8998][MLlib] Distribute PrefixSpan computation for large projected databases #7783

feynmanliang · 2015-07-30T05:47:14Z

Continuation of work by @zhangjiajin

Closes #7412

Use PrefixSpan.scala instead of Prefixspan.scala. Delete Prefixspan.scala

Use PrefixSpanSuite.scala instead of PrefixspanSuite.scala, Delete PrefixspanSuite.scala.

…efixSpan.

Initilize local master branch.

…efixeSpan

…EnoughPrefixes

…ocessing.

[Spark-8998]Collect Enough Prefixes Improvements

feynmanliang · 2015-07-30T05:53:24Z

@mengxr I've made the lineage changes as requested, but have concerns about the scalability of these changes. See in-line comments.

My preference is to prefer the longer lineage chain over the potential lack of scalability introduced by these changes.

feynmanliang · 2015-07-30T06:01:16Z

mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala

+      pairsForDistributed = largerPairsPart
+      pairsForDistributed.persist(StorageLevel.MEMORY_AND_DISK)
+      pairsForLocal ++= smallerPairsPart
+      resultsAccumulator ++= nextPatternAndCounts.collect()


This will cause all results except for those generated from pairsForLocal to be collected to driver since we continue processing until pairsForDistributed is empty.

Could potentially be many times the size of the dataset since a length k sequence has up to 2^k subsequences.

That is the worst case. We should assume that the number of frequent patterns are small. Having 1 billion frequent patterns doesn't provide any useful insights. So users should start with a high minSupport and collect just-enough number of frequent patterns.

SparkQA · 2015-07-30T06:14:15Z

Test build #39000 has finished for PR 7783 at commit 4ddf479.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

feynmanliang · 2015-07-30T06:18:52Z

Jenkins test this please

SparkQA · 2015-07-30T06:37:29Z

Test build #39004 has finished for PR 7783 at commit a61943d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-30T06:49:56Z

Test build #39010 has finished for PR 7783 at commit a61943d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-30T06:58:07Z

Test build #158 has finished for PR 7783 at commit a61943d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2015-07-30T15:14:30Z

LGTM. Merged into master. Thanks!

zhangjiajin and others added 30 commits July 7, 2015 15:30

Add new algorithm PrefixSpan and test file.

91fd7e6

Modified the code according to the review comments.

575995f

Delete Prefixspan.scala

951fd42

Use PrefixSpan.scala instead of Prefixspan.scala. Delete Prefixspan.scala

Delete PrefixspanSuite.scala

a2eb14c

Use PrefixSpanSuite.scala instead of PrefixspanSuite.scala, Delete PrefixspanSuite.scala.

Fixed a Scala style error.

89bc368

Modified the code according to the review comments.

1dd33ad

Fix some Scala style errors.

4c60fb3

Fix a Scala style error.

ba5df34

Add new object LocalPrefixSpan, and do some optimization.

574e56c

Modified the code according to the review comments.

ca9c4c8

Add feature: Collect enough frequent prefixes before projection in Pr…

22b0ef4

…efixSpan.

fix a scala style error.

078d410

initialize file before rebase.

4dd1c8a

Merge branch 'master' of https://github.com/apache/spark

a8fde87

Initilize local master branch.

Add feature: Collect enough frequent prefixes before projection in Pr…

6560c69

…efixeSpan

Modified the code according to the review comments.

baa2885

Modified the code according to the review comments.

095aa3a

Merge branch 'master' of https://github.com/apache/spark into Collect…

b07e20c

…EnoughPrefixes

remove minPatternsBeforeLocalProcessing, add maxSuffixesBeforeLocalPr…

d2250b7

…ocessing.

Modified codes according to comments.

64271b3

Fix splitPrefixSuffixPairs

6e149fa

Add getters

01c9ae9

Inline code for readability

cb2a4fc

Use lists for prefixes to reuse data

da0091b

Use Iterable[Array[_]] over Array[Array[_]] for database

1235cfc

Readability improvements and comments

c2caa5c

Improve extend prefix readability

87fa021

Merge pull request #1 from feynmanliang/SPARK-8998-collectBeforeLocal

ad23aa9

[Spark-8998]Collect Enough Prefixes Improvements

Parallelize freqItemCounts

4ddf479

Collect small patterns to local

a61943d

feynmanliang reviewed Jul 30, 2015
View reviewed changes

asfgit closed this in d212a31 Jul 30, 2015

feynmanliang deleted the SPARK-8998-improve-distributed branch August 17, 2015 19:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-8998][MLlib] Distribute PrefixSpan computation for large projected databases #7783

[SPARK-8998][MLlib] Distribute PrefixSpan computation for large projected databases #7783

feynmanliang commented Jul 30, 2015

feynmanliang commented Jul 30, 2015

feynmanliang Jul 30, 2015

mengxr Jul 30, 2015

SparkQA commented Jul 30, 2015

feynmanliang commented Jul 30, 2015

SparkQA commented Jul 30, 2015

SparkQA commented Jul 30, 2015

SparkQA commented Jul 30, 2015

mengxr commented Jul 30, 2015

[SPARK-8998][MLlib] Distribute PrefixSpan computation for large projected databases #7783

[SPARK-8998][MLlib] Distribute PrefixSpan computation for large projected databases #7783

Conversation

feynmanliang commented Jul 30, 2015

feynmanliang commented Jul 30, 2015

feynmanliang Jul 30, 2015

Choose a reason for hiding this comment

mengxr Jul 30, 2015

Choose a reason for hiding this comment

SparkQA commented Jul 30, 2015

feynmanliang commented Jul 30, 2015

SparkQA commented Jul 30, 2015

SparkQA commented Jul 30, 2015

SparkQA commented Jul 30, 2015

mengxr commented Jul 30, 2015