-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-8997][MLlib]Performance improvements in LocalPrefixSpan #7360
[SPARK-8997][MLlib]Performance improvements in LocalPrefixSpan #7360
Conversation
Test build #37103 has finished for PR 7360 at commit
|
@@ -42,22 +44,20 @@ private[fpm] object LocalPrefixSpan extends Logging with Serializable { | |||
def run( | |||
minCount: Long, | |||
maxPatternLength: Int, | |||
prefix: Array[Int], | |||
projectedDatabase: Array[Array[Int]]): Array[(Array[Int], Long)] = { | |||
prefix: ArrayBuffer[Int], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ArrayBuilder
is better than ArrayBuffer
for Int
. The latter is not specialized for Int
and hence has boxing/unboxing overhead. But here, we may want to consider List
to avoid re-allocating buffers. The cost is that we have to inverse the list (maybe not), e.g., https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPTree.scala#L72.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
@mengxr Updated; I wonder if |
Test build #37153 has finished for PR 7360 at commit
|
Test build #37163 has finished for PR 7360 at commit
|
val prefixProjectedDatabases = getPatternAndProjectedDatabase( | ||
prefix, frequentPrefixAndCounts.map(_._1), projectedDatabase) | ||
prefix: List[Int], | ||
database: Iterable[Array[Int]]): Iterator[(Array[Int], Long)] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
database
should be an Array[Array[Int]]
. We need multiple access to it. The return type should be Iterator[(List[Int], Long)]
.
@feynmanliang Please also remove the |
Test build #37285 has finished for PR 7360 at commit
|
@feynmanliang I sent you some updates at feynmanliang#1. Please review and merge it if it looks good to you. Thanks! |
Btw, I have some ideas about how to improve it. Basically, instead of getting suffixes in projection, we can actually scan from right to left for each sequence and get prefixes. Then in cc @zhangjiajin |
update LocalPrefixSpan impl
Test build #37315 has finished for PR 7360 at commit
|
LGTM. Merged into master. Thanks! |
Improves the performance of LocalPrefixSpan by implementing optimizations proposed in SPARK-8997