[SPARK-21050][ML] Word2vec persistence overflow bug fix #18265

jkbradley · 2017-06-11T00:36:34Z

What changes were proposed in this pull request?

The method calculateNumberOfPartitions() uses Int, not Long (unlike the MLlib version), so it is very easily to have an overflow in calculating the number of partitions for ML persistence.

This modifies the calculations to use Long.

How was this patch tested?

New unit test. I verified that the test fails before this patch.

jkbradley · 2017-06-11T00:37:43Z

CC @Krimit and @srowen who had worked on the previous related patch

SparkQA · 2017-06-11T01:46:17Z

Test build #77886 has finished for PR 18265 at commit 2048c00.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Krimit · 2017-06-11T09:40:45Z

mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala

-      ((approximateSizeInBytes / bufferSizeInBytes) + 1).toInt
+      val approximateSizeInBytes = (floatSize * vectorSize + averageWordSize) * numWords
+      val numPartitions = (approximateSizeInBytes / bufferSizeInBytes) + 1
+      require(numPartitions < 10e8, s"Word2VecModel calculated that it needs $numPartitions " +


Is this failure truly necessary? Can we make this a WARN and use a best-attempt (Int.MAX?) partitions instead? The models that would fail here would be so huge that they likely took days to train

I'm pretty sure it is necessary. If we cap it at Int.MAX and the user hits that cap, then it means that we'll fail when trying to write the partitions.

Krimit · 2017-06-11T09:46:57Z

mllib/src/test/scala/org/apache/spark/ml/feature/Word2VecSuite.scala

@@ -188,6 +188,15 @@ class Word2VecSuite extends SparkFunSuite with MLlibTestSparkContext with Defaul
    assert(math.abs(similarity(5) - similarityLarger(5) / similarity(5)) > 1E-5)
  }

+  test("Word2Vec read/write numPartitions calculation") {


Should this test hardcode a specific spark.kryoserializer.buffer.max? That could allow us to be more explicit in the assertions

Good point; I'll do that.

Krimit · 2017-06-11T09:47:39Z

mllib/src/test/scala/org/apache/spark/ml/feature/Word2VecSuite.scala

+    assert(tinyModelNumPartitions === 1)
+    val mediumModelNumPartitions = Word2VecModel.Word2VecModelWriter.calculateNumberOfPartitions(
+      sc, numWords = 1000000, vectorSize = 5000)
+    assert(mediumModelNumPartitions > 1)


What about a test for a truly large model that would have otherwise caused an overflow?

The "medium" one did cause an overflow.

Krimit · 2017-06-11T10:01:40Z

Thanks @jkbradley! I'm really curious about how this came to your attention. Did somebody actually encounter this bug? For this bug to come up, the model being trained would have to be truly monstrous, in combination with a very low spark.kryoserializer.buffer.max value. With the default buffer size of 64m, the model would have to have roughly numWords = 100,000,000 and vectorSize = 340, which would mean the model takes up ~137GB.

jkbradley · 2017-06-12T06:17:25Z

Yep, someone hit the bug!

SparkQA · 2017-06-12T06:18:58Z

Test build #77916 has started for PR 18265 at commit 6bcf66f.

jkbradley · 2017-06-12T16:29:54Z

looks like a spurious failure, retesting

SparkQA · 2017-06-12T17:37:03Z

Test build #3793 has finished for PR 18265 at commit 6bcf66f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-06-12T19:56:35Z

@Krimit Thanks for taking a look! Does it look ready to merge now?

jkbradley · 2017-06-12T21:26:21Z

I'm going to call this ready...but please say if you see other fixes I should make. Thanks!

Merging with master and branch-2.2

## What changes were proposed in this pull request? The method calculateNumberOfPartitions() uses Int, not Long (unlike the MLlib version), so it is very easily to have an overflow in calculating the number of partitions for ML persistence. This modifies the calculations to use Long. ## How was this patch tested? New unit test. I verified that the test fails before this patch. Author: Joseph K. Bradley <[email protected]> Closes #18265 from jkbradley/word2vec-save-fix. (cherry picked from commit ff318c0) Signed-off-by: Joseph K. Bradley <[email protected]>

Krimit · 2017-06-13T08:42:05Z

Cool, LGTM @jkbradley

## What changes were proposed in this pull request? The method calculateNumberOfPartitions() uses Int, not Long (unlike the MLlib version), so it is very easily to have an overflow in calculating the number of partitions for ML persistence. This modifies the calculations to use Long. ## How was this patch tested? New unit test. I verified that the test fails before this patch. Author: Joseph K. Bradley <[email protected]> Closes apache#18265 from jkbradley/word2vec-save-fix.

jkbradley added 2 commits June 10, 2017 17:19

fix overflow issue in Word2Vec.write numPartitions calculation

a7c86bb

small update

2048c00

Krimit reviewed Jun 11, 2017

View reviewed changes

cleanup

6bcf66f

asfgit closed this in ff318c0 Jun 12, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-21050][ML] Word2vec persistence overflow bug fix #18265

[SPARK-21050][ML] Word2vec persistence overflow bug fix #18265

jkbradley commented Jun 11, 2017

jkbradley commented Jun 11, 2017

SparkQA commented Jun 11, 2017

Krimit Jun 11, 2017

jkbradley Jun 12, 2017

Krimit Jun 11, 2017

jkbradley Jun 12, 2017

Krimit Jun 11, 2017

jkbradley Jun 12, 2017

Krimit commented Jun 11, 2017

jkbradley commented Jun 12, 2017

SparkQA commented Jun 12, 2017

jkbradley commented Jun 12, 2017

SparkQA commented Jun 12, 2017

jkbradley commented Jun 12, 2017

jkbradley commented Jun 12, 2017

Krimit commented Jun 13, 2017

[SPARK-21050][ML] Word2vec persistence overflow bug fix #18265

[SPARK-21050][ML] Word2vec persistence overflow bug fix #18265

Conversation

jkbradley commented Jun 11, 2017

What changes were proposed in this pull request?

How was this patch tested?

jkbradley commented Jun 11, 2017

SparkQA commented Jun 11, 2017

Krimit Jun 11, 2017

Choose a reason for hiding this comment

jkbradley Jun 12, 2017

Choose a reason for hiding this comment

Krimit Jun 11, 2017

Choose a reason for hiding this comment

jkbradley Jun 12, 2017

Choose a reason for hiding this comment

Krimit Jun 11, 2017

Choose a reason for hiding this comment

jkbradley Jun 12, 2017

Choose a reason for hiding this comment

Krimit commented Jun 11, 2017

jkbradley commented Jun 12, 2017

SparkQA commented Jun 12, 2017

jkbradley commented Jun 12, 2017

SparkQA commented Jun 12, 2017

jkbradley commented Jun 12, 2017

jkbradley commented Jun 12, 2017

Krimit commented Jun 13, 2017