Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21050][ML] Word2vec persistence overflow bug fix #18265

Closed
wants to merge 3 commits into from

Conversation

jkbradley
Copy link
Member

What changes were proposed in this pull request?

The method calculateNumberOfPartitions() uses Int, not Long (unlike the MLlib version), so it is very easily to have an overflow in calculating the number of partitions for ML persistence.

This modifies the calculations to use Long.

How was this patch tested?

New unit test. I verified that the test fails before this patch.

@jkbradley
Copy link
Member Author

CC @Krimit and @srowen who had worked on the previous related patch

@SparkQA
Copy link

SparkQA commented Jun 11, 2017

Test build #77886 has finished for PR 18265 at commit 2048c00.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

((approximateSizeInBytes / bufferSizeInBytes) + 1).toInt
val approximateSizeInBytes = (floatSize * vectorSize + averageWordSize) * numWords
val numPartitions = (approximateSizeInBytes / bufferSizeInBytes) + 1
require(numPartitions < 10e8, s"Word2VecModel calculated that it needs $numPartitions " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this failure truly necessary? Can we make this a WARN and use a best-attempt (Int.MAX?) partitions instead? The models that would fail here would be so huge that they likely took days to train

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure it is necessary. If we cap it at Int.MAX and the user hits that cap, then it means that we'll fail when trying to write the partitions.

@@ -188,6 +188,15 @@ class Word2VecSuite extends SparkFunSuite with MLlibTestSparkContext with Defaul
assert(math.abs(similarity(5) - similarityLarger(5) / similarity(5)) > 1E-5)
}

test("Word2Vec read/write numPartitions calculation") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this test hardcode a specific spark.kryoserializer.buffer.max? That could allow us to be more explicit in the assertions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point; I'll do that.

assert(tinyModelNumPartitions === 1)
val mediumModelNumPartitions = Word2VecModel.Word2VecModelWriter.calculateNumberOfPartitions(
sc, numWords = 1000000, vectorSize = 5000)
assert(mediumModelNumPartitions > 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about a test for a truly large model that would have otherwise caused an overflow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "medium" one did cause an overflow.

@Krimit
Copy link
Contributor

Krimit commented Jun 11, 2017

Thanks @jkbradley! I'm really curious about how this came to your attention. Did somebody actually encounter this bug? For this bug to come up, the model being trained would have to be truly monstrous, in combination with a very low spark.kryoserializer.buffer.max value. With the default buffer size of 64m, the model would have to have roughly numWords = 100,000,000 and vectorSize = 340, which would mean the model takes up ~137GB.

@jkbradley
Copy link
Member Author

Yep, someone hit the bug!

@SparkQA
Copy link

SparkQA commented Jun 12, 2017

Test build #77916 has started for PR 18265 at commit 6bcf66f.

@jkbradley
Copy link
Member Author

looks like a spurious failure, retesting

@SparkQA
Copy link

SparkQA commented Jun 12, 2017

Test build #3793 has finished for PR 18265 at commit 6bcf66f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member Author

@Krimit Thanks for taking a look! Does it look ready to merge now?

@jkbradley
Copy link
Member Author

I'm going to call this ready...but please say if you see other fixes I should make. Thanks!

Merging with master and branch-2.2

asfgit pushed a commit that referenced this pull request Jun 12, 2017
## What changes were proposed in this pull request?

The method calculateNumberOfPartitions() uses Int, not Long (unlike the MLlib version), so it is very easily to have an overflow in calculating the number of partitions for ML persistence.

This modifies the calculations to use Long.

## How was this patch tested?

New unit test.  I verified that the test fails before this patch.

Author: Joseph K. Bradley <[email protected]>

Closes #18265 from jkbradley/word2vec-save-fix.

(cherry picked from commit ff318c0)
Signed-off-by: Joseph K. Bradley <[email protected]>
@asfgit asfgit closed this in ff318c0 Jun 12, 2017
@Krimit
Copy link
Contributor

Krimit commented Jun 13, 2017

Cool, LGTM @jkbradley

dataknocker pushed a commit to dataknocker/spark that referenced this pull request Jun 16, 2017
## What changes were proposed in this pull request?

The method calculateNumberOfPartitions() uses Int, not Long (unlike the MLlib version), so it is very easily to have an overflow in calculating the number of partitions for ML persistence.

This modifies the calculations to use Long.

## How was this patch tested?

New unit test.  I verified that the test fails before this patch.

Author: Joseph K. Bradley <[email protected]>

Closes apache#18265 from jkbradley/word2vec-save-fix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants