-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21050][ML] Word2vec persistence overflow bug fix #18265
Conversation
Test build #77886 has finished for PR 18265 at commit
|
((approximateSizeInBytes / bufferSizeInBytes) + 1).toInt | ||
val approximateSizeInBytes = (floatSize * vectorSize + averageWordSize) * numWords | ||
val numPartitions = (approximateSizeInBytes / bufferSizeInBytes) + 1 | ||
require(numPartitions < 10e8, s"Word2VecModel calculated that it needs $numPartitions " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this failure truly necessary? Can we make this a WARN and use a best-attempt (Int.MAX
?) partitions instead? The models that would fail here would be so huge that they likely took days to train
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure it is necessary. If we cap it at Int.MAX and the user hits that cap, then it means that we'll fail when trying to write the partitions.
@@ -188,6 +188,15 @@ class Word2VecSuite extends SparkFunSuite with MLlibTestSparkContext with Defaul | |||
assert(math.abs(similarity(5) - similarityLarger(5) / similarity(5)) > 1E-5) | |||
} | |||
|
|||
test("Word2Vec read/write numPartitions calculation") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this test hardcode a specific spark.kryoserializer.buffer.max
? That could allow us to be more explicit in the assertions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point; I'll do that.
assert(tinyModelNumPartitions === 1) | ||
val mediumModelNumPartitions = Word2VecModel.Word2VecModelWriter.calculateNumberOfPartitions( | ||
sc, numWords = 1000000, vectorSize = 5000) | ||
assert(mediumModelNumPartitions > 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about a test for a truly large model that would have otherwise caused an overflow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "medium" one did cause an overflow.
Thanks @jkbradley! I'm really curious about how this came to your attention. Did somebody actually encounter this bug? For this bug to come up, the model being trained would have to be truly monstrous, in combination with a very low |
Yep, someone hit the bug! |
Test build #77916 has started for PR 18265 at commit |
looks like a spurious failure, retesting |
Test build #3793 has finished for PR 18265 at commit
|
@Krimit Thanks for taking a look! Does it look ready to merge now? |
I'm going to call this ready...but please say if you see other fixes I should make. Thanks! Merging with master and branch-2.2 |
## What changes were proposed in this pull request? The method calculateNumberOfPartitions() uses Int, not Long (unlike the MLlib version), so it is very easily to have an overflow in calculating the number of partitions for ML persistence. This modifies the calculations to use Long. ## How was this patch tested? New unit test. I verified that the test fails before this patch. Author: Joseph K. Bradley <[email protected]> Closes #18265 from jkbradley/word2vec-save-fix. (cherry picked from commit ff318c0) Signed-off-by: Joseph K. Bradley <[email protected]>
Cool, LGTM @jkbradley |
## What changes were proposed in this pull request? The method calculateNumberOfPartitions() uses Int, not Long (unlike the MLlib version), so it is very easily to have an overflow in calculating the number of partitions for ML persistence. This modifies the calculations to use Long. ## How was this patch tested? New unit test. I verified that the test fails before this patch. Author: Joseph K. Bradley <[email protected]> Closes apache#18265 from jkbradley/word2vec-save-fix.
What changes were proposed in this pull request?
The method calculateNumberOfPartitions() uses Int, not Long (unlike the MLlib version), so it is very easily to have an overflow in calculating the number of partitions for ML persistence.
This modifies the calculations to use Long.
How was this patch tested?
New unit test. I verified that the test fails before this patch.