-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23524] Big local shuffle blocks should not be checked for corruption. #20685
Conversation
Jenkins, test this please |
Test build #87714 has finished for PR 20685 at commit
|
535916c
to
110c851
Compare
Test build #87741 has finished for PR 20685 at commit
|
Jenkins, retest this please |
Test build #87755 has finished for PR 20685 at commit
|
Jenkins, retest this please. |
Test build #87770 has finished for PR 20685 at commit
|
The failed test |
Jenkins, retest this please. |
Test build #87807 has finished for PR 20685 at commit
|
cc @cloud-fan @zsxwing @squito @jiangxb1987 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor cleanup in the test, otherwise lgtm. thanks for catching this and suggesting a fix.
@@ -352,6 +352,63 @@ class ShuffleBlockFetcherIteratorSuite extends SparkFunSuite with PrivateMethodT | |||
intercept[FetchFailedException] { iterator.next() } | |||
} | |||
|
|||
test("big corrupt blocks will not be retiried") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: retried (or maybe "retired", not sure)
though I think a better name would be "big blocks are not checked for corruption"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will refine this.
) | ||
|
||
val transfer = mock(classOf[BlockTransferService]) | ||
when(transfer.fetchBlocks(any(), any(), any(), any(), any(), any())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can reuse createMockTransfer
to simplify this a little.
(actually, a bunch of this test code looks like it could be refactored across these tests -- but we can leave that out of this change.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot~ Imran, I can file another pr for the refine :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry my comment was vague -- I do think you can use createMockTransfer
here, since that helper method already exists.
I was just thinking that there may be more we could clean up -- setting up the local & remote BlockManager Id, creating the ShuffleIterator, etc. seems to have a lot of boilerplate in all the tests. But let's not to do a pure refactoring to the other tests in this change.
We should update the doc of |
@squito @cloud-fan |
Test build #88003 has finished for PR 20685 at commit
|
it'll also help with disk corruption ... from the stack traces in SPARK-4105 you can't really tell what the source of the problem is. it'll be pretty hard to determine what the source of corruption is if we start seeing it again. anyway, I don't feel that strongly about it either way. |
* @param size estimated size of the block, used to calculate bytesInFlight. | ||
* Note that this is NOT the exact bytes. | ||
* @param size estimated size of the block. Note that this is NOT the exact bytes. | ||
* Size of remote block is used to calculate bytesInFlight. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: documentation style
sounds reasonable. The purpose of this corruption check is to fail fast to retry the stage(re-shuffle), so disk corruption should also be counted. |
LGTM |
@cloud-fan @squito @Ngone51 |
Test build #88033 has finished for PR 20685 at commit
|
I think #20179 probably already fixed the data corruption issue. |
yea very likely, but I'm not 100% sure. How about we merge this one first to fix the mistake for local shuffle blocks, and then think about whether or not we should remove this corruption check? |
I agree with @cloud-fan . |
lgtm |
thanks, merging to master/2.3! |
…uption. ## What changes were proposed in this pull request? In current code, all local blocks will be checked for corruption no matter it's big or not. The reasons are as below: Size in FetchResult for local block is set to be 0 (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L327) SPARK-4105 meant to only check the small blocks(size<maxBytesInFlight/3), but for reason 1, below check will be invalid. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L420 We can fix this and avoid the OOM. ## How was this patch tested? UT added Author: jx158167 <[email protected]> Closes #20685 from jinxing64/SPARK-23524. (cherry picked from commit 77c91cc) Signed-off-by: Wenchen Fan <[email protected]>
Thanks for merging ! |
…uption. ## What changes were proposed in this pull request? In current code, all local blocks will be checked for corruption no matter it's big or not. The reasons are as below: Size in FetchResult for local block is set to be 0 (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L327) SPARK-4105 meant to only check the small blocks(size<maxBytesInFlight/3), but for reason 1, below check will be invalid. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L420 We can fix this and avoid the OOM. ## How was this patch tested? UT added Author: jx158167 <[email protected]> Closes apache#20685 from jinxing64/SPARK-23524. (cherry picked from commit 77c91cc) Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
In current code, all local blocks will be checked for corruption no matter it's big or not. The reasons are as below:
Size in FetchResult for local block is set to be 0 (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L327)
SPARK-4105 meant to only check the small blocks(size<maxBytesInFlight/3), but for reason 1, below check will be invalid. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L420
We can fix this and avoid the OOM.
How was this patch tested?
UT added