-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-6985][streaming] Receiver maxRate over 1000 causes a StackOverflowError #5559
Conversation
@@ -176,7 +176,7 @@ class ReceiverSuite extends TestSuiteBase with Timeouts with Serializable { | |||
blockGenerator.addData(count) | |||
generatedData += count | |||
count += 1 | |||
Thread.sleep(1) | |||
Thread.sleep(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why sleep(0) -- it's a no-op?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear that this is a no-op (because of system time granularity - see discussion here); however, this can't sleep for 1ms because the rate specified is greater than 1000 messages / second (1 message / ms), so the test would not actually trigger the throttling. It seems intuitive to replace this with Thread.yield(), since it's presumably necessary for a background thread to get a chance to process receiving messages, but performing that refactoring does not succeed, nor does simply removing the statement.
Any suggestions? Am I understanding the threading model correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah at best it is equivalent to Thread.yield()
but that wouldn't guarantee any particular behavior in this case. The thread may or may not yield, and other threads may or may not progress at all.
+1. This looks good. Apart from the couple of things which Sean already pointed out (the Thread.sleep and the if condition), this looks good to be merged! |
@srowen, @harishreedharan: Can one of you take a second pass? |
Looks good. Thread.yield is also not a guarantee, though I think this is the best we can do for now |
Thanks for the feedback! |
@@ -176,7 +176,7 @@ class ReceiverSuite extends TestSuiteBase with Timeouts with Serializable { | |||
blockGenerator.addData(count) | |||
generatedData += count | |||
count += 1 | |||
Thread.sleep(1) | |||
Thread.`yield`() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I'm about to learn some crazy new syntax, this won't compile. Thread.yield()
? but if the point is that this loop needs to happen "really fast" then just remove it? I get that this will flood the block generator much much more rapidly, but that might have already been happening with yield()
. In any event, if you just want to wait less than a millisecond, you can call Thread.sleep(0, [# nanoseconds])
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion; I'll experiment with a sleep time that's a scaled version of the original specification, since I'm not trying to meaningfully change the test, just tighten it up a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PS: It most certainly will compile! From the relevant section of the Scala language specification, 2.9 (p. 5):
"
Example 1.1.2 Backquote-enclosed strings are a solution when one needs to access Java identifiers that are reserved words in Scala. For instance, the statement Thread.yield() is illegal, since yield is a reserved word in Scala. However, here’s a work-around:
Thread.‘yield‘()
"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wow, didn't know that. That's certainly right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried all number of micro-waits, and nothing seems to do the trick like Thread.sleep(0). My hunch is that, given a longer test, the preemptive multithreading would even out and each thread would see sufficient CPU time, but, for such a concise run, the main test thread needs to be voluntarily cooperative.
At any rate, there doesn't seem to be a good compromise between platform independence, intuitive appeal, and speed and repeatability of the tests, so I'm at a bit of a loss. Currently this defect is hamstringing our entire stream-processing operation (rate limiting doesn't work reliably at any rate, but just crashes obviously and violently for the documented rates), so I'd appreciate any guidance or shepherding you can provide.
Cheers!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about no wait? I suppose I'm not sure why this depends deterministically on something like yield
or sleep(0)
since it doesn't necessarily do anything. If the problem is not going fast enough then I'd expect removing this line goes the fastest of all.
I don't think anyone's arguing against the fix of course; this is about the test implementation only. In the worst case, Thread.sleep(0)
with a comment isn't so bad but I want to fully explore this first.
Jenkins, retest this please. |
Thread.sleep(sleepTimeInMillis) | ||
} | ||
waitToPush() | ||
if( desiredRate > 0 ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't change this unless we have to push more changes, and this wasn't your change, but the spacing is off here -- should be no space inside parens but space outside. But don't bother with it now.
Test build #30556 has finished for PR 5559 at commit
|
Yeah, I thought some of those lines looked long. Rewrap smartly to keep under 100 per line. (And maybe you can fix the if statement spacing while you're at it. |
Jenkins, retest this please. |
Test build #30594 has finished for PR 5559 at commit
|
@srowen, @harishreedharan: I think I loosened the constraints sufficiently to account for the thread timing differences on the build server, if one of you could kick off another a build, please. |
Jenkins, add to whitelist |
Jenkins, retest this please |
Test build #30606 has finished for PR 5559 at commit
|
…flowError A simple truncation in integer division (on rates over 1000 messages / second) causes the existing implementation to sleep for 0 milliseconds, then call itself recursively; this causes what is essentially an infinite recursion, since the base case of the calculated amount of time having elapsed can't be reached before available stack space is exhausted. A fix to this truncation error is included in this patch. However, even with the defect patched, the accuracy of the existing implementation is abysmal (the error bounds of the original test were effectively [-30%, +10%], although this fact was obscured by hard-coded error margins); as such, when the error bounds were tightened down to [-5%, +5%], the existing implementation failed to meet the new, tightened, requirements. Therefore, an industry-vetted solution (from Guava) was used to get the adapted tests to pass. Author: David McGuire <[email protected]> Closes #5559 from dmcguire81/master and squashes the following commits: d29d2e0 [David McGuire] Back out to +/-5% error margins, for flexibility in timing 8be6934 [David McGuire] Fix spacing per code review 90e98b9 [David McGuire] Address scalastyle errors 29011bd [David McGuire] Further ratchet down the error margins b33b796 [David McGuire] Eliminate dependency on even distribution by BlockGenerator 8f2934b [David McGuire] Remove arbitrary thread timing / cooperation code 70ee310 [David McGuire] Use Thread.yield(), since Thread.sleep(0) is system-dependent 82ee46d [David McGuire] Replace guard clause with nested conditional 2794717 [David McGuire] Replace the RateLimiter with the Guava implementation 38f3ca8 [David McGuire] Ratchet down the error rate to +/- 5%; tests fail 24b1bc0 [David McGuire] Fix truncation in integer division causing infinite recursion d6e1079 [David McGuire] Stack overflow error in RateLimiter on rates over 1000/s (cherry picked from commit 5fea3e5) Signed-off-by: Sean Owen <[email protected]>
…flowError A simple truncation in integer division (on rates over 1000 messages / second) causes the existing implementation to sleep for 0 milliseconds, then call itself recursively; this causes what is essentially an infinite recursion, since the base case of the calculated amount of time having elapsed can't be reached before available stack space is exhausted. A fix to this truncation error is included in this patch. However, even with the defect patched, the accuracy of the existing implementation is abysmal (the error bounds of the original test were effectively [-30%, +10%], although this fact was obscured by hard-coded error margins); as such, when the error bounds were tightened down to [-5%, +5%], the existing implementation failed to meet the new, tightened, requirements. Therefore, an industry-vetted solution (from Guava) was used to get the adapted tests to pass. Author: David McGuire <[email protected]> Closes apache#5559 from dmcguire81/master and squashes the following commits: d29d2e0 [David McGuire] Back out to +/-5% error margins, for flexibility in timing 8be6934 [David McGuire] Fix spacing per code review 90e98b9 [David McGuire] Address scalastyle errors 29011bd [David McGuire] Further ratchet down the error margins b33b796 [David McGuire] Eliminate dependency on even distribution by BlockGenerator 8f2934b [David McGuire] Remove arbitrary thread timing / cooperation code 70ee310 [David McGuire] Use Thread.yield(), since Thread.sleep(0) is system-dependent 82ee46d [David McGuire] Replace guard clause with nested conditional 2794717 [David McGuire] Replace the RateLimiter with the Guava implementation 38f3ca8 [David McGuire] Ratchet down the error rate to +/- 5%; tests fail 24b1bc0 [David McGuire] Fix truncation in integer division causing infinite recursion d6e1079 [David McGuire] Stack overflow error in RateLimiter on rates over 1000/s (cherry picked from commit 5fea3e5) Signed-off-by: Sean Owen <[email protected]>
…flowError A simple truncation in integer division (on rates over 1000 messages / second) causes the existing implementation to sleep for 0 milliseconds, then call itself recursively; this causes what is essentially an infinite recursion, since the base case of the calculated amount of time having elapsed can't be reached before available stack space is exhausted. A fix to this truncation error is included in this patch. However, even with the defect patched, the accuracy of the existing implementation is abysmal (the error bounds of the original test were effectively [-30%, +10%], although this fact was obscured by hard-coded error margins); as such, when the error bounds were tightened down to [-5%, +5%], the existing implementation failed to meet the new, tightened, requirements. Therefore, an industry-vetted solution (from Guava) was used to get the adapted tests to pass. Author: David McGuire <[email protected]> Closes apache#5559 from dmcguire81/master and squashes the following commits: d29d2e0 [David McGuire] Back out to +/-5% error margins, for flexibility in timing 8be6934 [David McGuire] Fix spacing per code review 90e98b9 [David McGuire] Address scalastyle errors 29011bd [David McGuire] Further ratchet down the error margins b33b796 [David McGuire] Eliminate dependency on even distribution by BlockGenerator 8f2934b [David McGuire] Remove arbitrary thread timing / cooperation code 70ee310 [David McGuire] Use Thread.yield(), since Thread.sleep(0) is system-dependent 82ee46d [David McGuire] Replace guard clause with nested conditional 2794717 [David McGuire] Replace the RateLimiter with the Guava implementation 38f3ca8 [David McGuire] Ratchet down the error rate to +/- 5%; tests fail 24b1bc0 [David McGuire] Fix truncation in integer division causing infinite recursion d6e1079 [David McGuire] Stack overflow error in RateLimiter on rates over 1000/s
A simple truncation in integer division (on rates over 1000 messages / second) causes the existing implementation to sleep for 0 milliseconds, then call itself recursively; this causes what is essentially an infinite recursion, since the base case of the calculated amount of time having elapsed can't be reached before available stack space is exhausted. A fix to this truncation error is included in this patch.
However, even with the defect patched, the accuracy of the existing implementation is abysmal (the error bounds of the original test were effectively [-30%, +10%], although this fact was obscured by hard-coded error margins); as such, when the error bounds were tightened down to [-5%, +5%], the existing implementation failed to meet the new, tightened, requirements. Therefore, an industry-vetted solution (from Guava) was used to get the adapted tests to pass.