Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-6985][streaming] Receiver maxRate over 1000 causes a StackOverflowError #5559

Closed
wants to merge 12 commits into from
Closed

Conversation

dmcguire81
Copy link

A simple truncation in integer division (on rates over 1000 messages / second) causes the existing implementation to sleep for 0 milliseconds, then call itself recursively; this causes what is essentially an infinite recursion, since the base case of the calculated amount of time having elapsed can't be reached before available stack space is exhausted. A fix to this truncation error is included in this patch.

However, even with the defect patched, the accuracy of the existing implementation is abysmal (the error bounds of the original test were effectively [-30%, +10%], although this fact was obscured by hard-coded error margins); as such, when the error bounds were tightened down to [-5%, +5%], the existing implementation failed to meet the new, tightened, requirements. Therefore, an industry-vetted solution (from Guava) was used to get the adapted tests to pass.

@@ -176,7 +176,7 @@ class ReceiverSuite extends TestSuiteBase with Timeouts with Serializable {
blockGenerator.addData(count)
generatedData += count
count += 1
Thread.sleep(1)
Thread.sleep(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why sleep(0) -- it's a no-op?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear that this is a no-op (because of system time granularity - see discussion here); however, this can't sleep for 1ms because the rate specified is greater than 1000 messages / second (1 message / ms), so the test would not actually trigger the throttling. It seems intuitive to replace this with Thread.yield(), since it's presumably necessary for a background thread to get a chance to process receiving messages, but performing that refactoring does not succeed, nor does simply removing the statement.

Any suggestions? Am I understanding the threading model correctly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah at best it is equivalent to Thread.yield() but that wouldn't guarantee any particular behavior in this case. The thread may or may not yield, and other threads may or may not progress at all.

@harishreedharan
Copy link
Contributor

+1. This looks good.

Apart from the couple of things which Sean already pointed out (the Thread.sleep and the if condition), this looks good to be merged!

@dmcguire81
Copy link
Author

@srowen, @harishreedharan: Can one of you take a second pass?

@harishreedharan
Copy link
Contributor

Looks good. Thread.yield is also not a guarantee, though I think this is the best we can do for now

@dmcguire81
Copy link
Author

Thanks for the feedback!

@@ -176,7 +176,7 @@ class ReceiverSuite extends TestSuiteBase with Timeouts with Serializable {
blockGenerator.addData(count)
generatedData += count
count += 1
Thread.sleep(1)
Thread.`yield`()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm about to learn some crazy new syntax, this won't compile. Thread.yield()? but if the point is that this loop needs to happen "really fast" then just remove it? I get that this will flood the block generator much much more rapidly, but that might have already been happening with yield(). In any event, if you just want to wait less than a millisecond, you can call Thread.sleep(0, [# nanoseconds]).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion; I'll experiment with a sleep time that's a scaled version of the original specification, since I'm not trying to meaningfully change the test, just tighten it up a bit.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS: It most certainly will compile! From the relevant section of the Scala language specification, 2.9 (p. 5):
"
Example 1.1.2 Backquote-enclosed strings are a solution when one needs to access Java identifiers that are reserved words in Scala. For instance, the statement Thread.yield() is illegal, since yield is a reserved word in Scala. However, here’s a work-around:
Thread.‘yield‘()
"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow, didn't know that. That's certainly right.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried all number of micro-waits, and nothing seems to do the trick like Thread.sleep(0). My hunch is that, given a longer test, the preemptive multithreading would even out and each thread would see sufficient CPU time, but, for such a concise run, the main test thread needs to be voluntarily cooperative.

At any rate, there doesn't seem to be a good compromise between platform independence, intuitive appeal, and speed and repeatability of the tests, so I'm at a bit of a loss. Currently this defect is hamstringing our entire stream-processing operation (rate limiting doesn't work reliably at any rate, but just crashes obviously and violently for the documented rates), so I'd appreciate any guidance or shepherding you can provide.

Cheers!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about no wait? I suppose I'm not sure why this depends deterministically on something like yield or sleep(0) since it doesn't necessarily do anything. If the problem is not going fast enough then I'd expect removing this line goes the fastest of all.

I don't think anyone's arguing against the fix of course; this is about the test implementation only. In the worst case, Thread.sleep(0) with a comment isn't so bad but I want to fully explore this first.

@srowen
Copy link
Member

srowen commented Apr 19, 2015

Jenkins, retest this please.

Thread.sleep(sleepTimeInMillis)
}
waitToPush()
if( desiredRate > 0 ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't change this unless we have to push more changes, and this wasn't your change, but the spacing is off here -- should be no space inside parens but space outside. But don't bother with it now.

@SparkQA
Copy link

SparkQA commented Apr 19, 2015

Test build #30556 has finished for PR 5559 at commit 29011bd.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@srowen
Copy link
Member

srowen commented Apr 19, 2015

Yeah, I thought some of those lines looked long. Rewrap smartly to keep under 100 per line. (And maybe you can fix the if statement spacing while you're at it.

@srowen
Copy link
Member

srowen commented Apr 20, 2015

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Apr 20, 2015

Test build #30594 has finished for PR 5559 at commit 8be6934.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@dmcguire81
Copy link
Author

@srowen, @harishreedharan: I think I loosened the constraints sufficiently to account for the thread timing differences on the build server, if one of you could kick off another a build, please.

@srowen
Copy link
Member

srowen commented Apr 20, 2015

Jenkins, add to whitelist

@srowen
Copy link
Member

srowen commented Apr 20, 2015

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Apr 20, 2015

Test build #30606 has finished for PR 5559 at commit d29d2e0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@asfgit asfgit closed this in 5fea3e5 Apr 21, 2015
asfgit pushed a commit that referenced this pull request Apr 21, 2015
…flowError

A simple truncation in integer division (on rates over 1000 messages / second) causes the existing implementation to sleep for 0 milliseconds, then call itself recursively; this causes what is essentially an infinite recursion, since the base case of the calculated amount of time having elapsed can't be reached before available stack space is exhausted. A fix to this truncation error is included in this patch.

However, even with the defect patched, the accuracy of the existing implementation is abysmal (the error bounds of the original test were effectively [-30%, +10%], although this fact was obscured by hard-coded error margins); as such, when the error bounds were tightened down to [-5%, +5%], the existing implementation failed to meet the new, tightened, requirements. Therefore, an industry-vetted solution (from Guava) was used to get the adapted tests to pass.

Author: David McGuire <[email protected]>

Closes #5559 from dmcguire81/master and squashes the following commits:

d29d2e0 [David McGuire] Back out to +/-5% error margins, for flexibility in timing
8be6934 [David McGuire] Fix spacing per code review
90e98b9 [David McGuire] Address scalastyle errors
29011bd [David McGuire] Further ratchet down the error margins
b33b796 [David McGuire] Eliminate dependency on even distribution by BlockGenerator
8f2934b [David McGuire] Remove arbitrary thread timing / cooperation code
70ee310 [David McGuire] Use Thread.yield(), since Thread.sleep(0) is system-dependent
82ee46d [David McGuire] Replace guard clause with nested conditional
2794717 [David McGuire] Replace the RateLimiter with the Guava implementation
38f3ca8 [David McGuire] Ratchet down the error rate to +/- 5%; tests fail
24b1bc0 [David McGuire] Fix truncation in integer division causing infinite recursion
d6e1079 [David McGuire] Stack overflow error in RateLimiter on rates over 1000/s

(cherry picked from commit 5fea3e5)
Signed-off-by: Sean Owen <[email protected]>
markhamstra pushed a commit to markhamstra/spark that referenced this pull request Apr 23, 2015
…flowError

A simple truncation in integer division (on rates over 1000 messages / second) causes the existing implementation to sleep for 0 milliseconds, then call itself recursively; this causes what is essentially an infinite recursion, since the base case of the calculated amount of time having elapsed can't be reached before available stack space is exhausted. A fix to this truncation error is included in this patch.

However, even with the defect patched, the accuracy of the existing implementation is abysmal (the error bounds of the original test were effectively [-30%, +10%], although this fact was obscured by hard-coded error margins); as such, when the error bounds were tightened down to [-5%, +5%], the existing implementation failed to meet the new, tightened, requirements. Therefore, an industry-vetted solution (from Guava) was used to get the adapted tests to pass.

Author: David McGuire <[email protected]>

Closes apache#5559 from dmcguire81/master and squashes the following commits:

d29d2e0 [David McGuire] Back out to +/-5% error margins, for flexibility in timing
8be6934 [David McGuire] Fix spacing per code review
90e98b9 [David McGuire] Address scalastyle errors
29011bd [David McGuire] Further ratchet down the error margins
b33b796 [David McGuire] Eliminate dependency on even distribution by BlockGenerator
8f2934b [David McGuire] Remove arbitrary thread timing / cooperation code
70ee310 [David McGuire] Use Thread.yield(), since Thread.sleep(0) is system-dependent
82ee46d [David McGuire] Replace guard clause with nested conditional
2794717 [David McGuire] Replace the RateLimiter with the Guava implementation
38f3ca8 [David McGuire] Ratchet down the error rate to +/- 5%; tests fail
24b1bc0 [David McGuire] Fix truncation in integer division causing infinite recursion
d6e1079 [David McGuire] Stack overflow error in RateLimiter on rates over 1000/s

(cherry picked from commit 5fea3e5)
Signed-off-by: Sean Owen <[email protected]>
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
…flowError

A simple truncation in integer division (on rates over 1000 messages / second) causes the existing implementation to sleep for 0 milliseconds, then call itself recursively; this causes what is essentially an infinite recursion, since the base case of the calculated amount of time having elapsed can't be reached before available stack space is exhausted. A fix to this truncation error is included in this patch.

However, even with the defect patched, the accuracy of the existing implementation is abysmal (the error bounds of the original test were effectively [-30%, +10%], although this fact was obscured by hard-coded error margins); as such, when the error bounds were tightened down to [-5%, +5%], the existing implementation failed to meet the new, tightened, requirements. Therefore, an industry-vetted solution (from Guava) was used to get the adapted tests to pass.

Author: David McGuire <[email protected]>

Closes apache#5559 from dmcguire81/master and squashes the following commits:

d29d2e0 [David McGuire] Back out to +/-5% error margins, for flexibility in timing
8be6934 [David McGuire] Fix spacing per code review
90e98b9 [David McGuire] Address scalastyle errors
29011bd [David McGuire] Further ratchet down the error margins
b33b796 [David McGuire] Eliminate dependency on even distribution by BlockGenerator
8f2934b [David McGuire] Remove arbitrary thread timing / cooperation code
70ee310 [David McGuire] Use Thread.yield(), since Thread.sleep(0) is system-dependent
82ee46d [David McGuire] Replace guard clause with nested conditional
2794717 [David McGuire] Replace the RateLimiter with the Guava implementation
38f3ca8 [David McGuire] Ratchet down the error rate to +/- 5%; tests fail
24b1bc0 [David McGuire] Fix truncation in integer division causing infinite recursion
d6e1079 [David McGuire] Stack overflow error in RateLimiter on rates over 1000/s
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants