ZOOKEEPER-3072: Throttle race condition fix #563

bothejjms · 2018-07-06T14:52:14Z

Making the throttle check before passing over the request to the next thread will prevent the possibility of throttling code running after unthrottle

Added an additional async hammer thread which is pretty reliably reproduces the race condition. The globalOutstandingLimit is decreased so throttling code is executed.

anmolnar

Nice catch @bothejjms , thanks for your first contribution. ;)

anmolnar · 2018-07-09T10:35:44Z

src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java

@@ -1124,6 +1124,7 @@ public void processPacket(ServerCnxn cnxn, ByteBuffer incomingBuffer) throws IOE
            }
            return;
        } else {
+            cnxn.incrOutstandingRequests(h);


I have 2 observations here which probably don't make a big difference but might worse to consider.

First, the return statements in the if branches are not required anymore, because there's no more statement at the end of the method anymore,

Second, moving cnxn.incrOutstandingRequests(h) here means that from now on you'll trigger throttling for sasl requests too, which was not the case previously. Same for auth packets which I believe was done intentionally.

hmm right. That return was not there in 3.5.3 where I have spotted the issue. I have missed it when I have moved my change to master.
I see ZOOKEEPER-2785 introduced it. I will update my pr and move incr to the else branch to avoid sasl throttling.

anmolnar · 2018-07-09T12:24:52Z

@bothejjms What do you mean by "pretty reliably" exactly?
I see that the test has failed on Jenkins and also I've tested it and it basically killed my machine for 45 mins which seems to me a little bit overkill.

Could we run the 100 threads in a ThreadPoolExecutor for example and see how it behaves?
Do we need that many threads while setting globalOutstandingLimit to 1?
Also, I don't want to stick with covering this issue with test, if we could only do it pretty reliably and by introducing a new flaky test. :(

bothejjms · 2018-07-09T12:45:35Z

On "pretty reliably" I mean the test has failed for me like 90% of the time with the original code but the result can differ on different machines since it is a race condition.
Reproducing race condition in a test is not simple. I am open to suggestions how to do it reliably. Do you recall any other tests for race conditions in the test suite?

After the fix the test passed on my machine always. I am not sure yet why it fails on jenkins.

For me the test takes 40 sec on my VM which is not particularly strong. I am also not satisfied with this test. I just wanted to prove that the race condition is there. Instead of the test I could add a description on how to reproduce and skip permanent testing for it.

nkalmar

I don't think we should include this test with the unit tests. As you mentioned @bothejjms , maybe just write a description about this? This seems unreliable in terms of flakiness and runtime, especially on Apache Jenkins servers, which are often overloaded.

nkalmar · 2018-07-10T08:35:31Z

src/java/test/org/apache/zookeeper/test/ThrottleRaceTest.java

+            hammers[i].start();
+        }
+        LOG.info("Started hammers");
+        Thread.sleep(30000); // allow the clients to run for max 5sec


nit: This is 30 seconds not five as in the comment

anmolnar · 2018-07-10T10:10:34Z

@bothejjms Sorry, there's a typo in my previous comment: it was 45 seconds on my machine and it literally killed the entire machine which I think isn't acceptable on Jenkins slaves.

I'd give it a try with ThreadPoolExecutor at the first place and dig into why it's not 100% reliable currently. If there's no success with a few days work, just skip adding test here.

bothejjms · 2018-07-10T15:48:48Z

I have tweaked the test to use significantly less threads and be faster. Unfortunately it still fails on jenkins. :(

I am not sure how ThreadPoolExecutor would help with this. It will spin up the same amount of threads in background, isn't it?

anmolnar · 2018-07-10T20:35:17Z

@bothejjms It would spin up only a limited amount of threads, but that wouldn't help either as you said. You literally want 100 clients simultaneously sending requests until test stops. AsyncHammerTest does pretty much the same, looks like you copied most of the logic implemented in there.

I don't how to do this properly. I suspect AsyncHammerTest is also flaky which is another reason why not create a similar test.

bothejjms · 2018-07-18T15:54:56Z

I have removed the test for now as I don't have a good way to test this race condition. I can be reproduced easily by starting a server where the globalOutstandingLimit is 1 and sending a lot exists requests. There is a good chance that one session will stuck in a throttled state despite it has no active requests.

anmolnar · 2018-07-19T21:15:22Z

Thanks @bothejjms . I think the patch can be accepted now without the test.
We need at least one more committer to approve. @hanm @phunt ?

breed

+1 i have a minor refactoring suggestion, but i'm fine if we want to commit as is.

breed · 2018-07-27T05:06:09Z

src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java

@@ -1128,9 +1128,9 @@ public void processPacket(ServerCnxn cnxn, ByteBuffer incomingBuffer) throws IOE
                Record rsp = processSasl(incomingBuffer,cnxn);
                ReplyHeader rh = new ReplyHeader(h.getXid(), 0, KeeperException.Code.OK.intValue());
                cnxn.sendResponse(rh,rsp, "response"); // not sure about 3rd arg..what is it?
-                return;


it would be nice to keep this return since it matches the handling of the other auth logic above.

it would also be nice if this was an

} else if (h.getType() == OpCode.sasl) {

clause and the
} else {

was done outside of the if since all the other blocks will have returned. i think it makes the logic easier to follow.

I have refactored like that.
Returns are actually unnecessary but I have consistently added them now.

Making the throttle check before passing over the request to the next thread will prevent the possibility of throttling code running after unthrottle

bothejjms · 2018-07-27T13:50:41Z

I have refactored the branches as suggested.

breed · 2018-07-27T15:56:30Z

thank you @bothejjms !

Making the throttle check before passing over the request to the next thread will prevent the possibility of throttling code running after unthrottle Added an additional async hammer thread which is pretty reliably reproduces the race condition. The globalOutstandingLimit is decreased so throttling code is executed. Author: Botond Hejj <[email protected]> Reviewers: Andor Molnár <[email protected]>, Norbert Kalmar <[email protected]>, Benjamin Reed <[email protected]> Closes #563 from bothejjms/ZOOKEEPER-3072 (cherry picked from commit 2a372fc) Signed-off-by: Benjamin Reed <[email protected]>

Making the throttle check before passing over the request to the next thread will prevent the possibility of throttling code running after unthrottle Added an additional async hammer thread which is pretty reliably reproduces the race condition. The globalOutstandingLimit is decreased so throttling code is executed. Author: Botond Hejj <[email protected]> Reviewers: Andor Molnár <[email protected]>, Norbert Kalmar <[email protected]>, Benjamin Reed <[email protected]> Closes apache#563 from bothejjms/ZOOKEEPER-3072

bothejjms force-pushed the ZOOKEEPER-3072 branch 2 times, most recently from e27f95a to 0561feb Compare July 9, 2018 09:41

anmolnar reviewed Jul 9, 2018

View reviewed changes

bothejjms force-pushed the ZOOKEEPER-3072 branch from 0561feb to a796cad Compare July 9, 2018 11:05

nkalmar reviewed Jul 10, 2018

View reviewed changes

bothejjms force-pushed the ZOOKEEPER-3072 branch from a796cad to a2eea5b Compare July 10, 2018 15:11

bothejjms force-pushed the ZOOKEEPER-3072 branch from a2eea5b to c68063b Compare July 10, 2018 17:11

bothejjms changed the title ~~Fix for ZOOKEEPER-3072~~ ZOOKEEPER-3072: Throttle race condition fix Jul 18, 2018

bothejjms force-pushed the ZOOKEEPER-3072 branch from c68063b to 8ae1fcf Compare July 18, 2018 15:51

bothejjms force-pushed the ZOOKEEPER-3072 branch from 8ae1fcf to 3233a64 Compare July 19, 2018 08:23

breed approved these changes Jul 27, 2018

View reviewed changes

ZOOKEEPER-3072: Throttle race condition fix

d1756b9

Making the throttle check before passing over the request to the next thread will prevent the possibility of throttling code running after unthrottle

bothejjms force-pushed the ZOOKEEPER-3072 branch from 3233a64 to d1756b9 Compare July 27, 2018 13:47

asfgit closed this in 2a372fc Jul 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZOOKEEPER-3072: Throttle race condition fix #563

ZOOKEEPER-3072: Throttle race condition fix #563

bothejjms commented Jul 6, 2018

anmolnar left a comment

anmolnar Jul 9, 2018

bothejjms Jul 9, 2018

anmolnar commented Jul 9, 2018

bothejjms commented Jul 9, 2018

nkalmar left a comment

nkalmar Jul 10, 2018 •

edited

Loading

anmolnar commented Jul 10, 2018

bothejjms commented Jul 10, 2018

anmolnar commented Jul 10, 2018 •

edited

Loading

bothejjms commented Jul 18, 2018

anmolnar commented Jul 19, 2018

breed left a comment

breed Jul 27, 2018

bothejjms Jul 27, 2018

bothejjms commented Jul 27, 2018

breed commented Jul 27, 2018

ZOOKEEPER-3072: Throttle race condition fix #563

ZOOKEEPER-3072: Throttle race condition fix #563

Conversation

bothejjms commented Jul 6, 2018

anmolnar left a comment

Choose a reason for hiding this comment

anmolnar Jul 9, 2018

Choose a reason for hiding this comment

bothejjms Jul 9, 2018

Choose a reason for hiding this comment

anmolnar commented Jul 9, 2018

bothejjms commented Jul 9, 2018

nkalmar left a comment

Choose a reason for hiding this comment

nkalmar Jul 10, 2018 • edited Loading

Choose a reason for hiding this comment

anmolnar commented Jul 10, 2018

bothejjms commented Jul 10, 2018

anmolnar commented Jul 10, 2018 • edited Loading

bothejjms commented Jul 18, 2018

anmolnar commented Jul 19, 2018

breed left a comment

Choose a reason for hiding this comment

breed Jul 27, 2018

Choose a reason for hiding this comment

bothejjms Jul 27, 2018

Choose a reason for hiding this comment

bothejjms commented Jul 27, 2018

breed commented Jul 27, 2018

nkalmar Jul 10, 2018 •

edited

Loading

anmolnar commented Jul 10, 2018 •

edited

Loading