Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: fix flaky test-http2-session-timeout #15338

Conversation

apapirovski
Copy link
Member

@apapirovski apapirovski commented Sep 11, 2017

This is an alternative solution to #15328. Increases server timeout, reduces frequency of calls and unbinds timeout after runs are done in order to avoid race conditions.

cc @Trott @mcollina @tniessen

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • tests and/or benchmarks are included
  • commit message follows commit guidelines
Affected core subsystem(s)

test

@nodejs-github-bot nodejs-github-bot added the test Issues and PRs related to the tests. label Sep 11, 2017
@apapirovski
Copy link
Member Author

apapirovski commented Sep 11, 2017

I was able to run tools/test.py -j 96 --repeat 192 test/parallel/test-http2-session-timeout.js with no failures (repeatedly) so this should be safe.

@apapirovski apapirovski force-pushed the patch-fix-test-http2-session-timeout branch from a6b0f34 to c8e3767 Compare September 11, 2017 11:21
@mscdex mscdex added the http2 Issues or PRs related to the http2 subsystem. label Sep 11, 2017
@mcollina
Copy link
Member

Copy link
Member

@tniessen tniessen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM if CI passes.

@apapirovski
Copy link
Member Author

We should be able to run a CI stress test now that @TimothyGu's patch landed. Could someone pull the trigger, please?

@mcollina
Copy link
Member

Both of them failed all times, so maybe I got something wrong in kicking off the job. Can someone else have a look and start it again?

@apapirovski apapirovski force-pushed the patch-fix-test-http2-session-timeout branch from c8e3767 to 3bd0870 Compare September 13, 2017 11:24
@apapirovski
Copy link
Member Author

It behaved as if the patch to the test runner didn't land. Not sure why that would be.

(I've rebased this just in case but I can't imagine it's running my version of the test runner... is it?)

@Trott
Copy link
Member

Trott commented Sep 13, 2017

(I've rebased this just in case but I can't imagine it's running my version of the test runner... is it?)

I think it is (or was). It was using ref refs/pull/15338/head and doesn't appear to do a rebase.

@Trott
Copy link
Member

Trott commented Sep 13, 2017

@apapirovski
Copy link
Member Author

Well, that explains that. Thanks @Trott

@Trott
Copy link
Member

Trott commented Sep 14, 2017

Whoops, those stress tests aren't really valid because they aren't running things concurrently. (Glad their green, though.)

I'll kick off some more in a minute.

@Trott
Copy link
Member

Trott commented Sep 14, 2017

Oh, this should include Fixes: #15326 as metadata. This is a note to whoever lands this. :-D

@Trott
Copy link
Member

Trott commented Sep 14, 2017

Stress test on freebsd10-64 where it has been observed to fail:

This PR (should be green): https://ci.nodejs.org/job/node-stress-single-test/1417/

master (should be red): https://ci.nodejs.org/job/node-stress-single-test/1418/

@Trott
Copy link
Member

Trott commented Sep 14, 2017

The stress test still fails with these changes. (EDIT: I stopped the one for this PR after 76 failures in 76 runs.)

For comparison, here's a stress test on #15328: https://ci.nodejs.org/job/node-stress-single-test/1419/nodes=freebsd10-64/console EDIT: Argh, looks like I need to rebase to run a single test as well. Did that now, so let's try again: https://ci.nodejs.org/job/node-stress-single-test/1420/nodes=freebsd10-64/console

If that stress test passes but you still want the overall changes in this PR, hopefully the solution is to simply change the timeout values from 500 and 50 to 1200 and 10. Not a great solution as there's still a race condition, but one that (probably) won't be triggered on our CI or in most developer test runs.

If that stress test fails too, then it's time to look into things some more...

@apapirovski
Copy link
Member Author

apapirovski commented Sep 14, 2017

The problem with comparing this to the other PR is that it doesn't actually test the timeout condition, as per my comment there. It only runs 40 requests 10ms (400ms total) apart and the timeout is 1200ms.

I'll see what can be done to fix this PR.

@apapirovski apapirovski force-pushed the patch-fix-test-http2-session-timeout branch from 3bd0870 to c47516c Compare September 14, 2017 10:53
@apapirovski
Copy link
Member Author

Ok, changed up some more stuff. Should be able to handle concurrency better now. Can we run this again? Thanks so much!

@mcollina
Copy link
Member

@apapirovski
Copy link
Member Author

apapirovski commented Sep 14, 2017

Is there a margin of failure we consider acceptable for stress tests? Looks like it's failing about 15-20% of the time now.

I'm guessing if I bump up the timeout to something like 5000 and the calls to, say, 800 then it'll probably be fine on that system but it makes the test unnecessarily long on all platforms. Not sure if that's a big deal or not.

@apapirovski apapirovski force-pushed the patch-fix-test-http2-session-timeout branch from c47516c to 65e5a0b Compare September 14, 2017 12:26
@apapirovski
Copy link
Member Author

apapirovski commented Sep 14, 2017

New version is up, 5000 & 800 now. If this doesn't do it then I don't know...

@apapirovski apapirovski force-pushed the patch-fix-test-http2-session-timeout branch from 65e5a0b to 3103f89 Compare September 14, 2017 12:29
@apapirovski
Copy link
Member Author

apapirovski commented Sep 14, 2017

@mcollina I just checked the CI and the commit it's testing is the one from the other PR. I reverted my changes since I don't think they got tested.

@Trott
Copy link
Member

Trott commented Sep 14, 2017

So here are the important parameters to use to kick of a CI stress test for this PR:

  • GIT_REMOTE_REF should be refs/pull/15338/head
  • RUN_TESTS should be -j 96 --repeat 192 parallel/test-http2-session-timeout. If you don't include the -j and --repeat parameters, then the stress test runs one test at a time over and over, as if it was in sequential. This is fine for tests that are flaky for reasons other than concurrency, but this one is (like most timer-based race conditions) greatly impacted by how many other tests are running. 96 is way more than CI will typically run, but this is a way to make sure the problem is really solved. :-D
  • RUN_TIMES leave at 100 because each one of those will run the test 192 times, so that means you're running the test 19,200 rimwa, which is plenty.
  • Choose freebsd10-64 from the RUN_LABEL drop-down as that's where we've seen it fail in CI.

@Trott
Copy link
Member

Trott commented Sep 14, 2017

@Trott
Copy link
Member

Trott commented Sep 14, 2017

Still failing. I think moving the test from test/parallel to test/sequential may be the way to go. Given the reliance on timers, I don't think there's any way to make this test not prone to race conditions.

@apapirovski
Copy link
Member Author

A lot closer this time but yeah, still failing. Will think about it a bit and see if I can revise it in any way, if not we'll just have to move it to sequential. Sorry for the hassle, everyone.

@jasnell
Copy link
Member

jasnell commented Sep 15, 2017

No worries at all @apapirovski !

@apapirovski apapirovski force-pushed the patch-fix-test-http2-session-timeout branch from 3103f89 to 0c22399 Compare September 16, 2017 12:25
@apapirovski
Copy link
Member Author

Updated to move to sequential for now. Will revisit later potentially but all the http timeout tests are there so it might not be possible anyway.

@Trott
Copy link
Member

Trott commented Sep 16, 2017

Question that can be ignored if you wish in the interest of getting the fix landed quickly: Since it's in sequential, does it make sense to go back to a 200ms timeout or at least reduce it somewhat from 2000 to speed up the test?

@Trott
Copy link
Member

Trott commented Sep 16, 2017

@apapirovski
Copy link
Member Author

apapirovski commented Sep 16, 2017

@Trott Yeah, we can definitely reduce it. I'll put it at 200 for the server & 20 for the call. Did you want to run the CI on the updated version once I put it up or is that run above enough?

Increase server timeout, reduce frequency of calls and
unbind timeout after runs are done in order to avoid
race conditions. Temporarily moved to sequential.

Fixes: nodejs#15326
@Trott
Copy link
Member

Trott commented Sep 16, 2017

Did you want to run the CI on the updated version once I put it up

Yeah, let's run CI again after any changes.

@apapirovski apapirovski force-pushed the patch-fix-test-http2-session-timeout branch from 0c22399 to 0113f2e Compare September 16, 2017 14:36
@apapirovski
Copy link
Member Author

apapirovski commented Sep 16, 2017

Ok, timing updates are available to test now. Thanks @Trott!

@Trott
Copy link
Member

Trott commented Sep 17, 2017

@mcollina
Copy link
Member

CI (again): https://ci.nodejs.org/job/node-test-pull-request/10118/

The failure seemed unrelated, but it would be better to check.

@apapirovski
Copy link
Member Author

Failure looks unrelated, this is the test that failed parallel/test-listen-fd-ebadf.

Copy link
Member

@mcollina mcollina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mcollina
Copy link
Member

Landed as de51717.

@mcollina mcollina closed this Sep 18, 2017
mcollina pushed a commit that referenced this pull request Sep 18, 2017
Increase server timeout, reduce frequency of calls and
unbind timeout after runs are done in order to avoid
race conditions. Temporarily moved to sequential.

Fixes: #15326
PR-URL: #15338
Reviewed-By: Matteo Collina <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Yuta Hiroto <[email protected]>
Reviewed-By: Tobias Nießen <[email protected]>
@apapirovski apapirovski deleted the patch-fix-test-http2-session-timeout branch September 19, 2017 14:00
jasnell pushed a commit that referenced this pull request Sep 20, 2017
Increase server timeout, reduce frequency of calls and
unbind timeout after runs are done in order to avoid
race conditions. Temporarily moved to sequential.

Fixes: #15326
PR-URL: #15338
Reviewed-By: Matteo Collina <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Yuta Hiroto <[email protected]>
Reviewed-By: Tobias Nießen <[email protected]>
Qard pushed a commit to Qard/ayo that referenced this pull request Sep 21, 2017
Increase server timeout, reduce frequency of calls and
unbind timeout after runs are done in order to avoid
race conditions. Temporarily moved to sequential.

Fixes: nodejs/node#15326
PR-URL: nodejs/node#15338
Reviewed-By: Matteo Collina <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Yuta Hiroto <[email protected]>
Reviewed-By: Tobias Nießen <[email protected]>
Qard pushed a commit to Qard/ayo that referenced this pull request Sep 21, 2017
Increase server timeout, reduce frequency of calls and
unbind timeout after runs are done in order to avoid
race conditions. Temporarily moved to sequential.

Fixes: nodejs/node#15326
PR-URL: nodejs/node#15338
Reviewed-By: Matteo Collina <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Yuta Hiroto <[email protected]>
Reviewed-By: Tobias Nießen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
http2 Issues or PRs related to the http2 subsystem. test Issues and PRs related to the tests.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants