Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: fix flaky test-tls-wrap-timeout #7857

Closed
wants to merge 4 commits into from
Closed

Conversation

Trott
Copy link
Member

@Trott Trott commented Jul 23, 2016

Checklist
  • make -j4 test (UNIX), or vcbuild test nosign (Windows) passes
  • commit message follows commit guidelines
Affected core subsystem(s)

test tls

Description of change

Competing timers were causing a race condition and thus the test was
flaky. Instead, we check an object property on process exit.

Fixes: #7650

/cc @indutny @nodejs/testing @SomeKittens

@Trott Trott added tls Issues and PRs related to the tls subsystem. test Issues and PRs related to the tests. labels Jul 23, 2016
@Trott
Copy link
Member Author

Trott commented Jul 23, 2016

The fix for this test is in io.js 1.3 and absent in io.js 1.2. So I modified the test a bit so that it would work on those versions to confirm that the test still passes in 1.3 and fails in 1.2. That form of the test can be found at https://gist.github.com/Trott/e4110c64314b9666185c3018dbb33b5c

@Trott
Copy link
Member Author

Trott commented Jul 25, 2016

@indutny
Copy link
Member

indutny commented Jul 25, 2016

Hm... doesn't really LGTM. The test was about server timeout, not client's.

@indutny
Copy link
Member

indutny commented Jul 25, 2016

Sorry!

@Trott
Copy link
Member Author

Trott commented Jul 25, 2016

@indutny I don't understand. Both this version of the test and the current one check the same object (socket in the code).

The old version made sure that socket.setTimeout() handler never fired. That makes sense in theory but is subject to race conditions in practice.

It was introduced in 9b6b0555. Prior to that, the timeout was firing (incorrectly) because socket._idleTimeout and related properties were not being updated correctly. (They were being updated on TLSSocket but not the parent, which in this case is socket. Or at least, that's my understanding.)

This new proposed version removes the race condition by checking that socket._idleTimeout gets set to -1 after TLSSocket is closed. Prior to 9b6b0555, this wasn't happening. After 9b6b0555, it's the expected state upon exiting.

Maybe it would be better to also check that socket._idleStart is being updated after the write/read on the socket. That would (further) guarantee that the socket timeout is working as expected and won't fire erroneously.

@Trott
Copy link
Member Author

Trott commented Jul 25, 2016

(Failure that this PR is supposed to address happened again on FreeBSD in CI: https://ci.nodejs.org/job/node-test-commit-freebsd/3427/nodes=freebsd10-64/console )

@Trott
Copy link
Member Author

Trott commented Jul 26, 2016

@indutny I put back the timer (now setImmediate()) to do a write and an assert to check that _idleStart is being updated. PTAL

@Trott
Copy link
Member Author

Trott commented Jul 27, 2016

New version of test is failing on some platforms. More work to be done...

@Trott
Copy link
Member Author

Trott commented Jul 27, 2016

OK, let's try again: https://ci.nodejs.org/job/node-test-pull-request/3437/

EDIT: OK, that's looking much better so far.

@Trott
Copy link
Member Author

Trott commented Jul 29, 2016

Previous CI failures look CI infra related and not an issue with this test, but uh, let's run it again anyway: https://ci.nodejs.org/job/node-test-pull-request/3450/

@Trott
Copy link
Member Author

Trott commented Jul 29, 2016

@jasnell
Copy link
Member

jasnell commented Jul 29, 2016

LGTM ... I take that back... looks like there are CI failures

@Trott
Copy link
Member Author

Trott commented Jul 29, 2016

@jasnell All the CI failures are build related. None are related to this test. But I do love green, so let's do CI again again again: https://ci.nodejs.org/job/node-test-pull-request/3455/

@Trott
Copy link
Member Author

Trott commented Jul 29, 2016

I am eager to get this fix in for this test because the version currently on master is failing a fair amount on FreeBSD. Here's another one from this morning:

not ok 1032 parallel/test-tls-wrap-timeout
# /usr/home/iojs/build/workspace/node-test-commit-freebsd/nodes/freebsd10-64/test/parallel/test-tls-wrap-timeout.js:32
#       throw new Error('timeout');
#       ^
# 
# Error: timeout
#     at Socket.<anonymous> (/usr/home/iojs/build/workspace/node-test-commit-freebsd/nodes/freebsd10-64/test/parallel/test-tls-wrap-timeout.js:32:13)
#     at Socket.g (events.js:286:16)
#     at emitNone (events.js:86:13)
#     at Socket.emit (events.js:185:7)
#     at Socket._onTimeout (net.js:333:8)
#     at tryOnTimeout (timers.js:228:11)
#     at Timer.listOnTimeout (timers.js:202:5)
  ---
  duration_ms: 0.632

@jasnell
Copy link
Member

jasnell commented Jul 29, 2016

LGTM with green-ish CI then ;-)

@Trott
Copy link
Member Author

Trott commented Jul 29, 2016

Last CI is green! ✅

@Trott
Copy link
Member Author

Trott commented Jul 30, 2016

Another failure on FreeBSD with current master. Definitely eager to land this fix.

https://ci.nodejs.org/job/node-test-commit-freebsd/3506/nodes=freebsd10-64/console

not ok 1032 parallel/test-tls-wrap-timeout
# /usr/home/iojs/build/workspace/node-test-commit-freebsd/nodes/freebsd10-64/test/parallel/test-tls-wrap-timeout.js:32
#       throw new Error('timeout');
#       ^
# 
# Error: timeout
#     at Socket.<anonymous> (/usr/home/iojs/build/workspace/node-test-commit-freebsd/nodes/freebsd10-64/test/parallel/test-tls-wrap-timeout.js:32:13)
#     at Socket.g (events.js:286:16)
#     at emitNone (events.js:86:13)
#     at Socket.emit (events.js:185:7)
#     at Socket._onTimeout (net.js:333:8)
#     at tryOnTimeout (timers.js:228:11)
#     at Timer.listOnTimeout (timers.js:202:5)
  ---
  duration_ms: 1.37

var tsocket = tls.connect({
lastIdleStart = socket._idleStart;

const tsocket = tls.connect({
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe adding here

assert.notStrictEqual(socket._idleTimeout, -1);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@santigimeno Good idea. Added!

@santigimeno
Copy link
Member

I like that with this change we don't rely on timers. I'm not too sold on relying on implementation details (i.e:_idleTimeout and _idleStart) though.

Competing timers were causing a race condition and thus the test was
flaky. Instead, we check an object property on process exit.

Fixes: nodejs#7650
@Trott
Copy link
Member Author

Trott commented Jul 31, 2016

@santigimeno Yeah, I'd welcome a better way. For what it's worth, there is at least one other test where we do something similar (test-timers-active.js) and AFAIK it has never been flaky, so it does have at least that going for it.

@santigimeno
Copy link
Member

@Trott FWIW I gave this a try and found a way that, at least, seems less flaky than the current one: santigimeno@3df3f47.

I run a couple of stress tests and didn't fail:

whereas current master did fail:

The problem is that still relies on timers and the number of tests run doesn't seem representative enough to consider it non-flaky.

@santigimeno
Copy link
Member

Also, I looked to the original issue: nodejs/node-v0.x-archive#9242, and by the looks of the sample code pasted there, it seems that the timer associated with the socket is unref-ed as it doesn't keep the loop running even though the timer is already running. This test would have been so much easier to implement if the timer were ref-ed...

@Trott
Copy link
Member Author

Trott commented Jul 31, 2016

@santigimeno I'm inclined to think that using _idleTimeout/_idleStart is an indication of a missing API. It should be possible to query a socket to find out if it has any timeouts set or not. Our use case is for testing, but I'm sure there are other use cases. On the surface, at least, it seems like this would be easily done by adding a getter that returns _idleTimeout !== -1.

@Trott
Copy link
Member Author

Trott commented Jul 31, 2016

@indutny @santigimeno @jasnell Maybe something like d76bb07a19d890d95e71329da504187f3e4462f2 is sufficiently minimal a change to have everyone feel better about it? (Haven't actually stress tested or run it on the old versions and so on, but you get the idea of the approach.)

@Trott Trott mentioned this pull request Jul 31, 2016
2 tasks
@indutny
Copy link
Member

indutny commented Jul 31, 2016

Gosh, looks like I was thinking about different test all the time. LGTM. Sorry!

@santigimeno
Copy link
Member

@Trott, after your explanation, I'm fine with this PR, and I think I prefer this to the new one. LGTM

Trott added a commit to Trott/io.js that referenced this pull request Jul 31, 2016
Competing timers were causing a race condition and thus the test was
flaky. Instead, we check an object property on process exit.

Fixes: nodejs#7650
PR-URL: nodejs#7857
Reviewed-By: Santiago Gimeno <[email protected]>
Reviewed-By: Fedor Indutny <[email protected]>
Reviewed-By: jasnell - James M Snell <[email protected]>
@Trott
Copy link
Member Author

Trott commented Jul 31, 2016

Landed in bc464a8. Thanks, everyone!

@Trott Trott closed this Jul 31, 2016
@cjihrig cjihrig mentioned this pull request Aug 8, 2016
cjihrig pushed a commit that referenced this pull request Aug 10, 2016
Competing timers were causing a race condition and thus the test was
flaky. Instead, we check an object property on process exit.

Fixes: #7650
PR-URL: #7857
Reviewed-By: Santiago Gimeno <[email protected]>
Reviewed-By: Fedor Indutny <[email protected]>
Reviewed-By: jasnell - James M Snell <[email protected]>
@cjihrig cjihrig mentioned this pull request Aug 11, 2016
BethGriggs pushed a commit to BethGriggs/node that referenced this pull request Apr 21, 2017
Competing timers were causing a race condition and thus the test was
flaky. Instead, we check an object property on process exit.

Fixes: nodejs#7650
PR-URL: nodejs#7857
Reviewed-By: Santiago Gimeno <[email protected]>
Reviewed-By: Fedor Indutny <[email protected]>
Reviewed-By: jasnell - James M Snell <[email protected]>
MylesBorins pushed a commit that referenced this pull request May 2, 2017
Competing timers were causing a race condition and thus the test was
flaky. Instead, we check an object property on process exit.

Fixes: #7650
Backport-PR-URL: #12567
PR-URL: #7857
Reviewed-By: Santiago Gimeno <[email protected]>
Reviewed-By: Fedor Indutny <[email protected]>
Reviewed-By: jasnell - James M Snell <[email protected]>
@MylesBorins MylesBorins mentioned this pull request May 2, 2017
@Trott Trott deleted the wrap branch October 14, 2021 12:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Issues and PRs related to the tests. tls Issues and PRs related to the tls subsystem.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants