-
Notifications
You must be signed in to change notification settings - Fork 29.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
investigate flaky test-timers-promisified #37226
Comments
https://ci.nodejs.org/job/node-test-binary-arm-12+/9172/RUN_SUBSET=0,label=pi2-docker/console
|
@nodejs/timers |
Relevant failing test case: node/test/parallel/test-timers-promisified.js Lines 352 to 373 in fe43bd8
|
cc @Linkgoron |
It's also trivial to make it fail also by changing |
@Trott I went on a quick call with @Linkgoron (we know each other in real life) and we agreed he'll open a PR removing this test for now in the meantime. Then we should investigate why the error happens - I thought that eventhough timers didn't have a guarantee regarding running exactly after N milliseconds they were at least ordered. Apparently I was wrong. |
I think the problem might be that timers and intervals are in separate queues and ordering there is not guaranteed? I think? Or something close enough to that such that the solution may be to replace the timer on line 361 with an interval. Still investigating.... |
@Trott There is another test that is also dependant on such ordering (the last one in the |
Hmm, the intervals are firing out of order, and I do think that's a bug? Here's my modified version of the test with some really crude debugging going on. Using {
// Check that if we abort when we have some callbacks left,
// we actually call them.
const controller = new AbortController();
const { signal } = controller;
const delay = 1;
let totalIterations = 0;
let foo = '';
const timeoutLoop = runInterval(async (iterationNumber) => {
foo += `start ${iterationNumber}\n`;
if (iterationNumber === 2) {
foo += 'awaiting\n';
await setTimeout(delay * 2);
foo += 'aborting\n';
controller.abort();
foo += 'aborted\n';
}
if (iterationNumber > totalIterations) {
totalIterations = iterationNumber;
}
}, delay, signal);
timeoutLoop.catch(common.mustCall(() => {
assert.ok(totalIterations >= 3, `iterations was ${totalIterations} < 3, ${foo}`);
console.log(foo);
}));
}
} Most of the time, the test succeeds and the output looks like this: $ ./node --no-warnings --expose-internals test/parallel/test-timers-promisified.js
start 1
start 2
awaiting
aborting
aborted
start 3
$ But fairly often, the test fails and the output instead looks like this: $ ./node --no-warnings --expose-internals test/parallel/test-timers-promisified.js
node:internal/process/promises:227
triggerUncaughtException(err, true /* fromPromise */);
^
AssertionError [ERR_ASSERTION]: iterations was 2 < 3, start 1
start 2
awaiting
aborting
aborted
at /Users/trott/io.js/test/parallel/test-timers-promisified.js:375:14
at /Users/trott/io.js/test/common/index.js:376:15
at runNextTicks (node:internal/process/task_queues:59:5)
at processTimers (node:internal/timers:496:9) {
generatedMessage: false,
code: 'ERR_ASSERTION',
actual: false,
expected: true,
operator: '=='
}
$
|
Maybe the ordering problem is caused by this bit in the test? for await (const value of interval) {
assert.strictEqual(value, input);
iteration++;
await fn(iteration);
} The first |
@Linkgoron Yeah, I can't make the other one fail either. If it is also not 100% reliable, it is much closer to 100% than the one we're talking about. |
Yes, I tend to agree this is a bug in our timers. I don't actually think the queues are different (I checked the code and that's my understanding). @Linkgoron also mentioned the test fails even if you use setInterval in all cases (without mixing setInterval/setTimeout) |
Whoops, there is a |
Yeah, it's either a bug in our timer ordering or else a bug in the test |
OK, so sorry for all the bad information. Intervals are not firing out of order as far as I can tell. It does appear that the issue is simply that |
I don't think it's the runInterval logic. At least, the following "rewrite" is also flaky (works when run with the regular command, but crashes with your command): async function testRun() {
const controller = new AbortController();
const { signal } = controller;
const interval = setInterval(10, undefined, { signal });
const iterator = interval[Symbol.asyncIterator]();
return iterator.next()
.then(() => setTimeout(20))
.then(() => controller.abort())
.then(() => iterator.next())
.catch(common.mustNotCall())
.then(() => iterator.return());
}
testRun().then(common.mustCall()); |
Forgot the details.... Code: {
let foo = '\n';
async function runInterval(fn, intervalTime, signal) {
const input = 'foobar';
const interval = setInterval(intervalTime, input, { signal });
let iteration = 0;
for await (const value of interval) {
foo += `got ${value} from for await\n`;
assert.strictEqual(value, input);
iteration++;
await fn(iteration);
foo += `ran function for ${iteration}\n`;
}
}
{
// Check that if we abort when we have some callbacks left,
// we actually call them.
const controller = new AbortController();
const { signal } = controller;
const delay = 1;
let totalIterations = 0;
const timeoutLoop = runInterval(async (iterationNumber) => {
foo += `start ${iterationNumber}\n`;
if (iterationNumber === 2) {
foo += 'awaiting\n';
await setTimeout(delay * 2);
foo += 'aborting\n';
controller.abort();
foo += 'aborted\n';
}
if (iterationNumber > totalIterations) {
totalIterations = iterationNumber;
}
}, delay, signal);
timeoutLoop.catch(common.mustCall(() => {
assert.ok(totalIterations >= 3, `iterations was ${totalIterations} < 3, ${foo}`);
console.log(foo);
}));
}
} Output on success:
Output on failure:
The conclusion I draw from that is that the |
@Trott {
// Check that if we abort when we have some callbacks left,
// we actually call them.
const controller = new AbortController();
const { signal } = controller;
const delay = 10;
let totalIterations = 0;
const timeoutLoop = runInterval(async (iterationNumber) => {
if (iterationNumber === 2) {
await setTimeout(delay);
await setTimeout(delay);
controller.abort();
}
if (iterationNumber > totalIterations) {
totalIterations = iterationNumber;
}
}, delay, signal);
timeoutLoop.catch(common.mustCall(() => {
assert.ok(totalIterations >= 3, `iterations was ${totalIterations} < 3`);
}));
}
} |
This is interesting. I'm finding that if I use just one |
Which puts us back into "bug in timers" territory rather than "bug in test"? I guess? I'm a mess. This is what debugging does to me. |
Yeah, this is really strange. Changing For that matter, changing it to Is that what you're seeing too? Is this some heuristic in the timers that groups things together based on...ratios or something? That would seem to be the obvious explanation for this: 9 = ERROR Other ideas? |
Forgot to also mention that
|
According to Regarding using the different values, I assume that there's some issue with the priority queue where different values create different heaps (the priority-queue is built using a heap) and somehow they're prioritised differently. |
If timers are no longer ordered that means someone really messed up merging something in the last year as several tests would be rendered unstable and likely difficult to fix, and it would probably also mess with user code too. We probably should have just switched to making the ordering guarantee officially I suggest someone look into that. Ideally my write up should still be true: Lines 3 to 73 in fe43bd8
If it doesn't operate that way someone needs to either change it back or re-document it. |
@Fishrock123 I'm pretty sure the issue (if it is around timing) involves only the That said, I think the issue may be in the test after all. Next comment coming up.... |
@Linkgoron @benjamingr Does this correctly describes what the test is doing/what the expectations are? If If so, I think the fix is to get rid of |
@Trott The issue appears to be that there's an interval in a previous test that has the same timeout as the setTimeout (20). That's why it appears as though 19 etc. work (2*delay-1), it's just that there's another timer with the same value. The following failed for me (once) when I executed it with high parallel: setInterval(() => {
}, 20);
let i = 0;
setInterval(() => {
if (++i === 2) {
setTimeout(common.mustCall(function wat() {
if (i === 2) {
throw new Error('asdasd');
}
}), 20);
}
if (i === 10) {
process.exit();
}
}, 10); I found this out when I accidentally changed line 339 in |
Yep, this test fails for me consistently when using the j96 --repeat=192 command. const common = require('../common');
setInterval(() => {}, 20);
let i = 0;
setInterval(() => {
if (++i === 2) {
setTimeout(common.mustCall(function wat() {
if (i === 2) {
throw new Error('asdasd');
}
}), 20);
}
if (i === 10) {
process.exit();
}
}, 10); Removing the first (empty) setInterval makes the test pass consistently with that command. |
By using events, we can make sure the call to `controller.abort()` doesn't happen before the next iteration starts. Fixes: nodejs#37226
By using events, we can make sure the call to `controller.abort()` doesn't happen before the next iteration starts. Fixes: nodejs#37226
@Linkgoron Nice find. I'm glad there's a rational explanation for why I think #37230 fixes this issue by using events to make sure the next interval has started before running |
@Trott OK, I'm not sure if I was the only one who didn't get it, but after thinking about this for a while, I think I know what the issue is. Suppose that this is how the world "looks" like: The correct order should be 1->2->3. However, my PC is so slow that it executes on time |
By using events, we can make sure the call to `controller.abort()` doesn't happen before the next iteration starts. Fixes: nodejs#37226
I think that's a (pretty severe TBH) real bug in timers and we should fix the underlying issue. @Fishrock123 do you remember why the timer lists are batched by the timeout (the parameter passed) and not the deadline (when they should execute)? |
@benjamingr I think it's a known limitation. I imagine it was done this way for efficiency. As the docs say (emphasis added):
|
Please re-read the comment.
This is untrue. There are guarentees about their ordering, as I stated, if the ordering changes from how it was several test cases will fail. That also doesn't mean timers are ordered in the way one might expect, though. I don't know why people keep using that comment as the source of truth when it is anything but that. It is simply a warning to unsuspecting users. We are not really those unsuspecting users.
This sounds familiar, I wonder if there was code at some point before the binary heap was moved into JS (when each list was backed by a separate uv handle) that dealt with this. I suggest searching through the old timers issues and PRs and seeing if anything similar exists. |
Fixes: #37226 PR-URL: #37230 Reviewed-By: Benjamin Gruenbaum <[email protected]>
The issue has existed since the ring was first conceived of. There's no perfect solution here because we either need to sort all timers to get this level of accuracy, which is expensive, or we sort on some granularity level, which introduces some amount of supposedly acceptable accuracy-loss. The example given is very simplified but the reality is any n-th duration could have the actual next expiring timer. Binary heap here just replaced there being a handle per timers list and requiring constant hand-off back to libuv. |
Unfortunately, this test is unreliable and is failing in CI on Raspberry Pi devices. I also can make it fail trivially on my macOS laptop by running
tools/test.py -j96 --repeat=192 test/parallel/test-timers-promisified.js
. I haven't looked closely (yet) but in my experience, having a magic numberdelay
variable like this in a timers test indicates an assumption that the host isn't so slow that (say) a 10ms second timer won't fire 50ms later. This assumption is, of course, incorrect if the machine has low CPU/memory (like a Raspberry Pi device) or if there are a lot of other things competing for the machine's resources (-j96 --repeat=192
).Originally posted by @Trott in #37153 (comment)
The text was updated successfully, but these errors were encountered: