Mocha's hanging CI browser tests #2890

ScottFreeCode · 2017-06-18T05:20:09Z

Of late, there has been a failure in CI that looks like a browser flake... except that it's far too consistent to be a flake. Doesn't always happen, but usually does, and always the same browsers (and usually the same point in the test). It seems like the browser hangs or something -- Sauce disconnects with a timeout of 20,000ms partway through the testsuite.

It's blocking #2886, notably, but some testing determined that it affected the commits there even before coverage was actually added, so it seems to be something that coverage makes more likely to happen but isn't the sole cause.

I have had a... cryptic breakthrough, inasmuch as while I was experimenting with something else I accidentally wound up reproducing a hang in PhantomJS with Node 8.0.0 at the same points where Sauce is disconnecting. I do not know what about my local experiment triggerred this, and am going to move further experimentation to a new checkout of Mocha to ensure this one stays where it can reproduce the hangy issue.

Here's what I have been able to discover:

In Travis, it occurs for Edge, IE7 and Safari.
It seems like it may happen more consistently with my PR of adding coverage to Karma runs, but it has happened to builds that lack those changes as well (e.g. master, other PRs).
Both in travis and locally, it happens after printing the success result for should format functions saved in windows style - spaces and for the last unit test.
Adding console.logs to later tests indicates that, when it hangs after the "- spaces" test, not only is the next test started, but many more tests are actually run even though no more get their results printed; however, at some point there is a test that is never console.logged, thus indicating that it also stops running tests somewhere along the line.
The point at which it actually stops running tests does not seem to be consistent.
Adding console.logs may affect the point where it stops running tests and/or whether the issue occurs at all. It's hard to tell for sure. Sometimes it changes after I add a console.log and then changes back when I remove that console.log; other times it changes after I add a console.log and does not change back when I remove it.
On the theory that the continued running of tests past the point where the reporter ceases output suggest that output is somehow gummed up and perhaps the system hangs for real when buffers waiting for the output to ungum finally fill up, I tried upgrading the two things that I presume affect output: karma and karma-spec-reporter. Results were much like adding console.logs: it seems like these upgrades may affect whether and when the issue occurs, but niether one (nor even both together) consistently eliminates the issue.
By saving the Browserify bundle of the test files, I see that the tests when this issue occurs (where I've been adding console.logs) is not minified or obfuscated, so it is unlikely to be due to a bug in bundling that the exact code affects nondeterministically. At least, I think so anyway.

I feel just as stuck as before, but maybe this information will ring a bell to somebody.

The text was updated successfully, but these errors were encountered:

ScottFreeCode · 2017-06-19T04:21:11Z

I have reason to suspect that this might have something to do with console.error and/or util.inspect, or some such formatting behavior under the hood, in Node 8: I get some similar results trying to util.inspect Mocha suite/test objects in a Karma run, and trying to console.error Mocha suite/test objects in a Node (!) run. (Was using .error purely to be able to differentiate it from other output on stdout, but that's a long story; I should double-check whether I get these behaviors with .log too.)

Looked at the Node issue tracker and didn't find any issues about util.inspect that sound both similar and still unfixed/recent; may need to scour for console.<level> issues too.

ScottFreeCode · 2017-06-26T16:55:03Z

#2903 would help debug this (no pun intended).

I believe I have found the root cause of the issue.

mocha/lib/context.js

Line 109 in 2bb2b9f

return JSON.stringify(this, function (key, val) {

sometimes throws (I believe we have at least one other use of JSON.stringify that could do the same).

mocha/lib/runner.js

Line 713 in 2bc9c4d

if (runnable.state || runnable.isPending()) {

suppresses some of these. (We should read it closely and double-check that there are no other cases in this function where insufficient/no info is printed.) @mochajs/core Does anyone remember why these errors are being suppressed? It strikes me as dangerous in general (this issue is a perfect example) and, without knowing the details yet, as probably not the right solution for whatever it was meant to solve: we shouldn't be emitting spurious errors in the first place that would need to be suppressed, or emitting the same error twice, or anything like that (and even if we were doing so and really could not fix it, there's got to be a better way to determine that that's the case for an individual error so as to avoid suppressing legit errors).

I think fixing the latter (or at least allowing us to shut it off during CI runs) may give us the info we're missing from these browser failures, allowing us to confirm or deny that they're the former issue and if they're not then to get some idea of what they are instead.

boneskull · 2017-07-26T06:21:58Z

Does anyone remember why these errors are being suppressed? It strikes me as dangerous in general (this issue is a perfect example) and, without knowing the details yet, as probably not the right solution for whatever it was meant to solve: we shouldn't be emitting spurious errors in the first place that would need to be suppressed, or emitting the same error twice, or anything like that (and even if we were doing so and really could not fix it, there's got to be a better way to determine that that's the case for an individual error so as to avoid suppressing legit errors).

So, this is two different things. I'll explain the intent:

If `runnable.state` is truthy, the `Runnable` has already completed its test.

I think it's because of this:

it('should malinger', function (done) {
  setTimeout(() => {
    done();
    throw new Error('maybe I should have just returned?');
  });
});

So done() has marked the test OK, but we're still running code, and that code happens to throw an exception.

IMO, that should throw an exception (and probably exit) instead of eat it, but I haven't confirmed if that's what's happening here.

If `runnable.pending` is truthy, the `Runnable` has already been skipped.

This is necessary because of how an async test can skip itself:

it('should skip, eventually', function (done) {
  setTimeout(() => {
    this.skip();
    done(); // never called!
  }, 500);
});

If a test is async, it must call done with or without an Error as its first parameter. this.skip() is then shorthand for done(new Pending()). A Pending instance is of course an Error.

However, just like with calling this.skip() in a synchronous test:

it('should skip', function () {
  this.skip();
  throw new Error('never thrown');
});

...we must immediately abort test execution. The Error above is never thrown, because this.skip() in fact throws a Pending. In the synchronous case, this.skip() is shorthand for throw new Pending().

Likewise, in the async case, we call done() with the Pending, but we also must abort synchronous execution, since we are not guaranteed to return immediately after the call to done()! So we call done(new Pending()), but we also must throw to interrupt.

Happy to discuss better implementations or strategies for this sort of thing. We could abort execution upon done()--so anything called thereafter would never get called, like in the "pending" case--but this is likely to break tests, and adds more black magic where there's already too much.

boneskull · 2017-07-26T06:34:48Z

I also realize "exceptions are not flow control" are wise words. We can abuse exceptions to halt a test immediately if the user decides it should be skipped, but outside of Zones or async_wrap there's really no way to be sure we've aborted everything:

it('life is pain', function (done) {
  let i = 0;
  setInterval(() => {
    console.log('ha ha ha');
    if (++i > 5) {
      throw new Error('^_^');
    }
  }, 1000);
  setTimeout(() => {
    this.skip();
  }, 500);
});

Throwing a Pending is just a "best effort" for a more common situation. And it can theoretically be problematic! If we find out that one or both of these are causing problems, I'd probably want to soft-deprecate the behavior then eventually hard-deprecate--unless we can leverage async_wrap/zones to ensure there are no remaining tasks in a Runnable.

ScottFreeCode · 2017-09-14T01:32:16Z

So, I just now had a chance to sit down and review this one. If I understood this accurately:

it("should fail, but is suppressed", function(done) {
  setTimeout(function() {
    done()
    throw new Error("Gets thrown but ignored")
  }, 5)
})

...will be marked as passed even though it throws. Assuming that's accurate:

I'd consider this a bug and advocate for stopping suppression of that error (i.e. instead it should be failed with the thrown error -- even if after already printing that the test passed -- rather than considered just plain passed and nothing more). Furthermore, doing so by adding yet another special case would be inferior to doing so by removing a special case.

The main problem with fixing it is that this will break any tests that were incorrectly passing due to that pattern; it's technically a bugfix but we could save it for version 4 to minimize complaints. If we were to save it for verson 4, I would want to start working on a version 4 branch so that we can get these suppressions corrected there, use version 4 (in development) to diagnose the flake/hang, and then fix the flake/hang in master and add the browser coverage etc. so that before we release 4 we can pull in the browser coverage stuff and improve how sure we are of its correctness (especially if we're dropping a bunch of shims, although correspondingly dropping browsers will make it a little easier on the other hand). (Either way, eventually having more browser coverage will help us determine if there are any remaining shims or environment special-cases we can drop.)

it("synchronous pending", function() {
  this.skip()
  throw new Error("Should never be called")
})

it("asynchronous pending", function(done) {
  var test = this
  setTimeout(function() {
    test.skip()
    throw new Error("Should never be called")
  }, 5)
})

Both of these should behave the same.
They should be marked pending.
The throw should not be executed; not just ignored, but not executed, i.e. if there are side-effects they will not occur. (This behavior may be "magic" considering further asynchronous action cannot be stopped unless Zones turn out to not have the problems Domains have* and we integrate them. Allowing tests to continue executing after this.skip for consistency without Zones would be a breaking change we may consider for a future major bump.)
All of that is achieved in the current implementation; however...
The synchronous one is implemented as throw new Pending()
The asynchronous one is implemented as done(new Pending()); throw <ignored error>

Question: assuming the above is accurate, should we not be using throw new Pending() in both implementations and having both the synchronous failure handling and the asynchronous failure handling go through the same "treat Pending as skip rather than failure" branching code?

NB: I have not tried running those tests and confirming their behavior then changing the code and seeing how it changes their behavior. Just wrapping my head around the ideas in the comment.

* Last I checked, Domains made it into Node only to end up deprecated because people found problems with them, but Node won't remove them till there's a suitable replacement, and Zones have been proposed as that replacement but nobody can find info on how they avoid the problems Domains have. I don't have a citation or anything, that's just all the info I was able to bring up on Google last time I tried.

boneskull · 2017-09-27T04:57:35Z

honestly I think this has to do with a sauce connect problem and nothing in our codebase. basically I think we gotta go back to running all of the browsers at once.

boneskull · 2017-09-27T05:49:35Z

@ScottFreeCode see these commits

I think what's happening is this:

make test-browser-whatever runs
karma starts
karma opens a sauce connect proxy with the job ID as tunnel identifier
karma runs tests & exits, proxy with it
make test-browser-whatevers-next runs
karma opens a sauce connect proxy with the same job ID as tunnel identifier
due to timing issues, the tunnel id may now be considered invalid by saucelabs, because the previous karma process killed the sauce connect process.
fails to launch browser

so that's my theory. the fix would then be:

allow travis (not karma) to open the sauce connect proxy
whenever karma starts, use the job ID as the tunnel identifier
meanwhile, travis should leave the sauce connect process running after karma completes a set of tests

a bonus is that awhile back I talked with SauceLabs and had them increase our concurrency. when we split up the browser targets in the past, it was because we were running into problems. now, we should not have problems running all browsers we want in one fell swoop.

granted, that fell swoop contains multitudes of test-browser-* targets... but either way, if this works, CI will run quicker.

boneskull · 2017-09-27T05:54:38Z

fwiw, this build

boneskull · 2017-09-27T17:59:48Z

I understand your concerns about this behavior, but I'm not convinced it's causing any real problems in our build.

This example:

it("should fail, but is suppressed", function(done) {
  setTimeout(function() {
    done()
    throw new Error("Gets thrown but ignored")
  }, 5)
})

Isn't a case that Mocha should be handling, IMO. If code calling done() then doing more stuff in any callback-based JS, then the code is either a) wrong, or b) too clever.

So, as you said, Mocha is eating the error, and to let it bubble, we'd have to add more special cases. I don't think this is necessary. Once you've called done(), that should be the end of the test, and whatever happens after that--which is hopefully nothing--is beyond the scope of Mocha.

Question: assuming the above is accurate, should we not be using throw new Pending() in both implementations and having both the synchronous failure handling and the asynchronous failure handling go through the same "treat Pending as skip rather than failure" branching code?

We call done() with the Pending error, which effectively marks the Runnable as pending, because the only way to mark a Runnable as anything in an async test is to call done(). Yet, the only way to abort execution of the current stack is to throw, so to abort, we throw as well (I could see an argument for throwing Pending in the body of done, but this is just another way of doing the same thing).

If we were to change this behavior, I wouldn't suggest to not attempt to abort, even though this only covers the general case. See:

it('should be a problem', function (done) {
  process.nextTick(() => {
    throw new Error('gotcha'); // thrown, but where does it go?
  });
  this.skip(); // calls done() and finishes up
  throw new Error('nope nope nope nope nope'); // not thrown
});

So, if we aren't going to abort, what should we do instead? It would involve changing the API dramatically. Open to suggestions about what this should really look like, but it should require the user to either return from an synchronous Runnable or otherwise notify an async Runnable that it's pending. In a perfect world?

import {describe, it} from 'mocha';

it('is sync pending', test => test.pending); // or `test.pending();` or `test.pending = true; return;`, etc.

it('is async pending', (test, done) => {
  done(test.pending);
});

it('is promise pending', test => {
  return myPromise.then(() => test.pending);  // or `await myPromise; return test.pending;`
});

At any rate, I'm not really convinced this is the source of browser flake, but this could definitely be improved.

ScottFreeCode mentioned this issue Jun 26, 2017

Print debug logs in CI #2903

Closed

ScottFreeCode added the qa label Jun 26, 2017

ScottFreeCode mentioned this issue Jun 30, 2017

Duplicated failing test summary #2906

Closed

ScottFreeCode mentioned this issue Jul 8, 2017

Hostile Custom Runner API - Silent crash #2777

Closed

ScottFreeCode added the area: repository tooling concerning ease of contribution label Aug 25, 2017

ScottFreeCode mentioned this issue Aug 25, 2017

Cover Browser Tests (and all Node versions) #2886

Closed

boneskull assigned ScottFreeCode Sep 3, 2017

boneskull mentioned this issue Sep 27, 2017

consolidate saucelabs tests to address flake #3019

Merged

boneskull removed the area: repository tooling concerning ease of contribution label Sep 27, 2017

boneskull self-assigned this Sep 28, 2017

boneskull closed this as completed in #3019 Sep 28, 2017

TimvdLippe mentioned this issue Oct 22, 2017

🐛 Bug: HTML report pass/fail filters do not work correctly while testsuite runs #2947

Open

ScottFreeCode mentioned this issue Nov 22, 2017

Error thrown between it clauses disappears #3113

Closed

dependencies bot mentioned this issue Dec 4, 2017

Update mocha in / from 3.4.2 to 4.0.0 JetBrains/ring-ui#20

Merged

dependencies bot mentioned this issue Jan 24, 2018

Update mocha in /packages/generator/app/templates from 3.1.2 to 5.0.0 JetBrains/ring-ui#85

Merged

renovate bot mentioned this issue Feb 14, 2018

Update dependency mocha to v5 Urigo/tortilla#64

Merged

renovate bot mentioned this issue Nov 1, 2018

Update dependency mocha to v5 justinbarry/augur-copy#93

Merged

This was referenced Nov 14, 2018

Update dependency mocha to v5 hairyhenderson/nursery_alert#17

Merged

fix(deps): update dependency @types/mocha to v5 - autoclosed stefanwalther/boilerplate-typescript-rest-docker#18

Closed

renovate bot mentioned this issue Nov 23, 2018

chore(deps): update dependency mocha to v5 JLHwung/tianzige#37

Merged

This was referenced Dec 1, 2018

chore(deps): update dependency mocha to v5 - autoclosed stefanwalther/qix-graphql#66

Closed

Update dependency mocha to v5 softbrix/stureby-index#4

Merged

Update dependency mocha to v5 Automattic/jetpack#10877

Merged

This was referenced Dec 8, 2018

Update dependency @types/mocha to v5 - autoclosed tomyitav/apollo-typed-lambda#12

Closed

Update dependency mocha to v5 softbrix/dibba-tree#5

Merged

This was referenced Dec 23, 2018

Update dependency mocha to v5 - autoclosed mateodelnorte/microsvc-slave#7

Closed

chore(deps): update dependency mocha to v5 status-board/status-board-cli#175

Closed

snyk-bot mentioned this issue Jul 10, 2020

[Snyk] Security upgrade karma from 1.7.1 to 2.0.0 Deckhandfirststar01/mocha#20

Open

snyk-bot mentioned this issue Aug 17, 2020

[Snyk] Security upgrade karma from 1.7.1 to 2.0.0 Deckhandfirststar01/mocha#22

Open

snyk-bot mentioned this issue Aug 28, 2020

[Snyk] Security upgrade karma from 1.7.1 to 2.0.0 Deckhandfirststar01/mocha#23

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mocha's hanging CI browser tests #2890

Mocha's hanging CI browser tests #2890

ScottFreeCode commented Jun 18, 2017 •

edited

Loading

ScottFreeCode commented Jun 19, 2017

ScottFreeCode commented Jun 26, 2017

boneskull commented Jul 26, 2017 •

edited

Loading

boneskull commented Jul 26, 2017 •

edited

Loading

ScottFreeCode commented Sep 14, 2017

boneskull commented Sep 27, 2017

boneskull commented Sep 27, 2017

boneskull commented Sep 27, 2017

boneskull commented Sep 27, 2017 •

edited

Loading

Mocha's hanging CI browser tests #2890

Mocha's hanging CI browser tests #2890

Comments

ScottFreeCode commented Jun 18, 2017 • edited Loading

ScottFreeCode commented Jun 19, 2017

ScottFreeCode commented Jun 26, 2017

boneskull commented Jul 26, 2017 • edited Loading

If runnable.state is truthy, the Runnable has already completed its test.

If runnable.pending is truthy, the Runnable has already been skipped.

boneskull commented Jul 26, 2017 • edited Loading

ScottFreeCode commented Sep 14, 2017

boneskull commented Sep 27, 2017

boneskull commented Sep 27, 2017

boneskull commented Sep 27, 2017

boneskull commented Sep 27, 2017 • edited Loading

ScottFreeCode commented Jun 18, 2017 •

edited

Loading

boneskull commented Jul 26, 2017 •

edited

Loading

If `runnable.state` is truthy, the `Runnable` has already completed its test.

If `runnable.pending` is truthy, the `Runnable` has already been skipped.

boneskull commented Jul 26, 2017 •

edited

Loading

boneskull commented Sep 27, 2017 •

edited

Loading