Skip to content
This repository has been archived by the owner on Nov 6, 2019. It is now read-only.

Research: Why do non-Sauce Labs browsers report fewer than 100% of results? #478

Open
mariestaver opened this issue Feb 20, 2018 · 15 comments
Labels
next scheduled for next sprint project:runner

Comments

@mariestaver
Copy link
Collaborator

No description provided.

@jugglinmike
Copy link
Collaborator

As discussed previously, the term "completeness" can be interpreted in a few ways. If we define "completeness" in terms of "tests executed" (as opposed to "subtests executed"), we can calculate a score for a given result set procedurally. By that coarse-grained metric, we are now consistently collecting results for 100% of tests in both Chrome and Firefox. For Safari, we are collecting results for 99% of the available tests.

Most of the missing results are for "wdspec" (WebDriver specification) tests. This is expected because WPT does not run wdspec tests for Sauce Labs-mediated browsers. The relevant code is here, and you can verify experimentally by running ./wpt run --list-tests sauce:foo:bar. Although we could technically enable these tests, I don't think it would be in the interests of wpt.fyi to do so. That's because we cannot control the version of the WebDriver binary used in that context, so we can't authoritatively state what the results are actually describing. (A bit of conversation along these lines is available in the IRC logs of the #testing channel.)

For all trials under consideration here, only a single "chunk" omits results. When we ran Safari with chunk 81 of 100 of WPT at revision 709111adbc, we failed to collect results for 6 tests. In the logs currently available at http://builds.wpt.fyi, the Sauce Connect Tunnel closed abruptly near the end of that trial, which caused the WPT CLI to finish results collection without recording results for the six tests that remained. I can't explain why this occurred; the test being executed at the time was a nondescript reference test.

@foolip I think this puts to rest our concerns about high-level completeness. As we've been discussing, understanding subtest-level completeness is much trickier. Not only is the expected number impossible to determine procedurally, it can also be influenced by implementation status (meaning disparities do not always reflect a error in results collection). Possible explanations include:

...and of course, there is the possibility of still other classes of errors that we have not yet identified.

Due to the uncertainty here, I'm still interested in researching completeness at the subtest level. Following the introduction of experimental Firefox and Chrome, I'd like to spend 3-5 days investigating the results in those terms. I'd organize my findings into a report so that we could understand the scope of the problem and what fixing it would look like.

"Completeness" script
#!/bin/bash

shas='162e9dda47
2b3f901276
c218fe33f4
1331d68a18
6e2b4a77cb
7164dbb89f
939b30a8b2
24f7e6d2f6
af485dcf5f
3589b85af3
fc33be9acf
21be95d426
067f918f9a
d01ce1055e
709111adbc
149116dc79
a87d0554fa
55846d56d8
bc7c640b39
42f43956d4
6fca0b2cd6
383fd735a5
45d92ebc55
e96e038f14'

platforms='firefox-59.0-linux chrome-64.0-linux safari-11.0-macos-10.12-sauce'

echo sha,expected,platform,actual,completeness
for sha in $shas; do
	git checkout $sha -- > /dev/null 2>&1
	tests_expected=$(./wpt run --list-tests firefox | grep -E '^/' | wc -l)

	for platform in $platforms; do
		summary_url=https://storage.googleapis.com/wptd/$sha/$platform-summary.json.gz

		results=$(curl --fail $summary_url 2> /dev/null)

		if [ $? != '0' ]; then
			continue
		fi

		count=$(echo $results | python -c 'import json; import sys; print len(json.load(sys.stdin))')
		pct=$(python -c "print 100 * float($count) / $tests_expected")

		printf '%s,%s,%s,%s,%s\n' $sha $tests_expected $platform $count $pct
	done
done
"Completeness" script results
sha expected platform actual completeness
162e9dda47 25764 firefox-59.0-linux 25764 100.0
2b3f901276 25764 firefox-59.0-linux 25764 100.0
2b3f901276 25764 chrome-64.0-linux 25764 100.0
c218fe33f4 25764 firefox-59.0-linux 25764 100.0
c218fe33f4 25764 chrome-64.0-linux 25764 100.0
c218fe33f4 25764 safari-11.0-macos-10.12-sauce 25709 99.7865238317
1331d68a18 25744 firefox-59.0-linux 25744 100.0
1331d68a18 25744 chrome-64.0-linux 25744 100.0
6e2b4a77cb 25748 firefox-59.0-linux 25748 100.0
6e2b4a77cb 25748 chrome-64.0-linux 25748 100.0
7164dbb89f 25748 firefox-59.0-linux 25748 100.0
7164dbb89f 25748 chrome-64.0-linux 25748 100.0
939b30a8b2 25741 firefox-59.0-linux 25741 100.0
939b30a8b2 25741 chrome-64.0-linux 25741 100.0
24f7e6d2f6 25740 firefox-59.0-linux 25740 100.0
24f7e6d2f6 25740 chrome-64.0-linux 25740 100.0
af485dcf5f 25763 firefox-59.0-linux 25763 100.0
af485dcf5f 25763 chrome-64.0-linux 25763 100.0
3589b85af3 25750 firefox-59.0-linux 25750 100.0
3589b85af3 25750 chrome-64.0-linux 25750 100.0
3589b85af3 25750 safari-11.0-macos-10.12-sauce 25695 99.786407767
fc33be9acf 25744 firefox-59.0-linux 25744 100.0
fc33be9acf 25744 chrome-64.0-linux 25744 100.0
21be95d426 25745 firefox-59.0-linux 25745 100.0
21be95d426 25745 chrome-64.0-linux 25745 100.0
067f918f9a 25745 firefox-59.0-linux 25745 100.0
067f918f9a 25745 chrome-64.0-linux 25745 100.0
d01ce1055e 25745 firefox-59.0-linux 25745 100.0
d01ce1055e 25745 chrome-64.0-linux 25745 100.0
709111adbc 25742 firefox-59.0-linux 25742 100.0
709111adbc 25742 chrome-64.0-linux 25742 100.0
709111adbc 25742 safari-11.0-macos-10.12-sauce 25681 99.7630331754
149116dc79 25742 firefox-59.0-linux 25742 100.0
149116dc79 25742 chrome-64.0-linux 25742 100.0
a87d0554fa 25739 firefox-59.0-linux 25739 100.0
a87d0554fa 25739 chrome-64.0-linux 25739 100.0
55846d56d8 25739 firefox-59.0-linux 25739 100.0
55846d56d8 25739 chrome-64.0-linux 25739 100.0
bc7c640b39 25738 firefox-59.0-linux 25738 100.0
bc7c640b39 25738 chrome-64.0-linux 25738 100.0
42f43956d4 25723 firefox-59.0-linux 25723 100.0
42f43956d4 25723 chrome-64.0-linux 25723 100.0
6fca0b2cd6 25720 firefox-59.0-linux 25720 100.0
6fca0b2cd6 25720 chrome-64.0-linux 25720 100.0
383fd735a5 25717 firefox-59.0-linux 25717 100.0
383fd735a5 25717 chrome-64.0-linux 25717 100.0
45d92ebc55 25712 firefox-59.0-linux 25712 100.0
45d92ebc55 25712 chrome-64.0-linux 25712 100.0
45d92ebc55 25712 safari-11.0-macos-10.12-sauce 25657 99.7860920971
e96e038f14 25710 firefox-59.0-linux 25710 100.0
e96e038f14 25710 chrome-64.0-linux 25710 100.0
Missing results identification script
#!/bin/bash

shas='c218fe33f4
3589b85af3
709111adbc
45d92ebc55'

platform='safari-11.0-macos-10.12-sauce'

for sha in $shas; do
  git checkout $sha -- > /dev/null 2>&1 || exit 1
  expected_tests=$(./wpt run --list-tests firefox | grep -E '^/')
  summary_url=https://storage.googleapis.com/wptd/$sha/$platform-summary.json.gz
  results=$(curl --fail $summary_url 2> /dev/null)
  actual_tests=$(echo $results | python -c 'import json; import sys; print "\n".join(json.load(sys.stdin).keys())')

  echo $sha
  comm -23 <(echo "$expected_tests" | sort) <(echo "$actual_tests" | sort)
  echo
done
Missing results from recent Safari trials

"Missing" wdspec tests have been removed.

  • c218fe33f4
    • (only missing wdspec test results)
  • 3589b85af3
    • (only missing wdspec test results)
  • 709111adbc
    • /html/rendering/replaced-elements/embedded-content-rendering-rules/canvas-fallback.html
    • /html/rendering/replaced-elements/embedded-content-rendering-rules/canvas_scale.html
    • /html/rendering/replaced-elements/embedded-content-rendering-rules/canvas_without_context_a.html
    • /shadow-dom/untriaged/shadow-trees/shadow-root-001.html
    • /shadow-dom/untriaged/shadow-trees/shadow-root-002.html
    • /shadow-dom/untriaged/shadow-trees/text-decoration-001.html
  • 45d92ebc55
    • (only missing wdspec test results)
Excerpt of logs at the time of Safari crash
/css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-cover-svg-004e.html 2d6d86d606d483e788e5835321b636c816e4cb20
/css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-cover-svg-004-ref.html 3cdd009fe7d15c7d93067df4c13f8612ab5c0e12
Testing 2d6d86d606d483e788e5835321b636c816e4cb20 == 3cdd009fe7d15c7d93067df4c13f8612ab5c0e12
10:07.97 TEST_START: /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-cover-svg-004i.html
10:07.97 INFO Test requires OS-level window focus
10:09.70 TEST_END: PASS
10:09.70 TEST_START: /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-cover-svg-004o.html
10:09.70 INFO Test requires OS-level window focus
10:12.43 TEST_END: FAIL, expected PASS
/css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-cover-svg-004o.html 2d6d86d606d483e788e5835321b636c816e4cb20
/css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-cover-svg-004-ref.html 3cdd009fe7d15c7d93067df4c13f8612ab5c0e12
Testing 2d6d86d606d483e788e5835321b636c816e4cb20 == 3cdd009fe7d15c7d93067df4c13f8612ab5c0e12
10:12.43 TEST_START: /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-cover-svg-004p.html
10:12.43 INFO Test requires OS-level window focus
10:13.66 TEST_END: PASS
10:13.66 TEST_START: /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-cover-svg-005e.html
10:13.66 INFO Test requires OS-level window focus
10:16.19 TEST_END: FAIL, expected PASS
/css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-cover-svg-005e.html b5bfc85c20bb8859bdcb704ca166d146e2eda577
/css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-cover-svg-005-ref.html 03a3fc89a7c143a2e4ddb36564e98fd08519cecd
Testing b5bfc85c20bb8859bdcb704ca166d146e2eda577 == 03a3fc89a7c143a2e4ddb36564e98fd08519cecd
10:16.19 TEST_START: /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-cover-svg-005i.html
10:16.19 INFO Test requires OS-level window focus
30 Mar 04:41:41 - Cleaning up.
30 Mar 04:41:41 - Removing tunnel 0f9289a63a83408ab77c0916a5868405.
30 Mar 04:41:41 - Waiting for any active jobs using this tunnel to finish.
30 Mar 04:41:41 - Press CTRL-C again to shut down immediately.
30 Mar 04:41:41 - Note: if you do this, tests that are still running will fail.
30 Mar 04:41:41 - Number of jobs using tunnel: 1.
10:20.98 TEST_END: PASS
10:20.98 TEST_START: /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-cover-svg-005o.html
10:20.98 INFO Test requires OS-level window focus
10:24.91 TEST_END: FAIL, expected PASS
/css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-cover-svg-005o.html b5bfc85c20bb8859bdcb704ca166d146e2eda577
/css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-cover-svg-005-ref.html 03a3fc89a7c143a2e4ddb36564e98fd08519cecd
Testing b5bfc85c20bb8859bdcb704ca166d146e2eda577 == 03a3fc89a7c143a2e4ddb36564e98fd08519cecd
10:24.91 TEST_START: /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-cover-svg-005p.html
10:24.91 INFO Test requires OS-level window focus
30 Mar 04:41:47 - Number of jobs using tunnel: 1.
10:26.06 TEST_END: PASS
10:26.06 TEST_START: /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-cover-svg-006e.html
10:26.07 INFO Test requires OS-level window focus
10:29.47 TEST_END: FAIL, expected PASS

@foolip
Copy link
Member

foolip commented Apr 11, 2018

Excellent writeup hiding in my inbox, thanks for brining my attention to it!

On wdspec, I agree that trying to run them over Sauce doesn't make sense, and that we can just not run them using --test-types=reftest,testharness if they aren't skipped by other means.

For Sauce, is each chunk not being retried 3 times? Does this mean that the Sauce tunnel was closed all three times for those incomplete Safari runs?

"Undetected browser/WebDriver crashes" seems like it would have to result in missing results and not partial subtest, results, right? Addressing that and anything else that results in less than 100% manifest-test completion (modulo wdspec) seems important to me. Also bugs in wpt run like web-platform-tests/wpt#9007, but perhaps as part of investigating the possibility of wpt run safari to replace Sauce?

On testharness.js subtest completion, I think I would rank the issues roughly in order of tests fixed per time spent fixing, so:

  • Harness errors affecting all browsers (Bocoup report exists)
  • Timeouts affecting all browsers
  • Mismatches in subtest count/name where no browser has any failing test or harness error
  • Harness errors and timeouts affecting only some browsers

At the end of this is is roughly where I would put investigating mismatches subtest mismatches that do also come with test failures/timeouts/etc. or harness errors. If we could count them (we can!) then it'd be easier to say how big the pile is, but my hunch is that it's big and only ~50-80% of the cases are ones we'd really want to fix, i.e. there's no way to lock in the wins.

Related to locking in the wins, fixing all harness errors and making Travis block on harness errors seems like a decent opportunity.

@jugglinmike
Copy link
Collaborator

On wdspec, I agree that trying to run them over Sauce doesn't make sense, and
that we can just not run them using --test-types=reftest,testharness if
they aren't skipped by other means.

The source code I linked to implements this for Sauce Labs-powered browsers. To
be clear: continuing to rely on that means that we're okay with non-uniform
results for WebDriver. Given what we've said about the relevance of Sauce Labs
results, our only option for uniformity would be to skip WebDriver tests on
platforms where we can run them. I think having non-uniform data is
preferable to having uniform non-data, in this case anyway :)

For Sauce, is each chunk not being retried 3 times? Does this mean that the
Sauce tunnel was closed all three times for those incomplete Safari runs?

That recovery mechanism went by the wayside when I introduced Buildbot. Today,
I perform that process manually via the web UI to recover from failures.

"Undetected browser/WebDriver crashes" seems like it would have to result in
missing results and not partial subtest, results, right?

Yup. This is the source of the consistent manifest-level incompleteness when
running Edge on Sauce Labs. The output below shows how a test that interacts
with window.opener destroys a window which WPT relies on. The WPT CLI regards
this as an "ERROR", and even though it invalidates all further testing, the
--no-restart-on-unexpected disables the recovery mechanism (that's what I
proposed we change in web-platform-tests/wpt#10186).

Addressing that and anything else that results in less than 100%
manifest-test completion (modulo wdspec) seems important to me.

I have not been able to reproduce this locally in a Windows VM, so it may be
tied to details of the Sauce Labs communication. That would mean the fix may
not benefit other browsers, and it may not even be relevant for our purposes
for very long. I'm happy to dig in to that problem, but I want to make sure
we're on the same page about that before doing so.

  • Harness errors affecting all browsers (Bocoup report exists)
  • Timeouts affecting all browsers
  • Mismatches in subtest count/name where no browser has any failing test or
    harness error
  • Harness errors and timeouts affecting only some browsers

@rwaldron tells me that the referenced report was created at the end of last
year. Lacking access to the storage bucket, he was working from the limited
dataset he was able to build from rate-limited web requests. I'm not knocking
Rick's work, but instead explaining why that report will need to be refreshed.
And that's really all I'm proposing for this initial effort. I'd like to have
a clearer understanding of the distribution of these problems, extended with
some surface-level information about possible explanations/fixes for any
clusters we find.

Related to locking in the wins, fixing all harness errors and making Travis
block on harness errors seems like a decent opportunity.

The "ERROR" status is applied to tests which produce an uncaught exception.
This was a surprise to me when I first learned it because I always assumed that
"ERROR" described a condition where the test interfered with the harness's
functioning. We touched on this during our call today, but since @lukebjerring
mentioned being surprised by this, I thought I should call it out again. It may
be that there are many more "ERROR" results than we would otherwise expect.
That's not to say they aren't worth fixing, though!

Output from one of two consistently-incomplete Edge trials
--------------------------------------------------------
Expected PASS, got FAIL
assert_equals: The document should be an HTML document. expected (string) "text/html" but got (undefined) undefined
Check the document properties corresponding to the creator browsing context
---------------------------------------------------------------------------
Expected PASS, got FAIL
assert_equals: The document's referrer should be its creator document's address. expected "http://web-platform.test:8000/html/browsers/windows/browsing-context.html" but got ""
 4:03.03 TEST_START: /html/browsers/windows/noreferrer-null-opener.html
 4:08.48 TEST_END: ERROR, expected OK
Message: No such window (WARNING: The server did not provide any stacktrace information)
Command duration or timeout: 47 milliseconds
Build info: version: '2.52.0', revision: '4c2593c', time: '2016-02-11 19:06:42'
System info: host: 'sauce-win10', ip: '172.20.116.19', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '1.8.0_131'
Driver info: org.openqa.selenium.edge.EdgeDriver
Capabilities [{applicationCacheEnabled=true, acceptSslCerts=true, browserVersion=40.15063.0.0, platformVersion=10, locationContextEnabled=true, webStorageEnabled=true, browserName=MicrosoftEdge, takesScreenshot=true, pageLoadStrategy=normal, takesElementScreenshot=true, platformName=windows, platform=ANY}]
Session ID: XXXXXXXX-XXXX-XXXX-XXXX-XXXX09A32B9E
Stacktrace:
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0 (None:-2)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance (None:-1)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance (None:-1)
    at java.lang.reflect.Constructor.newInstance (None:-1)
    at org.openqa.selenium.remote.ErrorHandler.createThrowable (ErrorHandler.java:206)
    at org.openqa.selenium.remote.ErrorHandler.throwIfResponseFailed (ErrorHandler.java:158)
    at org.openqa.selenium.remote.RemoteWebDriver.execute (RemoteWebDriver.java:678)
    at org.openqa.selenium.remote.RemoteWebDriver.executeAsyncScript (RemoteWebDriver.java:598)
    at sun.reflect.NativeMethodAccessorImpl.invoke0 (None:-2)
    at sun.reflect.NativeMethodAccessorImpl.invoke (None:-1)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke (None:-1)
    at java.lang.reflect.Method.invoke (None:-1)
    at org.openqa.selenium.support.events.EventFiringWebDriver$2.invoke (EventFiringWebDriver.java:103)
    at com.sun.proxy.$Proxy1.executeAsyncScript (None:-1)
    at org.openqa.selenium.support.events.EventFiringWebDriver.executeAsyncScript (EventFiringWebDriver.java:229)
    at org.openqa.selenium.remote.server.handler.ExecuteAsyncScript.call (ExecuteAsyncScript.java:58)
    at java.util.concurrent.FutureTask.run (None:-1)
    at org.openqa.selenium.remote.server.DefaultSession$1.run (DefaultSession.java:176)
    at java.util.concurrent.ThreadPoolExecutor.runWorker (None:-1)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run (None:-1)
    at java.lang.Thread.run (None:-1)
 4:08.48 TEST_START: /html/browsers/windows/noreferrer-window-name.html
 4:09.21 TEST_END: ERROR, expected OK
Message: No such window (WARNING: The server did not provide any stacktrace information)
Command duration or timeout: 15 milliseconds
Build info: version: '2.52.0', revision: '4c2593c', time: '2016-02-11 19:06:42'
System info: host: 'sauce-win10', ip: '172.20.116.19', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '1.8.0_131'
Driver info: org.openqa.selenium.edge.EdgeDriver
Capabilities [{applicationCacheEnabled=true, acceptSslCerts=true, browserVersion=40.15063.0.0, platformVersion=10, locationContextEnabled=true, webStorageEnabled=true, browserName=MicrosoftEdge, takesScreenshot=true, pageLoadStrategy=normal, takesElementScreenshot=true, platformName=windows, platform=ANY}]
Session ID: XXXXXXXX-XXXX-XXXX-XXXX-XXXX09A32B9E
Stacktrace:
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0 (None:-2)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance (None:-1)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance (None:-1)
    at java.lang.reflect.Constructor.newInstance (None:-1)
    at org.openqa.selenium.remote.ErrorHandler.createThrowable (ErrorHandler.java:206)
    at org.openqa.selenium.remote.ErrorHandler.throwIfResponseFailed (ErrorHandler.java:158)
    at org.openqa.selenium.remote.RemoteWebDriver.execute (RemoteWebDriver.java:678)
    at org.openqa.selenium.remote.RemoteWebDriver.execute (RemoteWebDriver.java:701)
    at org.openqa.selenium.remote.RemoteWebDriver.getWindowHandle (RemoteWebDriver.java:555)
    at sun.reflect.NativeMethodAccessorImpl.invoke0 (None:-2)
    at sun.reflect.NativeMethodAccessorImpl.invoke (None:-1)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke (None:-1)
    at java.lang.reflect.Method.invoke (None:-1)
    at org.openqa.selenium.support.events.EventFiringWebDriver$2.invoke (EventFiringWebDriver.java:103)
    at com.sun.proxy.$Proxy1.getWindowHandle (None:-1)
    at org.openqa.selenium.support.events.EventFiringWebDriver.getWindowHandle (EventFiringWebDriver.java:210)
    at org.openqa.selenium.remote.server.handler.GetCurrentWindowHandle.call (GetCurrentWindowHandle.java:30)
    at org.openqa.selenium.remote.server.handler.GetCurrentWindowHandle.call (GetCurrentWindowHandle.java:1)
    at java.util.concurrent.FutureTask.run (None:-1)
    at org.openqa.selenium.remote.server.DefaultSession$1.run (DefaultSession.java:176)
    at java.util.concurrent.ThreadPoolExecutor.runWorker (None:-1)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run (None:-1)
    at java.lang.Thread.run (None:-1)
 4:09.21 TEST_START: /html/browsers/windows/targeting-cross-origin-nested-browsing-contexts.html
 4:09.73 TEST_END: ERROR, expected OK

@foolip
Copy link
Member

foolip commented Apr 11, 2018

I think having non-uniform data is
preferable to having uniform non-data, in this case anyway :)

Yes :)

web-platform-tests/wpt#10186

Commenting there.

And that's really all I'm proposing for this initial effort. I'd like to have
a clearer understanding of the distribution of these problems, extended with
some surface-level information about possible explanations/fixes for any
clusters we find.

Yeah, doing the work to figure out how many failures are in each of the interesting buckets, or other interesting ones you come up with, sounds very useful. Maybe it's easier starting from single JSON files per run, so starting to save those somewhere ASAP might be a good idea.

The "ERROR" status is applied to tests which produce an uncaught exception.
This was a surprise to me when I first learned it because I always assumed that
"ERROR" described a condition where the test interfered with the harness's
functioning. We touched on this during our call today, but since @lukebjerring
mentioned being surprised by this, I thought I should call it out again. It may
be that there are many more "ERROR" results than we would otherwise expect.
That's not to say they aren't worth fixing, though!

I think I may have become too acclimated to terrible things to see the problem here :) I don't actually know when "ERROR" is the test-level status, but that uncaught exceptions result in a harness errors does make good sense to me. Without it async_test with code not wrapped in t.step_func that throws exceptions would go nowhere. Now, for promise_test things are different because they run in serial, so it is still confusing, but can it be made less bad?

Would a serious effort to get rid of harness errors help, or is the "ERROR" status something different?

@jugglinmike
Copy link
Collaborator

Without it async_test with code not wrapped in t.step_func that throws
exceptions would go nowhere. Now, for promise_test things are different
because they run in serial, so it is still confusing, but can it be made less
bad?

Actually, I somehow forgot that async_test runs in parallel. In light of
that, I agree that a harness-level "ERROR" is the only way to communicate the
fact that something went wrong. (I just consider that another reason why
parallel testing is not such a good idea.)

Would a serious effort to get rid of harness errors help, or is the "ERROR"
status something different?

We won't be able to answer this by looking at the report data, and that's why
I mentioned getting the surface-level information on clusters of errors.

@foolip
Copy link
Member

foolip commented Apr 11, 2018

I just consider that another reason why parallel testing is not such a good idea

I don't disagree, and making async_test run in serial by default is something we could at least consider how one could pull off, even though I don't have any ideas that I think would fly.

@foolip
Copy link
Member

foolip commented Apr 11, 2018

We won't be able to answer this by looking at the report data, and that's why
I mentioned getting the surface-level information on clusters of errors.

Do you mean just looking to see in which directories most of the problems are? What answers couldn't we get by looking at the same information that is available to wpt.fyi?

@jugglinmike
Copy link
Collaborator

Do you mean just looking to see in which directories most of the problems
are? What answers couldn't we get by looking at the same information that is
available to wpt.fyi?

I mean that because the "ERROR" status can be applied for a number of reasons, we'll need to review the specific conditions that produced it. Focusing on the groups of similar tests that report an ERROR is a way to avoid getting lost in the weeds.

I have to admit that I'm getting a little confused, though. I just discovered that uncaught exceptions are reported as "FAIL" for "single-page" tests. I can see why this would be the case, but it complicates the way I think about the different statuses.

Further complicating the classification is the fact that all tests are assumed to be single-page tests until they use a function like test or async_test. This means that even tests which are not intended to be single-page tests will be interpreted as such if they produce an uncaught error prior to defining sub-tests.

An example is geolocation-sensor/GeolocationSensor-enabled-by-feature-policy-attribute-redirect-on-load.https.html, which is reported as a "Failure" on wpt.fyi. Using only the data available there, we might say, "That is not a problem; it is simply a test failure which accurately describes implementation status." But if we agree that tests should not produce uncaught errors, then the truth is that the test needs to be re-written so that it runs to completion, defining (and failing) a number of sub-tests.

At this point, our conversation about the investigation is evolving into the investigation itself... Except I'm not being nearly as disciplined about my approach as the problem deserves. Maybe it's enough to say this justifies dedicating time for creating that report.

@jgraham
Copy link
Collaborator

jgraham commented Apr 12, 2018

So, re: ERROR, I think there are two cases:

  • An error resulting from an in-browser condition (e.g. an uncauche exception)
  • An error resulting from a harness problem.

Currently wptrunner does not distinguish these cases internally, but it could (we do something similar for timeouts, distinguishing between the case where the js timeout fires and the case where we are still running a few seconds after that timeout ought to have fired). I don't think it would be hard to fix things so that harness problems always result in an INTERNAL-ERROR status inside the harness (which would still be reported as ERROR because changing the reporting is harder) and INTERNAL-ERROR would result in an unconditional restart.

@gsnedders
Copy link
Member

gsnedders commented Apr 12, 2018

In the logs currently available at http://builds.wpt.fyi, the Sauce Connect Tunnel closed abruptly near the end of that trial, which caused the WPT CLI to finish results collection without recording results for the six tests that remained.

To me, this sounds like the bug is in wptrunner insofar as we don't handle the Sauce Connect Tunnel closing abruptly?

@jugglinmike
Copy link
Collaborator

Here are some stats. This data is based on web-platform-tests at revision 6736e3f46b, where it had 26043 tests and 12308 testharness.js tests (those counts are for "manifest level" tests, as distinct from "sub-tests"). We're considering 6 sets of results in total: Chrome 65, Chrome 67 ("dev" channel), Firefox 59, Firefox 61 ("nightly" channel), Edge 15, and Safari 11.

I've included links to the relevant test results page on https://wpt.fyi, but bear in mind that we (currently) do not render the results for the experimental browsers. This means some of the inconsistencies described below won't be visible from those pages.

Working from @foolip's "most wanted" list above:

  • Harness errors affecting all browsers (Bocoup report exists)
  • Timeouts affecting all browsers
  • 5 tests report ERROR in all tested browsers
  • 39 tests report TIMEOUT in all tested browsers
  • (bonus) 82 tests report a mixture of ERROR and TIMEOUT in all tested browsers

That means 129 of the reported test results are suspect according to this criteria. That's just 0.5% of all tests and 1% of testharness.js tests.

Many of these are symptoms of a recent regression in the WPT infrastructure. My patch to fix this in WPT directly was superseded by a similar patch that originated in Gecko. Since the former also included tests, I've re-submitted those here. With a fix in place, the details aren't particularly relevant anymore (but I've included them below in case anyone is curious). We can expect the next result set to be much improved.

Many of these issues are due to unguarded references to the API under test. There's some low-handing fruit here (e.g. this small patch to make seven tests fail immediately), but there are also plenty of one-off problems.

We've always known that many tests are authored in such a way that the implementation status influences the number of sub-tests. To date, we've only discussed how this produces inconsistent denominators in results for the same test in different browsers. We haven't talked about how this same pattern can produce timeouts. Specifically, when zero sub-tests are produced in this way, the harness times out waiting for sub-tests. See, e.g. web-animations/animation-model/animation-types/interpolation-per-property.html

We could address these via:

 var features = cleverFunction();
+
+test(function() {
+  assert_true(features.length > 0);
+});

...but my preference would be to:

-var features = cleverFunction();
+var features = ['long', 'and', 'boring', 'list', 'of' 'features'];
  • Mismatches in subtest count/name where no browser has any failing test or
    harness error

435 tests satisfy this criteria. In an earlier comment, I mentioned how some tests are misclassified as "single-page tests" We can begin to filter those out if we discard the cases where the inconsistency is due to a ReferenceError, then the number drops to 374. Even then, there are some cases where property access on unrelated APIs is causing trouble. There are also some questionable patterns in those tests, where the expectatoins are defined in terms of the implementation status. In any case, I still think the tests need to be consistent in this regard.

  • Harness errors and timeouts affecting only some browsers

This describes 4643 tests. We know that some of these are due to a flaky connection to Sauce Labs, but it's hard to say how many. If we exclude Safari and Edge, then the number drops to 724. That's more tractable, though it almost certainly hides issues that are not related to the known network interference.

5 tests reporting ERROR in all browsers
42 tests reporting TIMEOUT in all browsers
82 tests inconsistently reporting either ERROR or TIMEOUT in all browsers
Analysis of failures/timeouts due to WebSocket regression

This issue effected tests outside of the websockets/ directory. Other web
platform features interact with WebSockets in non-trivial ways, so tests for
other features were invalidated by the removal of the server.

This configuration hiccup also effected some tests that had no relation to
WebSockets. That's due to the value substitution feature of the WPT server. For
example, the test
service-workers/service-worker/state.https.html
includes a JavaScript file named
resources/test-helpers.sub.js

which includes the following code:

function get_websocket_url() {
  return 'wss://{{host}}:{{ports[wss][0]}}/echo';
}

In the absence of the secure WebSocket server configuration, interpolating this
templated file causes a runtime Python error. That causes the WPT Server to
return an HTTP 500 response for this file. In other words, this bug effectively
removed that shared file.

Fortunately, that particular substitution does not appear to be very common:

$ find . -type f -name '*.sub.*' -print0 | xargs -0 grep -E '{.*wss'
./websockets/websocket.sub.js:var __SECURE__PORT = {{ports[wss][0]}};
./cookies/secure/set-from-wss.https.sub.html:    var ws = new WebSocket("wss://{{host}}:{{ports[wss][0]}}/set-cookie-secure?secure_from_secure");
./cookies/secure/set-from-wss.https.sub.html:      var ws2 = new WebSocket("wss://{{host}}:{{ports[wss][0]}}/echo-cookie");
./cookies/secure/set-from-ws.sub.html:      var ws2 = new WebSocket("wss://{{host}}:{{ports[wss][0]}}/echo-cookie");
./upgrade-insecure-requests/support/testharness-helper.sub.js:    url.port = {{ports[wss][0]}};
./service-workers/service-worker/resources/test-helpers.sub.js:  return 'wss://{{host}}:{{ports[wss][0]}}/echo';

Although all of those references were invalidated, the one in
upgrade-insecure-requests appears to be the only other that might effect
tests unrelated to WebSockets.

435 tests where no browser has a timeout or harness error, yet total sub-tests still varies across browsers

@jugglinmike
Copy link
Collaborator

Here's the script I wrote to analyze the results data:

analyze.py
import json
import os
import re
import sys


class Result(object):
    def __init__(self, outer_dir, test_file):
        self.name = test_file
        self.outer_dir = outer_dir

        with open(os.path.join(outer_dir, test_file)) as handle:
            self._data = json.load(handle)

    @property
    def url(self):
        return 'https://wpt.fyi/%s' % self.name

    @property
    def is_error(self):
        return self._data['status'] == 'ERROR'

    @property
    def is_timeout(self):
        return self._data['status'] == 'TIMEOUT'

    @property
    def is_okay(self):
        return self._data['status'] == 'OK'

    @property
    def total_subtests(self):
        return len(self._data['subtests'])

    @property
    def is_reference_failure(self):
        if self.total_subtests != 1:
            return False

        return bool(re.search('ReferenceError:| is undefined',
                              str(self._data['subtests'][0]['message'])))

    @property
    def potential_error(self):
        '''Identify single-page tests that are reported as failing. If a test
        file containing multiple sub-tests produces an uncaught error before
        any sub-tests are defined, it will be classified in this way (despite
        being technially a harness error.'''
        if not self.is_okay:
            return False

        if len(self._data['subtests']) > 1:
            return False

        return self._data['subtests'][0]['status'] == 'FAIL'


def walk_mirrored(dirs):
    ref = dirs[0]

    for root, _, file_names in os.walk(ref):
        wpt_dir = root[len(ref)+1:]

        for file_name in file_names:
            results = []

            for outer_dir in dirs:
                full_name = os.path.join(outer_dir, wpt_dir, file_name)

                try:
                    result = Result(outer_dir, os.path.join(wpt_dir, file_name))
                except (IOError, OSError):
                    continue

                results.append(result)

            yield results


def compare(results):
    all_error = reduce(lambda acc, result: acc and result.is_error, results, True)
    all_timeout = reduce(lambda acc, result: acc and result.is_timeout, results, True)
    all_bad = reduce(lambda acc, result: acc and (result.is_error or result.is_timeout), results, True)
    any_bad = reduce(lambda acc, result: acc or (result.is_error or result.is_timeout), results, False)
    all_single_fail = reduce(lambda acc, result: acc and result.potential_error, results)

    subtest_counts_differ = False
    subtest_counts_differ_meaningfully = False
    total_subtests = results[0].total_subtests
    for result in results[1:]:
        if result.total_subtests == total_subtests:
            continue

        subtest_counts_differ = True

        if not result.is_reference_failure:
            subtest_counts_differ_meaningfully = True

    if all_error:
    #if all_timeout:
    #if not all_error and not all_timeout and all_bad:
    #if subtest_counts_differ and not any_bad:
    #if subtest_counts_differ_meaningfully and not any_bad:
    #if any_bad and not all_bad:
        print(results[0].url)

if __name__ == '__main__':
    directories = sys.argv[1:]

    if len(directories) < 1:
        raise ValueError('At least one results directory must be provided')

    for results in walk_mirrored(directories):
        compare(results)

@jugglinmike
Copy link
Collaborator

Another interesting way to look at the data is to compare results between stable and experimental releases of the same browser. We can't assign much significance to the pass/fail rate itself (because as discussed above, more advanced implementations may "unlock" more sub-tests and report an overall lower percentage). However, it seems fair to expect that the number of passing sub-tests (that is, ignoring the total number of sub-tests) should never decrease. Even when an experimental release is subject to more sub-tests than a stable release, it shouldn't fail where the stable release passes.

It turns out that this is not always true! In 84 tests, the experimental release of Chrome passes fewer sub-tests than the stable release. There are 51 tests where the stable/experimental Firefox results demonstrate this pattern.

Three explanations for this come to mind:

  • The experimental release of the browser has regressed in some way (which seems highly unlikely given the rigor of each browser's release process)
  • The test or the browser is flaky, and we're observing the flakiness between the two test executions (i.e. the difference in release channel is only a coincidence)
  • The test has been authored with some questionable practice

I've only just begun to look into these tests, but I have found an explanation for 75% of the Chrome cases: the recent change in media "autoplay" policy. This doesn't quite fit in to any of the explanations listed above.

Although we've identified this issue by comparing "stable" and "experimental" results, the dataset is showing its age. Chrome 66 has recently been promoted to "stable", so on https://wpt.fyi today, we can see the errors introduced by the policy change. For example:

https://wpt.fyi/media-source/mediasource-redundant-seek.html?sha=43dd25c888

That makes the problem much more visible, so I've submitted a patch at gh-552.

I'm curious to see if there are any other patterns in the tests that remain.

Suspect tests from Chrome stable/experimental comparison
  • "Ps" - sub-tests passing in stable release
  • "Pe" - sub-tests passing in experimental release
Ps Pe Test
10 8 /webauthn/createcredential-badargs-user.https.html
3 1 /webauthn/createcredential-extensions.https.html
5 0 /webauthn/createcredential-badargs-authnrselection.https.html
8 0 /webauthn/createcredential-badargs-rp.https.html
1 0 /media-source/mediasource-endofstream.html
250 248 /WebCryptoAPI/wrapKey_unwrapKey/wrapKey_unwrapKey.https.worker.html
316 310 /css/cssom-view/interfaces.html
5 4 /html/browsers/the-window-object/window-open-noopener.html
1 0 https://wpt.fyi/html/semantics/embedded-content/the-iframe-element/sandbox_002.htm
2 0 /html/semantics/embedded-content/media-elements/autoplay-with-broken-track.html
4 2 /html/semantics/embedded-content/media-elements/event_order_canplay_playing.html
4 2 /html/semantics/embedded-content/media-elements/event_pause.html
4 2 /html/semantics/embedded-content/media-elements/event_playing.html
2 0 /html/semantics/embedded-content/media-elements/event_timeupdate.html
1 0 https://wpt.fyi/html/semantics/embedded-content/media-elements/video_008.htm
4 2 /html/semantics/embedded-content/media-elements/paused_false_during_play.html
4 2 /html/semantics/embedded-content/media-elements/readyState_during_playing.html
4 2 /html/semantics/embedded-content/media-elements/event_play.html
1 0 /html/semantics/embedded-content/media-elements/ready-states/autoplay-with-slow-text-tracks.html
4 0 /html/semantics/embedded-content/media-elements/loading-the-media-resource/autoplay-overrides-preload.html
1 0 /html/semantics/embedded-content/media-elements/loading-the-media-resource/resource-selection-invoke-set-src-networkState.html
1 0 /html/semantics/embedded-content/media-elements/track/track-element/track-cues-pause-on-exit.html
11 10 /webrtc/RTCPeerConnection-setRemoteDescription-tracks.https.html
1 0 /mediacapture-streams/MediaStreamTrack-MediaElement-disabled-audio-is-silence.https.html
3 2 /content-security-policy/reporting/report-same-origin-with-cookies.html
Suspect tests from Firefox stable/experimental comparison
  • "Ps" - sub-tests passing in stable release
  • "Pe" - sub-tests passing in experimental release
Ps Pe Test
1 0 /preload/dynamic-adding-preload.html
1 0 /preload/preload-with-type.html
1 0 /preload/delaying-onload-link-preload-after-discovery.html
4 2 /intersection-observer/same-document-zero-size-target.html
10 5 /intersection-observer/multiple-thresholds.html
5 3 /intersection-observer/multiple-targets.html
4 3 /intersection-observer/same-document-no-root.html
10 5 /webaudio/the-audio-api/the-convolvernode-interface/convolver-response-1-chan.html
8 6 https://wpt.fyi/webdriver/tests/element_send_keys/form_controls.py
79 62 https://wpt.fyi/webdriver/tests/interaction/element_clear.py
11 9 /css/css-tables/table-model-fixup.html
10 7 /css/css-tables/html5-table-formatting-2.html
22 19 /css/css-tables/table-model-fixup-2.html
5 2 /css/css-tables/html5-table-formatting-1.html
12 4 /css/selectors/focus-within-009.html
156 118 /css/css-values/viewport-units-css2-001.html
8 7 /css/cssom/shorthand-values.html
17 15 /css/css-grid/layout-algorithm/grid-find-fr-size-gutters-001.html
2 1 /css/cssom-view/elementsFromPoint-iframes.html
5 1 /css/cssom-view/elementsFromPoint-simple.html
4 0 /css/cssom-view/elementsFromPoint-table.html
4 0 /css/cssom-view/elementsFromPoint-svg.html
40 36 /css/cssom-view/scrollintoview.html
3 1 /css/cssom-view/elementsFromPoint-invalid-cases.html
2 1 /css/cssom-view/CaretPosition-001.html
5 4 /mathml/presentation-markup/spaces/space-1.html
5 4 /mathml/presentation-markup/scripts/underover-1.html
5 4 /mathml/presentation-markup/scripts/subsup-3.html
2 0 /uievents/order-of-events/focus-events/focus-automated-blink-webkit.html
2 0 /webvr/webvr-disabled-by-feature-policy.https.sub.html
1 0 /2dcontext/drawing-paths-to-the-canvas/drawFocusIfNeeded_005.html
1 0 /2dcontext/drawing-paths-to-the-canvas/drawFocusIfNeeded_001.html
1 0 /2dcontext/drawing-paths-to-the-canvas/drawFocusIfNeeded_004.html
14 0 /html/browsers/the-window-object/apis-for-creating-and-navigating-browsing-contexts-by-name/open-features-tokenization-width-height.html
128 126 /html/infrastructure/urls/resolving-urls/query-encoding/utf-8.html
128 126 /html/infrastructure/urls/resolving-urls/query-encoding/utf-16le.html
112 110 /html/infrastructure/urls/resolving-urls/query-encoding/windows-1252.html
128 126 /html/infrastructure/urls/resolving-urls/query-encoding/utf-16be.html
1 0 /html/editing/focus/composed.window.html
1 0 /html/editing/focus/sequential-focus-navigation-and-the-tabindex-attribute/focus-tabindex-positive.html
1 0 /html/editing/focus/sequential-focus-navigation-and-the-tabindex-attribute/focus-tabindex-zero.html
1 0 /html/editing/focus/sequential-focus-navigation-and-the-tabindex-attribute/focus-tabindex-negative.html
2 0 /html/editing/focus/focus-management/focus-events.html
60 0 /html/webappapis/system-state-and-capabilities/the-navigator-object/protocol.html
1 0 /html/semantics/scripting-1/the-script-element/execution-timing/031.html
30 0 /html/semantics/forms/textfieldselection/select-event.html
1 0 /html/semantics/selectors/pseudo-classes/focus-autofocus.html
5 0 /html/semantics/selectors/pseudo-classes/focus.html
1 0 /html/semantics/embedded-content/media-elements/track/track-element/track-cues-missed.html
7 1 /html/semantics/embedded-content/the-img-element/usemap-casing.html
14 13 /html/dom/usvstring-reflection.html

@lukebjerring
Copy link
Collaborator

lukebjerring commented May 3, 2018 via email

@foolip
Copy link
Member

foolip commented May 7, 2018

The experimental release of the browser has regressed in some way (which seems highly unlikely given the rigor of each browser's release process)

Even if it doesn't explain all of the failures, I would expect this to happen regularly, actually, and it's one of the first things I'd like to look for. From https://experimental-diff-dot-wptdashboard.appspot.com/results/?label=chrome&diff, I found http://w3c-test.org/longtask-timing/longtask-in-sibling-iframe-crossorigin.html which really does seem to time out more easily on Chrome Dev, but it's disabled for being flaky so we didn't notice. Even if it's mostly flaky tests we find this way, that's still good :)

Comparing one stable version to the next would also be very interesting, although by then it's too late to catch regressions.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
next scheduled for next sprint project:runner
Projects
None yet
Development

No branches or pull requests

7 participants