Research: Why do non-Sauce Labs browsers report fewer than 100% of results? #478

mariestaver · 2018-02-20T19:43:59Z

No description provided.

jugglinmike · 2018-04-05T19:32:52Z

As discussed previously, the term "completeness" can be interpreted in a few ways. If we define "completeness" in terms of "tests executed" (as opposed to "subtests executed"), we can calculate a score for a given result set procedurally. By that coarse-grained metric, we are now consistently collecting results for 100% of tests in both Chrome and Firefox. For Safari, we are collecting results for 99% of the available tests.

Most of the missing results are for "wdspec" (WebDriver specification) tests. This is expected because WPT does not run wdspec tests for Sauce Labs-mediated browsers. The relevant code is here, and you can verify experimentally by running ./wpt run --list-tests sauce:foo:bar. Although we could technically enable these tests, I don't think it would be in the interests of wpt.fyi to do so. That's because we cannot control the version of the WebDriver binary used in that context, so we can't authoritatively state what the results are actually describing. (A bit of conversation along these lines is available in the IRC logs of the #testing channel.)

For all trials under consideration here, only a single "chunk" omits results. When we ran Safari with chunk 81 of 100 of WPT at revision 709111adbc, we failed to collect results for 6 tests. In the logs currently available at http://builds.wpt.fyi, the Sauce Connect Tunnel closed abruptly near the end of that trial, which caused the WPT CLI to finish results collection without recording results for the six tests that remained. I can't explain why this occurred; the test being executed at the time was a nondescript reference test.

@foolip I think this puts to rest our concerns about high-level completeness. As we've been discussing, understanding subtest-level completeness is much trickier. Not only is the expected number impossible to determine procedurally, it can also be influenced by implementation status (meaning disparities do not always reflect a error in results collection). Possible explanations include:

Tests authored to allow implementation status to influence expectations (e.g. Add "expects" (not just "asserts") to testharness wpt#9180)
Test-level timeouts due to browser misconfiguration (e.g. Speech API tests are timing out on Chrome, Firefox and Safari #143, Firefox service workers results are all TIMEOUT #162, css/cssom/serialize-values.html is timeout in Edge, but runs fine manually #239, 4 tests in css-variables seem to be consistent timeout, but actually pass in Edge #269, Widespread timeouts after editing/run/justifycenter.html on Safari #389, Regression in htmlallcollection.html in Edge despite no test changes #393, WebRTC numbers are incorrect #419)
Undetected browser/WebDriver crashes (e.g. wptrunner: Always restart on test error wpt#10186)

...and of course, there is the possibility of still other classes of errors that we have not yet identified.

Due to the uncertainty here, I'm still interested in researching completeness at the subtest level. Following the introduction of experimental Firefox and Chrome, I'd like to spend 3-5 days investigating the results in those terms. I'd organize my findings into a report so that we could understand the scope of the problem and what fixing it would look like.

"Completeness" script

#!/bin/bash

shas='162e9dda47
2b3f901276
c218fe33f4
1331d68a18
6e2b4a77cb
7164dbb89f
939b30a8b2
24f7e6d2f6
af485dcf5f
3589b85af3
fc33be9acf
21be95d426
067f918f9a
d01ce1055e
709111adbc
149116dc79
a87d0554fa
55846d56d8
bc7c640b39
42f43956d4
6fca0b2cd6
383fd735a5
45d92ebc55
e96e038f14'

platforms='firefox-59.0-linux chrome-64.0-linux safari-11.0-macos-10.12-sauce'

echo sha,expected,platform,actual,completeness
for sha in $shas; do
	git checkout $sha -- > /dev/null 2>&1
	tests_expected=$(./wpt run --list-tests firefox | grep -E '^/' | wc -l)

	for platform in $platforms; do
		summary_url=https://storage.googleapis.com/wptd/$sha/$platform-summary.json.gz

		results=$(curl --fail $summary_url 2> /dev/null)

		if [ $? != '0' ]; then
			continue
		fi

		count=$(echo $results | python -c 'import json; import sys; print len(json.load(sys.stdin))')
		pct=$(python -c "print 100 * float($count) / $tests_expected")

		printf '%s,%s,%s,%s,%s\n' $sha $tests_expected $platform $count $pct
	done
done

"Completeness" script results

sha	expected	platform	actual	completeness
162e9dda47	25764	firefox-59.0-linux	25764	100.0
2b3f901276	25764	firefox-59.0-linux	25764	100.0
2b3f901276	25764	chrome-64.0-linux	25764	100.0
c218fe33f4	25764	firefox-59.0-linux	25764	100.0
c218fe33f4	25764	chrome-64.0-linux	25764	100.0
c218fe33f4	25764	safari-11.0-macos-10.12-sauce	25709	99.7865238317
1331d68a18	25744	firefox-59.0-linux	25744	100.0
1331d68a18	25744	chrome-64.0-linux	25744	100.0
6e2b4a77cb	25748	firefox-59.0-linux	25748	100.0
6e2b4a77cb	25748	chrome-64.0-linux	25748	100.0
7164dbb89f	25748	firefox-59.0-linux	25748	100.0
7164dbb89f	25748	chrome-64.0-linux	25748	100.0
939b30a8b2	25741	firefox-59.0-linux	25741	100.0
939b30a8b2	25741	chrome-64.0-linux	25741	100.0
24f7e6d2f6	25740	firefox-59.0-linux	25740	100.0
24f7e6d2f6	25740	chrome-64.0-linux	25740	100.0
af485dcf5f	25763	firefox-59.0-linux	25763	100.0
af485dcf5f	25763	chrome-64.0-linux	25763	100.0
3589b85af3	25750	firefox-59.0-linux	25750	100.0
3589b85af3	25750	chrome-64.0-linux	25750	100.0
3589b85af3	25750	safari-11.0-macos-10.12-sauce	25695	99.786407767
fc33be9acf	25744	firefox-59.0-linux	25744	100.0
fc33be9acf	25744	chrome-64.0-linux	25744	100.0
21be95d426	25745	firefox-59.0-linux	25745	100.0
21be95d426	25745	chrome-64.0-linux	25745	100.0
067f918f9a	25745	firefox-59.0-linux	25745	100.0
067f918f9a	25745	chrome-64.0-linux	25745	100.0
d01ce1055e	25745	firefox-59.0-linux	25745	100.0
d01ce1055e	25745	chrome-64.0-linux	25745	100.0
709111adbc	25742	firefox-59.0-linux	25742	100.0
709111adbc	25742	chrome-64.0-linux	25742	100.0
709111adbc	25742	safari-11.0-macos-10.12-sauce	25681	99.7630331754
149116dc79	25742	firefox-59.0-linux	25742	100.0
149116dc79	25742	chrome-64.0-linux	25742	100.0
a87d0554fa	25739	firefox-59.0-linux	25739	100.0
a87d0554fa	25739	chrome-64.0-linux	25739	100.0
55846d56d8	25739	firefox-59.0-linux	25739	100.0
55846d56d8	25739	chrome-64.0-linux	25739	100.0
bc7c640b39	25738	firefox-59.0-linux	25738	100.0
bc7c640b39	25738	chrome-64.0-linux	25738	100.0
42f43956d4	25723	firefox-59.0-linux	25723	100.0
42f43956d4	25723	chrome-64.0-linux	25723	100.0
6fca0b2cd6	25720	firefox-59.0-linux	25720	100.0
6fca0b2cd6	25720	chrome-64.0-linux	25720	100.0
383fd735a5	25717	firefox-59.0-linux	25717	100.0
383fd735a5	25717	chrome-64.0-linux	25717	100.0
45d92ebc55	25712	firefox-59.0-linux	25712	100.0
45d92ebc55	25712	chrome-64.0-linux	25712	100.0
45d92ebc55	25712	safari-11.0-macos-10.12-sauce	25657	99.7860920971
e96e038f14	25710	firefox-59.0-linux	25710	100.0
e96e038f14	25710	chrome-64.0-linux	25710	100.0

Missing results identification script

#!/bin/bash

shas='c218fe33f4
3589b85af3
709111adbc
45d92ebc55'

platform='safari-11.0-macos-10.12-sauce'

for sha in $shas; do
  git checkout $sha -- > /dev/null 2>&1 || exit 1
  expected_tests=$(./wpt run --list-tests firefox | grep -E '^/')
  summary_url=https://storage.googleapis.com/wptd/$sha/$platform-summary.json.gz
  results=$(curl --fail $summary_url 2> /dev/null)
  actual_tests=$(echo $results | python -c 'import json; import sys; print "\n".join(json.load(sys.stdin).keys())')

  echo $sha
  comm -23 <(echo "$expected_tests" | sort) <(echo "$actual_tests" | sort)
  echo
done

Missing results from recent Safari trials

"Missing" wdspec tests have been removed.

c218fe33f4
- (only missing wdspec test results)
3589b85af3
- (only missing wdspec test results)
709111adbc
- /html/rendering/replaced-elements/embedded-content-rendering-rules/canvas-fallback.html
- /html/rendering/replaced-elements/embedded-content-rendering-rules/canvas_scale.html
- /html/rendering/replaced-elements/embedded-content-rendering-rules/canvas_without_context_a.html
- /shadow-dom/untriaged/shadow-trees/shadow-root-001.html
- /shadow-dom/untriaged/shadow-trees/shadow-root-002.html
- /shadow-dom/untriaged/shadow-trees/text-decoration-001.html
45d92ebc55
- (only missing wdspec test results)

Excerpt of logs at the time of Safari crash

/css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-cover-svg-004e.html 2d6d86d606d483e788e5835321b636c816e4cb20
/css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-cover-svg-004-ref.html 3cdd009fe7d15c7d93067df4c13f8612ab5c0e12
Testing 2d6d86d606d483e788e5835321b636c816e4cb20 == 3cdd009fe7d15c7d93067df4c13f8612ab5c0e12
10:07.97 TEST_START: /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-cover-svg-004i.html
10:07.97 INFO Test requires OS-level window focus
10:09.70 TEST_END: PASS
10:09.70 TEST_START: /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-cover-svg-004o.html
10:09.70 INFO Test requires OS-level window focus
10:12.43 TEST_END: FAIL, expected PASS
/css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-cover-svg-004o.html 2d6d86d606d483e788e5835321b636c816e4cb20
/css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-cover-svg-004-ref.html 3cdd009fe7d15c7d93067df4c13f8612ab5c0e12
Testing 2d6d86d606d483e788e5835321b636c816e4cb20 == 3cdd009fe7d15c7d93067df4c13f8612ab5c0e12
10:12.43 TEST_START: /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-cover-svg-004p.html
10:12.43 INFO Test requires OS-level window focus
10:13.66 TEST_END: PASS
10:13.66 TEST_START: /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-cover-svg-005e.html
10:13.66 INFO Test requires OS-level window focus
10:16.19 TEST_END: FAIL, expected PASS
/css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-cover-svg-005e.html b5bfc85c20bb8859bdcb704ca166d146e2eda577
/css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-cover-svg-005-ref.html 03a3fc89a7c143a2e4ddb36564e98fd08519cecd
Testing b5bfc85c20bb8859bdcb704ca166d146e2eda577 == 03a3fc89a7c143a2e4ddb36564e98fd08519cecd
10:16.19 TEST_START: /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-cover-svg-005i.html
10:16.19 INFO Test requires OS-level window focus
30 Mar 04:41:41 - Cleaning up.
30 Mar 04:41:41 - Removing tunnel 0f9289a63a83408ab77c0916a5868405.
30 Mar 04:41:41 - Waiting for any active jobs using this tunnel to finish.
30 Mar 04:41:41 - Press CTRL-C again to shut down immediately.
30 Mar 04:41:41 - Note: if you do this, tests that are still running will fail.
30 Mar 04:41:41 - Number of jobs using tunnel: 1.
10:20.98 TEST_END: PASS
10:20.98 TEST_START: /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-cover-svg-005o.html
10:20.98 INFO Test requires OS-level window focus
10:24.91 TEST_END: FAIL, expected PASS
/css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-cover-svg-005o.html b5bfc85c20bb8859bdcb704ca166d146e2eda577
/css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-cover-svg-005-ref.html 03a3fc89a7c143a2e4ddb36564e98fd08519cecd
Testing b5bfc85c20bb8859bdcb704ca166d146e2eda577 == 03a3fc89a7c143a2e4ddb36564e98fd08519cecd
10:24.91 TEST_START: /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-cover-svg-005p.html
10:24.91 INFO Test requires OS-level window focus
30 Mar 04:41:47 - Number of jobs using tunnel: 1.
10:26.06 TEST_END: PASS
10:26.06 TEST_START: /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-cover-svg-006e.html
10:26.07 INFO Test requires OS-level window focus
10:29.47 TEST_END: FAIL, expected PASS

foolip · 2018-04-11T15:50:11Z

Excellent writeup hiding in my inbox, thanks for brining my attention to it!

On wdspec, I agree that trying to run them over Sauce doesn't make sense, and that we can just not run them using --test-types=reftest,testharness if they aren't skipped by other means.

For Sauce, is each chunk not being retried 3 times? Does this mean that the Sauce tunnel was closed all three times for those incomplete Safari runs?

"Undetected browser/WebDriver crashes" seems like it would have to result in missing results and not partial subtest, results, right? Addressing that and anything else that results in less than 100% manifest-test completion (modulo wdspec) seems important to me. Also bugs in wpt run like web-platform-tests/wpt#9007, but perhaps as part of investigating the possibility of wpt run safari to replace Sauce?

On testharness.js subtest completion, I think I would rank the issues roughly in order of tests fixed per time spent fixing, so:

Harness errors affecting all browsers (Bocoup report exists)
Timeouts affecting all browsers
Mismatches in subtest count/name where no browser has any failing test or harness error
Harness errors and timeouts affecting only some browsers

At the end of this is is roughly where I would put investigating mismatches subtest mismatches that do also come with test failures/timeouts/etc. or harness errors. If we could count them (we can!) then it'd be easier to say how big the pile is, but my hunch is that it's big and only ~50-80% of the cases are ones we'd really want to fix, i.e. there's no way to lock in the wins.

Related to locking in the wins, fixing all harness errors and making Travis block on harness errors seems like a decent opportunity.

jugglinmike · 2018-04-11T21:15:16Z

On wdspec, I agree that trying to run them over Sauce doesn't make sense, and
that we can just not run them using --test-types=reftest,testharness if
they aren't skipped by other means.

The source code I linked to implements this for Sauce Labs-powered browsers. To
be clear: continuing to rely on that means that we're okay with non-uniform
results for WebDriver. Given what we've said about the relevance of Sauce Labs
results, our only option for uniformity would be to skip WebDriver tests on
platforms where we can run them. I think having non-uniform data is
preferable to having uniform non-data, in this case anyway :)

For Sauce, is each chunk not being retried 3 times? Does this mean that the
Sauce tunnel was closed all three times for those incomplete Safari runs?

That recovery mechanism went by the wayside when I introduced Buildbot. Today,
I perform that process manually via the web UI to recover from failures.

"Undetected browser/WebDriver crashes" seems like it would have to result in
missing results and not partial subtest, results, right?

Yup. This is the source of the consistent manifest-level incompleteness when
running Edge on Sauce Labs. The output below shows how a test that interacts
with window.opener destroys a window which WPT relies on. The WPT CLI regards
this as an "ERROR", and even though it invalidates all further testing, the
--no-restart-on-unexpected disables the recovery mechanism (that's what I
proposed we change in web-platform-tests/wpt#10186).

Addressing that and anything else that results in less than 100%
manifest-test completion (modulo wdspec) seems important to me.

I have not been able to reproduce this locally in a Windows VM, so it may be
tied to details of the Sauce Labs communication. That would mean the fix may
not benefit other browsers, and it may not even be relevant for our purposes
for very long. I'm happy to dig in to that problem, but I want to make sure
we're on the same page about that before doing so.

Harness errors affecting all browsers (Bocoup report exists)

Timeouts affecting all browsers

Mismatches in subtest count/name where no browser has any failing test or
harness error

Harness errors and timeouts affecting only some browsers

@rwaldron tells me that the referenced report was created at the end of last
year. Lacking access to the storage bucket, he was working from the limited
dataset he was able to build from rate-limited web requests. I'm not knocking
Rick's work, but instead explaining why that report will need to be refreshed.
And that's really all I'm proposing for this initial effort. I'd like to have
a clearer understanding of the distribution of these problems, extended with
some surface-level information about possible explanations/fixes for any
clusters we find.

Related to locking in the wins, fixing all harness errors and making Travis
block on harness errors seems like a decent opportunity.

The "ERROR" status is applied to tests which produce an uncaught exception.
This was a surprise to me when I first learned it because I always assumed that
"ERROR" described a condition where the test interfered with the harness's
functioning. We touched on this during our call today, but since @lukebjerring
mentioned being surprised by this, I thought I should call it out again. It may
be that there are many more "ERROR" results than we would otherwise expect.
That's not to say they aren't worth fixing, though!

Output from one of two consistently-incomplete Edge trials

--------------------------------------------------------
Expected PASS, got FAIL
assert_equals: The document should be an HTML document. expected (string) "text/html" but got (undefined) undefined
Check the document properties corresponding to the creator browsing context
---------------------------------------------------------------------------
Expected PASS, got FAIL
assert_equals: The document's referrer should be its creator document's address. expected "http://web-platform.test:8000/html/browsers/windows/browsing-context.html" but got ""
 4:03.03 TEST_START: /html/browsers/windows/noreferrer-null-opener.html
 4:08.48 TEST_END: ERROR, expected OK
Message: No such window (WARNING: The server did not provide any stacktrace information)
Command duration or timeout: 47 milliseconds
Build info: version: '2.52.0', revision: '4c2593c', time: '2016-02-11 19:06:42'
System info: host: 'sauce-win10', ip: '172.20.116.19', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '1.8.0_131'
Driver info: org.openqa.selenium.edge.EdgeDriver
Capabilities [{applicationCacheEnabled=true, acceptSslCerts=true, browserVersion=40.15063.0.0, platformVersion=10, locationContextEnabled=true, webStorageEnabled=true, browserName=MicrosoftEdge, takesScreenshot=true, pageLoadStrategy=normal, takesElementScreenshot=true, platformName=windows, platform=ANY}]
Session ID: XXXXXXXX-XXXX-XXXX-XXXX-XXXX09A32B9E
Stacktrace:
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0 (None:-2)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance (None:-1)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance (None:-1)
    at java.lang.reflect.Constructor.newInstance (None:-1)
    at org.openqa.selenium.remote.ErrorHandler.createThrowable (ErrorHandler.java:206)
    at org.openqa.selenium.remote.ErrorHandler.throwIfResponseFailed (ErrorHandler.java:158)
    at org.openqa.selenium.remote.RemoteWebDriver.execute (RemoteWebDriver.java:678)
    at org.openqa.selenium.remote.RemoteWebDriver.executeAsyncScript (RemoteWebDriver.java:598)
    at sun.reflect.NativeMethodAccessorImpl.invoke0 (None:-2)
    at sun.reflect.NativeMethodAccessorImpl.invoke (None:-1)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke (None:-1)
    at java.lang.reflect.Method.invoke (None:-1)
    at org.openqa.selenium.support.events.EventFiringWebDriver$2.invoke (EventFiringWebDriver.java:103)
    at com.sun.proxy.$Proxy1.executeAsyncScript (None:-1)
    at org.openqa.selenium.support.events.EventFiringWebDriver.executeAsyncScript (EventFiringWebDriver.java:229)
    at org.openqa.selenium.remote.server.handler.ExecuteAsyncScript.call (ExecuteAsyncScript.java:58)
    at java.util.concurrent.FutureTask.run (None:-1)
    at org.openqa.selenium.remote.server.DefaultSession$1.run (DefaultSession.java:176)
    at java.util.concurrent.ThreadPoolExecutor.runWorker (None:-1)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run (None:-1)
    at java.lang.Thread.run (None:-1)
 4:08.48 TEST_START: /html/browsers/windows/noreferrer-window-name.html
 4:09.21 TEST_END: ERROR, expected OK
Message: No such window (WARNING: The server did not provide any stacktrace information)
Command duration or timeout: 15 milliseconds
Build info: version: '2.52.0', revision: '4c2593c', time: '2016-02-11 19:06:42'
System info: host: 'sauce-win10', ip: '172.20.116.19', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '1.8.0_131'
Driver info: org.openqa.selenium.edge.EdgeDriver
Capabilities [{applicationCacheEnabled=true, acceptSslCerts=true, browserVersion=40.15063.0.0, platformVersion=10, locationContextEnabled=true, webStorageEnabled=true, browserName=MicrosoftEdge, takesScreenshot=true, pageLoadStrategy=normal, takesElementScreenshot=true, platformName=windows, platform=ANY}]
Session ID: XXXXXXXX-XXXX-XXXX-XXXX-XXXX09A32B9E
Stacktrace:
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0 (None:-2)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance (None:-1)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance (None:-1)
    at java.lang.reflect.Constructor.newInstance (None:-1)
    at org.openqa.selenium.remote.ErrorHandler.createThrowable (ErrorHandler.java:206)
    at org.openqa.selenium.remote.ErrorHandler.throwIfResponseFailed (ErrorHandler.java:158)
    at org.openqa.selenium.remote.RemoteWebDriver.execute (RemoteWebDriver.java:678)
    at org.openqa.selenium.remote.RemoteWebDriver.execute (RemoteWebDriver.java:701)
    at org.openqa.selenium.remote.RemoteWebDriver.getWindowHandle (RemoteWebDriver.java:555)
    at sun.reflect.NativeMethodAccessorImpl.invoke0 (None:-2)
    at sun.reflect.NativeMethodAccessorImpl.invoke (None:-1)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke (None:-1)
    at java.lang.reflect.Method.invoke (None:-1)
    at org.openqa.selenium.support.events.EventFiringWebDriver$2.invoke (EventFiringWebDriver.java:103)
    at com.sun.proxy.$Proxy1.getWindowHandle (None:-1)
    at org.openqa.selenium.support.events.EventFiringWebDriver.getWindowHandle (EventFiringWebDriver.java:210)
    at org.openqa.selenium.remote.server.handler.GetCurrentWindowHandle.call (GetCurrentWindowHandle.java:30)
    at org.openqa.selenium.remote.server.handler.GetCurrentWindowHandle.call (GetCurrentWindowHandle.java:1)
    at java.util.concurrent.FutureTask.run (None:-1)
    at org.openqa.selenium.remote.server.DefaultSession$1.run (DefaultSession.java:176)
    at java.util.concurrent.ThreadPoolExecutor.runWorker (None:-1)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run (None:-1)
    at java.lang.Thread.run (None:-1)
 4:09.21 TEST_START: /html/browsers/windows/targeting-cross-origin-nested-browsing-contexts.html
 4:09.73 TEST_END: ERROR, expected OK

foolip · 2018-04-11T21:28:16Z

I think having non-uniform data is
preferable to having uniform non-data, in this case anyway :)

Yes :)

web-platform-tests/wpt#10186

Commenting there.

And that's really all I'm proposing for this initial effort. I'd like to have
a clearer understanding of the distribution of these problems, extended with
some surface-level information about possible explanations/fixes for any
clusters we find.

Yeah, doing the work to figure out how many failures are in each of the interesting buckets, or other interesting ones you come up with, sounds very useful. Maybe it's easier starting from single JSON files per run, so starting to save those somewhere ASAP might be a good idea.

The "ERROR" status is applied to tests which produce an uncaught exception.
This was a surprise to me when I first learned it because I always assumed that
"ERROR" described a condition where the test interfered with the harness's
functioning. We touched on this during our call today, but since @lukebjerring
mentioned being surprised by this, I thought I should call it out again. It may
be that there are many more "ERROR" results than we would otherwise expect.
That's not to say they aren't worth fixing, though!

I think I may have become too acclimated to terrible things to see the problem here :) I don't actually know when "ERROR" is the test-level status, but that uncaught exceptions result in a harness errors does make good sense to me. Without it async_test with code not wrapped in t.step_func that throws exceptions would go nowhere. Now, for promise_test things are different because they run in serial, so it is still confusing, but can it be made less bad?

Would a serious effort to get rid of harness errors help, or is the "ERROR" status something different?

jugglinmike · 2018-04-11T21:39:26Z

Without it async_test with code not wrapped in t.step_func that throws
exceptions would go nowhere. Now, for promise_test things are different
because they run in serial, so it is still confusing, but can it be made less
bad?

Actually, I somehow forgot that async_test runs in parallel. In light of
that, I agree that a harness-level "ERROR" is the only way to communicate the
fact that something went wrong. (I just consider that another reason why
parallel testing is not such a good idea.)

Would a serious effort to get rid of harness errors help, or is the "ERROR"
status something different?

We won't be able to answer this by looking at the report data, and that's why
I mentioned getting the surface-level information on clusters of errors.

foolip · 2018-04-11T21:42:28Z

I just consider that another reason why parallel testing is not such a good idea

I don't disagree, and making async_test run in serial by default is something we could at least consider how one could pull off, even though I don't have any ideas that I think would fly.

foolip · 2018-04-11T21:44:03Z

We won't be able to answer this by looking at the report data, and that's why
I mentioned getting the surface-level information on clusters of errors.

Do you mean just looking to see in which directories most of the problems are? What answers couldn't we get by looking at the same information that is available to wpt.fyi?

jugglinmike · 2018-04-11T22:38:04Z

Do you mean just looking to see in which directories most of the problems
are? What answers couldn't we get by looking at the same information that is
available to wpt.fyi?

I mean that because the "ERROR" status can be applied for a number of reasons, we'll need to review the specific conditions that produced it. Focusing on the groups of similar tests that report an ERROR is a way to avoid getting lost in the weeds.

I have to admit that I'm getting a little confused, though. I just discovered that uncaught exceptions are reported as "FAIL" for "single-page" tests. I can see why this would be the case, but it complicates the way I think about the different statuses.

Further complicating the classification is the fact that all tests are assumed to be single-page tests until they use a function like test or async_test. This means that even tests which are not intended to be single-page tests will be interpreted as such if they produce an uncaught error prior to defining sub-tests.

An example is geolocation-sensor/GeolocationSensor-enabled-by-feature-policy-attribute-redirect-on-load.https.html, which is reported as a "Failure" on wpt.fyi. Using only the data available there, we might say, "That is not a problem; it is simply a test failure which accurately describes implementation status." But if we agree that tests should not produce uncaught errors, then the truth is that the test needs to be re-written so that it runs to completion, defining (and failing) a number of sub-tests.

At this point, our conversation about the investigation is evolving into the investigation itself... Except I'm not being nearly as disciplined about my approach as the problem deserves. Maybe it's enough to say this justifies dedicating time for creating that report.

jgraham · 2018-04-12T12:44:57Z

So, re: ERROR, I think there are two cases:

An error resulting from an in-browser condition (e.g. an uncauche exception)
An error resulting from a harness problem.

Currently wptrunner does not distinguish these cases internally, but it could (we do something similar for timeouts, distinguishing between the case where the js timeout fires and the case where we are still running a few seconds after that timeout ought to have fired). I don't think it would be hard to fix things so that harness problems always result in an INTERNAL-ERROR status inside the harness (which would still be reported as ERROR because changing the reporting is harder) and INTERNAL-ERROR would result in an unconditional restart.

gsnedders · 2018-04-12T14:35:25Z

In the logs currently available at http://builds.wpt.fyi, the Sauce Connect Tunnel closed abruptly near the end of that trial, which caused the WPT CLI to finish results collection without recording results for the six tests that remained.

To me, this sounds like the bug is in wptrunner insofar as we don't handle the Sauce Connect Tunnel closing abruptly?

jugglinmike · 2018-04-25T01:26:24Z

Here are some stats. This data is based on web-platform-tests at revision 6736e3f46b, where it had 26043 tests and 12308 testharness.js tests (those counts are for "manifest level" tests, as distinct from "sub-tests"). We're considering 6 sets of results in total: Chrome 65, Chrome 67 ("dev" channel), Firefox 59, Firefox 61 ("nightly" channel), Edge 15, and Safari 11.

I've included links to the relevant test results page on https://wpt.fyi, but bear in mind that we (currently) do not render the results for the experimental browsers. This means some of the inconsistencies described below won't be visible from those pages.

Working from @foolip's "most wanted" list above:

Harness errors affecting all browsers (Bocoup report exists)

Timeouts affecting all browsers

5 tests report ERROR in all tested browsers
39 tests report TIMEOUT in all tested browsers
(bonus) 82 tests report a mixture of ERROR and TIMEOUT in all tested browsers

That means 129 of the reported test results are suspect according to this criteria. That's just 0.5% of all tests and 1% of testharness.js tests.

Many of these are symptoms of a recent regression in the WPT infrastructure. My patch to fix this in WPT directly was superseded by a similar patch that originated in Gecko. Since the former also included tests, I've re-submitted those here. With a fix in place, the details aren't particularly relevant anymore (but I've included them below in case anyone is curious). We can expect the next result set to be much improved.

Many of these issues are due to unguarded references to the API under test. There's some low-handing fruit here (e.g. this small patch to make seven tests fail immediately), but there are also plenty of one-off problems.

We've always known that many tests are authored in such a way that the implementation status influences the number of sub-tests. To date, we've only discussed how this produces inconsistent denominators in results for the same test in different browsers. We haven't talked about how this same pattern can produce timeouts. Specifically, when zero sub-tests are produced in this way, the harness times out waiting for sub-tests. See, e.g. web-animations/animation-model/animation-types/interpolation-per-property.html

We could address these via:

 var features = cleverFunction();
+
+test(function() {
+  assert_true(features.length > 0);
+});

...but my preference would be to:

-var features = cleverFunction();
+var features = ['long', 'and', 'boring', 'list', 'of' 'features'];

Mismatches in subtest count/name where no browser has any failing test or
harness error

435 tests satisfy this criteria. In an earlier comment, I mentioned how some tests are misclassified as "single-page tests" We can begin to filter those out if we discard the cases where the inconsistency is due to a ReferenceError, then the number drops to 374. Even then, there are some cases where property access on unrelated APIs is causing trouble. There are also some questionable patterns in those tests, where the expectatoins are defined in terms of the implementation status. In any case, I still think the tests need to be consistent in this regard.

Harness errors and timeouts affecting only some browsers

This describes 4643 tests. We know that some of these are due to a flaky connection to Sauce Labs, but it's hard to say how many. If we exclude Safari and Edge, then the number drops to 724. That's more tractable, though it almost certainly hides issues that are not related to the known network interference.

5 tests reporting ERROR in all browsers

42 tests reporting TIMEOUT in all browsers

82 tests inconsistently reporting either ERROR or TIMEOUT in all browsers

Analysis of failures/timeouts due to WebSocket regression

This issue effected tests outside of the websockets/ directory. Other web
platform features interact with WebSockets in non-trivial ways, so tests for
other features were invalidated by the removal of the server.

This configuration hiccup also effected some tests that had no relation to
WebSockets. That's due to the value substitution feature of the WPT server. For
example, the test
service-workers/service-worker/state.https.html
includes a JavaScript file named
resources/test-helpers.sub.js
which includes the following code:

function get_websocket_url() {
  return 'wss://{{host}}:{{ports[wss][0]}}/echo';
}

In the absence of the secure WebSocket server configuration, interpolating this
templated file causes a runtime Python error. That causes the WPT Server to
return an HTTP 500 response for this file. In other words, this bug effectively
removed that shared file.

Fortunately, that particular substitution does not appear to be very common:

$ find . -type f -name '*.sub.*' -print0 | xargs -0 grep -E '{.*wss'
./websockets/websocket.sub.js:var __SECURE__PORT = {{ports[wss][0]}};
./cookies/secure/set-from-wss.https.sub.html:    var ws = new WebSocket("wss://{{host}}:{{ports[wss][0]}}/set-cookie-secure?secure_from_secure");
./cookies/secure/set-from-wss.https.sub.html:      var ws2 = new WebSocket("wss://{{host}}:{{ports[wss][0]}}/echo-cookie");
./cookies/secure/set-from-ws.sub.html:      var ws2 = new WebSocket("wss://{{host}}:{{ports[wss][0]}}/echo-cookie");
./upgrade-insecure-requests/support/testharness-helper.sub.js:    url.port = {{ports[wss][0]}};
./service-workers/service-worker/resources/test-helpers.sub.js:  return 'wss://{{host}}:{{ports[wss][0]}}/echo';

Although all of those references were invalidated, the one in
upgrade-insecure-requests appears to be the only other that might effect
tests unrelated to WebSockets.

435 tests where no browser has a timeout or harness error, yet total sub-tests still varies across browsers

jugglinmike · 2018-04-25T17:03:55Z

Here's the script I wrote to analyze the results data:

analyze.py

import json
import os
import re
import sys


class Result(object):
    def __init__(self, outer_dir, test_file):
        self.name = test_file
        self.outer_dir = outer_dir

        with open(os.path.join(outer_dir, test_file)) as handle:
            self._data = json.load(handle)

    @property
    def url(self):
        return 'https://wpt.fyi/%s' % self.name

    @property
    def is_error(self):
        return self._data['status'] == 'ERROR'

    @property
    def is_timeout(self):
        return self._data['status'] == 'TIMEOUT'

    @property
    def is_okay(self):
        return self._data['status'] == 'OK'

    @property
    def total_subtests(self):
        return len(self._data['subtests'])

    @property
    def is_reference_failure(self):
        if self.total_subtests != 1:
            return False

        return bool(re.search('ReferenceError:| is undefined',
                              str(self._data['subtests'][0]['message'])))

    @property
    def potential_error(self):
        '''Identify single-page tests that are reported as failing. If a test
        file containing multiple sub-tests produces an uncaught error before
        any sub-tests are defined, it will be classified in this way (despite
        being technially a harness error.'''
        if not self.is_okay:
            return False

        if len(self._data['subtests']) > 1:
            return False

        return self._data['subtests'][0]['status'] == 'FAIL'


def walk_mirrored(dirs):
    ref = dirs[0]

    for root, _, file_names in os.walk(ref):
        wpt_dir = root[len(ref)+1:]

        for file_name in file_names:
            results = []

            for outer_dir in dirs:
                full_name = os.path.join(outer_dir, wpt_dir, file_name)

                try:
                    result = Result(outer_dir, os.path.join(wpt_dir, file_name))
                except (IOError, OSError):
                    continue

                results.append(result)

            yield results


def compare(results):
    all_error = reduce(lambda acc, result: acc and result.is_error, results, True)
    all_timeout = reduce(lambda acc, result: acc and result.is_timeout, results, True)
    all_bad = reduce(lambda acc, result: acc and (result.is_error or result.is_timeout), results, True)
    any_bad = reduce(lambda acc, result: acc or (result.is_error or result.is_timeout), results, False)
    all_single_fail = reduce(lambda acc, result: acc and result.potential_error, results)

    subtest_counts_differ = False
    subtest_counts_differ_meaningfully = False
    total_subtests = results[0].total_subtests
    for result in results[1:]:
        if result.total_subtests == total_subtests:
            continue

        subtest_counts_differ = True

        if not result.is_reference_failure:
            subtest_counts_differ_meaningfully = True

    if all_error:
    #if all_timeout:
    #if not all_error and not all_timeout and all_bad:
    #if subtest_counts_differ and not any_bad:
    #if subtest_counts_differ_meaningfully and not any_bad:
    #if any_bad and not all_bad:
        print(results[0].url)

if __name__ == '__main__':
    directories = sys.argv[1:]

    if len(directories) < 1:
        raise ValueError('At least one results directory must be provided')

    for results in walk_mirrored(directories):
        compare(results)

jugglinmike · 2018-05-03T00:42:51Z

Another interesting way to look at the data is to compare results between stable and experimental releases of the same browser. We can't assign much significance to the pass/fail rate itself (because as discussed above, more advanced implementations may "unlock" more sub-tests and report an overall lower percentage). However, it seems fair to expect that the number of passing sub-tests (that is, ignoring the total number of sub-tests) should never decrease. Even when an experimental release is subject to more sub-tests than a stable release, it shouldn't fail where the stable release passes.

It turns out that this is not always true! In 84 tests, the experimental release of Chrome passes fewer sub-tests than the stable release. There are 51 tests where the stable/experimental Firefox results demonstrate this pattern.

Three explanations for this come to mind:

The experimental release of the browser has regressed in some way (which seems highly unlikely given the rigor of each browser's release process)
The test or the browser is flaky, and we're observing the flakiness between the two test executions (i.e. the difference in release channel is only a coincidence)
The test has been authored with some questionable practice

I've only just begun to look into these tests, but I have found an explanation for 75% of the Chrome cases: the recent change in media "autoplay" policy. This doesn't quite fit in to any of the explanations listed above.

Although we've identified this issue by comparing "stable" and "experimental" results, the dataset is showing its age. Chrome 66 has recently been promoted to "stable", so on https://wpt.fyi today, we can see the errors introduced by the policy change. For example:

https://wpt.fyi/media-source/mediasource-redundant-seek.html?sha=43dd25c888

That makes the problem much more visible, so I've submitted a patch at gh-552.

I'm curious to see if there are any other patterns in the tests that remain.

Suspect tests from Chrome stable/experimental comparison

"Ps" - sub-tests passing in stable release
"Pe" - sub-tests passing in experimental release

Ps	Pe	Test
10	8	/webauthn/createcredential-badargs-user.https.html
3	1	/webauthn/createcredential-extensions.https.html
5	0	/webauthn/createcredential-badargs-authnrselection.https.html
8	0	/webauthn/createcredential-badargs-rp.https.html
1	0	/media-source/mediasource-endofstream.html
250	248	/WebCryptoAPI/wrapKey_unwrapKey/wrapKey_unwrapKey.https.worker.html
316	310	/css/cssom-view/interfaces.html
5	4	/html/browsers/the-window-object/window-open-noopener.html
1	0	https://wpt.fyi/html/semantics/embedded-content/the-iframe-element/sandbox_002.htm
2	0	/html/semantics/embedded-content/media-elements/autoplay-with-broken-track.html
4	2	/html/semantics/embedded-content/media-elements/event_order_canplay_playing.html
4	2	/html/semantics/embedded-content/media-elements/event_pause.html
4	2	/html/semantics/embedded-content/media-elements/event_playing.html
2	0	/html/semantics/embedded-content/media-elements/event_timeupdate.html
1	0	https://wpt.fyi/html/semantics/embedded-content/media-elements/video_008.htm
4	2	/html/semantics/embedded-content/media-elements/paused_false_during_play.html
4	2	/html/semantics/embedded-content/media-elements/readyState_during_playing.html
4	2	/html/semantics/embedded-content/media-elements/event_play.html
1	0	/html/semantics/embedded-content/media-elements/ready-states/autoplay-with-slow-text-tracks.html
4	0	/html/semantics/embedded-content/media-elements/loading-the-media-resource/autoplay-overrides-preload.html
1	0	/html/semantics/embedded-content/media-elements/loading-the-media-resource/resource-selection-invoke-set-src-networkState.html
1	0	/html/semantics/embedded-content/media-elements/track/track-element/track-cues-pause-on-exit.html
11	10	/webrtc/RTCPeerConnection-setRemoteDescription-tracks.https.html
1	0	/mediacapture-streams/MediaStreamTrack-MediaElement-disabled-audio-is-silence.https.html
3	2	/content-security-policy/reporting/report-same-origin-with-cookies.html

Suspect tests from Firefox stable/experimental comparison

"Ps" - sub-tests passing in stable release
"Pe" - sub-tests passing in experimental release

Ps	Pe	Test
1	0	/preload/dynamic-adding-preload.html
1	0	/preload/preload-with-type.html
1	0	/preload/delaying-onload-link-preload-after-discovery.html
4	2	/intersection-observer/same-document-zero-size-target.html
10	5	/intersection-observer/multiple-thresholds.html
5	3	/intersection-observer/multiple-targets.html
4	3	/intersection-observer/same-document-no-root.html
10	5	/webaudio/the-audio-api/the-convolvernode-interface/convolver-response-1-chan.html
8	6	https://wpt.fyi/webdriver/tests/element_send_keys/form_controls.py
79	62	https://wpt.fyi/webdriver/tests/interaction/element_clear.py
11	9	/css/css-tables/table-model-fixup.html
10	7	/css/css-tables/html5-table-formatting-2.html
22	19	/css/css-tables/table-model-fixup-2.html
5	2	/css/css-tables/html5-table-formatting-1.html
12	4	/css/selectors/focus-within-009.html
156	118	/css/css-values/viewport-units-css2-001.html
8	7	/css/cssom/shorthand-values.html
17	15	/css/css-grid/layout-algorithm/grid-find-fr-size-gutters-001.html
2	1	/css/cssom-view/elementsFromPoint-iframes.html
5	1	/css/cssom-view/elementsFromPoint-simple.html
4	0	/css/cssom-view/elementsFromPoint-table.html
4	0	/css/cssom-view/elementsFromPoint-svg.html
40	36	/css/cssom-view/scrollintoview.html
3	1	/css/cssom-view/elementsFromPoint-invalid-cases.html
2	1	/css/cssom-view/CaretPosition-001.html
5	4	/mathml/presentation-markup/spaces/space-1.html
5	4	/mathml/presentation-markup/scripts/underover-1.html
5	4	/mathml/presentation-markup/scripts/subsup-3.html
2	0	/uievents/order-of-events/focus-events/focus-automated-blink-webkit.html
2	0	/webvr/webvr-disabled-by-feature-policy.https.sub.html
1	0	/2dcontext/drawing-paths-to-the-canvas/drawFocusIfNeeded_005.html
1	0	/2dcontext/drawing-paths-to-the-canvas/drawFocusIfNeeded_001.html
1	0	/2dcontext/drawing-paths-to-the-canvas/drawFocusIfNeeded_004.html
14	0	/html/browsers/the-window-object/apis-for-creating-and-navigating-browsing-contexts-by-name/open-features-tokenization-width-height.html
128	126	/html/infrastructure/urls/resolving-urls/query-encoding/utf-8.html
128	126	/html/infrastructure/urls/resolving-urls/query-encoding/utf-16le.html
112	110	/html/infrastructure/urls/resolving-urls/query-encoding/windows-1252.html
128	126	/html/infrastructure/urls/resolving-urls/query-encoding/utf-16be.html
1	0	/html/editing/focus/composed.window.html
1	0	/html/editing/focus/sequential-focus-navigation-and-the-tabindex-attribute/focus-tabindex-positive.html
1	0	/html/editing/focus/sequential-focus-navigation-and-the-tabindex-attribute/focus-tabindex-zero.html
1	0	/html/editing/focus/sequential-focus-navigation-and-the-tabindex-attribute/focus-tabindex-negative.html
2	0	/html/editing/focus/focus-management/focus-events.html
60	0	/html/webappapis/system-state-and-capabilities/the-navigator-object/protocol.html
1	0	/html/semantics/scripting-1/the-script-element/execution-timing/031.html
30	0	/html/semantics/forms/textfieldselection/select-event.html
1	0	/html/semantics/selectors/pseudo-classes/focus-autofocus.html
5	0	/html/semantics/selectors/pseudo-classes/focus.html
1	0	/html/semantics/embedded-content/media-elements/track/track-element/track-cues-missed.html
7	1	/html/semantics/embedded-content/the-img-element/usemap-casing.html
14	13	/html/dom/usvstring-reflection.html

lukebjerring · 2018-05-03T17:30:17Z

FYI - I'm working on visual diffs. https://experimental-diff-dot-wptdashboard.appspot.com/results/?label=chrome&diff https://experimental-diff-dot-wptdashboard.appspot.com/results/?label=firefox&diff

…

On Wed, 2 May 2018 at 20:42 jugglinmike ***@***.***> wrote: Another interesting way to look at the data is to compare results between stable and experimental releases of the same browser. We can't assign much significance to the pass/fail rate itself (because as discussed above, more advanced implementations may "unlock" more sub-tests and report an overall lower percentage). However, it seems fair to expect that the number of passing sub-tests (that is, ignoring the total number of sub-tests) should never decrease. Even when an experimental release is subject to more sub-tests than a stable release, it shouldn't fail where the stable release passes. It turns out that this is not always true! In 84 tests, the experimental release of Chrome passes fewer sub-tests than the stable release. There are 51 tests where the stable/experimental Firefox results demonstrate this pattern. Three explanations for this come to mind: - The experimental release of the browser has regressed in some way (which seems highly unlikely given the rigor of each browser's release process) - The test or the browser is flaky, and we're observing the flakiness between the two test executions (i.e. the difference in release channel is only a coincidence) - The test has been authored with some questionable practice I've only just begun to look into these tests, but I have found an explanation for 75% of the Chrome cases: the recent change in media "autoplay" policy <https://developers.google.com/web/updates/2017/09/autoplay-policy-changes>. This doesn't quite fit in to any of the explanations listed above. Although we've identified this issue by comparing "stable" and "experimental" results, the dataset is showing its age. Chrome 66 has recently been promoted to "stable", so on https://wpt.fyi today, we can see the errors introduced by the policy change. For example: https://wpt.fyi/media-source/mediasource-redundant-seek.html?sha=43dd25c888 That makes the problem much more visible, so I've submitted a patch at gh-552 <#552> . I'm curious to see if there are any other patterns in the tests that remain. Suspect tests from Chrome stable/experimental comparison - "Ps" - sub-tests passing in stable release - "Pe" - sub-tests passing in experimental release Ps Pe Test 10 8 /webauthn/createcredential-badargs-user.https.html <https://wpt.fyi/webauthn/createcredential-badargs-user.https.html?sha=709111adbc> 3 1 /webauthn/createcredential-extensions.https.html <https://wpt.fyi/webauthn/createcredential-extensions.https.html?sha=709111adbc> 5 0 /webauthn/createcredential-badargs-authnrselection.https.html <https://wpt.fyi/webauthn/createcredential-badargs-authnrselection.https.html?sha=709111adbc> 8 0 /webauthn/createcredential-badargs-rp.https.html <https://wpt.fyi/webauthn/createcredential-badargs-rp.https.html?sha=709111adbc> 1 0 /media-source/mediasource-endofstream.html <https://wpt.fyi/media-source/mediasource-endofstream.html?sha=709111adbc> 250 248 /WebCryptoAPI/wrapKey_unwrapKey/wrapKey_unwrapKey.https.worker.html <https://wpt.fyi/WebCryptoAPI/wrapKey_unwrapKey/wrapKey_unwrapKey.https.worker.html?sha=709111adbc> 316 310 /css/cssom-view/interfaces.html <https://wpt.fyi/css/cssom-view/interfaces.html?sha=709111adbc> 5 4 /html/browsers/the-window-object/window-open-noopener.html <https://wpt.fyi/html/browsers/the-window-object/window-open-noopener.html?sha=709111adbc> 1 0 https://wpt.fyi/html/semantics/embedded-content/the-iframe-element/sandbox_002.htm 2 0 /html/semantics/embedded-content/media-elements/autoplay-with-broken-track.html <https://wpt.fyi/html/semantics/embedded-content/media-elements/autoplay-with-broken-track.html?sha=709111adbc> 4 2 /html/semantics/embedded-content/media-elements/event_order_canplay_playing.html <https://wpt.fyi/html/semantics/embedded-content/media-elements/event_order_canplay_playing.html?sha=709111adbc> 4 2 /html/semantics/embedded-content/media-elements/event_pause.html <https://wpt.fyi/html/semantics/embedded-content/media-elements/event_pause.html?sha=709111adbc> 4 2 /html/semantics/embedded-content/media-elements/event_playing.html <https://wpt.fyi/html/semantics/embedded-content/media-elements/event_playing.html?sha=709111adbc> 2 0 /html/semantics/embedded-content/media-elements/event_timeupdate.html <https://wpt.fyi/html/semantics/embedded-content/media-elements/event_timeupdate.html?sha=709111adbc> 1 0 https://wpt.fyi/html/semantics/embedded-content/media-elements/video_008.htm 4 2 /html/semantics/embedded-content/media-elements/paused_false_during_play.html <https://wpt.fyi/html/semantics/embedded-content/media-elements/paused_false_during_play.html?sha=709111adbc> 4 2 /html/semantics/embedded-content/media-elements/readyState_during_playing.html <https://wpt.fyi/html/semantics/embedded-content/media-elements/readyState_during_playing.html?sha=709111adbc> 4 2 /html/semantics/embedded-content/media-elements/event_play.html <https://wpt.fyi/html/semantics/embedded-content/media-elements/event_play.html?sha=709111adbc> 1 0 /html/semantics/embedded-content/media-elements/ready-states/autoplay-with-slow-text-tracks.html <https://wpt.fyi/html/semantics/embedded-content/media-elements/ready-states/autoplay-with-slow-text-tracks.html?sha=709111adbc> 4 0 /html/semantics/embedded-content/media-elements/loading-the-media-resource/autoplay-overrides-preload.html <https://wpt.fyi/html/semantics/embedded-content/media-elements/loading-the-media-resource/autoplay-overrides-preload.html?sha=709111adbc> 1 0 /html/semantics/embedded-content/media-elements/loading-the-media-resource/resource-selection-invoke-set-src-networkState.html <https://wpt.fyi/html/semantics/embedded-content/media-elements/loading-the-media-resource/resource-selection-invoke-set-src-networkState.html?sha=709111adbc> 1 0 /html/semantics/embedded-content/media-elements/track/track-element/track-cues-pause-on-exit.html <https://wpt.fyi/html/semantics/embedded-content/media-elements/track/track-element/track-cues-pause-on-exit.html?sha=709111adbc> 11 10 /webrtc/RTCPeerConnection-setRemoteDescription-tracks.https.html <https://wpt.fyi/webrtc/RTCPeerConnection-setRemoteDescription-tracks.https.html?sha=709111adbc> 1 0 /mediacapture-streams/MediaStreamTrack-MediaElement-disabled-audio-is-silence.https.html <https://wpt.fyi/mediacapture-streams/MediaStreamTrack-MediaElement-disabled-audio-is-silence.https.html?sha=709111adbc> 3 2 /content-security-policy/reporting/report-same-origin-with-cookies.html <https://wpt.fyi/content-security-policy/reporting/report-same-origin-with-cookies.html?sha=709111adbc> Suspect tests from Firefox stable/experimental comparison - "Ps" - sub-tests passing in stable release - "Pe" - sub-tests passing in experimental release Ps Pe Test 1 0 /preload/dynamic-adding-preload.html <https://wpt.fyi/preload/dynamic-adding-preload.html?sha=709111adbc> 1 0 /preload/preload-with-type.html <https://wpt.fyi/preload/preload-with-type.html?sha=709111adbc> 1 0 /preload/delaying-onload-link-preload-after-discovery.html <https://wpt.fyi/preload/delaying-onload-link-preload-after-discovery.html?sha=709111adbc> 4 2 /intersection-observer/same-document-zero-size-target.html <https://wpt.fyi/intersection-observer/same-document-zero-size-target.html?sha=709111adbc> 10 5 /intersection-observer/multiple-thresholds.html <https://wpt.fyi/intersection-observer/multiple-thresholds.html?sha=709111adbc> 5 3 /intersection-observer/multiple-targets.html <https://wpt.fyi/intersection-observer/multiple-targets.html?sha=709111adbc> 4 3 /intersection-observer/same-document-no-root.html <https://wpt.fyi/intersection-observer/same-document-no-root.html?sha=709111adbc> 10 5 /webaudio/the-audio-api/the-convolvernode-interface/convolver-response-1-chan.html <https://wpt.fyi/webaudio/the-audio-api/the-convolvernode-interface/convolver-response-1-chan.html?sha=709111adbc> 8 6 https://wpt.fyi/webdriver/tests/element_send_keys/form_controls.py 79 62 https://wpt.fyi/webdriver/tests/interaction/element_clear.py 11 9 /css/css-tables/table-model-fixup.html <https://wpt.fyi/css/css-tables/table-model-fixup.html?sha=709111adbc> 10 7 /css/css-tables/html5-table-formatting-2.html <https://wpt.fyi/css/css-tables/html5-table-formatting-2.html?sha=709111adbc> 22 19 /css/css-tables/table-model-fixup-2.html <https://wpt.fyi/css/css-tables/table-model-fixup-2.html?sha=709111adbc> 5 2 /css/css-tables/html5-table-formatting-1.html <https://wpt.fyi/css/css-tables/html5-table-formatting-1.html?sha=709111adbc> 12 4 /css/selectors/focus-within-009.html <https://wpt.fyi/css/selectors/focus-within-009.html?sha=709111adbc> 156 118 /css/css-values/viewport-units-css2-001.html <https://wpt.fyi/css/css-values/viewport-units-css2-001.html?sha=709111adbc> 8 7 /css/cssom/shorthand-values.html <https://wpt.fyi/css/cssom/shorthand-values.html?sha=709111adbc> 17 15 /css/css-grid/layout-algorithm/grid-find-fr-size-gutters-001.html <https://wpt.fyi/css/css-grid/layout-algorithm/grid-find-fr-size-gutters-001.html?sha=709111adbc> 2 1 /css/cssom-view/elementsFromPoint-iframes.html <https://wpt.fyi/css/cssom-view/elementsFromPoint-iframes.html?sha=709111adbc> 5 1 /css/cssom-view/elementsFromPoint-simple.html <https://wpt.fyi/css/cssom-view/elementsFromPoint-simple.html?sha=709111adbc> 4 0 /css/cssom-view/elementsFromPoint-table.html <https://wpt.fyi/css/cssom-view/elementsFromPoint-table.html?sha=709111adbc> 4 0 /css/cssom-view/elementsFromPoint-svg.html <https://wpt.fyi/css/cssom-view/elementsFromPoint-svg.html?sha=709111adbc> 40 36 /css/cssom-view/scrollintoview.html <https://wpt.fyi/css/cssom-view/scrollintoview.html?sha=709111adbc> 3 1 /css/cssom-view/elementsFromPoint-invalid-cases.html <https://wpt.fyi/css/cssom-view/elementsFromPoint-invalid-cases.html?sha=709111adbc> 2 1 /css/cssom-view/CaretPosition-001.html <https://wpt.fyi/css/cssom-view/CaretPosition-001.html?sha=709111adbc> 5 4 /mathml/presentation-markup/spaces/space-1.html <https://wpt.fyi/mathml/presentation-markup/spaces/space-1.html?sha=709111adbc> 5 4 /mathml/presentation-markup/scripts/underover-1.html <https://wpt.fyi/mathml/presentation-markup/scripts/underover-1.html?sha=709111adbc> 5 4 /mathml/presentation-markup/scripts/subsup-3.html <https://wpt.fyi/mathml/presentation-markup/scripts/subsup-3.html?sha=709111adbc> 2 0 /uievents/order-of-events/focus-events/focus-automated-blink-webkit.html <https://wpt.fyi/uievents/order-of-events/focus-events/focus-automated-blink-webkit.html?sha=709111adbc> 2 0 /webvr/webvr-disabled-by-feature-policy.https.sub.html <https://wpt.fyi/webvr/webvr-disabled-by-feature-policy.https.sub.html?sha=709111adbc> 1 0 /2dcontext/drawing-paths-to-the-canvas/drawFocusIfNeeded_005.html <https://wpt.fyi/2dcontext/drawing-paths-to-the-canvas/drawFocusIfNeeded_005.html?sha=709111adbc> 1 0 /2dcontext/drawing-paths-to-the-canvas/drawFocusIfNeeded_001.html <https://wpt.fyi/2dcontext/drawing-paths-to-the-canvas/drawFocusIfNeeded_001.html?sha=709111adbc> 1 0 /2dcontext/drawing-paths-to-the-canvas/drawFocusIfNeeded_004.html <https://wpt.fyi/2dcontext/drawing-paths-to-the-canvas/drawFocusIfNeeded_004.html?sha=709111adbc> 14 0 /html/browsers/the-window-object/apis-for-creating-and-navigating-browsing-contexts-by-name/open-features-tokenization-width-height.html <https://wpt.fyi/html/browsers/the-window-object/apis-for-creating-and-navigating-browsing-contexts-by-name/open-features-tokenization-width-height.html?sha=709111adbc> 128 126 /html/infrastructure/urls/resolving-urls/query-encoding/utf-8.html <https://wpt.fyi/html/infrastructure/urls/resolving-urls/query-encoding/utf-8.html?sha=709111adbc> 128 126 /html/infrastructure/urls/resolving-urls/query-encoding/utf-16le.html <https://wpt.fyi/html/infrastructure/urls/resolving-urls/query-encoding/utf-16le.html?sha=709111adbc> 112 110 /html/infrastructure/urls/resolving-urls/query-encoding/windows-1252.html <https://wpt.fyi/html/infrastructure/urls/resolving-urls/query-encoding/windows-1252.html?sha=709111adbc> 128 126 /html/infrastructure/urls/resolving-urls/query-encoding/utf-16be.html <https://wpt.fyi/html/infrastructure/urls/resolving-urls/query-encoding/utf-16be.html?sha=709111adbc> 1 0 /html/editing/focus/composed.window.html <https://wpt.fyi/html/editing/focus/composed.window.html?sha=709111adbc> 1 0 /html/editing/focus/sequential-focus-navigation-and-the-tabindex-attribute/focus-tabindex-positive.html <https://wpt.fyi/html/editing/focus/sequential-focus-navigation-and-the-tabindex-attribute/focus-tabindex-positive.html?sha=709111adbc> 1 0 /html/editing/focus/sequential-focus-navigation-and-the-tabindex-attribute/focus-tabindex-zero.html <https://wpt.fyi/html/editing/focus/sequential-focus-navigation-and-the-tabindex-attribute/focus-tabindex-zero.html?sha=709111adbc> 1 0 /html/editing/focus/sequential-focus-navigation-and-the-tabindex-attribute/focus-tabindex-negative.html <https://wpt.fyi/html/editing/focus/sequential-focus-navigation-and-the-tabindex-attribute/focus-tabindex-negative.html?sha=709111adbc> 2 0 /html/editing/focus/focus-management/focus-events.html <https://wpt.fyi/html/editing/focus/focus-management/focus-events.html?sha=709111adbc> 60 0 /html/webappapis/system-state-and-capabilities/the-navigator-object/protocol.html <https://wpt.fyi/html/webappapis/system-state-and-capabilities/the-navigator-object/protocol.html?sha=709111adbc> 1 0 /html/semantics/scripting-1/the-script-element/execution-timing/031.html <https://wpt.fyi/html/semantics/scripting-1/the-script-element/execution-timing/031.html?sha=709111adbc> 30 0 /html/semantics/forms/textfieldselection/select-event.html <https://wpt.fyi/html/semantics/forms/textfieldselection/select-event.html?sha=709111adbc> 1 0 /html/semantics/selectors/pseudo-classes/focus-autofocus.html <https://wpt.fyi/html/semantics/selectors/pseudo-classes/focus-autofocus.html?sha=709111adbc> 5 0 /html/semantics/selectors/pseudo-classes/focus.html <https://wpt.fyi/html/semantics/selectors/pseudo-classes/focus.html?sha=709111adbc> 1 0 /html/semantics/embedded-content/media-elements/track/track-element/track-cues-missed.html <https://wpt.fyi/html/semantics/embedded-content/media-elements/track/track-element/track-cues-missed.html?sha=709111adbc> 7 1 /html/semantics/embedded-content/the-img-element/usemap-casing.html <https://wpt.fyi/html/semantics/embedded-content/the-img-element/usemap-casing.html?sha=709111adbc> 14 13 /html/dom/usvstring-reflection.html <https://wpt.fyi/html/dom/usvstring-reflection.html?sha=709111adbc> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#478 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAve5Or5ijB-6DxCVZ8q8pk2kXDlQvnDks5tulKMgaJpZM4SMjZs> .

foolip · 2018-05-07T11:34:56Z

The experimental release of the browser has regressed in some way (which seems highly unlikely given the rigor of each browser's release process)

Even if it doesn't explain all of the failures, I would expect this to happen regularly, actually, and it's one of the first things I'd like to look for. From https://experimental-diff-dot-wptdashboard.appspot.com/results/?label=chrome&diff, I found http://w3c-test.org/longtask-timing/longtask-in-sibling-iframe-crossorigin.html which really does seem to time out more easily on Chrome Dev, but it's disabled for being flaky so we didn't notice. Even if it's mostly flaky tests we find this way, that's still good :)

Comparing one stable version to the next would also be very interesting, although by then it's too late to catch regressions.

mariestaver assigned mattl Feb 20, 2018

jugglinmike mentioned this issue Feb 20, 2018

Profile the number and frequency of intermittent test errors #431

Closed

mariestaver added the next scheduled for next sprint label Feb 27, 2018

mariestaver added this to the WPT.fyi is reliable and useful milestone Feb 28, 2018

mattl assigned jugglinmike and unassigned mattl Mar 5, 2018

jugglinmike added the project:runner label Apr 5, 2018

plehegar unassigned jugglinmike Apr 11, 2018

jugglinmike mentioned this issue Apr 18, 2018

Unexpected closure of Sauce Labs Connect proxy #544

Open

zcorpan mentioned this issue May 7, 2018

Make harness errors block PRs (with override) web-platform-tests/wpt#10877

Open

zcorpan mentioned this issue May 31, 2018

Fix tests with harness status ERROR or TIMEOUT web-platform-tests/wpt#11269

Closed

jugglinmike mentioned this issue Sep 12, 2018

Validate results on upload web-platform-tests/wpt.fyi#534

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research: Why do non-Sauce Labs browsers report fewer than 100% of results? #478

Research: Why do non-Sauce Labs browsers report fewer than 100% of results? #478

mariestaver commented Feb 20, 2018

jugglinmike commented Apr 5, 2018

foolip commented Apr 11, 2018

jugglinmike commented Apr 11, 2018

foolip commented Apr 11, 2018

jugglinmike commented Apr 11, 2018

foolip commented Apr 11, 2018

foolip commented Apr 11, 2018

jugglinmike commented Apr 11, 2018

jgraham commented Apr 12, 2018

gsnedders commented Apr 12, 2018 •

edited

Loading

jugglinmike commented Apr 25, 2018

jugglinmike commented Apr 25, 2018

jugglinmike commented May 3, 2018

lukebjerring commented May 3, 2018 via email

foolip commented May 7, 2018

Research: Why do non-Sauce Labs browsers report fewer than 100% of results? #478

Research: Why do non-Sauce Labs browsers report fewer than 100% of results? #478

Comments

mariestaver commented Feb 20, 2018

jugglinmike commented Apr 5, 2018

foolip commented Apr 11, 2018

jugglinmike commented Apr 11, 2018

foolip commented Apr 11, 2018

jugglinmike commented Apr 11, 2018

foolip commented Apr 11, 2018

foolip commented Apr 11, 2018

jugglinmike commented Apr 11, 2018

jgraham commented Apr 12, 2018

gsnedders commented Apr 12, 2018 • edited Loading

jugglinmike commented Apr 25, 2018

jugglinmike commented Apr 25, 2018

jugglinmike commented May 3, 2018

lukebjerring commented May 3, 2018 via email

foolip commented May 7, 2018

gsnedders commented Apr 12, 2018 •

edited

Loading