-
Notifications
You must be signed in to change notification settings - Fork 46
Change UI incentives to focus on interoperability, not test pass rate #83
Comments
Amazing - LOVE this idea. Even just briefly scanning through the demo the sigma's were highliting good areas of focus. |
One option is to show something derived from per-directory (presumably pre-computed) interop data, and don't do anything based on pure pass/fail data. |
So I love the approach here, but it feels like the implementation isn't perfect yet; some things are easy to see in the red/green colour scheme are obscured and not all of them are harmful. Maybe it's worth spending some time thinking about all the use cases. Some use cases I can think of:
I feel like this presentation is pretty good for the last use case, but once you have made the decision to improve interop on a specific area it's less good for actually doing the work (case 2 above), because it's harder to tell which tests are actually failing. And it's hard for test authors to use to identify where tests don't pass on any implementation but should. In theory it seems like it could be good for use case 1, but green is used for all of "this has good test coverage and works well everywhere", "this has poor test coverage and so we don't know how well it works" and "this fails everywhere". I think some work is needed on the first column to disambiguate some of these cases. Also possibly to make it more than a 5-point score (all the values I saw are 0.0, 0.1, 0.2, 0.3 or 0.4; multiplying by 100 and rounding would feel like a more useful metric without changing the actual computation at all). As a browser developer I would particularly like it to be easy to tell where there are test failures in my implementation that are passes in other implementations. Maybe that doesn't require different colours here, but it would require some way to filter down by result. |
I didn't see this until today, pretty exciting! Just seeing the σ without reading this issue I didn't know what to make of it, but clearly you're on to something here. What is the most useful aggregated metric, and what incentives do we want to create? Given 4 engines, I think that:
The final point makes it tricky to define a metric that doesn't at some point decrease even though all the right things are happening. However, the metric has to decrease in order to be able to later reward the steps toward full interop. So, my best idea is to use the total number of tests (see #98) as the denominator, and sum up scores based on the "goodness" of each test:
(Could be generalized to >4 implementers.) Then, implementers who want to improve the aggregate score should focus on the cases where they are the last failing implementation, or where they can move it from 2/4 or 3/4. Other than a disincentive to increasing the denominator, what else would be wrong with this? |
So I like the idea of increasing the weight attached to fixing tests that pass in multiple other implementations. I also think that we should consider supplementing autogenerated data with human derived data about the percieved completeness of different testsuites. A problem we have here is that I don't think we know what interoperability looks like. As a thought experiment, let's say that instead of developing a metric on theoretical considerations, we decided to train an ML model to produce such a metric based on past data. In that case I don't think we would know how to create a meaningful training set, which implies we don't really know what "interoperability" looks like in this data yet. Therefore I'm wary of attaching too much weight to any specific metric. |
Do you mean something like a simple percentage, by which the aggregate is scaled, so that a test suite judged to only be 30% complete can at best score 30%? That WFM, but how would we seed the data?
I think we have some idea, but no way of measuring it directly at this point. I would argue that if we had a "use counter" for every line of every spec, translated into metrics in all implementations, then each test should be weighted by how often the use counters it hits are also hit in the wild, and the test suite's coverage could also be straightforwardly measured. @drufball and @RByers have had ideas about experiments along these lines, and I think we should seriously consider it, but I think having a simpler base metric would still be useful. |
I was literally thinking a boolean yes/no, because like you I don't know how to get data on coverage. Mozilla can possibly provide code coverage data for gecko, but at the moment it's for all of wpt (although I could probably generate per-directory on demand), and it requries an expert to interpret it, so I don't know how helpful it is.
I'm not sure I entirely follow, but telemetry at that level seems like a lot of effort (is anyone really going to go thorugh HTML line by line and turn every assertion into a telemetry probe) and probably privacy-sensitive since it might be possible to reconstruct browsing history from such detailed telemetry. |
Seems simpler, how would it feed into the aggregate score, if at all?
Yes, I don't think line-by-line telemetry is doable, I was just making the argument that we have some conceptual idea about what interoperability looks like and how to measure it. The challenge isn't so much discovering what it is, but coming up with useful approximations that can be measured. Going back to this issue, what are the options for an aggregate metric that are worth pursuing? |
I don't have a good feeling for how the details should work out; I think we would need to look at various examples with different possible approaches to see what metric ended up matching our intuition. But I would expect that a complete testsuite would be a requirement to categorise something as having good interoperability, and would increase the impact metric for bugs (i.e. browser developers would be encouraged to preferentially work on features with good interop in other implementations and a "complete" testsuite). This could perhaps just be applied as a multiplier on some underlying metric e.g. increase all the scores by a factor of 2 when the testsuite is judged complete, and set some thresholds so that a spec with an incomplete testsuite could never be marked as having good interop. Of course it's not entirely clear how this works with living standards where the testsuite could get worse over time, although living standard + commitment to add tests with every spec change might be good enough. |
I agree it's a hard problem. I sort of like the idea of 0/4 is okay and 4/4 is okay, but 2/4 is bad - developing a metric to show deviation from cross-platform consistency. Except that it may incentivize an early-adoptor vendor to drop support for a realatively new and highly-desired-by-developers feature rather than wait for interop. I agree that we think we kind of know what interop looks like, but we don't really know at a data level. Compounding this is that we can't be entirely sure at this point whether a failing test is due to a failing implementation, a bug in the test, a bug in the test runner, or a bug in the way the dashboard invokes the runner without going test-by-test to figure it out. That's why the work @rwaldron and @boazsender did on https://bocoup.github.io/wpt-error-report/ is valuable - it is exposing areas of the dashboard tests that broadly fail in the same way and are good candidates for further investigation. But getting back to it, I think we need to agree on what interop looks like away from the data (all browsers implementing? all browsers not implementing? with or without feature flags? how do we measure interop of new features when we know they'll be incompletely implemented for a period? do we set a time limit on that period?) |
I'm preparing a presentation for https://webengineshackfest.org/ and as part of that I fiddled with devtools a bit to make a mockup for what a simple 4/3/2/1 browser-neutral view might look: Colors need a lot of tweaking of course, and we might want a 0/4 column, but I think the above wouldn't be too bad. |
Maybe percentages would make this nicer still, but they'd mean very different things depending on the completeness of the test suites. |
Demo of proposed pass rate metrics is temporarily available at https://metrics5-dot-wptdashboard.appspot.com/metrics/ Feedback welcome! @foolip has already mentioned that maybe the order should be 4 / 4 down to 0 / 4. I would also like to add links to the equivalent results-based (rather than metrics-based) view somewhere. ATM, search in this view works a bit differently than in the results-based view. We should discuss what approach makes the most sense here. (Perhaps create a separate issue for that?) |
Some quick thoughts:
- Color intensity should probably be proportionate to browsers count,
not the number of tests?
- Total test count could be its own column (instead of [Passes] /
[Total] ) everywhere
- In the filtered path views, instead of browser-count by test-path,
aggregated metrics could be broken in a grid by a different 2 dimension
grid - browser-count x browser
- e.g. (Chrome is failing 4 of the 7 tests which pass in 3/4 browsers)
…On Wed, 13 Dec 2017 at 08:40 Mark Dittmer ***@***.***> wrote:
Demo of proposed pass rate metrics is temporarily available at
https://metrics5-dot-wptdashboard.appspot.com/metrics/
Feedback welcome! @foolip <https://github.com/foolip> has already
mentioned that maybe the order should be 4 / 4 down to 0 / 4. I would also
like to add links to the equivalent results-based (rather than
metrics-based) view somewhere. ATM, search in this view works a bit
differently than in the results-based view. We should discuss what approach
makes the most sense here. (Perhaps create a separate issue for that?)
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#83 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAve5K02yyVCkpSIw3PJJU_IK0-NoDDfks5s_9PHgaJpZM4O5eAX>
.
|
@lukebjerring I'm having trouble parsing some aspects of your recommendations, but we can chat offline.
I believe that any browser-specific information was an explicit non-goal for this view. The idea is to assess general interop health in dependent of "who is passing, who is failing". Another view is coming soon that shows per-browser failing tests, ordered (ascending) by number of other browsers failing (i.e., start with tests where "this is the only browser failing this test"). |
Just met with @foolip to discuss these comments and other thoughts. The following changes will be applied to mdittmer#3 (or earlier PR, in the case of back end changes) to improve this UI:
Still to sort out for mdittmer#3:
Future work on metrics (and results) web components:
|
Or always wrap, whichever you think looks better.
Yep, and this probably need to be a bit prominent. |
This issue was moved to web-platform-tests/wpt.fyi#39 |
According to http://wpt.fyi/about, the stated purpose of the WPT Dashboard is:
However, the way the UI works today explicitly rewards passing tests over failing tests by displaying green for 100% passing results and shades of red for anything else.[1] If a browser came along and magically made all their tests 100% green, that wouldn't entirely satisfy the goal of platform predictability.
Ideally, as I understand the goals, the "opinion" of the dashboard UI should be:
GOOD
OK
BAD
My concrete suggestions are:
I have a demo of this up here: http://sigma-dot-wptdashboard.appspot.com/
[1] The code that determines the color based on pass rate lives at components/wpt-results.html#L320
[2] The green=good red=bad connotation applies only in Western cultures, however I can't think of a better alternative
The text was updated successfully, but these errors were encountered: