-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add /api/badge #674
Add /api/badge #674
Conversation
Staging instance deployed by Travis CI! |
Yeah! I don't see the label ;( Moreover, why no color? |
Staging instance deployed by Travis CI! |
Eh, my bad, a few bugs (needs content type too) |
Should the colors align with what we arrive at in #423, or is there a de-facto color scheme for this type of badge that we should follow. One thing that I'd be more concerned with here than on wpt.fyi proper is the discontinuity in color between 99% (yellow) and 100% (green) since once engine gets 100% they might push back on changes that will make them yellow again, and argue technically correct but pedantic points about the spec to prevent actually useful tests from being shared. |
Its just what was in the prototype I mimiced. That sounds like a specific example of the concerns raised by @jgraham about producing problematic insentives. Is this concern based on empirical evidence, or are we paranoid programming here? (We already go green on wpt.fyi for 100%.) Happy to change the color scheme - did you have a specific suggestion? |
So, when force to suggest a specific color scheme that would lessen the potential for gamification, it seems implausible that colors is an effective tool, and that being consistent with wpt.fyi proper is good. Perhaps a better approach would be for us to recommend a little disclaimer for people to use after these badges in readmes. So, I suggest the same color scheme as in #423 and that we have a single definition of the shades of red, yellow and green, so that we keep them in sync. |
OK; so I'll bundle the color changes for this into work done (at a later date) resolving #423. |
Can you add API docs for this for how people should embed this? Right beneath that might be a good place to offer advice on how to use it responsibly and maybe even suggest disclaimer wording. |
api/README.md
Outdated
|
||
Latest chrome experimental run: | ||
|
||
![webdriver badge](https://staging.wpt.fyi/api/badge?product=chrome&path=/webdriver) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a preview for what this will render like, I added a commit that uses this PR's staging environment here:
1ed6008?short_path=d4b47e8#diff-d4b47e85ce206fdbe458cb599edf2a90
(View it in rich mode)
I'm looking forward to get this PR merged! Thanks for working on this. |
FTR I still think that providing and promoting this API is a bad idea, and I don't think the changes meaningfully address those concerns. Turning results into marketing buttons for people who don't even read the spec seems to have few benefits and significant drawbacks in terms of the incentives it promotes. We have had many conversations about the difficulty of understanding the meaning of test results, and allowing them to be presented in a context-free manner in READMEs and promotional material ignores the concern that interpreting the results is hard without careful reading of a specification or knowledge of an implementation to get some idea of how meaningful and exhaustive the tests are. For that reason I think there's a clear difference between presenting the results alongside the spec text, or on a dashboard aimed at browser vendors with clear disclaimers, compared to putting them in a README or other introductory material that will be the first port of call for people with a shallow understanding of the topic. Hoping that people add this context into their documentation seems optimistic at best. |
There are some assumptions being made about the reason people desire the badges, which I'd defer to @beaufortfrancois to clarify/justify. My understanding is that those authoring a specific feature like to track implementation progress in each major vendor. As discussed elsewhere, the motivation to provide this API here is "lesser of two evils", under the assumption that it's being done anyway; wpt.fyi is the suitable home for more control of the cachability, computation formulas, color consistency, context/disclaimer, etc. That is to say, your argument's merits cascade past wpt.fyi and onto the endpoints (READMEs), and perhaps we should be instead arguing to prevent the badges being used at all (whether provided by us or not). |
@jgraham, if you could totally control the presentation on https://github.com/WICG/picture-in-picture, how would you present the tests? Would replicating exactly the presentation of wpt.fyi as follows be unproblematic? If we think that having badges or similar is going to be a net negative no matter how they're done, then we should just dissuade people from making/using them. That seems very likely to succeed given how small a community we are. In other words, I don't think the main reason to provide this API should be that we'll be in control and can react to bad usage, even if that's true. Rather, I think that bringing attention to the test suite in an eye-catching way in READMEs and at the top of specs is likely to be net positive, it signals that the web standards community values interop testing, that doing it well is a point of pride. There is some non-zero amount of attention on the status of tests that is helpful to interop. @beaufortfrancois perhaps you can elaborate a bit on what you'd like to convey in a compact form? Have you had any reactions from other vendors on the README yet and is there anything they'd like to see changed? |
For posterity, here's how looks the Picture-in-Picture README.md file for now: I've added these custom badges for two reasons. The first was for me (spec author), I wanted to make sure all web platform tests related to Picture-in-Picture would be as complete as possible. I didn't want some of my tests to not run or fail unexpectedly in other browser vendor test suites. I used this to keep track of my progress. And since I visit quite often this page, this also gives me a hint of where things are moving. The second reason was for web developers interested in Picture-in-Picture support. At a quick glance, they can tell which browsers support it or are about to. And I do believe (I may be too naive) that it would empower web developers to raise bugs or even fix tests where possible. For info, other browser vendors didn't share their concern. If they do so, I'll be happy to discuss and remove if needed. |
Thanks for the details, @beaufortfrancois. All, I'd like to point out that "didn't want some of my tests to not run or fail unexpectedly in other browser" is not just an aspiration, but that François put in a lot of effort to understand the issues: |
So I agree that these numbers can be useful if used responsibly. But one of the challenges of designing public-facing systems is to also consider how those systems can be used in a way that has unintended consequence. When we have previously discussed test results, presenting context free per-browser numbers has been identified as leading to poor incentives around submitting new tests that make the numbers look worse, or submitting tests that intentionally make competing browsers look worse (e.g. by having large numbers of tests that all fail due to the same edge-case bug). These are both failure modes that we have seen in the past when people have tried to use test results in a marketing context. So whilst I'm sure that everyone in the issue has good intentions, I'm still wary of the proposed changes; in my mind they look a lot like we would be adding an API specifically to encourage the kind of presentation that has previously proven problematic. Concretely I have several possible suggestions:
|
Thanks @jgraham! Getting the incentives right on wpt.fyi itself and when results are exposed in some form elsewhere is obviously very important, and a discussion we had at length at TPAC 2017, starting at "WPT Dashboard Interop Stats PRD". https://wpt.fyi/interop/ is what we built based on that discussion. So the risks of bad incentives are pretty clear, and the fewer pixels we have the harder it is to convey nuance. But I think we should also take seriously the value of promoting testing discipline via wpt(.fyi) to a web platform engineer audience, and strike some sort of balance. On the concrete suggestions:
@beaufortfrancois, have you tried out https://tabatkins.github.io/bikeshed/#testing? @tabatkins is planning some presentation changes there, maybe you'd like to try it and see if you have feedback? However, I this doesn't address "At a quick glance, they can tell which browsers support it or are about to." How to extract that answer for wpt is something we've discussed many, many times, but don't have a ready answer for. @beaufortfrancois, is there a single test for picture-in-picture whose results would be the answer for that question?
@beaufortfrancois, WDYT, would a summary of https://wpt.fyi/interop/picture-in-picture be useful?
(Suggested by @gsnedders on IRC.) I'm not sure I understand how this changes incentives. With a fine gradation of 0/N–N/N, or 0–100%, small improvements make for small changes. To divide up the space, if people actually care about which broad category each browser is in, won't that just create weird incentives around the boundaries between categories? (This concern is why, ultimately, we went with small color differences between 99% and 100% in the new https://staging.wpt.fyi/ color scheme, instead of the originally planned jump from yellow to green.) |
I strongly agree with the goal here. There is clear benefit to having engineers and their managers understand the state of implementations and have the information they need to prioritise issues likely to cause interop problems. However I'm unconvinced that this is the right way to achieve that.
That seems like the kind of thing that's handled by human curated listes like caniuse and the various browser intent-to-implement dashboards that are floating around. As you say, extracting that specific information from test results is surprisingly tricky since there are many confounding factors. It's also questionable whether it's a useful goal; sites like html5test.com were mainly concerned with answering this kind of broad question and it's certainly questionable whether the overall impact of such sites was positive.
It may not. But I think there's less chance of people getting hung up on the difference between "most tests passing" and "some tests passing" than on the difference between 100% and not-100%, and it makes it harder to try to pressgang the test results into some (likely misleading) metric of browser quality once there is more than one serious attempt an an implementation. But I could certainly be persuaded that doing nothing is better than doing broad categories. |
Ah, so that I agree with that, too much of a difference between 99% and 100% would be bad. I thought using lime green for 99% should mostly address that, but the number (or fraction) itself is still there. @beaufortfrancois would you still find the badges useful if they elided the details and just said "good support" or something in the 80-100% range? (I'm finding it a bit tricky to find a phrasing that's not misleading for the 100%, and "mostly passing" seems to imply <100%.) |
@foolip Number of tests (passing and total) is my main preoccupation to be honest. |
What is the current status? |
We're waiting on a consensus. Hoping to either determine an alternate manifestation of this feature, which will alleviate the concerns highlighted by @jgraham, or have agreement that the concerns are noted and the feature should move ahead. I believe the conversation has just currently fallen by the wayside. @jgraham - where do you sit on this issue? @foolip ? (Specifically, are we deciding to block this feature on the concerns, and should I close this PR, or do we think the feature's merits outweigh the concerns, and should I merge it?) Personally, I still feel the feature adds more value than risk of misinterpretation. |
I think the status is that there's a fundamental incompatibility between the functionality requested and the concerns raised. @beaufortfrancois doesn't believe that any of the proposed alternatives would meet their requirements, and I don't believe that the unmodified feature is something we should go ahead with. |
Ok, closing this PR. FTR, in my opinion this feels like throwing the baby out with the bath water. We're avoiding supporting the feature because of potential use in misleading/inappropriate contexts, when the demand for the feature and a clear value-add are presented (and, the badges are already created and used anyway). |
Sorry for not getting back to this despite the mention. I think walking away from this would be a real shame, even if we're not confident it's perfect. As I argued in #674 (comment), we could show exactly the views we have on wpt.fyi and link to wpt.fyi for further exploration. That would show the same numbers that @beaufortfrancois is getting with his current badges. If there is a fundamental incompatibility of goals here, we must already have that problem on wpt.fyi to some degree. (Being presented together with a long list of directories for other features can't be what makes the difference.) (I'm not arguing that this would be the perfect presentation for inlining in READMEs, just an acceptable presentation does exists.) @lukebjerring, what would you think about seeking wider feedback, perhaps on public-test-infra, blink-dev, dev-platform and webkit-dev? |
FWIW, I've discovered this morning another bug in my picture-in-picture web platform tests by simply looking at my badges in the Picture-in-Picture spec repo. It is useful (at least for me) |
I'm glad to hear that, and also not surprised that looking at results brings problems to your attention. Ping @lukebjerring on #674 (comment). |
I finally stumbled upon the spec where I'd seen something like this in the header: https://w3c.github.io/hr-time/ That's using caniuse.com, but is the sort of presentation that would be great to enable based on WPT results. |
As another point in favor of increasing visibility of results, @beaufortfrancois tells me that he discovered the picture-in-picture regression that led to web-platform-tests/wpt#14495 using his own badges. |
Sorry for slow response. Happy to seek wider feedback. |
Description
Copying @beaufortfrancois 's badge code, but with a few caching benefits server-side, and also handles a few bad-use-cases, e.g. a path of
/
, and a few other query params (e.g.product=chrome[experimental]
).Review Information
Visit /api/badge, for a
product
spec and apath
, e.g. /api/badge?product=chrome&path=/webdriver