Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add /api/badge #674

Closed
wants to merge 10 commits into from
Closed

Add /api/badge #674

wants to merge 10 commits into from

Conversation

lukebjerring
Copy link
Contributor

Description

Copying @beaufortfrancois 's badge code, but with a few caching benefits server-side, and also handles a few bad-use-cases, e.g. a path of /, and a few other query params (e.g. product=chrome[experimental]).

Review Information

Visit /api/badge, for a product spec and a path, e.g. /api/badge?product=chrome&path=/webdriver

@lukebjerring lukebjerring requested a review from foolip October 18, 2018 13:44
@wpt-pr-bot
Copy link

Staging instance deployed by Travis CI!
Running at https://badge-dot-wptdashboard-staging.appspot.com

@beaufortfrancois
Copy link

Yeah!

I don't see the label ;(
https://badge-dot-wptdashboard-staging.appspot.com/api/badge?product=chrome&path=/picture-in-picture

screenshot 2018-10-18 at 3 55 21 pm

Moreover, why no color?
I liked the color to get a glimpse of implementation status.

@wpt-pr-bot
Copy link

Staging instance deployed by Travis CI!
Running at https://badge-dot-announcer-dot-wptdashboard-staging.appspot.com

@lukebjerring
Copy link
Contributor Author

Eh, my bad, a few bugs (needs content type too)
https://zenith-case.glitch.me/public/examples.html

@foolip
Copy link
Member

foolip commented Oct 18, 2018

Should the colors align with what we arrive at in #423, or is there a de-facto color scheme for this type of badge that we should follow.

One thing that I'd be more concerned with here than on wpt.fyi proper is the discontinuity in color between 99% (yellow) and 100% (green) since once engine gets 100% they might push back on changes that will make them yellow again, and argue technically correct but pedantic points about the spec to prevent actually useful tests from being shared.

@lukebjerring
Copy link
Contributor Author

Its just what was in the prototype I mimiced. That sounds like a specific example of the concerns raised by @jgraham about producing problematic insentives. Is this concern based on empirical evidence, or are we paranoid programming here? (We already go green on wpt.fyi for 100%.)

Happy to change the color scheme - did you have a specific suggestion?

@foolip
Copy link
Member

foolip commented Oct 19, 2018

So, when force to suggest a specific color scheme that would lessen the potential for gamification, it seems implausible that colors is an effective tool, and that being consistent with wpt.fyi proper is good.

Perhaps a better approach would be for us to recommend a little disclaimer for people to use after these badges in readmes.

So, I suggest the same color scheme as in #423 and that we have a single definition of the shades of red, yellow and green, so that we keep them in sync.

@lukebjerring
Copy link
Contributor Author

OK; so I'll bundle the color changes for this into work done (at a later date) resolving #423.

@foolip
Copy link
Member

foolip commented Oct 26, 2018

Can you add API docs for this for how people should embed this? Right beneath that might be a good place to offer advice on how to use it responsibly and maybe even suggest disclaimer wording.

api/README.md Outdated

Latest chrome experimental run:

![webdriver badge](https://staging.wpt.fyi/api/badge?product=chrome&path=/webdriver)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a preview for what this will render like, I added a commit that uses this PR's staging environment here:
1ed6008?short_path=d4b47e8#diff-d4b47e85ce206fdbe458cb599edf2a90

(View it in rich mode)

@beaufortfrancois
Copy link

I'm looking forward to get this PR merged! Thanks for working on this.

@jgraham
Copy link
Contributor

jgraham commented Oct 29, 2018

FTR I still think that providing and promoting this API is a bad idea, and I don't think the changes meaningfully address those concerns.

Turning results into marketing buttons for people who don't even read the spec seems to have few benefits and significant drawbacks in terms of the incentives it promotes. We have had many conversations about the difficulty of understanding the meaning of test results, and allowing them to be presented in a context-free manner in READMEs and promotional material ignores the concern that interpreting the results is hard without careful reading of a specification or knowledge of an implementation to get some idea of how meaningful and exhaustive the tests are. For that reason I think there's a clear difference between presenting the results alongside the spec text, or on a dashboard aimed at browser vendors with clear disclaimers, compared to putting them in a README or other introductory material that will be the first port of call for people with a shallow understanding of the topic. Hoping that people add this context into their documentation seems optimistic at best.

@lukebjerring
Copy link
Contributor Author

There are some assumptions being made about the reason people desire the badges, which I'd defer to @beaufortfrancois to clarify/justify. My understanding is that those authoring a specific feature like to track implementation progress in each major vendor.

As discussed elsewhere, the motivation to provide this API here is "lesser of two evils", under the assumption that it's being done anyway; wpt.fyi is the suitable home for more control of the cachability, computation formulas, color consistency, context/disclaimer, etc.

That is to say, your argument's merits cascade past wpt.fyi and onto the endpoints (READMEs), and perhaps we should be instead arguing to prevent the badges being used at all (whether provided by us or not).

@foolip
Copy link
Member

foolip commented Oct 29, 2018

@jgraham, if you could totally control the presentation on https://github.com/WICG/picture-in-picture, how would you present the tests? Would replicating exactly the presentation of wpt.fyi as follows be unproblematic?
screen shot 2018-10-29 at 11 45 44 pm

If we think that having badges or similar is going to be a net negative no matter how they're done, then we should just dissuade people from making/using them. That seems very likely to succeed given how small a community we are. In other words, I don't think the main reason to provide this API should be that we'll be in control and can react to bad usage, even if that's true.

Rather, I think that bringing attention to the test suite in an eye-catching way in READMEs and at the top of specs is likely to be net positive, it signals that the web standards community values interop testing, that doing it well is a point of pride. There is some non-zero amount of attention on the status of tests that is helpful to interop.

@beaufortfrancois perhaps you can elaborate a bit on what you'd like to convey in a compact form? Have you had any reactions from other vendors on the README yet and is there anything they'd like to see changed?

@beaufortfrancois
Copy link

For posterity, here's how looks the Picture-in-Picture README.md file for now:
image

I've added these custom badges for two reasons. The first was for me (spec author), I wanted to make sure all web platform tests related to Picture-in-Picture would be as complete as possible. I didn't want some of my tests to not run or fail unexpectedly in other browser vendor test suites. I used this to keep track of my progress. And since I visit quite often this page, this also gives me a hint of where things are moving.

The second reason was for web developers interested in Picture-in-Picture support. At a quick glance, they can tell which browsers support it or are about to. And I do believe (I may be too naive) that it would empower web developers to raise bugs or even fix tests where possible.

For info, other browser vendors didn't share their concern. If they do so, I'll be happy to discuss and remove if needed.

@foolip
Copy link
Member

foolip commented Oct 31, 2018

Thanks for the details, @beaufortfrancois. All, I'd like to point out that "didn't want some of my tests to not run or fail unexpectedly in other browser" is not just an aspiration, but that François put in a lot of effort to understand the issues:

@jgraham
Copy link
Contributor

jgraham commented Oct 31, 2018

So I agree that these numbers can be useful if used responsibly. But one of the challenges of designing public-facing systems is to also consider how those systems can be used in a way that has unintended consequence. When we have previously discussed test results, presenting context free per-browser numbers has been identified as leading to poor incentives around submitting new tests that make the numbers look worse, or submitting tests that intentionally make competing browsers look worse (e.g. by having large numbers of tests that all fail due to the same edge-case bug). These are both failure modes that we have seen in the past when people have tried to use test results in a marketing context.

So whilst I'm sure that everyone in the issue has good intentions, I'm still wary of the proposed changes; in my mind they look a lot like we would be adding an API specifically to encourage the kind of presentation that has previously proven problematic.

Concretely I have several possible suggestions:

  • Don't link the test results from the README but in the spec itself, prefereably close to the text defining the feature under test. Presumably spec authors and developers of the kind who are likely to work on improving test coverage are looking at the spec itself, so this seems like it would satisfy the requirement to make the results available to those audiences.
  • Add badges based on the numbers in the interop view i.e. for the fraction of tests passed by 4/4 major engines, 3/4 etc. This is harder to understand, and maybe doesn't work for the use case of "I want to understand how different browsers are doing". But linking through to the wpt.fyi dashboard where there's more ability to set the context (and plans in the future to add e.g. annotations linking to per-engine bugs where available) makes it more likely that the results will be understood in a reasonable manner.
  • Have badges with super-broad categories like "No support", "Some passing tests", "Most tests pass". Again that wouldn't necessarily help with your use case of wanting to investigate the results in detail, but especially once we have annotations on wpt.fyi doing that in the context of the dashboard (so you can see which test failures are already triaged, etc.) is going to be more helpful anyway.

@foolip
Copy link
Member

foolip commented Oct 31, 2018

Thanks @jgraham! Getting the incentives right on wpt.fyi itself and when results are exposed in some form elsewhere is obviously very important, and a discussion we had at length at TPAC 2017, starting at "WPT Dashboard Interop Stats PRD". https://wpt.fyi/interop/ is what we built based on that discussion.

So the risks of bad incentives are pretty clear, and the fewer pixels we have the harder it is to convey nuance. But I think we should also take seriously the value of promoting testing discipline via wpt(.fyi) to a web platform engineer audience, and strike some sort of balance.

On the concrete suggestions:

Don't link the test results from the README but in the spec itself

@beaufortfrancois, have you tried out https://tabatkins.github.io/bikeshed/#testing? @tabatkins is planning some presentation changes there, maybe you'd like to try it and see if you have feedback?

However, I this doesn't address "At a quick glance, they can tell which browsers support it or are about to." How to extract that answer for wpt is something we've discussed many, many times, but don't have a ready answer for. @beaufortfrancois, is there a single test for picture-in-picture whose results would be the answer for that question?

Add badges based on the numbers in the interop view

@beaufortfrancois, WDYT, would a summary of https://wpt.fyi/interop/picture-in-picture be useful?

Have badges with super-broad categories

(Suggested by @gsnedders on IRC.)

I'm not sure I understand how this changes incentives. With a fine gradation of 0/N–N/N, or 0–100%, small improvements make for small changes. To divide up the space, if people actually care about which broad category each browser is in, won't that just create weird incentives around the boundaries between categories? (This concern is why, ultimately, we went with small color differences between 99% and 100% in the new https://staging.wpt.fyi/ color scheme, instead of the originally planned jump from yellow to green.)

@jgraham
Copy link
Contributor

jgraham commented Oct 31, 2018

But I think we should also take seriously the value of promoting testing discipline via wpt(.fyi) to a web platform engineer audience

I strongly agree with the goal here. There is clear benefit to having engineers and their managers understand the state of implementations and have the information they need to prioritise issues likely to cause interop problems. However I'm unconvinced that this is the right way to achieve that.

[in-spec results] doesn't address "At a quick glance, they can tell which browsers support it or are about to."

That seems like the kind of thing that's handled by human curated listes like caniuse and the various browser intent-to-implement dashboards that are floating around. As you say, extracting that specific information from test results is surprisingly tricky since there are many confounding factors. It's also questionable whether it's a useful goal; sites like html5test.com were mainly concerned with answering this kind of broad question and it's certainly questionable whether the overall impact of such sites was positive.

I'm not sure I understand how [broad categories] changes incentives

It may not. But I think there's less chance of people getting hung up on the difference between "most tests passing" and "some tests passing" than on the difference between 100% and not-100%, and it makes it harder to try to pressgang the test results into some (likely misleading) metric of browser quality once there is more than one serious attempt an an implementation.

But I could certainly be persuaded that doing nothing is better than doing broad categories.

@foolip
Copy link
Member

foolip commented Oct 31, 2018

there's less chance of people getting hung up on the difference between "most tests passing" and "some tests passing" than on the difference between 100% and not-100%

Ah, so that I agree with that, too much of a difference between 99% and 100% would be bad. I thought using lime green for 99% should mostly address that, but the number (or fraction) itself is still there.

@beaufortfrancois would you still find the badges useful if they elided the details and just said "good support" or something in the 80-100% range? (I'm finding it a bit tricky to find a phrasing that's not misleading for the 100%, and "mostly passing" seems to imply <100%.)

@beaufortfrancois
Copy link

@foolip Number of tests (passing and total) is my main preoccupation to be honest.

@beaufortfrancois
Copy link

What is the current status?

@lukebjerring
Copy link
Contributor Author

We're waiting on a consensus. Hoping to either determine an alternate manifestation of this feature, which will alleviate the concerns highlighted by @jgraham, or have agreement that the concerns are noted and the feature should move ahead.

I believe the conversation has just currently fallen by the wayside. @jgraham - where do you sit on this issue? @foolip ? (Specifically, are we deciding to block this feature on the concerns, and should I close this PR, or do we think the feature's merits outweigh the concerns, and should I merge it?)

Personally, I still feel the feature adds more value than risk of misinterpretation.

@jgraham
Copy link
Contributor

jgraham commented Nov 8, 2018

I think the status is that there's a fundamental incompatibility between the functionality requested and the concerns raised. @beaufortfrancois doesn't believe that any of the proposed alternatives would meet their requirements, and I don't believe that the unmodified feature is something we should go ahead with.

@lukebjerring
Copy link
Contributor Author

Ok, closing this PR.

FTR, in my opinion this feels like throwing the baby out with the bath water. We're avoiding supporting the feature because of potential use in misleading/inappropriate contexts, when the demand for the feature and a clear value-add are presented (and, the badges are already created and used anyway).

@foolip
Copy link
Member

foolip commented Nov 15, 2018

Sorry for not getting back to this despite the mention.

I think walking away from this would be a real shame, even if we're not confident it's perfect. As I argued in #674 (comment), we could show exactly the views we have on wpt.fyi and link to wpt.fyi for further exploration. That would show the same numbers that @beaufortfrancois is getting with his current badges. If there is a fundamental incompatibility of goals here, we must already have that problem on wpt.fyi to some degree. (Being presented together with a long list of directories for other features can't be what makes the difference.)

(I'm not arguing that this would be the perfect presentation for inlining in READMEs, just an acceptable presentation does exists.)

@lukebjerring, what would you think about seeking wider feedback, perhaps on public-test-infra, blink-dev, dev-platform and webkit-dev?

@beaufortfrancois
Copy link

FWIW, I've discovered this morning another bug in my picture-in-picture web platform tests by simply looking at my badges in the Picture-in-Picture spec repo. It is useful (at least for me)

@foolip
Copy link
Member

foolip commented Nov 21, 2018

I'm glad to hear that, and also not surprised that looking at results brings problems to your attention. Ping @lukebjerring on #674 (comment).

@foolip
Copy link
Member

foolip commented Dec 5, 2018

I finally stumbled upon the spec where I'd seen something like this in the header: https://w3c.github.io/hr-time/

That's using caniuse.com, but is the sort of presentation that would be great to enable based on WPT results.

@foolip
Copy link
Member

foolip commented Dec 13, 2018

As another point in favor of increasing visibility of results, @beaufortfrancois tells me that he discovered the picture-in-picture regression that led to web-platform-tests/wpt#14495 using his own badges.

@lukebjerring
Copy link
Contributor Author

Sorry for slow response. Happy to seek wider feedback.

@Hexcles Hexcles deleted the badge branch February 16, 2019 21:41
@Hexcles Hexcles restored the badge branch February 16, 2019 21:42
@Hexcles Hexcles deleted the badge branch October 23, 2019 03:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants