Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling test renames or splits (test suite evolution over time) #284

Closed
foolip opened this issue Feb 10, 2023 · 3 comments
Closed

Handling test renames or splits (test suite evolution over time) #284

foolip opened this issue Feb 10, 2023 · 3 comments

Comments

@foolip
Copy link
Member

foolip commented Feb 10, 2023

In #281 @jensimmons reported that the scores for Cascade Layers have regressed from 100 to something less than 100.

The basic problem is that we are maintaining (in wpt-metadata) a list of tests for each focus area, which is kept in sync with WPT as it evolves. We can't get a list of tests for a specific commit of WPT, but only for the latest. If one test is replaced by another (by renaming, splitting, variant-ifying, etc.) and we unlabel the old test and label the new test(s), the score for all past runs will go down as the new test(s) aren't found. This is described in a comment:

We always normalize against the number of tests we are looking for,
rather than the total number of tests we found. The trade-off is all
about new tests being added to the set.

If a large chunk of tests are introduced at date X, and they fail in
some browser, then runs after date X look worse if you're only
counting total tests found - even though the tests would have failed
before date X as well.

Conversely, if a large chunk of tests are introduced at date X, and
they pass in some browser, then runs after date X would get an
artificial boost in pass-rate due to this - even if the tests would
have passed before date X as well.

We consider the former case worse than the latter, so optimize for it
by always comparing against the full test list. This does mean that
when tests are added to the set, previously generated data is no
longer valid and this script should be re-run for all dates.

This has worked well enough so far, but if we simply label the new tests now, it will change the scores of the Interop 2022 dashboard.

So what to do?

One option is to accept the tradeoff (from the above comment) in the current year but avoid it for past years by stopping metrics updates. This has been proposed previously by at least @jensimmons and @jgraham. As a consequence, it would not be possible to reproduce the same results by running the script again. This probably won't matter, but could be a problem if we discover some anomaly/bug.

Another option is to tie labels to WPT version, so that we can score results based on the test list as it was in the past. An inexact but probably good-enough solution would be to just snapshot the test lists every day. This would lead to some anomalies in graphs as label changes aren't exactly synchronized with the test changes. To avoid any such issues, we might need to put the labels in-tree and produce a "fat manifest" with extra information. Significant work, stuff to maintain, but it would work.

A third option is to just maintain a list of all test names we might observe. The main challenge here is how many tests to expect in total for a given test run. Without that information, any missing test results will actually inflate the test results.

@jgraham
Copy link
Contributor

jgraham commented Feb 10, 2023

Moving the wpt metadata in-tree seems like an option worth exploring.

That has a few advantages outside of the direct context e.g.:

  • It's much easier to make CI jobs that apply different rules for interop-labelled tests. For example it would be pretty easy to have a CI job that fails if a labelled test gets a non-error status.
  • It's easier for vendors to update the metadata (notably bug links) without having to interact with a different repo.

Even if we snapshot the end of year scores, the whole concept of "inactive focus areas" where the scores keep updating means that we need a good long-term solution to being able to enforce interop-specific rules for those tests that are included. Although seperate repos doesn't quite prevent that, it does make it harder, and in particular makes it harder to co-evolve the data.

Technically we don't necessarily need "fat manifests" in the sense of "a single json file that contains both the data required to run the tests and also the additional metadata"; that's one implementation option, but another one would be to make the metadata an entirely seperate artifact which is designed to be easy to index by test id.

@foolip
Copy link
Member Author

foolip commented Mar 15, 2023

Dropping from the agenda, I think we should finish the work for #276 first.

@foolip
Copy link
Member Author

foolip commented Feb 15, 2024

I've closed #276 and will close this too. De facto the solution we've arrived at is to simply accept that renames or split appear as new tests for the purposes of scoring Interop 202X. It hasn't been a big problem in practice, and because we're freezing the dashboard for each year, there's no long term problem.

@foolip foolip closed this as completed Feb 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants