-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling test renames or splits (test suite evolution over time) #284
Comments
Moving the wpt metadata in-tree seems like an option worth exploring. That has a few advantages outside of the direct context e.g.:
Even if we snapshot the end of year scores, the whole concept of "inactive focus areas" where the scores keep updating means that we need a good long-term solution to being able to enforce interop-specific rules for those tests that are included. Although seperate repos doesn't quite prevent that, it does make it harder, and in particular makes it harder to co-evolve the data. Technically we don't necessarily need "fat manifests" in the sense of "a single json file that contains both the data required to run the tests and also the additional metadata"; that's one implementation option, but another one would be to make the metadata an entirely seperate artifact which is designed to be easy to index by test id. |
Dropping from the agenda, I think we should finish the work for #276 first. |
I've closed #276 and will close this too. De facto the solution we've arrived at is to simply accept that renames or split appear as new tests for the purposes of scoring Interop 202X. It hasn't been a big problem in practice, and because we're freezing the dashboard for each year, there's no long term problem. |
In #281 @jensimmons reported that the scores for Cascade Layers have regressed from 100 to something less than 100.
The basic problem is that we are maintaining (in wpt-metadata) a list of tests for each focus area, which is kept in sync with WPT as it evolves. We can't get a list of tests for a specific commit of WPT, but only for the latest. If one test is replaced by another (by renaming, splitting, variant-ifying, etc.) and we unlabel the old test and label the new test(s), the score for all past runs will go down as the new test(s) aren't found. This is described in a comment:
This has worked well enough so far, but if we simply label the new tests now, it will change the scores of the Interop 2022 dashboard.
So what to do?
One option is to accept the tradeoff (from the above comment) in the current year but avoid it for past years by stopping metrics updates. This has been proposed previously by at least @jensimmons and @jgraham. As a consequence, it would not be possible to reproduce the same results by running the script again. This probably won't matter, but could be a problem if we discover some anomaly/bug.
Another option is to tie labels to WPT version, so that we can score results based on the test list as it was in the past. An inexact but probably good-enough solution would be to just snapshot the test lists every day. This would lead to some anomalies in graphs as label changes aren't exactly synchronized with the test changes. To avoid any such issues, we might need to put the labels in-tree and produce a "fat manifest" with extra information. Significant work, stuff to maintain, but it would work.
A third option is to just maintain a list of all test names we might observe. The main challenge here is how many tests to expect in total for a given test run. Without that information, any missing test results will actually inflate the test results.
The text was updated successfully, but these errors were encountered: