-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try reporting flaky tests to issues #34432
Conversation
Size Change: 0 B Total Size: 1.04 MB ℹ️ View Unchanged
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! I mostly left questions below 😄
Is there any way we can test this, or shall we just merge it and test live?
I think the only way we can test it for now without merging this is to create your own fork and merge this PR onto your fork's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is freaking cool! 🎉
Should we only retry tests and report flaky tests when CI runs on trunk
? Two reasons:
- It aligns more closely with the discussion in Retry flaky e2e tests at most 2 times #31682 where the consensus seems to be that we should build a report of flaky tests before we automatically retry tests on PRs.
- It lets us incrementally ramp up this new bit of infrastructure. A staged rollout, if you will. This would reduce the likelihood that e.g. the job accidentally spams the repo with GitHub issues after it is deployed.
|
||
body = | ||
renderIssueDescription( { meta, testTitle, testPath } ) + | ||
body.slice( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you wanted to really lean into the "HTML comments are obviously the best place to store data why are you looking at me like that" way of life then you could use wp.blocks.serialize()
and wp.blocks.parse()
instead of string concatenation 😂
27a38d8
to
f11f308
Compare
This is such a good idea, thank you for working on that! ❤️ |
Amazing work here 👏
I did also have a concern that because PRs are a work in progress, a failing test could be in a PR, but the test could be fixed in the PR and never merged to trunk. An issue wouldn't make sense for that. That said, it would probably be a rare occurrence with the way this uses retries to detect flakiness. On a different note, there's also an alternative model to this proposal, instead any test failure in The retries could still happen in PRs, but |
This is also a cool idea, but I'm afraid if there's actually a bug in For instance, if some commit in WordPress core causes all e2e tests in gutenberg fail out of a sudden, the action will then create an issue for each of them. I think auto-retrying is a safer way to determine if a test is flaky or not. At least until we can make sure our |
f11f308
to
fb2ecb2
Compare
* - If a test fails, it will automatically re-run at most 2 times. | ||
* - If it pass after retrying (below 2 times), then it's marked as **flaky** | ||
* but displayed as **passed** in the original test suite. | ||
* - If it fail all 3 times, then it's a **failed** test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The interesting part in this proposal is that when the test fails but then passes with re-tries we open an issue but when the test fails all the time we do nothing. I would worry more about the latter – clearly a regression 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the test fails all the time it's not considered flaky, it's just doing its job 😄
Tests that fail consistently on CI usually get noticed and fixed pretty quickly, but tests that fail 1 out of 10 times less so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tellthemachines, all correct and we are on the same page here. I'm just saying that it could be also beneficial to open automatically a bur report when we detect a regression to make it easier to coordinate the flow for fixing it. In case of regression, it could also leave a message on WordPress Slack like it happens in #core channel when CI job on trunk
fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's already the "check" itself on each commit reporting the failure of the tests. Yep, we could add some notifications or webhook if we desire, but I don't think we need to open an "issue" for them. If the tests suddenly start to fail on trunk then I think we can assume that something went wrong in our pipeline. Excluding flaky tests, the failures should have something to do with third-party libraries (core or plugins update for example). In that case, we should immediately start to fix them, opening an issue doesn't seem to help much other than creating additional noice 🤔.
And after all it's not in the scope of this PR 😅. Adding a webhook to notify failing tests on Slack is also pretty trivial with the help of the official webhook API of both GitHub and Slack.
Generally I really love this idea and think we should ship it and see how it works in practice on the repo. We can revert if it introduces too much noise initially. One thing that could be tried is posting comments on a single issue as part of the initial iteration to help with fine tuning frequency and content but I appreciate that might result in a lot of extra code. |
We discussed this in our weekly core-js meeting. I believe we're in agreement with giving this a try. There are still some minor issues we need to figure out before merging this though.
And here are my thoughts:
|
If you go the route of creating issues, one tweak that might make this nicer to follow up on is to have the bot leave a reaction on an issue that's already created. That way we can sort based on reactions (using the GitHub issue UI) to determine high flakiness tests (and prioritize which get addressed first). I do think there needs to be some process for regularly addressing flaky tests surfaced by this new logic (either via removing, or rewriting the test) because by it's very nature a flaky test is unreliable and thus can give a false impression of code coverage that doesn't actually exist. So while I still love the idea, the only concern I have is issues created by this process will pile up and not be addressed. |
Having this run only on trunk first seems to remove one of the main benefits of this work, which is to improve the experience for contributors where flaky tests contribute to unexpected failures in PRS where changes are unrelated to the test. |
I've thought about this, but immediately dropped it. I don't think there's an API to add multiple same reactions to an issue from the same (bot) user. I initially wanted to append new errors as comments to the issue and lock the issue, but I guess that makes the issue even less visible.
We already have this problem now, I don't think it'd be a new thing. Just that flaky tests are more visible now, but are also blocking contributors at the same time. The way we usually temporary fix this if we haven't come up with a proper solution for the flaky test is to simply skip it. It's the last resort and is very invisible to us (a less invisible way to overcome this is to... create an issue, which is exactly what this PR does).
Yep, agree. The end goal is always to run on all PRs. Running it on trunk is just a trial to see if it works before rolling out to all the contributors. We could also argue that it's not necessary and we can always just revert this PR though. Hence I'm open to all suggestions. |
I concur with all your thoughts here—let's do it! 🙂 |
fb2ecb2
to
e5499d3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try it! 👍
Description
Why
As an exploration of #33809, this PR is an experiment of using GitHub issues as our flaky tests dashboard. It has been discussed before in #31682. The goal is to auto-retry flaky e2e tests so that they don't cause confusion to contributors and block their progress. But just retrying is not enough, the flaky failure would be hidden by the retrying system and could potentially be a bug. Instead, we report them to a GitHub issue with the detail of the failure, so that we can monitor them in the long term.
What does it look like
Here's an example issue I created in my fork.
There are the title of the test, the path of the test, the full error logs of the test, and the estimated flaky rate.
The issue is labeled by
flaky-test
so that we can filter them in the issues tab. Each log has a link pointing to the original failing action run.How does it work
The brief overview of the flow is as follow:
jest.retryTimes(2)
to at most retry the tests 2 times (plus the original run is 3 times) before it pass.flaky-tests
workflow and run thereport-flaky-tests
action.report-flaky-tests
, we use GitHub API to download the stored artifact, run some pre-formatting, and post the aggregated result to an issue.Step 4 needs the repo token to perform some of the GitHub API, due to security reason, we can't expose the token in our e2e workflow in step 2, hence we need the individual workflow and the artifacts. This is documented in a blog post by GitHub.
The new workflow is depending on the
workflow_run
event, which will only work if the workflow file is in the default branch (trunk
), so this PR won't work until it's merged.The flaky rate is calculated by counting the failing times and the total amount of commits since the first recorded commit. It's only recorded in
trunk
, and it doesn't count the manual reruns, so it's just an estimated value.Types of changes
New feature
Checklist:
*.native.js
files for terms that need renaming or removal).