Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[chore] Re-run failed unit tests automatically #31253

Merged
merged 2 commits into from
Feb 14, 2024

Conversation

mx-psi
Copy link
Member

@mx-psi mx-psi commented Feb 14, 2024

Description:

Re-runs failed unit tests automatically. Follow up to #31163
This re-runs the tests once if there are less than 10 total test failures.

This should speed up development, but it comes with the risk of missing real issues.
I think given the current situation our CI is in this is acceptable, but I assume this PR is going to be controversial :)

One improvement would be to keep this but auto-generate Github issues when a test fails and then passes on main's CI.

Link to tracking Issue: Relates to #30880 (does not speed up individual tests but reduces the number of attempts to be made)

@mx-psi mx-psi marked this pull request as ready for review February 14, 2024 10:14
@mx-psi mx-psi requested review from a team and songy23 February 14, 2024 10:14
@mx-psi
Copy link
Member Author

mx-psi commented Feb 14, 2024

cc @open-telemetry/collector-contrib-approvers

Copy link
Member

@jpkrohling jpkrohling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Looking at the documentation, it looks like it will attempt two more times by default, which is reasonable.

Copy link
Member

@djaglowski djaglowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's good to rerun automatically given the state of CI. Tolerating failures is always a tradeoff but we currently have so many failures that it's difficult to separate the worst offenders from the 1/million flukes. Retrying is a great way to separate these so we can get the worst offenders under control. The question in my mind is whether we should retry twice or only once.

We should keep in mind that retrying twice means a test which fails 1% of the time has only a 1/million chance of failing a given test run. We run CI ~100 times per day, so a 1% failure rate test would show up maybe once per quarter. On the other hand, retrying only once means that we would see the failure a couple times per month, which (in a less noisy CI environment) seems often enough to notice and fix/skip/remove.

@mx-psi
Copy link
Member Author

mx-psi commented Feb 14, 2024

@djaglowski Your argument makes sense to me and it's possible that re-running once could be enough. I changed the option to --rerun-fails=1 on my last commit, we can start with this and revisit if needed in the future

@mx-psi mx-psi requested a review from djaglowski February 14, 2024 15:58
@arminru arminru added this pull request to the merge queue Feb 14, 2024
@arminru arminru removed this pull request from the merge queue due to a manual request Feb 14, 2024
@mx-psi mx-psi added this pull request to the merge queue Feb 14, 2024
@TylerHelmuth TylerHelmuth added the ready to merge Code review completed; ready to merge by maintainers label Feb 14, 2024
@TylerHelmuth TylerHelmuth removed this pull request from the merge queue due to a manual request Feb 14, 2024
@TylerHelmuth TylerHelmuth merged commit c1a65e9 into open-telemetry:main Feb 14, 2024
147 checks passed
@github-actions github-actions bot added this to the next release milestone Feb 14, 2024
XinRanZhAWS pushed a commit to XinRanZhAWS/opentelemetry-collector-contrib that referenced this pull request Mar 13, 2024
**Description:** 

Re-runs failed unit tests automatically. Follow up to open-telemetry#31163
This re-runs the tests once if there are less than 10 total test
failures.

This should speed up development, but it comes with the risk of missing
real issues.
I think given the current situation our CI is in this is acceptable, but
I assume this PR is going to be controversial :)

One improvement would be to keep this but auto-generate Github issues
when a test fails and then passes on main's CI.

**Link to tracking Issue:** Relates to open-telemetry#30880 (does not speed up
individual tests but reduces the number of attempts to be made)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready to merge Code review completed; ready to merge by maintainers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants