-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better visibility into test failures over time #11217
Comments
The request here is very similar to this older issue: #3713 |
@andrross I created a repo [1] that collects project health information and published reports to its repo ever day - see the latest reports at https://github.com/peternied/contribution-rate?tab=readme-ov-file#reports One such report is a last 30 days top failing test - here is the March 8 report. It will keep updating every day. Feel free to contribute any kind of reports you'd like to see. |
Hey @andrross @peternied we now have the gradle metrics published to OpenSearch Gradle Check Metrics dashboard, this is part of surfacing opensearch-metrics to community. Please check the current supported metrics. How about we expand this and add more metrics as required? and also we use this data for creating triggers like GitHub issues, comments etc. |
@prudhvigodithi Do you think it would make sense to add some details about using the gradle check metrics dashboard to help investigate and fix flaky failures in either TESTING.md or DEVELOPER_GUIDE.md? @dreamer-89 created a great list in #3713:
I think if we document somewhere how to use the new dashboard to solve those problems then we can close both of these issues as completed. |
Thanks @andrross and @dreamer-89, based on the list you have I have modified the gradle check workflow with new fields and created some new visualizations based on the indexed data (Thanks to @rishabh6788 for setting up the initial flow), please check the link OpenSearch Gradle Check Metrics. Identify top hitter for prioritizationFor this I have created a pie chart with the top Identify commit introduced a flaky test or increase freq of existing test failureThe following data tables should have the git commit, the associated PR and the PR owner with all the failing test details, we can filter per PR or commit to get the details of failed tests. The new visualization Build failure trend to identify health of softwareFor this the dashboard has a TSVB and line chart with the trend for the failure tests, this can be again further filtered with test name, test class, commitID, PR and with executions with Post merge. Developers impacted due to flaky testsThe entire visualizations can be filtered with PR owner, PR number or commitID. The results has the hyperlinks for the GitHub PR or commit where one can see the comments and other users. The dashboards also has the PR owner attached to see impacted user. The visualizations also has the hyperlinks with the jenkins build data where one can see all the stack trace details for the failed tests (example 39487). Test historyAll the visualizations in dashboard can be filtered by date range, using OpenSearch we get this out of the box :) |
@prudhvigodithi @rishabh6788 it looks great, thank you so much folks for putting it all together |
Agreed, this is awesome! |
Next step moving forward for surfacing the test failures as GitHub Issues instead of creating a very generic issue like #13893 (coming from https://github.com/opensearch-project/OpenSearch/blob/main/.github/workflows/gradle-check.yml#L161-L168) which sometimes fails to execute https://github.com/opensearch-project/OpenSearch/actions/runs/9320653340/job/25657907035, how about we use the following data table information to create a GitHub issue. Here is the example: After finding the failed tests from Post Merge ActionsWe should start by creating an issue at a test class level 1st to the issue created for 2nd on the same issue for 3rd on the same issue, we can add other PR's information where this has or has been failing. I'm open for ideas on whom to assign this created issue? Should we just keep it open without any assignee as each issue will have multiple PR and commits information. later during triaging the maintainer should be able to identify the right team/user and add as assignee. Moving forward we can have a logic to auto close the created issue if in last 30 days there is no failure for the test class ( @andrross @reta @dblock @getsaurabh02 @peternied let me know your thoughts on this. Thank you |
Now that we have the metrics and the updated developer guide, I'm going to close this and issue #3713. If anyone thinks there is more to do here please reopen or open a new issue. Thanks! |
This is great!
I can't wait for this. It's something developers spend a lot of time on.
My 0.02c:
|
@prudhvigodithi I agree with @dblock, no need to auto assign the created issues. |
Thanks @dblock,
The idea is to open an issue (update if already exists) for each test class failure which failed on Post Merge Actions. The Post Merge Action failures are for sure the flaky ones. Dont want to keep creating the issues (at least initially) on every failed test on PR creation as the failures can be legit for a PR.
Since the Post Merge Action Gradle check is executed after the PR is merged, the suggestion here is to comment on the closed PR and link that back to the Issue created, with this we can have datapoints of the PR's added to the issue, is my understanding correct here @dblock @andrross ?
Make sense. Thank you |
Makes sense. What would be massively useful is to link existing flaky test issues when they fail in PRs. So maybe this could be the action run for every PR gradle check failure:
I think this is unnecessary because the PR most definitely didn't cause the flaky test and once merged nobody is going to be looking at it. I recommend doing (1) and (2) above. |
Since this issue is closed I have created a new issue #13950 (comment) for this topic of surfacing the flaky tests as github issues and we continue our discussion there. |
@prudhvigodithi, I just checked out the dashboard for the first time. It is amazing! There is a little noise from cases where the open PR introduced failures. For example, looking at the last 7 days, it looks like I added a new test that was failing due to a type mismatch, so I tried modifying the test framework. I fixed my test, but broke 1000+ other tests. (That fix obviously didn't get merged, but I accidentally skewed the statistics.) |
I've created a script that crawls the OpenSearch Jenkins builds to find test failures, but only for the Gradle checks that run on code after it is pushed to the main branch. This filters out failures that are due to unmerged code in work-in-progress PRs.
I've included below the output after crawling 2000 recent builds (approx. Oct 16 - Nov 14). This data is very hard to follow, but one thing in particular stands out:
SearchQueryIT.testCommonTermsQuery
is a frequently failing test, but only since build 29184 (Oct 28). There are no failures before that, which strongly suggests something was changed around Oct 28 that introduced the flakiness.I haven't started to look but I suspect we'll be able to find the cause pretty quickly given that there is a point in time to start looking at.Update Nov 16: the root cause was an unrelated change for concurrent search randomly increased the number of deleted documents and exposed some underlying brittleness in this test: #11233 Diagnosing the root cause was a bit tricky and required diving into the specifics of how the common terms query works, but it was indeed much simpler once the flakiness was correlated to a small date range and then a specific commit.Surely there are better tools for visualizing test reports over time, perhaps already built into Jenkins? Also, we don't push that many commits so the sample size on builds after pushes to main isn't that large. Something like a nightly job to run the test suite 10 or 50 or 100 times and create a report on failures would help to quickly surface newly introduced flakiness.
The text was updated successfully, but these errors were encountered: