Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Automation Enhancement] Mechanism to close the created Gradle Check AUTOCUT flaky test issues. #14475

Closed
prudhvigodithi opened this issue Jun 20, 2024 · 5 comments
Assignees
Labels
Build Build Tasks/Gradle Plugin, groovy scripts, build tools, Javadoc enforcement. enhancement Enhancement or improvement to existing feature or request

Comments

@prudhvigodithi
Copy link
Member

prudhvigodithi commented Jun 20, 2024

Is your feature request related to a problem? Please describe

Background

Coming from the initial implementation #13950, the automation as described in the DEVELOPER_GUIDE will identify and start creating the flaky test report issues based on a test failures in the post merge actions. The data used to create these issues is part of the OpenSearch Metrics Project (For more details refer Gradle Check Metrics Dashboard). The initial goal to find the flaky tests and creating a detailed issue report was solved.

Problem Statement

Now the issues that are auto created with the automation can only be closed once the failures are not part of the post merge actions for the next 30 days (the query executed on the metrics clusters is targeting to filter the failing tests in past 30 days), example here is an AUTOCUT issue created related to RemoteStoreClusterStateRestoreIT , even though this was identified and fixed promptly there is no way to for a user to close this as the automation will again flag RemoteStoreClusterStateRestoreIT and re-opens the issue as the RemoteStoreClusterStateRestoreIT was identified failing in past 30 days. With this the issue remains open (for next 30 days and if not again failed in post merge action builds) even though the flaky test is fixed by the user.

Describe the solution you'd like

Proposed Solution

Solution 1

As proposed here

If the issue is closed (considering the flaky test is fixed by the user) the automation should not re-open unless the data is different from what shown in the issue body, if anything (in the issue body) is different after closed then it should re-open the issue. Here the data to compare is the markdown table and not the linked PR's as during the PR creation the failures sometimes could be genuine. So re-open when seen a new failure (with a different post merge commit) after the issue is closed. This should also solve the problem where sometimes we think the Flaky test is fixed but would re-occur and with new reoccurrence the issue should re-open with new data.

This solution is simple comparison with existing test names and git reference on the existing issue body and decide to re-open (once the issue is closed by the user) the issue or keep in the closed state.

Solution 2

This solution targets to have a database of events and decide based on events to open a new issue or keep the issue in closed state.

Create a new index gradle-check-flaky-tests, from identified flaky test names in OpenSearch Gradle Check Metrics which is part of the automation FetchPostMergeFailedTestClass. Now create a new document for each test name with a test_class and git_reference association. Example as

{
  "_index": "gradle-check-flaky-tests",
  "_id": "yrZzNpAB0YKBsy3HQg9I",
  "_version": 1,
  "_score": null,
  "_source": {
    "test_class": "RemoteStoreClusterStateRestoreIT",
    "test_name": "org.opensearch.remotestore.RemoteStoreClusterStateRestoreIT.testFullClusterRestoreGlobalMetadata",
    "git_reference": "a06afef1fc63cab9ab9fc1b84215a575a91a12d8",
    "flaky": true,
    "flaky_identified_at": 
    "updated_at": 1718898508731,
    "fixed_at":
    "issue_number: 
    "time_open_in_days":
    "time_closed_in_days": 
  },
  "fields": {
    "updated_at": [
      "2024-06-20T15:48:28.731Z"
    ]
  },
  "sort": [
    1718898508731
  ]
}

The flaky_identified_at is the date when the document was 1st created.
The updated_at is when the daily automation was triggered.
(Optional) The time_open_in_days is the difference between (updated_at - flaky_identified_at).
(Optional) The time_closed_in_days is the difference between (updated_at - flaky_identified_at) once the flaky is set to false.
The flaky will be set to false once the issue is closed by the user.
The fixed_at will be the current updated_at after the flaky is set to false (Its ~time when the issue was closed).
The issue_number is the GitHub issue number created for the test_class (example as #14326).

Now for the upcoming automation runs if it identifies the test_name for the same git_reference with "flaky": flase it should not re-open the issue, if it finds the test_name for different git_reference then it means even though the same flaky test is fixed it failed for another post merge commit (git_reference) and should create a new document and a new issue flagging the test as flaky for different commit. For open issues the automation will continue to keep updating the issue body and the above document fields still keeping the "flaky": true.

The assumption here the user will only close the issue when all the Test Names part of the issue, example #14381 are closed. The framework maintains one GitHub Issue for all test failures grouped by test class and different documents in cluster, one for each test name.

With this solution we can even build trends on these flaky test documents using the OpenSearch Metrics Dashboard.

Related component

Other

Describe alternatives you've considered

No response

Additional context

No response

@prudhvigodithi prudhvigodithi added enhancement Enhancement or improvement to existing feature or request untriaged and removed untriaged labels Jun 20, 2024
@prudhvigodithi
Copy link
Member Author

[Triage]
Adding @andrross @reta @dblock @msfroh @shiv0408 @getsaurabh02 to please check the proposed solutions.

@prudhvigodithi prudhvigodithi added Build Build Tasks/Gradle Plugin, groovy scripts, build tools, Javadoc enforcement. and removed untriaged Other labels Jun 20, 2024
@reta
Copy link
Collaborator

reta commented Jun 20, 2024

I think the 1st option is pretty simple and straightforward, thanks @prudhvigodithi !

@andrross
Copy link
Member

Agree that the 1st option is the simpler one and probably worth trying first.

@prudhvigodithi
Copy link
Member Author

Thanks the solution 1 is in place now, here is an example #14499 (comment).
Related Library change PR: opensearch-project/opensearch-build-libraries#448
Related Jenkins change PR: opensearch-project/opensearch-build#4805.

Thank you

@prudhvigodithi
Copy link
Member Author

Closing this issue as today we have the mechanism to close the created Gradle Check AUTOCUT flaky test issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Build Build Tasks/Gradle Plugin, groovy scripts, build tools, Javadoc enforcement. enhancement Enhancement or improvement to existing feature or request
Projects
Status: ✅ Done
Development

No branches or pull requests

3 participants