[Automation Enhancement] Mechanism to close the created Gradle Check AUTOCUT flaky test issues. #14475

prudhvigodithi · 2024-06-20T17:43:06Z

Is your feature request related to a problem? Please describe

Background

Coming from the initial implementation #13950, the automation as described in the DEVELOPER_GUIDE will identify and start creating the flaky test report issues based on a test failures in the post merge actions. The data used to create these issues is part of the OpenSearch Metrics Project (For more details refer Gradle Check Metrics Dashboard). The initial goal to find the flaky tests and creating a detailed issue report was solved.

Problem Statement

Now the issues that are auto created with the automation can only be closed once the failures are not part of the post merge actions for the next 30 days (the query executed on the metrics clusters is targeting to filter the failing tests in past 30 days), example here is an AUTOCUT issue created related to RemoteStoreClusterStateRestoreIT , even though this was identified and fixed promptly there is no way to for a user to close this as the automation will again flag RemoteStoreClusterStateRestoreIT and re-opens the issue as the RemoteStoreClusterStateRestoreIT was identified failing in past 30 days. With this the issue remains open (for next 30 days and if not again failed in post merge action builds) even though the flaky test is fixed by the user.

Describe the solution you'd like

Proposed Solution

Solution 1

As proposed here

If the issue is closed (considering the flaky test is fixed by the user) the automation should not re-open unless the data is different from what shown in the issue body, if anything (in the issue body) is different after closed then it should re-open the issue. Here the data to compare is the markdown table and not the linked PR's as during the PR creation the failures sometimes could be genuine. So re-open when seen a new failure (with a different post merge commit) after the issue is closed. This should also solve the problem where sometimes we think the Flaky test is fixed but would re-occur and with new reoccurrence the issue should re-open with new data.

This solution is simple comparison with existing test names and git reference on the existing issue body and decide to re-open (once the issue is closed by the user) the issue or keep in the closed state.

Solution 2

This solution targets to have a database of events and decide based on events to open a new issue or keep the issue in closed state.

Create a new index gradle-check-flaky-tests, from identified flaky test names in OpenSearch Gradle Check Metrics which is part of the automation FetchPostMergeFailedTestClass. Now create a new document for each test name with a test_class and git_reference association. Example as

{
  "_index": "gradle-check-flaky-tests",
  "_id": "yrZzNpAB0YKBsy3HQg9I",
  "_version": 1,
  "_score": null,
  "_source": {
    "test_class": "RemoteStoreClusterStateRestoreIT",
    "test_name": "org.opensearch.remotestore.RemoteStoreClusterStateRestoreIT.testFullClusterRestoreGlobalMetadata",
    "git_reference": "a06afef1fc63cab9ab9fc1b84215a575a91a12d8",
    "flaky": true,
    "flaky_identified_at": 
    "updated_at": 1718898508731,
    "fixed_at":
    "issue_number: 
    "time_open_in_days":
    "time_closed_in_days": 
  },
  "fields": {
    "updated_at": [
      "2024-06-20T15:48:28.731Z"
    ]
  },
  "sort": [
    1718898508731
  ]
}

The flaky_identified_at is the date when the document was 1st created.
The updated_at is when the daily automation was triggered.
(Optional) The time_open_in_days is the difference between (updated_at - flaky_identified_at).
(Optional) The time_closed_in_days is the difference between (updated_at - flaky_identified_at) once the flaky is set to false.
The flaky will be set to false once the issue is closed by the user.
The fixed_at will be the current updated_at after the flaky is set to false (Its ~time when the issue was closed).
The issue_number is the GitHub issue number created for the test_class (example as #14326).

Now for the upcoming automation runs if it identifies the test_name for the same git_reference with "flaky": flase it should not re-open the issue, if it finds the test_name for different git_reference then it means even though the same flaky test is fixed it failed for another post merge commit (git_reference) and should create a new document and a new issue flagging the test as flaky for different commit. For open issues the automation will continue to keep updating the issue body and the above document fields still keeping the "flaky": true.

The assumption here the user will only close the issue when all the Test Names part of the issue, example #14381 are closed. The framework maintains one GitHub Issue for all test failures grouped by test class and different documents in cluster, one for each test name.

With this solution we can even build trends on these flaky test documents using the OpenSearch Metrics Dashboard.

Related component

Other

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

prudhvigodithi · 2024-06-20T17:43:51Z

[Triage]
Adding @andrross @reta @dblock @msfroh @shiv0408 @getsaurabh02 to please check the proposed solutions.

reta · 2024-06-20T19:49:40Z

I think the 1st option is pretty simple and straightforward, thanks @prudhvigodithi !

andrross · 2024-06-20T22:46:05Z

Agree that the 1st option is the simpler one and probably worth trying first.

prudhvigodithi · 2024-06-27T18:36:59Z

Thanks the solution 1 is in place now, here is an example #14499 (comment).
Related Library change PR: opensearch-project/opensearch-build-libraries#448
Related Jenkins change PR: opensearch-project/opensearch-build#4805.

Thank you

prudhvigodithi · 2024-07-03T17:35:34Z

Closing this issue as today we have the mechanism to close the created Gradle Check AUTOCUT flaky test issues.

prudhvigodithi added enhancement Enhancement or improvement to existing feature or request untriaged and removed untriaged labels Jun 20, 2024

github-actions bot added Other untriaged labels Jun 20, 2024

prudhvigodithi added Build Build Tasks/Gradle Plugin, groovy scripts, build tools, Javadoc enforcement. and removed untriaged Other labels Jun 20, 2024

This was referenced Jun 21, 2024

[AUTOCUT] Gradle Check Flaky Test Report for MixedClusterClientYamlTestSuiteIT #14294

Open

[AUTOCUT] Gradle Check Flaky Test Report for AzureStorageServiceTests #14499

Closed

prudhvigodithi added this to OpenSearch Engineering Effectiveness Jun 24, 2024

github-project-automation bot moved this to Backlog in OpenSearch Engineering Effectiveness Jun 24, 2024

prudhvigodithi self-assigned this Jun 24, 2024

prudhvigodithi moved this from Backlog to In Progress in OpenSearch Engineering Effectiveness Jun 24, 2024

This was referenced Jun 25, 2024

Mechanism to close the created Gradle Check AUTOCUT flaky test issues opensearch-project/opensearch-build-libraries#448

Merged

Update the gradle-check-flaky-test-issue-creation.jenkinsfile with new library version opensearch-project/opensearch-build#4805

Merged

peterzhuamazon added this to Engineering Effectiveness Board Jul 1, 2024

github-project-automation bot moved this to 🆕 New in Engineering Effectiveness Board Jul 1, 2024

peterzhuamazon moved this from 🆕 New to 🏗 In progress in Engineering Effectiveness Board Jul 1, 2024

prudhvigodithi closed this as completed Jul 3, 2024

github-project-automation bot moved this from 🏗 In progress to ✅ Done in Engineering Effectiveness Board Jul 3, 2024

github-project-automation bot moved this from In Progress to Done in OpenSearch Engineering Effectiveness Jul 3, 2024

reta mentioned this issue Jul 3, 2024

[FEATURE] Introduce commit queue on Jenkins (main branch only) to proactively spot flaky tests opensearch-project/opensearch-build#4810

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Automation Enhancement] Mechanism to close the created Gradle Check AUTOCUT flaky test issues. #14475

[Automation Enhancement] Mechanism to close the created Gradle Check AUTOCUT flaky test issues. #14475

prudhvigodithi commented Jun 20, 2024 •

edited

Loading

prudhvigodithi commented Jun 20, 2024

reta commented Jun 20, 2024

andrross commented Jun 20, 2024

prudhvigodithi commented Jun 27, 2024

prudhvigodithi commented Jul 3, 2024

[Automation Enhancement] Mechanism to close the created Gradle Check AUTOCUT flaky test issues. #14475

[Automation Enhancement] Mechanism to close the created Gradle Check AUTOCUT flaky test issues. #14475

Comments

prudhvigodithi commented Jun 20, 2024 • edited Loading

Is your feature request related to a problem? Please describe

Background

Problem Statement

Describe the solution you'd like

Proposed Solution

Solution 1

Solution 2

Related component

Describe alternatives you've considered

Additional context

prudhvigodithi commented Jun 20, 2024

reta commented Jun 20, 2024

andrross commented Jun 20, 2024

prudhvigodithi commented Jun 27, 2024

prudhvigodithi commented Jul 3, 2024

prudhvigodithi commented Jun 20, 2024 •

edited

Loading