Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerting] Fixing flaky tests #111366

Merged
merged 14 commits into from
Sep 9, 2021
Merged

Conversation

ymao1
Copy link
Contributor

@ymao1 ymao1 commented Sep 7, 2021

Resolves #106492
Resolves #111022
Resolves #111001
Resolves #110827
Resolves #110801
Resolves #110789
Resolves #111496

Summary

Started out fixing #106492, which was a failure in superuser at space1 should handle custom retry logic when appropriate but when unskipping the test suite, got a lot of flaky tests related to "expect 204, got 409", so I included changes to fix that in this PR as well.

As @mikecote suggested, the switch to multiple-isolated SO type for alert caused the behavior for deleting rules to change. Now that a rule can have an array of namespaces, deleting that rule could actually just be updating the SO document to remove the rule's namespace from the namespaces array. This leads to potential 409. Updated the rule client delete to use the same retry if conflict logic as the `update function does.

Flaky test runs:

@ymao1 ymao1 changed the title Flaky test custom retry [Alerting] Fixing flaky test, expect 204 got 409 Sep 7, 2021
@ymao1 ymao1 changed the title [Alerting] Fixing flaky test, expect 204 got 409 [Alerting] Fixing flaky tests Sep 7, 2021
@@ -502,19 +501,6 @@ instanceStateValue: true
})
);

// Enqueue non ephemerically so we the latter code can query properly
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this is the fix for the superuser at space1 should handle custom retry logic when appropriate flaky test. Discussed with @chrisronline and this was something that was added to update the tests when ephemeral actions were enabled. Since the tests are running with ephemeral disabled, this should have been removed but was overlooked. After removing, I no longer see flakiness in the flaky test runner.

@ymao1 ymao1 self-assigned this Sep 7, 2021
@ymao1 ymao1 added Feature:Alerting release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.16.0 v8.0.0 labels Sep 7, 2021
@@ -672,6 +672,14 @@ export class RulesClient {
}

public async delete({ id }: { id: string }) {
return await retryIfConflicts(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I will not be backporting this specific change to 7.x since 7.x should still have the SO type as single.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this change and the removal code below? I thought the removal of the below code solved the issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes the "expected 204, got 409" flaky tests. When I unskipped the test suite that was fixed by removing the dead code, I got a bunch of flaky test failures related to the 204/409 issue so I handled both in the same PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any concern with this code change for normal use (outside of tests)? Is it possible/a good idea to move this retry logic into the test suites themselves to avoid exposing any new issue by changing the non test code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This retry logic is already used in the rules client update and updateApiKey functions, which both have the possibility of returning a conflict if multiple Kibanas are updating a rule at the same time. We have to add this now to the delete function for multiple-isolated alert SOs because delete is not just deleting the SO doc with id ${spaceId}:alert:${alertId}. Instead it could be updating an alert SO that is shared between spaces by removing one of the spaces from the namespaces field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, the "expected 204, got 409" errors are our functional tests correctly telling us that we've introduced a race condition with the change from single to multiple-isolated for the alert SO type. Yay for tests!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation, makes sense to me!

@ymao1 ymao1 marked this pull request as ready for review September 7, 2021 22:54
@ymao1 ymao1 requested a review from a team as a code owner September 7, 2021 22:54
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@ymao1
Copy link
Contributor Author

ymao1 commented Sep 8, 2021

@elasticmachine merge upstream

Copy link
Contributor

@chrisronline chrisronline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I'd suggest rerunning CI on this a few times (once it starts passing) to make sure!

@ymao1
Copy link
Contributor Author

ymao1 commented Sep 8, 2021

LGTM! I'd suggest rerunning CI on this a few times (once it starts passing) to make sure!

I linked to the flaky test runner runs I ran in the PR description. Total of 126 runs on this CI group with no failures 🎉

@ymao1
Copy link
Contributor Author

ymao1 commented Sep 8, 2021

@elasticmachine merge upstream

Copy link
Contributor

@YulNaumenko YulNaumenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ymao1
Copy link
Contributor Author

ymao1 commented Sep 9, 2021

@elasticmachine merge upstream

@ymao1
Copy link
Contributor Author

ymao1 commented Sep 9, 2021

@elasticmachine merge upstream

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @ymao1

@ymao1 ymao1 merged commit 334f129 into elastic:master Sep 9, 2021
ymao1 added a commit to ymao1/kibana that referenced this pull request Sep 9, 2021
* Unskipping test

* Retrying deletes

* Unskipping test

* Changing fn signature

* hmm

* Removing unnecessary code

* Unskipping test

Co-authored-by: Kibana Machine <[email protected]>
@ymao1 ymao1 deleted the flaky-test-custom-retry branch September 9, 2021 13:49
ymao1 added a commit that referenced this pull request Sep 9, 2021
* [Alerting] Fixing flaky tests (#111366)

* Unskipping test

* Retrying deletes

* Unskipping test

* Changing fn signature

* hmm

* Removing unnecessary code

* Unskipping test

Co-authored-by: Kibana Machine <[email protected]>

* Reverting change to delete function in rules client for 7.x

Co-authored-by: Kibana Machine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Alerting release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.16.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failing test: X-Pack Alerting API Integration Tests.x-pack/test/alerting_api_integration/security_and_spaces/tests/alerting/delete·ts - alerting api integration security and spaces enabled Alerts alerts delete superuser at space1 should handle delete alert request appropriately Failing test: X-Pack Alerting API Integration Tests.x-pack/test/alerting_api_integration/spaces_only/tests/alerting/delete·ts - alerting api integration spaces only Alerting delete should handle delete alert request appropriately Failing test: X-Pack Alerting API Integration Tests.x-pack/test/alerting_api_integration/security_and_spaces/tests/alerting/update·ts - alerting api integration security and spaces enabled Alerts alerts update "after all" hook in "update" Failing test: X-Pack Alerting API Integration Tests.x-pack/test/alerting_api_integration/security_and_spaces/tests/alerting/delete·ts - alerting api integration security and spaces enabled Alerts alerts delete space_1_all_alerts_none_actions at space1 should handle delete alert request appropriately Failing test: X-Pack Alerting API Integration Tests.x-pack/test/alerting_api_integration/security_and_spaces/tests/alerting/alerts·ts - alerting api integration security and spaces enabled Alerts alerts alerts superuser at space1 should handle custom retry logic when appropriate
5 participants