-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Alerting] Fixing flaky tests #111366
[Alerting] Fixing flaky tests #111366
Conversation
@@ -502,19 +501,6 @@ instanceStateValue: true | |||
}) | |||
); | |||
|
|||
// Enqueue non ephemerically so we the latter code can query properly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing this is the fix for the superuser at space1 should handle custom retry logic when appropriate
flaky test. Discussed with @chrisronline and this was something that was added to update the tests when ephemeral actions were enabled. Since the tests are running with ephemeral disabled, this should have been removed but was overlooked. After removing, I no longer see flakiness in the flaky test runner.
@@ -672,6 +672,14 @@ export class RulesClient { | |||
} | |||
|
|||
public async delete({ id }: { id: string }) { | |||
return await retryIfConflicts( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that I will not be backporting this specific change to 7.x since 7.x should still have the SO type as single
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this change and the removal code below? I thought the removal of the below code solved the issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fixes the "expected 204, got 409" flaky tests. When I unskipped the test suite that was fixed by removing the dead code, I got a bunch of flaky test failures related to the 204/409 issue so I handled both in the same PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any concern with this code change for normal use (outside of tests)? Is it possible/a good idea to move this retry logic into the test suites themselves to avoid exposing any new issue by changing the non test code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This retry logic is already used in the rules client update and updateApiKey functions, which both have the possibility of returning a conflict if multiple Kibanas are updating a rule at the same time. We have to add this now to the delete function for multiple-isolated
alert SOs because delete is not just deleting the SO doc with id ${spaceId}:alert:${alertId}
. Instead it could be updating an alert SO that is shared between spaces by removing one of the spaces from the namespaces
field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, the "expected 204, got 409" errors are our functional tests correctly telling us that we've introduced a race condition with the change from single
to multiple-isolated
for the alert SO type. Yay for tests!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation, makes sense to me!
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
@elasticmachine merge upstream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I'd suggest rerunning CI on this a few times (once it starts passing) to make sure!
I linked to the flaky test runner runs I ran in the PR description. Total of 126 runs on this CI group with no failures 🎉 |
@elasticmachine merge upstream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@elasticmachine merge upstream |
@elasticmachine merge upstream |
💚 Build SucceededMetrics [docs]
History
To update your PR or re-run it, just comment with: cc @ymao1 |
* Unskipping test * Retrying deletes * Unskipping test * Changing fn signature * hmm * Removing unnecessary code * Unskipping test Co-authored-by: Kibana Machine <[email protected]>
* [Alerting] Fixing flaky tests (#111366) * Unskipping test * Retrying deletes * Unskipping test * Changing fn signature * hmm * Removing unnecessary code * Unskipping test Co-authored-by: Kibana Machine <[email protected]> * Reverting change to delete function in rules client for 7.x Co-authored-by: Kibana Machine <[email protected]>
Resolves #106492
Resolves #111022
Resolves #111001
Resolves #110827
Resolves #110801
Resolves #110789
Resolves #111496
Summary
Started out fixing #106492, which was a failure in
superuser at space1 should handle custom retry logic when appropriate
but when unskipping the test suite, got a lot of flaky tests related to"expect 204, got 409
", so I included changes to fix that in this PR as well.As @mikecote suggested, the switch to
multiple-isolated
SO type foralert
caused the behavior for deleting rules to change. Now that a rule can have an array ofnamespaces
, deleting that rule could actually just be updating the SO document to remove the rule's namespace from thenamespaces
array. This leads to potential409
. Updated the rule clientdelete
to use the same retry if conflict logic as the `update function does.Flaky test runs: