-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Response Ops][Alerting] Discuss how to handle possible conflicts when updating alerts-as-data documents #158403
Comments
Pinging @elastic/response-ops (Team:ResponseOps) |
I actually just ran into this issue as part of working on converting the alerts indices to data streams (PR #160572). In that PR, updates to existing documents use OCC, because they have to when you do updates in data streams. The scenario for me was a metric threshold rule running, with an active alert, every 3 seconds. I then added that alert to a case. And hit a 409 conflict on the next rule run. Somehow, the case update raced in during the rule run, so the OCC info in the rule run was stale. I'd have to look a bit more to see if we could delay reading the alert doc OCC info to later, to hopefully avoid a race condition, or similar re-arrangement scenarios. But feels to me we need to come up with a safer way to do this. I think Ying's suggestions above will work:
We still run the risk of race conditions, but only for fields updated by multiple independent update calls. If every update caller is updating fields no other caller is updating, we shouldn't see any overwritten data. Right? |
Now that we're finishing up with #160572, I want to start thinking about this issue a bit more. A couple of things:
|
resolves #154266 Changes the way the alerts-as-data (AAD) indices are created and written to, to allow them to be built as they have been in the past (alias and backing indices created manually) OR as an ES Data Stream. Serverless will use Data Streams, other environments will use the existing alias and backing indices. The determination is made by optionally including the `serverless` plugin, and determining if it's available. The implementation is organized around a `DataStreamAdapter` object, which is instantiated with a "data stream" or "alias" flavor, and then it handles the specialized behavior. Currently, a lot of the smaller implementation bits, like setting property values in ES calls, is done via in-line boolean checks of that object, as to whether data streams or aliases are being used. This could probably be cleaned up some. Existing non-serverless function tests are largely unchanged, as they can't test the new data stream path. Some tests have been added to the serverless function tests, to test basic reading / writing via updated alert documents. ## DEFER - more serverless AaD tests - #158403 - this issue is more noticeable now that we HAVE to do OCC with data streams, so we get errors instead of simply overwriting documents (which is also bad) Co-authored-by: Patryk Kopycinski <[email protected]>
…#160572) resolves elastic#154266 Changes the way the alerts-as-data (AAD) indices are created and written to, to allow them to be built as they have been in the past (alias and backing indices created manually) OR as an ES Data Stream. Serverless will use Data Streams, other environments will use the existing alias and backing indices. The determination is made by optionally including the `serverless` plugin, and determining if it's available. The implementation is organized around a `DataStreamAdapter` object, which is instantiated with a "data stream" or "alias" flavor, and then it handles the specialized behavior. Currently, a lot of the smaller implementation bits, like setting property values in ES calls, is done via in-line boolean checks of that object, as to whether data streams or aliases are being used. This could probably be cleaned up some. Existing non-serverless function tests are largely unchanged, as they can't test the new data stream path. Some tests have been added to the serverless function tests, to test basic reading / writing via updated alert documents. ## DEFER - more serverless AaD tests - elastic#158403 - this issue is more noticeable now that we HAVE to do OCC with data streams, so we get errors instead of simply overwriting documents (which is also bad) Co-authored-by: Patryk Kopycinski <[email protected]>
I've got an experiment open in PR #165275 where I changed all the alert update docs from occ-based |
…xecution resolves: elastic#158403 When conflicts are detected while updating alert docs after a rule runs, we'll try to resolve the conflict by `mget()`'ing the alert documents again, to get the updated OCC info `_seq_no` and `_primary_term`. We'll also get the current versions of "ad-hoc" updated fields (which caused the conflict), like workflow status, case assignments, etc. And then attempt to update the alert doc again, with that info, which should get it back up-to-date. - hard-code the fields to refresh - add skeletal version of a function test - add some debugging for CI-only/not-local test failure - change new rule type to wait for signal to finish - a little clean up, no clue why this passes locally and fails in CI though - dump rule type registry to see if rule type there - remove diagnostic code from FT - a real test that passes locally, for alerting framework - add test for RR, but it's failing as it doesn't resolve conflicts yet - fix some stuff, add support for upcoming untracked alert status - change recover algorithm to retry subsequent times corectly - remove RR support (not needed), so simplify other things - remove more RR bits (TransportStuff) and add jest tests - add multi-alert bulk update function test - rebase main
…xecution (#166283) resolves: #158403 When conflicts are detected while updating alert docs after a rule runs, we'll try to resolve the conflict by `mget()`'ing the alert documents again, to get the updated OCC info `_seq_no` and `_primary_term`. We'll also get the current versions of "ad-hoc" updated fields (which caused the conflict), like workflow status, case assignments, etc. And then attempt to update the alert doc again, with that info, which should get it back up-to-date. Note that the rule registry was not touched here. During this PR's development, I added the retry support to it, but then my function tests were failing because there were never any conflicts happening. Turns out rule registry mget's the alerts before it updates them, to get the latest values. So they won't need this fix. It's also not clear to me if this can be exercised in serverless, since it requires the use of an alerting framework based AaD implementation AND the ability to ad-hoc update alerts. I think this can only be done with Elasticsearch Query and Index Threshold, and only when used in metrics scope, so it will show up in the metrics UX, which is where you can add the alerts to the case. ## manual testing It's hard! I've seen the conflict messages before, but it's quite difficult to get them to go off whenever you want. The basic idea is to get a rule that uses alerting framework AAD (not rule registry, which is not affected the same way with conflicts (they mget alerts right before updating them), set it to run on a `1s` interval, and probably also configure TM to run a `1s` interval, via the following configs: ``` xpack.alerting.rules.minimumScheduleInterval.value: "1s" xpack.task_manager.poll_interval: 1000 ``` You want to get the rule to execute often and generate a lot of alerts, and run for as long as possible. Then while it's running, add the generated alerts to cases. Here's the EQ rule definition I used: ![image](https://github.com/elastic/kibana/assets/25117/56c69d50-a76c-48d4-9a45-665a0008b248) I selected the alerts from the o11y alerts page, since you can't add alerts to cases from the stack page. Hmm. :-). Sort the alert list by low-high duration, so the newest alerts will be at the top. Refresh, select all the rules (set page to show 100), then add to case from the `...` menu. If you force a conflict, you should see something like this in the Kibana logs: ``` [ERROR] [plugins.alerting] Error writing alerts: 168 successful, 100 conflicts, 0 errors: [INFO ] [plugins.alerting] Retrying bulk update of 100 conflicted alerts [INFO ] [plugins.alerting] Retried bulk update of 100 conflicted alerts succeeded ``` --------- Co-authored-by: kibanamachine <[email protected]>
From #156946 (comment):
For alerts as data, alert documents are created or updated each rule execution. Users can also perform certain updates on these documents (attaching cases, closing an alert, etc). It is possible that these two updates could occur at the same time and override each other.
As an example:
active
closed
active
.A few options:
The text was updated successfully, but these errors were encountered: