[Response Ops][Alerting] Discuss how to handle possible conflicts when updating alerts-as-data documents #158403

ymao1 · 2023-05-24T17:17:37Z

What happens if the bulk call happens at the same time a user updates the same alert document? Will this overwrite the user's action and/or vice versa?

For alerts as data, alert documents are created or updated each rule execution. Users can also perform certain updates on these documents (attaching cases, closing an alert, etc). It is possible that these two updates could occur at the same time and override each other.

As an example:

Rule execution begins. Alert "A" is queried from the alert index and has status active
User updates alert "A" to status closed
Rule reports alert "A" as active again
Rule execution ends. Alert "A" is updated by the executor using the data from (1) and status is reset to active.

A few options:

Since we only allow certain fields to be updated by the user, we ensure that we're only performing partial updates during rule executions and we don't include those user-updatable fields in the update doc
We add OCC to the bulk index call to retry on version conflicts
Others?

The text was updated successfully, but these errors were encountered:

elasticmachine · 2023-05-24T17:17:40Z

Pinging @elastic/response-ops (Team:ResponseOps)

pmuellr · 2023-07-14T13:07:30Z

I actually just ran into this issue as part of working on converting the alerts indices to data streams (PR #160572). In that PR, updates to existing documents use OCC, because they have to when you do updates in data streams.

The scenario for me was a metric threshold rule running, with an active alert, every 3 seconds. I then added that alert to a case. And hit a 409 conflict on the next rule run. Somehow, the case update raced in during the rule run, so the OCC info in the rule run was stale.

I'd have to look a bit more to see if we could delay reading the alert doc OCC info to later, to hopefully avoid a race condition, or similar re-arrangement scenarios.

But feels to me we need to come up with a safer way to do this. I think Ying's suggestions above will work:

every update to an alert doc is a partial update
every update is done with OCC, and retried some fixed set of times (like rule docs)

We still run the risk of race conditions, but only for fields updated by multiple independent update calls. If every update caller is updating fields no other caller is updating, we shouldn't see any overwritten data. Right?

pmuellr · 2023-08-29T14:28:04Z

Now that we're finishing up with #160572, I want to start thinking about this issue a bit more. A couple of things:

I think the chance of conflicts is going to be worse now, as I believe the only code doing overwriting of alerts is the alerting rule runner. It fetches the alerts at the beginning of a run, along with seq_no and primary_term, and then uses those to update existing alert docs when the rule completes. This means the window of opportunity for a conflict is the length of a rule run.
Any other update to an alert doc during a rule run - closing, adding a case, etc - will cause the rule run update to the alert doc to fail. Previously, this race condition would cause the update (closing, adding a case, etc), to be lost.
I think the only way around this situation is that all writes beyond create, to alert documents, should use partial updates. It appears simple updates like closing an alert already use partial updates, so it's the updating of alert docs by the rule runner that will need to be converted to a partial update. The danger is that these updates may no longer "delete" fields that aren't passed in on the update, instead leaving the previous value.

resolves #154266 Changes the way the alerts-as-data (AAD) indices are created and written to, to allow them to be built as they have been in the past (alias and backing indices created manually) OR as an ES Data Stream. Serverless will use Data Streams, other environments will use the existing alias and backing indices. The determination is made by optionally including the `serverless` plugin, and determining if it's available. The implementation is organized around a `DataStreamAdapter` object, which is instantiated with a "data stream" or "alias" flavor, and then it handles the specialized behavior. Currently, a lot of the smaller implementation bits, like setting property values in ES calls, is done via in-line boolean checks of that object, as to whether data streams or aliases are being used. This could probably be cleaned up some. Existing non-serverless function tests are largely unchanged, as they can't test the new data stream path. Some tests have been added to the serverless function tests, to test basic reading / writing via updated alert documents. ## DEFER - more serverless AaD tests - #158403 - this issue is more noticeable now that we HAVE to do OCC with data streams, so we get errors instead of simply overwriting documents (which is also bad) Co-authored-by: Patryk Kopycinski <[email protected]>

…#160572) resolves elastic#154266 Changes the way the alerts-as-data (AAD) indices are created and written to, to allow them to be built as they have been in the past (alias and backing indices created manually) OR as an ES Data Stream. Serverless will use Data Streams, other environments will use the existing alias and backing indices. The determination is made by optionally including the `serverless` plugin, and determining if it's available. The implementation is organized around a `DataStreamAdapter` object, which is instantiated with a "data stream" or "alias" flavor, and then it handles the specialized behavior. Currently, a lot of the smaller implementation bits, like setting property values in ES calls, is done via in-line boolean checks of that object, as to whether data streams or aliases are being used. This could probably be cleaned up some. Existing non-serverless function tests are largely unchanged, as they can't test the new data stream path. Some tests have been added to the serverless function tests, to test basic reading / writing via updated alert documents. ## DEFER - more serverless AaD tests - elastic#158403 - this issue is more noticeable now that we HAVE to do OCC with data streams, so we get errors instead of simply overwriting documents (which is also bad) Co-authored-by: Patryk Kopycinski <[email protected]>

pmuellr · 2023-08-31T20:11:47Z

I've got an experiment open in PR #165275 where I changed all the alert update docs from occ-based index to partial update. Seeing what that breaks.

…xecution resolves: elastic#158403 When conflicts are detected while updating alert docs after a rule runs, we'll try to resolve the conflict by `mget()`'ing the alert documents again, to get the updated OCC info `_seq_no` and `_primary_term`. We'll also get the current versions of "ad-hoc" updated fields (which caused the conflict), like workflow status, case assignments, etc. And then attempt to update the alert doc again, with that info, which should get it back up-to-date. - hard-code the fields to refresh - add skeletal version of a function test - add some debugging for CI-only/not-local test failure - change new rule type to wait for signal to finish - a little clean up, no clue why this passes locally and fails in CI though - dump rule type registry to see if rule type there - remove diagnostic code from FT - a real test that passes locally, for alerting framework - add test for RR, but it's failing as it doesn't resolve conflicts yet - fix some stuff, add support for upcoming untracked alert status - change recover algorithm to retry subsequent times corectly - remove RR support (not needed), so simplify other things - remove more RR bits (TransportStuff) and add jest tests - add multi-alert bulk update function test - rebase main

…xecution (#166283) resolves: #158403 When conflicts are detected while updating alert docs after a rule runs, we'll try to resolve the conflict by `mget()`'ing the alert documents again, to get the updated OCC info `_seq_no` and `_primary_term`. We'll also get the current versions of "ad-hoc" updated fields (which caused the conflict), like workflow status, case assignments, etc. And then attempt to update the alert doc again, with that info, which should get it back up-to-date. Note that the rule registry was not touched here. During this PR's development, I added the retry support to it, but then my function tests were failing because there were never any conflicts happening. Turns out rule registry mget's the alerts before it updates them, to get the latest values. So they won't need this fix. It's also not clear to me if this can be exercised in serverless, since it requires the use of an alerting framework based AaD implementation AND the ability to ad-hoc update alerts. I think this can only be done with Elasticsearch Query and Index Threshold, and only when used in metrics scope, so it will show up in the metrics UX, which is where you can add the alerts to the case. ## manual testing It's hard! I've seen the conflict messages before, but it's quite difficult to get them to go off whenever you want. The basic idea is to get a rule that uses alerting framework AAD (not rule registry, which is not affected the same way with conflicts (they mget alerts right before updating them), set it to run on a `1s` interval, and probably also configure TM to run a `1s` interval, via the following configs: ``` xpack.alerting.rules.minimumScheduleInterval.value: "1s" xpack.task_manager.poll_interval: 1000 ``` You want to get the rule to execute often and generate a lot of alerts, and run for as long as possible. Then while it's running, add the generated alerts to cases. Here's the EQ rule definition I used: ![image](https://github.com/elastic/kibana/assets/25117/56c69d50-a76c-48d4-9a45-665a0008b248) I selected the alerts from the o11y alerts page, since you can't add alerts to cases from the stack page. Hmm. :-). Sort the alert list by low-high duration, so the newest alerts will be at the top. Refresh, select all the rules (set page to show 100), then add to case from the `...` menu. If you force a conflict, you should see something like this in the Kibana logs: ``` [ERROR] [plugins.alerting] Error writing alerts: 168 successful, 100 conflicts, 0 errors: [INFO ] [plugins.alerting] Retrying bulk update of 100 conflicted alerts [INFO ] [plugins.alerting] Retried bulk update of 100 conflicted alerts succeeded ``` --------- Co-authored-by: kibanamachine <[email protected]>

ymao1 added Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/Alerts-as-Data Issues related to Alerts-as-data and RuleRegistry labels May 24, 2023

ymao1 added this to AppEx: ResponseOps - Execution & Connectors May 24, 2023

github-project-automation bot moved this to Awaiting Triage in AppEx: ResponseOps - Execution & Connectors May 24, 2023

ymao1 mentioned this issue May 24, 2023

[Response Ops][Alerting] Initial implementation of FAAD AlertsClient for writing generic AAD documents #156946

Merged

pmuellr moved this from Awaiting Triage to Todo in AppEx: ResponseOps - Execution & Connectors May 25, 2023

pmuellr moved this from Todo to Awaiting Triage in AppEx: ResponseOps - Execution & Connectors Jul 14, 2023

pmuellr moved this from Awaiting Triage to Todo in AppEx: ResponseOps - Execution & Connectors Jul 27, 2023

pmuellr mentioned this issue Aug 16, 2023

[ResponseOps] use Data Streams for AAD indices in serverless #160572

Merged

3 tasks

pmuellr self-assigned this Sep 6, 2023

pmuellr mentioned this issue Sep 8, 2023

[ResponseOps] [POC] resolve conflicts when updating alert docs after rule execution #166038

Closed

9 tasks

mikecote moved this from Todo to In Progress in AppEx: ResponseOps - Execution & Connectors Sep 8, 2023

pmuellr mentioned this issue Sep 12, 2023

[ResponseOps] resolve conflicts when updating alert docs after rule execution #166283

Merged

mikecote moved this from In Progress to In Review in AppEx: ResponseOps - Execution & Connectors Sep 25, 2023

ymao1 closed this as completed in #166283 Sep 28, 2023

github-project-automation bot moved this from In Review to Done in AppEx: ResponseOps - Execution & Connectors Sep 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Response Ops][Alerting] Discuss how to handle possible conflicts when updating alerts-as-data documents #158403

[Response Ops][Alerting] Discuss how to handle possible conflicts when updating alerts-as-data documents #158403

ymao1 commented May 24, 2023

elasticmachine commented May 24, 2023

pmuellr commented Jul 14, 2023

pmuellr commented Aug 29, 2023

pmuellr commented Aug 31, 2023

[Response Ops][Alerting] Discuss how to handle possible conflicts when updating alerts-as-data documents #158403

[Response Ops][Alerting] Discuss how to handle possible conflicts when updating alerts-as-data documents #158403

Comments

ymao1 commented May 24, 2023

elasticmachine commented May 24, 2023

pmuellr commented Jul 14, 2023

pmuellr commented Aug 29, 2023

pmuellr commented Aug 31, 2023