Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Response Ops][Alerting] Discuss how to handle possible conflicts when updating alerts-as-data documents #158403

Closed
ymao1 opened this issue May 24, 2023 · 4 comments · Fixed by #166283
Assignees
Labels
Feature:Alerting/Alerts-as-Data Issues related to Alerts-as-data and RuleRegistry Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@ymao1
Copy link
Contributor

ymao1 commented May 24, 2023

From #156946 (comment):

What happens if the bulk call happens at the same time a user updates the same alert document? Will this overwrite the user's action and/or vice versa?

For alerts as data, alert documents are created or updated each rule execution. Users can also perform certain updates on these documents (attaching cases, closing an alert, etc). It is possible that these two updates could occur at the same time and override each other.

As an example:

  1. Rule execution begins. Alert "A" is queried from the alert index and has status active
  2. User updates alert "A" to status closed
  3. Rule reports alert "A" as active again
  4. Rule execution ends. Alert "A" is updated by the executor using the data from (1) and status is reset to active.

A few options:

  • Since we only allow certain fields to be updated by the user, we ensure that we're only performing partial updates during rule executions and we don't include those user-updatable fields in the update doc
  • We add OCC to the bulk index call to retry on version conflicts
  • Others?
@ymao1 ymao1 added Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/Alerts-as-Data Issues related to Alerts-as-data and RuleRegistry labels May 24, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@pmuellr
Copy link
Member

pmuellr commented Jul 14, 2023

I actually just ran into this issue as part of working on converting the alerts indices to data streams (PR #160572). In that PR, updates to existing documents use OCC, because they have to when you do updates in data streams.

The scenario for me was a metric threshold rule running, with an active alert, every 3 seconds. I then added that alert to a case. And hit a 409 conflict on the next rule run. Somehow, the case update raced in during the rule run, so the OCC info in the rule run was stale.

I'd have to look a bit more to see if we could delay reading the alert doc OCC info to later, to hopefully avoid a race condition, or similar re-arrangement scenarios.

But feels to me we need to come up with a safer way to do this. I think Ying's suggestions above will work:

  • every update to an alert doc is a partial update
  • every update is done with OCC, and retried some fixed set of times (like rule docs)

We still run the risk of race conditions, but only for fields updated by multiple independent update calls. If every update caller is updating fields no other caller is updating, we shouldn't see any overwritten data. Right?

@pmuellr
Copy link
Member

pmuellr commented Aug 29, 2023

Now that we're finishing up with #160572, I want to start thinking about this issue a bit more. A couple of things:

  • I think the chance of conflicts is going to be worse now, as I believe the only code doing overwriting of alerts is the alerting rule runner. It fetches the alerts at the beginning of a run, along with seq_no and primary_term, and then uses those to update existing alert docs when the rule completes. This means the window of opportunity for a conflict is the length of a rule run.

  • Any other update to an alert doc during a rule run - closing, adding a case, etc - will cause the rule run update to the alert doc to fail. Previously, this race condition would cause the update (closing, adding a case, etc), to be lost.

  • I think the only way around this situation is that all writes beyond create, to alert documents, should use partial updates. It appears simple updates like closing an alert already use partial updates, so it's the updating of alert docs by the rule runner that will need to be converted to a partial update. The danger is that these updates may no longer "delete" fields that aren't passed in on the update, instead leaving the previous value.

pmuellr added a commit that referenced this issue Aug 30, 2023
resolves #154266

Changes the way the alerts-as-data (AAD) indices are created and written
to, to allow them to be built as they have been in the past (alias and
backing indices created manually) OR as an ES Data Stream.

Serverless will use Data Streams, other environments will use the
existing alias and backing indices. The determination is made by
optionally including the `serverless` plugin, and determining if it's
available.

The implementation is organized around a `DataStreamAdapter` object,
which is instantiated with a "data stream" or "alias" flavor, and then
it handles the specialized behavior. Currently, a lot of the smaller
implementation bits, like setting property values in ES calls, is done
via in-line boolean checks of that object, as to whether data streams or
aliases are being used. This could probably be cleaned up some.

Existing non-serverless function tests are largely unchanged, as they
can't test the new data stream path. Some tests have been added to the
serverless function tests, to test basic reading / writing via updated
alert documents.

## DEFER

- more serverless AaD tests

- #158403 - this issue is more
noticeable now that we HAVE to do OCC with data streams, so we get
errors instead of simply overwriting documents (which is also bad)

Co-authored-by: Patryk Kopycinski <[email protected]>
eokoneyo pushed a commit to eokoneyo/kibana that referenced this issue Aug 31, 2023
…#160572)

resolves elastic#154266

Changes the way the alerts-as-data (AAD) indices are created and written
to, to allow them to be built as they have been in the past (alias and
backing indices created manually) OR as an ES Data Stream.

Serverless will use Data Streams, other environments will use the
existing alias and backing indices. The determination is made by
optionally including the `serverless` plugin, and determining if it's
available.

The implementation is organized around a `DataStreamAdapter` object,
which is instantiated with a "data stream" or "alias" flavor, and then
it handles the specialized behavior. Currently, a lot of the smaller
implementation bits, like setting property values in ES calls, is done
via in-line boolean checks of that object, as to whether data streams or
aliases are being used. This could probably be cleaned up some.

Existing non-serverless function tests are largely unchanged, as they
can't test the new data stream path. Some tests have been added to the
serverless function tests, to test basic reading / writing via updated
alert documents.

## DEFER

- more serverless AaD tests

- elastic#158403 - this issue is more
noticeable now that we HAVE to do OCC with data streams, so we get
errors instead of simply overwriting documents (which is also bad)

Co-authored-by: Patryk Kopycinski <[email protected]>
@pmuellr
Copy link
Member

pmuellr commented Aug 31, 2023

I've got an experiment open in PR #165275 where I changed all the alert update docs from occ-based index to partial update. Seeing what that breaks.

@pmuellr pmuellr self-assigned this Sep 6, 2023
@mikecote mikecote moved this from Todo to In Progress in AppEx: ResponseOps - Execution & Connectors Sep 8, 2023
pmuellr added a commit to pmuellr/kibana that referenced this issue Sep 22, 2023
…xecution

resolves: elastic#158403

When conflicts are detected while updating alert docs after a rule
runs, we'll try to resolve the conflict by `mget()`'ing the alert
documents again, to get the updated OCC info `_seq_no` and
`_primary_term`.  We'll also get the current versions of "ad-hoc"
updated fields (which caused the conflict), like workflow status,
case assignments, etc.  And then attempt to update the alert doc again,
with that info, which should get it back up-to-date.

- hard-code the fields to refresh
- add skeletal version of a function test
- add some debugging for CI-only/not-local test failure
- change new rule type to wait for signal to finish
- a little clean up, no clue why this passes locally and fails in CI though
- dump rule type registry to see if rule type there
- remove diagnostic code from FT
- a real test that passes locally, for alerting framework
- add test for RR, but it's failing as it doesn't resolve conflicts yet
- fix some stuff, add support for upcoming untracked alert status
- change recover algorithm to retry subsequent times corectly
- remove RR support (not needed), so simplify other things
- remove more RR bits (TransportStuff) and add jest tests
- add multi-alert bulk update function test
- rebase main
@mikecote mikecote moved this from In Progress to In Review in AppEx: ResponseOps - Execution & Connectors Sep 25, 2023
ymao1 pushed a commit that referenced this issue Sep 28, 2023
…xecution (#166283)

resolves: #158403

When conflicts are detected while updating alert docs after a rule runs,
we'll try to resolve the conflict by `mget()`'ing the alert documents
again, to get the updated OCC info `_seq_no` and `_primary_term`. We'll
also get the current versions of "ad-hoc" updated fields (which caused
the conflict), like workflow status, case assignments, etc. And then
attempt to update the alert doc again, with that info, which should get
it back up-to-date.

Note that the rule registry was not touched here. During this PR's
development, I added the retry support to it, but then my function tests
were failing because there were never any conflicts happening. Turns out
rule registry mget's the alerts before it updates them, to get the
latest values. So they won't need this fix.

It's also not clear to me if this can be exercised in serverless, since
it requires the use of an alerting framework based AaD implementation
AND the ability to ad-hoc update alerts. I think this can only be done
with Elasticsearch Query and Index Threshold, and only when used in
metrics scope, so it will show up in the metrics UX, which is where you
can add the alerts to the case.

## manual testing

It's hard! I've seen the conflict messages before, but it's quite
difficult to get them to go off whenever you want. The basic idea is to
get a rule that uses alerting framework AAD (not rule registry, which is
not affected the same way with conflicts (they mget alerts right before
updating them), set it to run on a `1s` interval, and probably also
configure TM to run a `1s` interval, via the following configs:

```
xpack.alerting.rules.minimumScheduleInterval.value: "1s"
xpack.task_manager.poll_interval: 1000
```

You want to get the rule to execute often and generate a lot of alerts,
and run for as long as possible. Then while it's running, add the
generated alerts to cases. Here's the EQ rule definition I used:


![image](https://github.com/elastic/kibana/assets/25117/56c69d50-a76c-48d4-9a45-665a0008b248)

I selected the alerts from the o11y alerts page, since you can't add
alerts to cases from the stack page. Hmm. :-). Sort the alert list by
low-high duration, so the newest alerts will be at the top. Refresh,
select all the rules (set page to show 100), then add to case from the
`...` menu. If you force a conflict, you should see something like this
in the Kibana logs:

```
[ERROR] [plugins.alerting] Error writing alerts: 168 successful, 100 conflicts, 0 errors:
[INFO ] [plugins.alerting] Retrying bulk update of 100 conflicted alerts
[INFO ] [plugins.alerting] Retried bulk update of 100 conflicted alerts succeeded
```

---------

Co-authored-by: kibanamachine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Alerting/Alerts-as-Data Issues related to Alerts-as-data and RuleRegistry Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
No open projects
3 participants