Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.6] [Fleet] refactored bulk update tags retry (#147594) #147839

Merged
merged 1 commit into from
Dec 20, 2022

Conversation

kibanamachine
Copy link
Contributor

Backport

This will backport the following commits from main to 8.6:

Questions ?

Please refer to the Backport tool documentation

## Summary

Fixes elastic#144161

As discussed
[here](elastic#144161 (comment)),
the existing implementation of update tags doesn't work well with real
agents, as there are many conflicts with checkin, even when trying to
add/remove one tag.
Refactored the logic to make retries more efficient:
- Instead of aborting the whole bulk action on conflicts, changed the
conflict strategy to 'proceed'. This means, if an action of 50k agents
has 1k conflicts, not all 50k is retried, but only the 1k conflicts,
this makes it less likely to conflict on retry.
- Because of this, on retry we have to know which agents don't yet have
the tag added/removed. For this, added an additional filter to the
`updateByQuery` request. Only adding the filter if there is exactly one
`tagsToAdd` or one `tagsToRemove`. This is the main use case from the
UI, and handling other cases would complicate the logic more (each
additional tag to add/remove would result in another OR query, which
would match more agents, making conflicts more likely).
- Added this additional query on the initial request as well (not only
retries) to save on unnecessary work e.g. if the user tries to add a tag
on 50k agents, but 48k already have it, it is enough to update the
remaining 2k agents.
- This improvement has the effect that 'Agent activity' shows the real
updated agent count, not the total selected. I think this is not really
a problem for update tags.
- Cleaned up some of the UI logic, because the conflicts are fully
handled now on the backend.
- Locally I couldn't reproduce the conflict with agent checkins, even
with 1k horde agents. I'll try to test in cloud with more real agents.

To verify:
- Enroll 50k agents (I used 50k with create_agents script, and 1k with
horde). Enroll 50k with horde if possible.
- Select all on UI and try to add/remove one or more tags
- Expect the changes to propagate quickly (up to 1m). It might take a
few refreshes to see the result on agent list and tags list, because the
UI polls the agents every 30s. It is expected that the tags list
temporarily shows incorrect data because the action is async.

E.g. removed `test3` tag and added `add` tag quickly:
<img width="1776" alt="image"
src="https://user-images.githubusercontent.com/90178898/207824481-411f0f70-d7e8-42a6-b73f-ed80e77b7700.png">
<img width="422" alt="image"
src="https://user-images.githubusercontent.com/90178898/207824550-582d43fc-87db-45e1-ba58-15915447fefd.png">

The logs show the details of how many `version_conflicts` were there,
and it decreased with retries.

```
[2022-12-15T10:32:12.937+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd541-19ac-4738-b3d3-db32789233de, total agents: 52000
[2022-12-15T10:32:12.981+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:16.477+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 29e9da70-7194-4e52-8004-2c1b19f6dfd5, total agents: 52000
[2022-12-15T10:32:16.537+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5
[2022-12-15T10:32:22.893+01:00][DEBUG][plugins.fleet] {"took":9886,"timed_out":false,"total":52000,"updated":41143,"deleted":0,"batches":52,"version_conflicts":10857,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]}
[2022-12-15T10:32:26.066+01:00][DEBUG][plugins.fleet] {"took":9518,"timed_out":false,"total":52000,"updated":25755,"deleted":0,"batches":52,"version_conflicts":26245,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]}
[2022-12-15T10:32:27.401+01:00][ERROR][plugins.fleet] Action failed: version conflict of 10857 agents
[2022-12-15T10:32:27.461+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:27.462+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:29.274+01:00][ERROR][plugins.fleet] Action failed: version conflict of 26245 agents
[2022-12-15T10:32:29.353+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5
[2022-12-15T10:32:29.353+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5
[2022-12-15T10:32:31.480+01:00][INFO ][plugins.fleet] Running bulk action retry task
[2022-12-15T10:32:31.481+01:00][DEBUG][plugins.fleet] Retry elastic#1 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:31.481+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd541-19ac-4738-b3d3-db32789233de, total agents: 52000
[2022-12-15T10:32:31.481+01:00][INFO ][plugins.fleet] Completed bulk action retry task
[2022-12-15T10:32:31.485+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:33.841+01:00][DEBUG][plugins.fleet] {"took":2347,"timed_out":false,"total":10857,"updated":9857,"deleted":0,"batches":11,"version_conflicts":1000,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]}
[2022-12-15T10:32:34.556+01:00][INFO ][plugins.fleet] Running bulk action retry task
[2022-12-15T10:32:34.557+01:00][DEBUG][plugins.fleet] Retry elastic#1 of task fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5
[2022-12-15T10:32:34.557+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 29e9da70-7194-4e52-8004-2c1b19f6dfd5, total agents: 52000
[2022-12-15T10:32:34.557+01:00][INFO ][plugins.fleet] Completed bulk action retry task
[2022-12-15T10:32:34.560+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5
[2022-12-15T10:32:35.388+01:00][ERROR][plugins.fleet] Retry elastic#1 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de failed: version conflict of 1000 agents
[2022-12-15T10:32:35.468+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:35.468+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
{"took":5509,"timed_out":false,"total":26245,"updated":26245,"deleted":0,"batches":27,"version_conflicts":0,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]}
[2022-12-15T10:32:42.722+01:00][INFO ][plugins.fleet] processed 26245 agents, took 5509ms
[2022-12-15T10:32:42.723+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5
[2022-12-15T10:32:46.705+01:00][INFO ][plugins.fleet] Running bulk action retry task
[2022-12-15T10:32:46.706+01:00][DEBUG][plugins.fleet] Retry elastic#2 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:46.707+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd541-19ac-4738-b3d3-db32789233de, total agents: 52000
[2022-12-15T10:32:46.707+01:00][INFO ][plugins.fleet] Completed bulk action retry task
[2022-12-15T10:32:46.711+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de
[2022-12-15T10:32:47.099+01:00][DEBUG][plugins.fleet] {"took":379,"timed_out":false,"total":1000,"updated":1000,"deleted":0,"batches":1,"version_conflicts":0,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]}
[2022-12-15T10:32:47.623+01:00][INFO ][plugins.fleet] processed 1000 agents, took 379ms
[2022-12-15T10:32:47.623+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de
```

### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

Co-authored-by: Kibana Machine <[email protected]>
(cherry picked from commit 687987a)
@kibanamachine kibanamachine enabled auto-merge (squash) December 20, 2022 09:40
@botelastic botelastic bot added the Team:Fleet Team label for Observability Data Collection Fleet team label Dec 20, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@kibanamachine kibanamachine merged commit 335b86a into elastic:8.6 Dec 20, 2022
@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
fleet 875.9KB 875.7KB -218.0B
Unknown metric groups

ESLint disabled in files

id before after diff
osquery 1 2 +1

ESLint disabled line counts

id before after diff
enterpriseSearch 19 21 +2
fleet 59 65 +6
osquery 108 113 +5
securitySolution 441 447 +6
total +19

Total ESLint disabled count

id before after diff
enterpriseSearch 20 22 +2
fleet 68 74 +6
osquery 109 115 +6
securitySolution 518 524 +6
total +20

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @juliaElastic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants