-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fleet] refactored bulk update tags retry #147594
Conversation
Pinging @elastic/fleet (Team:Fleet) |
x-pack/plugins/fleet/server/services/agents/update_agent_tags_action_runner.ts
Show resolved
Hide resolved
Tested on ECE with 20k horde agents. The new logic works fine, and the conflicted agents are updated in a few retries. EDIT: found the reason: the logic of generating ids for action results was not giving unique ids for retries (always assigned 0,1,2...). Changed to generate |
Thanks @juliaElastic glad to hear that the retry mechanism solve almost all our problems at scale. |
I noticed some discrepancy in Agent activity around 40k agents. |
@elasticmachine merge upstream |
{ pitId: '' } | ||
).runActionAsyncWithRetry(); | ||
} | ||
return await new UpdateAgentTagsActionRunner( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simplified the logic to use retry for all update tags kuery actions
as reported here, the version conflict happened even with less than 10k agents, which didn't retry before
Could reproduce in an ECE instance by adding a tag on 5k horde agents and getting this response from bulk API:
{"statusCode":500,"error":"Internal Server Error","message":"version conflict of 1865 agents"}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested with "fake" agents locally and works, a few nits but nothing blocking 👍
x-pack/plugins/fleet/server/services/agents/update_agent_tags_action_runner.ts
Outdated
Show resolved
Hide resolved
x-pack/plugins/fleet/server/services/agents/update_agent_tags_action_runner.ts
Outdated
Show resolved
Hide resolved
: Math.min( | ||
docCount, | ||
// only using cardinality count when count lower than precision threshold | ||
docCount > PRECISION_THRESHOLD ? docCount : cardinalityCount, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside: Why is cardinalityCount
used, can't we always use the docCount here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cardinality was introduced for actions that can potentially be acked multiple times by agents e.g. upgrade. So we count acks by one agent once.
💚 Build Succeeded
Metrics [docs]Async chunks
Unknown metric groupsESLint disabled in files
ESLint disabled line counts
Total ESLint disabled count
History
To update your PR or re-run it, just comment with: |
## Summary Fixes elastic#144161 As discussed [here](elastic#144161 (comment)), the existing implementation of update tags doesn't work well with real agents, as there are many conflicts with checkin, even when trying to add/remove one tag. Refactored the logic to make retries more efficient: - Instead of aborting the whole bulk action on conflicts, changed the conflict strategy to 'proceed'. This means, if an action of 50k agents has 1k conflicts, not all 50k is retried, but only the 1k conflicts, this makes it less likely to conflict on retry. - Because of this, on retry we have to know which agents don't yet have the tag added/removed. For this, added an additional filter to the `updateByQuery` request. Only adding the filter if there is exactly one `tagsToAdd` or one `tagsToRemove`. This is the main use case from the UI, and handling other cases would complicate the logic more (each additional tag to add/remove would result in another OR query, which would match more agents, making conflicts more likely). - Added this additional query on the initial request as well (not only retries) to save on unnecessary work e.g. if the user tries to add a tag on 50k agents, but 48k already have it, it is enough to update the remaining 2k agents. - This improvement has the effect that 'Agent activity' shows the real updated agent count, not the total selected. I think this is not really a problem for update tags. - Cleaned up some of the UI logic, because the conflicts are fully handled now on the backend. - Locally I couldn't reproduce the conflict with agent checkins, even with 1k horde agents. I'll try to test in cloud with more real agents. To verify: - Enroll 50k agents (I used 50k with create_agents script, and 1k with horde). Enroll 50k with horde if possible. - Select all on UI and try to add/remove one or more tags - Expect the changes to propagate quickly (up to 1m). It might take a few refreshes to see the result on agent list and tags list, because the UI polls the agents every 30s. It is expected that the tags list temporarily shows incorrect data because the action is async. E.g. removed `test3` tag and added `add` tag quickly: <img width="1776" alt="image" src="https://user-images.githubusercontent.com/90178898/207824481-411f0f70-d7e8-42a6-b73f-ed80e77b7700.png"> <img width="422" alt="image" src="https://user-images.githubusercontent.com/90178898/207824550-582d43fc-87db-45e1-ba58-15915447fefd.png"> The logs show the details of how many `version_conflicts` were there, and it decreased with retries. ``` [2022-12-15T10:32:12.937+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd541-19ac-4738-b3d3-db32789233de, total agents: 52000 [2022-12-15T10:32:12.981+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de [2022-12-15T10:32:16.477+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 29e9da70-7194-4e52-8004-2c1b19f6dfd5, total agents: 52000 [2022-12-15T10:32:16.537+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5 [2022-12-15T10:32:22.893+01:00][DEBUG][plugins.fleet] {"took":9886,"timed_out":false,"total":52000,"updated":41143,"deleted":0,"batches":52,"version_conflicts":10857,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]} [2022-12-15T10:32:26.066+01:00][DEBUG][plugins.fleet] {"took":9518,"timed_out":false,"total":52000,"updated":25755,"deleted":0,"batches":52,"version_conflicts":26245,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]} [2022-12-15T10:32:27.401+01:00][ERROR][plugins.fleet] Action failed: version conflict of 10857 agents [2022-12-15T10:32:27.461+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de [2022-12-15T10:32:27.462+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de [2022-12-15T10:32:29.274+01:00][ERROR][plugins.fleet] Action failed: version conflict of 26245 agents [2022-12-15T10:32:29.353+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5 [2022-12-15T10:32:29.353+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5 [2022-12-15T10:32:31.480+01:00][INFO ][plugins.fleet] Running bulk action retry task [2022-12-15T10:32:31.481+01:00][DEBUG][plugins.fleet] Retry elastic#1 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de [2022-12-15T10:32:31.481+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd541-19ac-4738-b3d3-db32789233de, total agents: 52000 [2022-12-15T10:32:31.481+01:00][INFO ][plugins.fleet] Completed bulk action retry task [2022-12-15T10:32:31.485+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de [2022-12-15T10:32:33.841+01:00][DEBUG][plugins.fleet] {"took":2347,"timed_out":false,"total":10857,"updated":9857,"deleted":0,"batches":11,"version_conflicts":1000,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]} [2022-12-15T10:32:34.556+01:00][INFO ][plugins.fleet] Running bulk action retry task [2022-12-15T10:32:34.557+01:00][DEBUG][plugins.fleet] Retry elastic#1 of task fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5 [2022-12-15T10:32:34.557+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 29e9da70-7194-4e52-8004-2c1b19f6dfd5, total agents: 52000 [2022-12-15T10:32:34.557+01:00][INFO ][plugins.fleet] Completed bulk action retry task [2022-12-15T10:32:34.560+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5 [2022-12-15T10:32:35.388+01:00][ERROR][plugins.fleet] Retry elastic#1 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de failed: version conflict of 1000 agents [2022-12-15T10:32:35.468+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de [2022-12-15T10:32:35.468+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de {"took":5509,"timed_out":false,"total":26245,"updated":26245,"deleted":0,"batches":27,"version_conflicts":0,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]} [2022-12-15T10:32:42.722+01:00][INFO ][plugins.fleet] processed 26245 agents, took 5509ms [2022-12-15T10:32:42.723+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5 [2022-12-15T10:32:46.705+01:00][INFO ][plugins.fleet] Running bulk action retry task [2022-12-15T10:32:46.706+01:00][DEBUG][plugins.fleet] Retry elastic#2 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de [2022-12-15T10:32:46.707+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd541-19ac-4738-b3d3-db32789233de, total agents: 52000 [2022-12-15T10:32:46.707+01:00][INFO ][plugins.fleet] Completed bulk action retry task [2022-12-15T10:32:46.711+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de [2022-12-15T10:32:47.099+01:00][DEBUG][plugins.fleet] {"took":379,"timed_out":false,"total":1000,"updated":1000,"deleted":0,"batches":1,"version_conflicts":0,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]} [2022-12-15T10:32:47.623+01:00][INFO ][plugins.fleet] processed 1000 agents, took 379ms [2022-12-15T10:32:47.623+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de ``` ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios Co-authored-by: Kibana Machine <[email protected]> (cherry picked from commit 687987a)
💚 All backports created successfully
Note: Successful backport PRs will be merged automatically after passing CI. Questions ?Please refer to the Backport tool documentation |
# Backport This will backport the following commits from `main` to `8.6`: - [[Fleet] refactored bulk update tags retry (#147594)](#147594) <!--- Backport version: 8.9.7 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT [{"author":{"name":"Julia Bardi","email":"[email protected]"},"sourceCommit":{"committedDate":"2022-12-20T09:36:36Z","message":"[Fleet] refactored bulk update tags retry (#147594)\n\n## Summary\r\n\r\nFixes https://github.com/elastic/kibana/issues/144161\r\n\r\nAs discussed\r\n[here](https://github.com/elastic/kibana/issues/144161#issuecomment-1348668610),\r\nthe existing implementation of update tags doesn't work well with real\r\nagents, as there are many conflicts with checkin, even when trying to\r\nadd/remove one tag.\r\nRefactored the logic to make retries more efficient:\r\n- Instead of aborting the whole bulk action on conflicts, changed the\r\nconflict strategy to 'proceed'. This means, if an action of 50k agents\r\nhas 1k conflicts, not all 50k is retried, but only the 1k conflicts,\r\nthis makes it less likely to conflict on retry.\r\n- Because of this, on retry we have to know which agents don't yet have\r\nthe tag added/removed. For this, added an additional filter to the\r\n`updateByQuery` request. Only adding the filter if there is exactly one\r\n`tagsToAdd` or one `tagsToRemove`. This is the main use case from the\r\nUI, and handling other cases would complicate the logic more (each\r\nadditional tag to add/remove would result in another OR query, which\r\nwould match more agents, making conflicts more likely).\r\n- Added this additional query on the initial request as well (not only\r\nretries) to save on unnecessary work e.g. if the user tries to add a tag\r\non 50k agents, but 48k already have it, it is enough to update the\r\nremaining 2k agents.\r\n- This improvement has the effect that 'Agent activity' shows the real\r\nupdated agent count, not the total selected. I think this is not really\r\na problem for update tags.\r\n- Cleaned up some of the UI logic, because the conflicts are fully\r\nhandled now on the backend.\r\n- Locally I couldn't reproduce the conflict with agent checkins, even\r\nwith 1k horde agents. I'll try to test in cloud with more real agents.\r\n\r\nTo verify:\r\n- Enroll 50k agents (I used 50k with create_agents script, and 1k with\r\nhorde). Enroll 50k with horde if possible.\r\n- Select all on UI and try to add/remove one or more tags\r\n- Expect the changes to propagate quickly (up to 1m). It might take a\r\nfew refreshes to see the result on agent list and tags list, because the\r\nUI polls the agents every 30s. It is expected that the tags list\r\ntemporarily shows incorrect data because the action is async.\r\n\r\nE.g. removed `test3` tag and added `add` tag quickly:\r\n<img width=\"1776\" alt=\"image\"\r\nsrc=\"https://user-images.githubusercontent.com/90178898/207824481-411f0f70-d7e8-42a6-b73f-ed80e77b7700.png\">\r\n<img width=\"422\" alt=\"image\"\r\nsrc=\"https://user-images.githubusercontent.com/90178898/207824550-582d43fc-87db-45e1-ba58-15915447fefd.png\">\r\n\r\nThe logs show the details of how many `version_conflicts` were there,\r\nand it decreased with retries.\r\n\r\n```\r\n[2022-12-15T10:32:12.937+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd541-19ac-4738-b3d3-db32789233de, total agents: 52000\r\n[2022-12-15T10:32:12.981+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:16.477+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 29e9da70-7194-4e52-8004-2c1b19f6dfd5, total agents: 52000\r\n[2022-12-15T10:32:16.537+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:22.893+01:00][DEBUG][plugins.fleet] {\"took\":9886,\"timed_out\":false,\"total\":52000,\"updated\":41143,\"deleted\":0,\"batches\":52,\"version_conflicts\":10857,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:26.066+01:00][DEBUG][plugins.fleet] {\"took\":9518,\"timed_out\":false,\"total\":52000,\"updated\":25755,\"deleted\":0,\"batches\":52,\"version_conflicts\":26245,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:27.401+01:00][ERROR][plugins.fleet] Action failed: version conflict of 10857 agents\r\n[2022-12-15T10:32:27.461+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:27.462+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:29.274+01:00][ERROR][plugins.fleet] Action failed: version conflict of 26245 agents\r\n[2022-12-15T10:32:29.353+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:29.353+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:31.480+01:00][INFO ][plugins.fleet] Running bulk action retry task\r\n[2022-12-15T10:32:31.481+01:00][DEBUG][plugins.fleet] Retry #1 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:31.481+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd541-19ac-4738-b3d3-db32789233de, total agents: 52000\r\n[2022-12-15T10:32:31.481+01:00][INFO ][plugins.fleet] Completed bulk action retry task\r\n[2022-12-15T10:32:31.485+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:33.841+01:00][DEBUG][plugins.fleet] {\"took\":2347,\"timed_out\":false,\"total\":10857,\"updated\":9857,\"deleted\":0,\"batches\":11,\"version_conflicts\":1000,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:34.556+01:00][INFO ][plugins.fleet] Running bulk action retry task\r\n[2022-12-15T10:32:34.557+01:00][DEBUG][plugins.fleet] Retry #1 of task fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:34.557+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 29e9da70-7194-4e52-8004-2c1b19f6dfd5, total agents: 52000\r\n[2022-12-15T10:32:34.557+01:00][INFO ][plugins.fleet] Completed bulk action retry task\r\n[2022-12-15T10:32:34.560+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:35.388+01:00][ERROR][plugins.fleet] Retry #1 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de failed: version conflict of 1000 agents\r\n[2022-12-15T10:32:35.468+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:35.468+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n{\"took\":5509,\"timed_out\":false,\"total\":26245,\"updated\":26245,\"deleted\":0,\"batches\":27,\"version_conflicts\":0,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:42.722+01:00][INFO ][plugins.fleet] processed 26245 agents, took 5509ms\r\n[2022-12-15T10:32:42.723+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:46.705+01:00][INFO ][plugins.fleet] Running bulk action retry task\r\n[2022-12-15T10:32:46.706+01:00][DEBUG][plugins.fleet] Retry #2 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:46.707+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd541-19ac-4738-b3d3-db32789233de, total agents: 52000\r\n[2022-12-15T10:32:46.707+01:00][INFO ][plugins.fleet] Completed bulk action retry task\r\n[2022-12-15T10:32:46.711+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:47.099+01:00][DEBUG][plugins.fleet] {\"took\":379,\"timed_out\":false,\"total\":1000,\"updated\":1000,\"deleted\":0,\"batches\":1,\"version_conflicts\":0,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:47.623+01:00][INFO ][plugins.fleet] processed 1000 agents, took 379ms\r\n[2022-12-15T10:32:47.623+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n```\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios\r\n\r\nCo-authored-by: Kibana Machine <[email protected]>","sha":"687987aa9ce56ce359f722485330179a4807d79a","branchLabelMapping":{"^v8.7.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","Team:Fleet","v8.7.0","v8.6.1"],"number":147594,"url":"https://github.com/elastic/kibana/pull/147594","mergeCommit":{"message":"[Fleet] refactored bulk update tags retry (#147594)\n\n## Summary\r\n\r\nFixes https://github.com/elastic/kibana/issues/144161\r\n\r\nAs discussed\r\n[here](https://github.com/elastic/kibana/issues/144161#issuecomment-1348668610),\r\nthe existing implementation of update tags doesn't work well with real\r\nagents, as there are many conflicts with checkin, even when trying to\r\nadd/remove one tag.\r\nRefactored the logic to make retries more efficient:\r\n- Instead of aborting the whole bulk action on conflicts, changed the\r\nconflict strategy to 'proceed'. This means, if an action of 50k agents\r\nhas 1k conflicts, not all 50k is retried, but only the 1k conflicts,\r\nthis makes it less likely to conflict on retry.\r\n- Because of this, on retry we have to know which agents don't yet have\r\nthe tag added/removed. For this, added an additional filter to the\r\n`updateByQuery` request. Only adding the filter if there is exactly one\r\n`tagsToAdd` or one `tagsToRemove`. This is the main use case from the\r\nUI, and handling other cases would complicate the logic more (each\r\nadditional tag to add/remove would result in another OR query, which\r\nwould match more agents, making conflicts more likely).\r\n- Added this additional query on the initial request as well (not only\r\nretries) to save on unnecessary work e.g. if the user tries to add a tag\r\non 50k agents, but 48k already have it, it is enough to update the\r\nremaining 2k agents.\r\n- This improvement has the effect that 'Agent activity' shows the real\r\nupdated agent count, not the total selected. I think this is not really\r\na problem for update tags.\r\n- Cleaned up some of the UI logic, because the conflicts are fully\r\nhandled now on the backend.\r\n- Locally I couldn't reproduce the conflict with agent checkins, even\r\nwith 1k horde agents. I'll try to test in cloud with more real agents.\r\n\r\nTo verify:\r\n- Enroll 50k agents (I used 50k with create_agents script, and 1k with\r\nhorde). Enroll 50k with horde if possible.\r\n- Select all on UI and try to add/remove one or more tags\r\n- Expect the changes to propagate quickly (up to 1m). It might take a\r\nfew refreshes to see the result on agent list and tags list, because the\r\nUI polls the agents every 30s. It is expected that the tags list\r\ntemporarily shows incorrect data because the action is async.\r\n\r\nE.g. removed `test3` tag and added `add` tag quickly:\r\n<img width=\"1776\" alt=\"image\"\r\nsrc=\"https://user-images.githubusercontent.com/90178898/207824481-411f0f70-d7e8-42a6-b73f-ed80e77b7700.png\">\r\n<img width=\"422\" alt=\"image\"\r\nsrc=\"https://user-images.githubusercontent.com/90178898/207824550-582d43fc-87db-45e1-ba58-15915447fefd.png\">\r\n\r\nThe logs show the details of how many `version_conflicts` were there,\r\nand it decreased with retries.\r\n\r\n```\r\n[2022-12-15T10:32:12.937+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd541-19ac-4738-b3d3-db32789233de, total agents: 52000\r\n[2022-12-15T10:32:12.981+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:16.477+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 29e9da70-7194-4e52-8004-2c1b19f6dfd5, total agents: 52000\r\n[2022-12-15T10:32:16.537+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:22.893+01:00][DEBUG][plugins.fleet] {\"took\":9886,\"timed_out\":false,\"total\":52000,\"updated\":41143,\"deleted\":0,\"batches\":52,\"version_conflicts\":10857,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:26.066+01:00][DEBUG][plugins.fleet] {\"took\":9518,\"timed_out\":false,\"total\":52000,\"updated\":25755,\"deleted\":0,\"batches\":52,\"version_conflicts\":26245,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:27.401+01:00][ERROR][plugins.fleet] Action failed: version conflict of 10857 agents\r\n[2022-12-15T10:32:27.461+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:27.462+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:29.274+01:00][ERROR][plugins.fleet] Action failed: version conflict of 26245 agents\r\n[2022-12-15T10:32:29.353+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:29.353+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:31.480+01:00][INFO ][plugins.fleet] Running bulk action retry task\r\n[2022-12-15T10:32:31.481+01:00][DEBUG][plugins.fleet] Retry #1 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:31.481+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd541-19ac-4738-b3d3-db32789233de, total agents: 52000\r\n[2022-12-15T10:32:31.481+01:00][INFO ][plugins.fleet] Completed bulk action retry task\r\n[2022-12-15T10:32:31.485+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:33.841+01:00][DEBUG][plugins.fleet] {\"took\":2347,\"timed_out\":false,\"total\":10857,\"updated\":9857,\"deleted\":0,\"batches\":11,\"version_conflicts\":1000,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:34.556+01:00][INFO ][plugins.fleet] Running bulk action retry task\r\n[2022-12-15T10:32:34.557+01:00][DEBUG][plugins.fleet] Retry #1 of task fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:34.557+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 29e9da70-7194-4e52-8004-2c1b19f6dfd5, total agents: 52000\r\n[2022-12-15T10:32:34.557+01:00][INFO ][plugins.fleet] Completed bulk action retry task\r\n[2022-12-15T10:32:34.560+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:35.388+01:00][ERROR][plugins.fleet] Retry #1 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de failed: version conflict of 1000 agents\r\n[2022-12-15T10:32:35.468+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:35.468+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n{\"took\":5509,\"timed_out\":false,\"total\":26245,\"updated\":26245,\"deleted\":0,\"batches\":27,\"version_conflicts\":0,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:42.722+01:00][INFO ][plugins.fleet] processed 26245 agents, took 5509ms\r\n[2022-12-15T10:32:42.723+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:46.705+01:00][INFO ][plugins.fleet] Running bulk action retry task\r\n[2022-12-15T10:32:46.706+01:00][DEBUG][plugins.fleet] Retry #2 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:46.707+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd541-19ac-4738-b3d3-db32789233de, total agents: 52000\r\n[2022-12-15T10:32:46.707+01:00][INFO ][plugins.fleet] Completed bulk action retry task\r\n[2022-12-15T10:32:46.711+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:47.099+01:00][DEBUG][plugins.fleet] {\"took\":379,\"timed_out\":false,\"total\":1000,\"updated\":1000,\"deleted\":0,\"batches\":1,\"version_conflicts\":0,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:47.623+01:00][INFO ][plugins.fleet] processed 1000 agents, took 379ms\r\n[2022-12-15T10:32:47.623+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n```\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios\r\n\r\nCo-authored-by: Kibana Machine <[email protected]>","sha":"687987aa9ce56ce359f722485330179a4807d79a"}},"sourceBranch":"main","suggestedTargetBranches":["8.6"],"targetPullRequestStates":[{"branch":"main","label":"v8.7.0","labelRegex":"^v8.7.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/147594","number":147594,"mergeCommit":{"message":"[Fleet] refactored bulk update tags retry (#147594)\n\n## Summary\r\n\r\nFixes https://github.com/elastic/kibana/issues/144161\r\n\r\nAs discussed\r\n[here](https://github.com/elastic/kibana/issues/144161#issuecomment-1348668610),\r\nthe existing implementation of update tags doesn't work well with real\r\nagents, as there are many conflicts with checkin, even when trying to\r\nadd/remove one tag.\r\nRefactored the logic to make retries more efficient:\r\n- Instead of aborting the whole bulk action on conflicts, changed the\r\nconflict strategy to 'proceed'. This means, if an action of 50k agents\r\nhas 1k conflicts, not all 50k is retried, but only the 1k conflicts,\r\nthis makes it less likely to conflict on retry.\r\n- Because of this, on retry we have to know which agents don't yet have\r\nthe tag added/removed. For this, added an additional filter to the\r\n`updateByQuery` request. Only adding the filter if there is exactly one\r\n`tagsToAdd` or one `tagsToRemove`. This is the main use case from the\r\nUI, and handling other cases would complicate the logic more (each\r\nadditional tag to add/remove would result in another OR query, which\r\nwould match more agents, making conflicts more likely).\r\n- Added this additional query on the initial request as well (not only\r\nretries) to save on unnecessary work e.g. if the user tries to add a tag\r\non 50k agents, but 48k already have it, it is enough to update the\r\nremaining 2k agents.\r\n- This improvement has the effect that 'Agent activity' shows the real\r\nupdated agent count, not the total selected. I think this is not really\r\na problem for update tags.\r\n- Cleaned up some of the UI logic, because the conflicts are fully\r\nhandled now on the backend.\r\n- Locally I couldn't reproduce the conflict with agent checkins, even\r\nwith 1k horde agents. I'll try to test in cloud with more real agents.\r\n\r\nTo verify:\r\n- Enroll 50k agents (I used 50k with create_agents script, and 1k with\r\nhorde). Enroll 50k with horde if possible.\r\n- Select all on UI and try to add/remove one or more tags\r\n- Expect the changes to propagate quickly (up to 1m). It might take a\r\nfew refreshes to see the result on agent list and tags list, because the\r\nUI polls the agents every 30s. It is expected that the tags list\r\ntemporarily shows incorrect data because the action is async.\r\n\r\nE.g. removed `test3` tag and added `add` tag quickly:\r\n<img width=\"1776\" alt=\"image\"\r\nsrc=\"https://user-images.githubusercontent.com/90178898/207824481-411f0f70-d7e8-42a6-b73f-ed80e77b7700.png\">\r\n<img width=\"422\" alt=\"image\"\r\nsrc=\"https://user-images.githubusercontent.com/90178898/207824550-582d43fc-87db-45e1-ba58-15915447fefd.png\">\r\n\r\nThe logs show the details of how many `version_conflicts` were there,\r\nand it decreased with retries.\r\n\r\n```\r\n[2022-12-15T10:32:12.937+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd541-19ac-4738-b3d3-db32789233de, total agents: 52000\r\n[2022-12-15T10:32:12.981+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:16.477+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 29e9da70-7194-4e52-8004-2c1b19f6dfd5, total agents: 52000\r\n[2022-12-15T10:32:16.537+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:22.893+01:00][DEBUG][plugins.fleet] {\"took\":9886,\"timed_out\":false,\"total\":52000,\"updated\":41143,\"deleted\":0,\"batches\":52,\"version_conflicts\":10857,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:26.066+01:00][DEBUG][plugins.fleet] {\"took\":9518,\"timed_out\":false,\"total\":52000,\"updated\":25755,\"deleted\":0,\"batches\":52,\"version_conflicts\":26245,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:27.401+01:00][ERROR][plugins.fleet] Action failed: version conflict of 10857 agents\r\n[2022-12-15T10:32:27.461+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:27.462+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:29.274+01:00][ERROR][plugins.fleet] Action failed: version conflict of 26245 agents\r\n[2022-12-15T10:32:29.353+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:29.353+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:31.480+01:00][INFO ][plugins.fleet] Running bulk action retry task\r\n[2022-12-15T10:32:31.481+01:00][DEBUG][plugins.fleet] Retry #1 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:31.481+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd541-19ac-4738-b3d3-db32789233de, total agents: 52000\r\n[2022-12-15T10:32:31.481+01:00][INFO ][plugins.fleet] Completed bulk action retry task\r\n[2022-12-15T10:32:31.485+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:33.841+01:00][DEBUG][plugins.fleet] {\"took\":2347,\"timed_out\":false,\"total\":10857,\"updated\":9857,\"deleted\":0,\"batches\":11,\"version_conflicts\":1000,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:34.556+01:00][INFO ][plugins.fleet] Running bulk action retry task\r\n[2022-12-15T10:32:34.557+01:00][DEBUG][plugins.fleet] Retry #1 of task fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:34.557+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 29e9da70-7194-4e52-8004-2c1b19f6dfd5, total agents: 52000\r\n[2022-12-15T10:32:34.557+01:00][INFO ][plugins.fleet] Completed bulk action retry task\r\n[2022-12-15T10:32:34.560+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:35.388+01:00][ERROR][plugins.fleet] Retry #1 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de failed: version conflict of 1000 agents\r\n[2022-12-15T10:32:35.468+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:35.468+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n{\"took\":5509,\"timed_out\":false,\"total\":26245,\"updated\":26245,\"deleted\":0,\"batches\":27,\"version_conflicts\":0,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:42.722+01:00][INFO ][plugins.fleet] processed 26245 agents, took 5509ms\r\n[2022-12-15T10:32:42.723+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:46.705+01:00][INFO ][plugins.fleet] Running bulk action retry task\r\n[2022-12-15T10:32:46.706+01:00][DEBUG][plugins.fleet] Retry #2 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:46.707+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd541-19ac-4738-b3d3-db32789233de, total agents: 52000\r\n[2022-12-15T10:32:46.707+01:00][INFO ][plugins.fleet] Completed bulk action retry task\r\n[2022-12-15T10:32:46.711+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:47.099+01:00][DEBUG][plugins.fleet] {\"took\":379,\"timed_out\":false,\"total\":1000,\"updated\":1000,\"deleted\":0,\"batches\":1,\"version_conflicts\":0,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:47.623+01:00][INFO ][plugins.fleet] processed 1000 agents, took 379ms\r\n[2022-12-15T10:32:47.623+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n```\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios\r\n\r\nCo-authored-by: Kibana Machine <[email protected]>","sha":"687987aa9ce56ce359f722485330179a4807d79a"}},{"branch":"8.6","label":"v8.6.1","labelRegex":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"}]}] BACKPORT--> Co-authored-by: Julia Bardi <[email protected]>
Summary
Fixes #144161
As discussed here, the existing implementation of update tags doesn't work well with real agents, as there are many conflicts with checkin, even when trying to add/remove one tag.
Refactored the logic to make retries more efficient:
updateByQuery
request. Only adding the filter if there is exactly onetagsToAdd
or onetagsToRemove
. This is the main use case from the UI, and handling other cases would complicate the logic more (each additional tag to add/remove would result in another OR query, which would match more agents, making conflicts more likely).To verify:
E.g. removed
test3
tag and addedadd
tag quickly:The logs show the details of how many
version_conflicts
were there, and it decreased with retries.Checklist