-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make Fleet kibana bulk action execution async #141567
Comments
cc @juliaElastic would like to get your input on this since you worked on the recent Kibana changes |
@joshdover Here are my thoughts on this:
|
It may not be strictly necessary, but without creating the PIT in Kibana, there will be some edge cases where a different number of agents are selected for the action than the user saw in the UI. I'd be ok with doing this all in FS at first and seeing how much of a problem that edge case actually confuses users.
The
Good question, it's probably simpler to move everything to FS. The query that Kibana includes in the action document could be filtering on a list of IDs?
Yep, agreed. |
Discussed this with @juliaElastic today. Julia shared that she saw the current behavior take about 10s per 10k agents, meaning we're likely to starting hitting proxy timeouts in Kibana somewhere around 50k-60k (60s). We will need to prioritize this to be able to execute tests against 100k agents. |
I think this issue could potentially be worked on by engineers on Control Plane or Fleet UI. @jen-huang and @pierrehilbert should discuss ownership, depending on team capacity. |
I think this features is really similar to our current implementation of the Batching for I think that could be implement as a really thin layer over our existing action model, so the system would behave like this.
The drawback is you could get a pretty big documents in memory 36 bytes for uuid * 60 000 = 2.1mb, but we could optimize the fetching loop if this become a problem. There is already a limit in our system from ES since they have a 100mb limit per document. Fleet server would receive an initial document like this: {
"agents":[],
"agent_query": "Query to execute on Elasticsearch",
} After the query is executed Fleet Server can close right away the point in time query and the dispatch loop would receive this document. {
"agents":["16be82be-c19c-4e44-b497-ae9ac8ccb053", "b35c6429-cf4b-4a13-98e5-331272a54742", ....],
} Note: I only keep the useful field in the document above There a few things we would need to consider know, we need to probably guard on what can be queried by Fleet-Server. Also, I think we would need to expand our bulk logic to more than just an upgrade action.
I think it's something we need to solve and verify outside of this work, because upgrade would work the same. @michel-laterman What do you think about this? |
I don't think there's any issues with that plan @ph.
Are these UUIDs the agent IDs? |
Yes, those are agent ids. I think it would be best to break up the execution to multiple batches, so we don't have a limit on how many agents can be actioned. We did it similarly in Fleet API, so we could create multiple action documents, each with up to 10k agents.
@ph What do you mean by dispatching the action? Does it mean writing the changes to
Is concurrency the main issue here? Could we store state on the action document with PIT, to keep track whether a Fleet Server has started executing the pit query? This would prevent multiple Fleet Servers picking up the same action.
|
Linking test results with 15k agents for reference: #134565 (comment) |
Hi! Update:
Update 2: |
@aleksmaus We are discussing the approach in the RFC, it is an option to keep the agent id resolution logic in kibana, and move it out of the API handler. |
@scunningham had to optimize and add some configuration knobs for that back in the days, as far as I remember he had some measurements and fleet server configuration recommendations depending on the number of the agents the fleet server is going to serve. Sean do you have the these numbers anywhere? |
@juliaElastic Shall we close this as #138870 is merged? |
@jen-huang there is still an improvement I wanted to do, and also planning to add more tests. Also I was thinking of the use cases of action validation errors, e.g. agent already assigned to new policy, hosted agent can't be unenrolled, host might not be upgradeable.
WDYT @joshdover @kpollich ? |
+1 on this. We should have a paper trail of what's happening in the system to be able to show this to the user and for our own debugging purposes. |
Closing as @juliaElastic fixed both issues. |
Currently, agent actions performed from Fleet UI go through Fleet API and are executed synchronously.
This worked well for a small number of agents, but it is not scalable.
In 8.4, actions were optimized in a way to introduce batching on the Fleet API side, so that the actions are executed in 10k batches. This unblocked actions for up to 50-60k selections, but it will hit a limit of Network timeout of 1 or 2 mins depending on configuration.
In order to support larger scales, we make action execution asynchronous, so that the execution is decoupled from the Fleet API call.
Changes:
Previous description:
In #133388 we updated all of Kibana's logic for creating bulk actions to use a batched approach which will create a new action document for each batch of 10k agents in the full list of search results. This allows us to handle larger scales of agents >10k in these bulk actions, but it still makes the UI responsiveness slower as the number of agents a user is taking action on increases.
Instead, we could move to a model where Kibana creates a single document for the bulk action which includes the query parameters for the matching agents and an Elasticsearch point-in-time finder. Fleet Server could then consume this document and run the query with the PIT finder to identify the same agents that the user selected in the UI for the bulk action.
This would allow for UI snappiness at any scale and prevent any problems like proxy timeouts prevent users from taking bulk actions on very large numbers (100k+) of agents.
Challenges with this approach:
Related: #138870
The text was updated successfully, but these errors were encountered: