-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide a bulk API for creating ingest assets #77505
Comments
Pinging @elastic/es-distributed (Team:Distributed) |
There's a question about why cluster state updates take ~150ms in these tests, and that falls under the |
Pinging @elastic/es-data-management (Team:Data Management) |
@joshdover out of curiosity, do you have an idea of the number of items Fleet usually would want to do in a single request? 10s? 100s? 1000s? |
Also, rather than batching cluster state updates, perhaps it would be better to teach Elasticsearch the concept of a package, with a set of templates, policies, pipelines, and metadata, and then make adding or removing a package an atomic operation from a cluster-state perspective. |
That is an interesting idea, and is tangentially related to #63798. However, that request could be interpreted as a single package that would contribute to something that is shared among all installed packages. I am not sure if that request is actually related, but if we pursue a package-based approach we should probably consider things that a package may want to install that are not unique to only that package. (for example maybe multiple packages re-use ingest pipelines, we would need a way to keep them from stomping on each other) |
I think we'd probably need both, because we will probably want the concept of "global" settings/items, and we will also want to be able to have two packages that initially shared a "thing", but when a change is made to that thing, the change is made only to a single package (i.e. namespacing of customization). |
For initial setup, we're looking at 100s of objects but in the future as more packages are being upgraded we could potentially want to upgrade 1000s of objects at once. That said, we'll likely want to do these in one request per package, so that in case any of these operations fail we can isolate the failure to a single ingest integration. |
This is the long term plan and I agree it's something we'll need at some point. My thinking was that maybe it'd be simpler to start with a bulk API that could later be used under the hood for a more complete package API abstraction in the future. That way we can more immediately solve the problems we're seeing now as we work out the many details on what a package API would need to support. I'll defer to you folks on what makes sense here. |
Package install and upgrade performance continues to be a challenge for both Kibana reliability and user onboarding. I want to highlight the recent changes in these areas and how this challenge affects the user and operator experience. User onboardingOne user experience change that is targeting to ship in 8.1 is the removal of installing default packages (elastic/kibana#108456). This will move the package installation step for key required packages (fleet_server, elastic_agent, system) to happen during the onboarding process when a user sets up their first Agent. By moving this installation step to the onboarding flow, we're adding a very significant ~30s+ delay to a key step in the flow which must happen before the user is instructed to actually install their first agent. We have concerns that this delay may have a negative impact on the success rate of users getting started with the Stack. This is primarily or entirely bottlenecked by the performance of creating ingest assets in Elasticsearch. Kibana reliabilityAs of elastic/kibana#111858 which is shipping in 8.0, Kibana will be installing and upgrading 1st party packages (Endpoint, APM, Synthetics, etc.) on boot. In this initial version, this process does not block Kibana startup and instead runs as an asynchronous upgrade process. This asynchronous upgrade process is not ideal as it may give operators a false sense that the Kibana upgrade has completed and it's safe to start upgrading Fleet Server, Elastic Agent, or standalone APM Server. If these components are upgraded before packages have completed upgrade, ingest could break resulting in dropped data or data could be ingested in a format that is unusable by application UIs or dashboards. One of the reasons we are hesitant to block Kibana startup (elastic/kibana#120616) is the slow installation process which is primarily bottlenecked by this issue. It's worth noting that package upgrades are currently implemented as full removal and then subsequent installation. This means that we'd need to be able to both delete and create ingest assets quickly in Elasticsearch. If increasing the scope of this bulk create API to also support deletes is a major challenge but supporting updates is not, it's possible we could revisit the upgrade logic in Fleet to minimize the deletes we do and leverage the bulk create/update logic. |
@jakelandis You asked for some addition metrics and reproduction steps here. Could you clarify what would be helpful to provide aside from what I provided here: elastic/kibana#110500 (comment) This can easily be reproduced by:
If this isn't enough to go on, I think this could easily be emulated by trying to do many PUT calls on index templates and ingest pipelines in parallel using |
Yes this would help to be able to isolate the issue. Do you have any example index templates and ingest pipelines we can use to test ? How much concurrency do you have ? i.e. a dozen concurrent requests for a mix of templates and pipelines or just 2 concurrent with different lanes for pipelines and templates ? What is the cluster setup ? (a single node hosted locally?) Anything information you can provide that allows us to reproduce this without Fleet, but based closely to Fleet's usage would greatly help us to identify the slow down. |
I can think of at least one (hopefully quick) thing that may help this without any additional API overhead—we could change the cluster state updates for these to be batched (currently neither the templates nor ingest pipelines are batched). That would only really help if multiple things of the same type were being installed in parallel, however. Judging by the issue Josh linked where they were experimenting with both, I think it could help the parallel case. (although none of this is backed up by numbers, and we'd want to have a steadily reproducible way to test this, as Jake mentioned above) |
I'll create an easy repro example later this week. In the meantime, I can discuss how the parallelism works.
I've experimented with making each asset type created in parallel when possible, and this did not improve the performance at all, just more async requests waiting at once. The bulk of the code for this lives here: https://github.com/elastic/kibana/blob/66b3f01a17dbcbb35fdf47ea439b8dd8666ae249/x-pack/plugins/fleet/server/services/epm/packages/_install_package.ts/#L117 |
I've created an emulation script that attempts to emulate the parallelism we do in package installation when installing the system package. This package only contains Saved Objects (not included in my script), ingest pipelines, component templates, and index templates. You'll see that the ingest part of the package installation (everything except the SOs) takes ~6.5s when running raw like this. With @jakelandis's ES build that includes timing logs on these endpoints, I noticed component templates that were taking >2.5 seconds to be created. To use my script:
Branch with the (hacky) code I used to generate this script while running a package install: https://github.com/joshdover/kibana/tree/fleet/install-repro. Here's a related APM trace from the real package install code in Kibana that shows similar behavior: |
Thanks @joshdover, this was useful! I ran some local tests with your code. On my local laptop I was at ~6s for installing all the operations, and ~4s for installing them all on my desktop machine. Interestingly, even with a single ES node, you can see the nanosecond timings for PUT-ing a component template keep increasing as more and more are done in parallel:
I did a quick-and-dirty batching implementation for all the template stuff as well as ingest pipelines. That brought the real time for the reproduction script to ~1.2s on my desktop (so ~4s => ~1.2s), which seems like a pretty good improvement considering that there isn't even any network overhead for the cluster state updates. The nanosecond timings for the component templates even out over time also, if you compare the timings:
You can see how the timing reaches a steady equilibrium of ~0.3 seconds to install each component template. I've opened up a WIP PR with my changes at #86017, and here are some custom 8.3.0-SNAPSHOT builds that include the changes from that PR as well as some timing output from Jake:
Could you try your tests with these builds and see if this is enough for the short term to alleviate this problem? (Feel free to generate your own ES build from my PR, I wasn't sure whether that was something you wanted to do, hence the custom builds; I'd try the Kibana reproduction but I have never been able to figure out how to build Kibana locally). If this seems promising to you I can work on getting that PR polished up and merged in. |
@dakrone Awesome improvements. QQ, I see your second snippet has the note |
All of these timings use the reproduction script, which sends requests in parallel to Elasticsearch. The batching I mentioned on the ES side is batching cluster state update tasks (which occur when the pipeline and templates are created). |
Thanks @dakrone. With your changes alone I see an improvement when installing the system package (same package from the repro case I used) from 22.9s to 12.0s, a drop of 48% of the total package installation time. With a few additional improvements on the Fleet side (elastic/kibana#130906) I was able to optimize this further, down to 8.0s on my local machine, totaling an improvement of 65% which is nearly 3x as fast. I believe there is likely another win to make on the Fleet side to shave off another 1-2s. I think we should definitely move forward on your PR. 🎉 |
Cool, thanks @joshdover, I'll work on getting the PR in. |
This commit changes the cluster state operations for templates (legacy, component, and composable) as well as ingest pipelines to be bulk executed. This means that they can be processed much faster when creating/updating many simultaneously. Relates to #77505
With the batching changes in and the improvements I’ve been able to make on the Kibana side, I think we can close this issue for now. I am still seeing some related slowness around creating ES transforms, but that is a separate problem that can be evaluated independently. Thanks all, @dakrone and @jakelandis |
When Fleet installs Elasticsearch ingest assets (index and component templates, ingest pipelines, ILM policies, etc.) for a package, we're currently bottlenecked by queueing behavior on cluster state updates as observed in this issue: elastic/kibana#110500 (comment)
This is causing some package installs to take upwards of 30s. This is a problem for Fleet, Kibana, and Elastic Agent for two primary reasons:
For both of these use cases, if this process is slow, Kibana upgrades and rollbacks will be too slow and possibly time out depending on the configuration of the orchestration layer.
When executing Fleet's setup process which installs the
system
package, we're seeing cluster state updates take ~150ms each on a single node cluster running on the same machine as Kibana. See the node stats results taken here before and after the setup process: node_stats.zip, es_logs.zip@DaveCTurner mentioned that one way we could optimize this is by providing a bulk API to batch these cluster state updates in a single write.
The text was updated successfully, but these errors were encountered: