Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a bulk API for creating ingest assets #77505

Closed
joshdover opened this issue Sep 9, 2021 · 21 comments
Closed

Provide a bulk API for creating ingest assets #77505

joshdover opened this issue Sep 9, 2021 · 21 comments

Comments

@joshdover
Copy link
Contributor

joshdover commented Sep 9, 2021

When Fleet installs Elasticsearch ingest assets (index and component templates, ingest pipelines, ILM policies, etc.) for a package, we're currently bottlenecked by queueing behavior on cluster state updates as observed in this issue: elastic/kibana#110500 (comment)

This is causing some package installs to take upwards of 30s. This is a problem for Fleet, Kibana, and Elastic Agent for two primary reasons:

  1. We need the ability to upgrade packages on Kibana upgrades to keep some ingest assets in sync with the rest of the Stack (eg. assets used by APM Server or Elastic Agents themselves for monitoring).
  2. We also likely will want the ability to automatically downgrade packages and reinstall older version of assets when there was an issue with a Kibana upgrade that requires a rollback to the previous Kibana version. This would require that we re-write all ingest assets in Elasticsearch to be sure they're compatible with the older Kibana version.

For both of these use cases, if this process is slow, Kibana upgrades and rollbacks will be too slow and possibly time out depending on the configuration of the orchestration layer.

When executing Fleet's setup process which installs the system package, we're seeing cluster state updates take ~150ms each on a single node cluster running on the same machine as Kibana. See the node stats results taken here before and after the setup process: node_stats.zip, es_logs.zip

@DaveCTurner mentioned that one way we could optimize this is by providing a bulk API to batch these cluster state updates in a single write.

@joshdover joshdover added >enhancement :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. needs:triage Requires assignment of a team area label labels Sep 9, 2021
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Sep 9, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@DaveCTurner
Copy link
Contributor

There's a question about why cluster state updates take ~150ms in these tests, and that falls under the :Distributed/Cluster coordination label, but the question about a bulk API for installing templates/pipelines/ILM policies etc is the domain of the data management team so I'm moving this over there.

@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Sep 9, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@DaveCTurner DaveCTurner removed :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. Team:Data Management Meta label for data/management team Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. needs:triage Requires assignment of a team area label labels Sep 9, 2021
@dakrone
Copy link
Member

dakrone commented Sep 9, 2021

@joshdover out of curiosity, do you have an idea of the number of items Fleet usually would want to do in a single request? 10s? 100s? 1000s?

@dakrone
Copy link
Member

dakrone commented Sep 9, 2021

Also, rather than batching cluster state updates, perhaps it would be better to teach Elasticsearch the concept of a package, with a set of templates, policies, pipelines, and metadata, and then make adding or removing a package an atomic operation from a cluster-state perspective.

@jakelandis
Copy link
Contributor

perhaps it would be better to teach Elasticsearch the concept of a package

That is an interesting idea, and is tangentially related to #63798. However, that request could be interpreted as a single package that would contribute to something that is shared among all installed packages. I am not sure if that request is actually related, but if we pursue a package-based approach we should probably consider things that a package may want to install that are not unique to only that package. (for example maybe multiple packages re-use ingest pipelines, we would need a way to keep them from stomping on each other)

@dakrone
Copy link
Member

dakrone commented Sep 9, 2021

we should probably consider things that a package may want to install that are not unique to only that package. (for example maybe multiple packages re-use ingest pipelines, we would need a way to keep them from stomping on each other)

I think we'd probably need both, because we will probably want the concept of "global" settings/items, and we will also want to be able to have two packages that initially shared a "thing", but when a change is made to that thing, the change is made only to a single package (i.e. namespacing of customization).

@joshdover
Copy link
Contributor Author

out of curiosity, do you have an idea of the number of items Fleet usually would want to do in a single request? 10s? 100s? 1000s?

For initial setup, we're looking at 100s of objects but in the future as more packages are being upgraded we could potentially want to upgrade 1000s of objects at once. That said, we'll likely want to do these in one request per package, so that in case any of these operations fail we can isolate the failure to a single ingest integration.

@joshdover
Copy link
Contributor Author

perhaps it would be better to teach Elasticsearch the concept of a package

This is the long term plan and I agree it's something we'll need at some point. My thinking was that maybe it'd be simpler to start with a bulk API that could later be used under the hood for a more complete package API abstraction in the future. That way we can more immediately solve the problems we're seeing now as we work out the many details on what a package API would need to support. I'll defer to you folks on what makes sense here.

@joshdover
Copy link
Contributor Author

joshdover commented Jan 6, 2022

Package install and upgrade performance continues to be a challenge for both Kibana reliability and user onboarding. I want to highlight the recent changes in these areas and how this challenge affects the user and operator experience.

User onboarding

One user experience change that is targeting to ship in 8.1 is the removal of installing default packages (elastic/kibana#108456). This will move the package installation step for key required packages (fleet_server, elastic_agent, system) to happen during the onboarding process when a user sets up their first Agent.

By moving this installation step to the onboarding flow, we're adding a very significant ~30s+ delay to a key step in the flow which must happen before the user is instructed to actually install their first agent. We have concerns that this delay may have a negative impact on the success rate of users getting started with the Stack.

This is primarily or entirely bottlenecked by the performance of creating ingest assets in Elasticsearch.

Kibana reliability

As of elastic/kibana#111858 which is shipping in 8.0, Kibana will be installing and upgrading 1st party packages (Endpoint, APM, Synthetics, etc.) on boot. In this initial version, this process does not block Kibana startup and instead runs as an asynchronous upgrade process. This asynchronous upgrade process is not ideal as it may give operators a false sense that the Kibana upgrade has completed and it's safe to start upgrading Fleet Server, Elastic Agent, or standalone APM Server. If these components are upgraded before packages have completed upgrade, ingest could break resulting in dropped data or data could be ingested in a format that is unusable by application UIs or dashboards.

One of the reasons we are hesitant to block Kibana startup (elastic/kibana#120616) is the slow installation process which is primarily bottlenecked by this issue.

It's worth noting that package upgrades are currently implemented as full removal and then subsequent installation. This means that we'd need to be able to both delete and create ingest assets quickly in Elasticsearch. If increasing the scope of this bulk create API to also support deletes is a major challenge but supporting updates is not, it's possible we could revisit the upgrade logic in Fleet to minimize the deletes we do and leverage the bulk create/update logic.

cc @jakelandis @dakrone

@joshdover
Copy link
Contributor Author

joshdover commented Apr 7, 2022

@jakelandis You asked for some addition metrics and reproduction steps here. Could you clarify what would be helpful to provide aside from what I provided here: elastic/kibana#110500 (comment)

This can easily be reproduced by:

  1. Configure Kibana to send APM data to a cluster of your choosing, by setting these env vars:
ELASTIC_APM_ACTIVE=true
ELASTIC_APM_SERVER_URL=https://myapmendpoint.com/
ELASTIC_APM_SECRET_TOKEN=foo
  1. Start ES and Kibana
  2. Run this API call against Kibana:
curl -XPOST -H 'content-type: application/json' -H 'kbn-xsrf: foo' -u elastic:changeme http://localhost:5601/api/fleet/epm/packages/system/1.6.4
  1. View trace in APM on the /api/fleet/epm/packages endpoint

If this isn't enough to go on, I think this could easily be emulated by trying to do many PUT calls on index templates and ingest pipelines in parallel using concurrently in the shell.

@jakelandis
Copy link
Contributor

I think this could easily be emulated by trying to do many PUT calls on index templates and ingest pipelines in parallel using concurrently in the shell.

Yes this would help to be able to isolate the issue. Do you have any example index templates and ingest pipelines we can use to test ? How much concurrency do you have ? i.e. a dozen concurrent requests for a mix of templates and pipelines or just 2 concurrent with different lanes for pipelines and templates ? What is the cluster setup ? (a single node hosted locally?)

Anything information you can provide that allows us to reproduce this without Fleet, but based closely to Fleet's usage would greatly help us to identify the slow down.

@dakrone
Copy link
Member

dakrone commented Apr 8, 2022

I can think of at least one (hopefully quick) thing that may help this without any additional API overhead—we could change the cluster state updates for these to be batched (currently neither the templates nor ingest pipelines are batched). That would only really help if multiple things of the same type were being installed in parallel, however. Judging by the issue Josh linked where they were experimenting with both, I think it could help the parallel case.

(although none of this is backed up by numbers, and we'd want to have a steadily reproducible way to test this, as Jake mentioned above)

@joshdover
Copy link
Contributor Author

joshdover commented Apr 11, 2022

I'll create an easy repro example later this week. In the meantime, I can discuss how the parallelism works.

  • Multiple packages can be installed at once, in parallel. This is Node.js so, there's no limit the number of async requests we execute at once. I've also confirmed we're not hitting any connection cap in the client, but that was months ago and it should be revalidated.
  • For each package, each asset type is currently installed in serial, in this order:
    • Kibana Saved Objects
    • ILM policies
    • ML models
    • Ingest pipelines (new ones created)
    • Index and component templates
    • Data streams are rolled over
    • Transforms
    • Ingest pipelines (old ones deleted)
  • Within each asset type, each individual asset is done in parallel, except in cases where there's a dependency (component templates are created before the index template that references them).

I've experimented with making each asset type created in parallel when possible, and this did not improve the performance at all, just more async requests waiting at once.

The bulk of the code for this lives here: https://github.com/elastic/kibana/blob/66b3f01a17dbcbb35fdf47ea439b8dd8666ae249/x-pack/plugins/fleet/server/services/epm/packages/_install_package.ts/#L117

@joshdover
Copy link
Contributor Author

joshdover commented Apr 14, 2022

I've created an emulation script that attempts to emulate the parallelism we do in package installation when installing the system package. This package only contains Saved Objects (not included in my script), ingest pipelines, component templates, and index templates. You'll see that the ingest part of the package installation (everything except the SOs) takes ~6.5s when running raw like this. With @jakelandis's ES build that includes timing logs on these endpoints, I noticed component templates that were taking >2.5 seconds to be created.

To use my script:

  1. Start ES with changeme as the elastic password
  2. Unzip the archive
  3. Run ./run.sh, observe timings
  4. Run ./teardown.sh - note this doesn't emulate our parallelism during uninstalls, it's just a convenience script for re-running the test without having to start a clean ES

package-install-repro.zip

Branch with the (hacky) code I used to generate this script while running a package install: https://github.com/joshdover/kibana/tree/fleet/install-repro.

Here's a related APM trace from the real package install code in Kibana that shows similar behavior:
image

@dakrone
Copy link
Member

dakrone commented Apr 19, 2022

Thanks @joshdover, this was useful! I ran some local tests with your code. On my local laptop I was at ~6s for installing all the operations, and ~4s for installing them all on my desktop machine. Interestingly, even with a single ES node, you can see the nanosecond timings for PUT-ing a component template keep increasing as more and more are done in parallel:

component template time to install NON-batching (in nanoseconds):
193570000
103017000
97592875
179529792
292973166
392919291
469226625
551573167
683457458
761124667
840760875
980525000
1057703666
1137113667
1316617209
1401967667
1513343958
1714177875
1887339000
1976050292
2059373333
2185087958
2303293375
2486336042
2571670625
2652833625
2773985833
2856030209
2936203417
3076151084
3161950541
3243575167
3382663917
3484927917
3569993167
3644332375

I did a quick-and-dirty batching implementation for all the template stuff as well as ingest pipelines. That brought the real time for the reproduction script to ~1.2s on my desktop (so ~4s => ~1.2s), which seems like a pretty good improvement considering that there isn't even any network overhead for the cluster state updates. The nanosecond timings for the component templates even out over time also, if you compare the timings:

component template time to install WITH-batching (in nanoseconds):
37369903
39976718
69583654
70282912
300642897
300601008
300733378
301792195
302690829
302738018
303127592
301654274
301658623
303753222
300833978
299771164
299143169
301040587
297771062
297805007
298434795
296777609
300011938
296849525
296877517
296880223
297054001
297145763
299669202
301940534
297097322
301204276
297407938
297501865
303229554
297697925

You can see how the timing reaches a steady equilibrium of ~0.3 seconds to install each component template.

I've opened up a WIP PR with my changes at #86017, and here are some custom 8.3.0-SNAPSHOT builds that include the changes from that PR as well as some timing output from Jake:

Could you try your tests with these builds and see if this is enough for the short term to alleviate this problem? (Feel free to generate your own ES build from my PR, I wasn't sure whether that was something you wanted to do, hence the custom builds; I'd try the Kibana reproduction but I have never been able to figure out how to build Kibana locally). If this seems promising to you I can work on getting that PR polished up and merged in.

@jen-huang
Copy link

@dakrone Awesome improvements. QQ, I see your second snippet has the note install WITH-batching - does this refer to batching on the ES side, or are you recommending for Fleet to batch our requests to ES?

@dakrone
Copy link
Member

dakrone commented Apr 20, 2022

does this refer to batching on the ES side, or are you recommending for Fleet to batch our requests to ES?

All of these timings use the reproduction script, which sends requests in parallel to Elasticsearch. The batching I mentioned on the ES side is batching cluster state update tasks (which occur when the pipeline and templates are created).

@joshdover
Copy link
Contributor Author

Thanks @dakrone. With your changes alone I see an improvement when installing the system package (same package from the repro case I used) from 22.9s to 12.0s, a drop of 48% of the total package installation time.

With a few additional improvements on the Fleet side (elastic/kibana#130906) I was able to optimize this further, down to 8.0s on my local machine, totaling an improvement of 65% which is nearly 3x as fast. I believe there is likely another win to make on the Fleet side to shave off another 1-2s.

I think we should definitely move forward on your PR. 🎉

@dakrone
Copy link
Member

dakrone commented Apr 25, 2022

Cool, thanks @joshdover, I'll work on getting the PR in.

dakrone added a commit that referenced this issue May 4, 2022
This commit changes the cluster state operations for templates (legacy, component, and composable) as well as ingest pipelines to be bulk executed. This means that they can be processed much faster when creating/updating many simultaneously.

Relates to #77505
@joshdover
Copy link
Contributor Author

With the batching changes in and the improvements I’ve been able to make on the Kibana side, I think we can close this issue for now. I am still seeing some related slowness around creating ES transforms, but that is a separate problem that can be evaluated independently.

Thanks all, @dakrone and @jakelandis

@joshdover joshdover closed this as not planned Won't fix, can't repro, duplicate, stale May 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants