Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Optimize package installation performance, phase 1 #130906

Merged
merged 7 commits into from
May 5, 2022

Conversation

joshdover
Copy link
Contributor

@joshdover joshdover commented Apr 25, 2022

Summary

Phase 1 of #110500

Depends on the following to be merged first:

In conjunction with elastic/elasticsearch#86017, this PR improves package installation time significantly. I tested this with the package install script that installs all packages, without any failures.

  • Sets refresh: false for several of the Saved Object operations that we don't need to wait for
  • Removes the a fetch of each index template that seemed unnecessary (see notes in code comments)
  • Removes an unnecessary fetch of the package from the registry
  • Switches the logic for creating the @custom templates to use the create=true query parameter rather than a check and set (removes an API call)
  • Parallelizes creation of Saved Objects alongside Elasticsearch assets
  • Avoids a 301 redirect on a registry API call

Another 1-2 seconds will be shaved off by #115032

The largest change in this PR is related to removing the refresh on each of the Saved Object update calls for updating the installed_es array on the epm-packages Saved Object. Prior to this PR, each update of this array involved querying the current value and then writing an update back. In order to be able to remove the refresh, we have to avoid needing to query the current value and instead keep the current value in memory. This requires keeping track of the current array and passing it into each asset type's install function.

This change also exposes a known issue with Saved Object update calls that could result in conflict errors even when not using optimistic concurrency control (see #126240). In order to work around this problem, I added simple retry logic which seems to work well.

I've deferred on a couple more optimizations to a separate PR that will shave off another few seconds:

  • Use refresh: false on the bulk Saved Object import, blocked on Expose refresh option to SavedObjectsImporter #131339
  • Create ingest pipelines and index templates in parallel, rather than in serial
    • This is the most time consuming step now, but thanks to the ES change, it's possible to parallelize these now. I want to do some additional testing to ensure that this change does not break upgrades before merging that change.

Results

To install the system package, version 1.13.0:

When installing all packages, the 95p changes significantly:

Screen Shot 2022-04-29 at 14 56 14

View APM traces

main

image

main + elastic/elasticsearch#86017

image

This branch + elastic/elasticsearch#86017

image
View install script results

This branch + elastic/elasticsearch#86017

 info INSTALLING packages
 info ✅ 1password-1.3.0  took 4.95s : 200
 info ✅ aws-1.14.4  took 7.238s : 200
 info ✅ awsfargate-0.1.1  took 4.075s : 200
 info ✅ ti_abusech-1.3.0  took 5.068s : 200
 info ✅ activemq-0.3.1  took 4.11s : 200
 info ✅ akamai-0.2.0  took 5.092s : 200
 info ✅ ti_otx-1.3.0  took 4.105s : 200
 info ✅ ti_anomali-1.3.0  took 4.078s : 200
 info ✅ apache-1.3.6  took 6.459s : 200
 info ✅ apache_spark-0.1.0  took 4.057s : 200
 info ✅ tomcat-1.4.0  took 4.081s : 200
 info ✅ netscout-0.8.0  took 4.088s : 200
 info ✅ atlassian_bitbucket-1.2.1  took 4.092s : 200
 info ✅ atlassian_confluence-1.2.0  took 4.125s : 200
 info ✅ atlassian_jira-1.2.0  took 4.082s : 200
 info ✅ auditd-2.2.0  took 4.072s : 200
 info ✅ auth0-0.1.4  took 5.129s : 200
 info ✅ azure_application_insights-1.0.2  took 4.086s : 200
 info ✅ azure_billing-1.0.1  took 4.072s : 200
 info ✅ azure-1.1.6  took 5.091s : 200
 info ✅ azure_metrics-1.0.3  took 6.102s : 200
 info ✅ barracuda-0.9.0  took 4.06s : 200
 info ✅ bluecoat-0.8.0  took 4.066s : 200
 info ✅ cef-1.5.0  took 4.074s : 200
 info ✅ cloud_security_posture-0.0.3  took 4.073s : 200
 info ❌ cis_kubernetes_benchmark-0.0.2  took 6.112s : {"body":{"statusCode":500,"error":"Internal Server Error","message":"Encountered 1 errors creating saved objects: [{\"type\":\"csp-rule-template\",\"id\":\"csp_rule_template-41308bcdaaf665761478bb6f0d745a5c\",\"error\":{\"type\":\"unsupported_type\"}}]"},"status":500,"took":6.112}
 info ✅ cassandra-1.2.2  took 5.075s : 200
 info ✅ checkpoint-1.4.0  took 4.058s : 200
 info ✅ cisco_asa-2.3.0  took 4.142s : 200
 info ✅ cisco_duo-1.2.1  took 4.091s : 200
 info ✅ cisco_ftd-2.1.0  took 4.073s : 200
 info ✅ cisco_ios-1.5.0  took 4.065s : 200
 info ✅ cisco_ise-0.1.0  took 4.09s : 200
 info ✅ cisco_meraki-0.5.0  took 5.491s : 200
 info ✅ cisco_nexus-0.5.1  took 5.061s : 200
 info ✅ cisco_secure_endpoint-2.4.0  took 4.086s : 200
 info ✅ cisco_umbrella-0.6.0  took 4.07s : 200
 info ✅ cloudflare-1.4.1  took 6.179s : 200
 info ✅ cockroachdb-0.2.2  took 4.071s : 200
 info ✅ containerd-0.2.1  took 4.098s : 200
 info ✅ crowdstrike-1.3.0  took 5.138s : 200
 info ✅ aws_logs-0.2.1  took 3.042s : 200
 info ✅ gcp_pubsub-1.0.0  took 5.058s : 200
 info ✅ http_endpoint-1.1.0  took 2.071s : 200
 info ✅ httpjson-1.2.0  took 2.048s : 200
 info ✅ log-1.0.0  took 2.049s : 200
 info ✅ tcp-1.1.0  took 2.048s : 200
 info ✅ udp-1.1.0  took 2.041s : 200
 info ✅ winlog-1.5.0  took 4.072s : 200
 info ✅ cyberarkpas-2.4.0  took 4.103s : 200
 info ✅ ti_cybersixgill-1.4.0  took 4.07s : 200
 info ✅ cylance-0.8.0  took 5.063s : 200
 info ✅ dga-0.0.2  took 13.919s : 200
 info ✅ docker-2.1.0  took 4.561s : 200
 info ✅ apm-8.3.0-dev1  took 8.237s : 200
 info ✅ elastic_agent-1.3.1  took 3.124s : 200
 info ✅ synthetics-0.9.2  took 3.645s : 200
 info ✅ endpoint-8.2.0  took 13.208s : 200
 info ✅ f5-0.9.0  took 5.176s : 200
 info ✅ fim-0.1.0  took 4.12s : 200
 info ✅ fireeye-1.3.0  took 4.073s : 200
 info ✅ fleet_server-1.1.1  took 2.038s : 200
 info ✅ fortinet-1.5.0  took 6.212s : 200
 info ✅ github-0.4.0  took 6.574s : 200
 info ✅ gcp-1.6.0  took 5.11s : 200
 info ✅ santa-2.1.0  took 4.066s : 200
 info ✅ google_workspace-1.4.0  took 5.114s : 200
 info ✅ haproxy-1.1.1  took 5.098s : 200
 info ✅ hadoop-0.1.0  took 4.085s : 200
 info ✅ hashicorp_vault-1.4.0  took 6.156s : 200
 info ✅ hid_bravura_monitor-1.0.2  took 6.145s : 200
 info ✅ iis-0.8.4  took 6.113s : 200
 info ✅ imperva-0.8.0  took 4.109s : 200
 info ✅ infoblox-0.8.0  took 5.182s : 200
 info ✅ iptables-0.9.0  took 4.078s : 200
 info ✅ juniper_junos-0.2.0  took 4.077s : 200
 info ✅ juniper-1.1.1  took 5.534s : 200
 info ✅ juniper_netscreen-0.2.0  took 5.114s : 200
 info ✅ juniper_srx-1.2.0  took 5.075s : 200
 info ✅ kafka-1.2.3  took 5.089s : 200
 info ✅ keycloak-1.3.0  took 4.069s : 200
 info ✅ kubernetes-1.19.1  took 5.103s : 200
 info ✅ linux-0.6.4  took 5.274s : 200
 info ✅ logstash-1.1.0  took 5.074s : 200
 info ✅ problemchild-0.0.2  took 6.836s : 200
 info ✅ m365_defender-1.0.3  took 6.134s : 200
 info ✅ ti_misp-1.3.0  took 5.07s : 200
 info ✅ mattermost-1.2.0  took 6.304s : 200
 info ✅ microsoft_dhcp-1.4.0  took 5.073s : 200
 info ✅ microsoft_defender_endpoint-2.2.0  took 5.197s : 200
 info ✅ microsoft_sqlserver-0.5.0  took 5.098s : 200
 info ✅ mimecast-0.0.11  took 6.08s : 200
 info ✅ modsecurity-0.1.5  took 5.094s : 200
 info ✅ mongodb-1.3.2  took 5.078s : 200
 info ✅ mysql-1.3.1  took 7.128s : 200
 info ✅ mysql_enterprise-1.0.1  took 5.206s : 200
 info ✅ nats-1.3.0  took 5.145s : 200
 info ✅ nagios_xi-0.1.1  took 6.095s : 200
 info ✅ netflow-1.5.0  took 5.078s : 200
 info ✅ netskope-0.1.2  took 5.505s : 200
 info ✅ network_traffic-0.9.0  took 6.087s : 200
 info ✅ nginx-1.3.2  took 5.089s : 200
 info ✅ nginx_ingress_controller-1.3.1  took 6.096s : 200
 info ✅ o365-1.5.0  took 6.492s : 200
 info ✅ okta-1.6.0  took 5.098s : 200
 info ✅ oracle-1.0.1  took 6.384s : 200
 info ✅ osquery-1.3.0  took 5.094s : 200
 info ✅ osquery_manager-1.2.1  took 4.352s : 200
 info ✅ panw_cortex_xdr-1.2.0  took 5.082s : 200
 info ✅ panw-1.6.0  took 6.179s : 200
 info ✅ postgresql-1.3.1  took 6.107s : 200
 info ✅ security_detection_engine-1.0.1  took 5.087s : 200
 info ✅ prometheus-0.9.1  took 4.132s : 200
 info ✅ proofpoint-0.7.0  took 5.116s : 200
 info ✅ pulse_connect_secure-0.3.0  took 5.726s : 200
 info ✅ qnap_nas-1.2.0  took 5.089s : 200
 info ✅ rabbitmq-1.3.1  took 6.099s : 200
 info ✅ radware-0.7.0  took 6.084s : 200
 info ✅ ti_recordedfuture-0.1.2  took 5.079s : 200
 info ✅ redis-1.3.1  took 5.092s : 200
 info ✅ stan-1.3.0  took 6.079s : 200
 info ✅ snapshot-0.0.1  took 2.05s : 200
 info ✅ snort-0.3.0  took 5.114s : 200
 info ✅ snyk-1.2.0  took 5.072s : 200
 info ✅ sonicwall-0.8.0  took 5.104s : 200
 info ✅ sophos-2.1.0  took 7.109s : 200
 info ✅ spring_boot-0.5.0  took 5.128s : 200
 info ✅ squid-0.8.0  took 5.099s : 200
 info ✅ staging-0.0.1  took 3.124s : 200
 info ✅ suricata-1.7.0  took 6.096s : 200
 info ✅ symantec_endpoint-0.0.2  took 5.074s : 200
 info ✅ system-1.13.0  took 7.208s : 200
 info ✅ tenable_sc-1.2.0  took 6.141s : 200
 info ✅ ti_threatq-1.3.0  took 5.077s : 200
 info ✅ traefik-1.3.1  took 6.155s : 200
 info ✅ carbon_black_cloud-0.1.2  took 6.139s : 200
 info ✅ carbonblack_edr-1.2.0  took 6.083s : 200
 info ✅ vsphere-0.1.1  took 5.089s : 200
 info ✅ windows-1.11.0  took 7.131s : 200
 info ✅ zeek-1.7.0  took 8.136s : 200
 info ✅ zerofox-1.3.0  took 5.062s : 200
 info ✅ zookeeper-1.3.1  took 5.099s : 200
 info ✅ zoom-1.3.1  took 6.101s : 200
 info ✅ zscaler_zia-0.2.0  took 9.163s : 200
 info ✅ zscaler-0.1.2  took 5.131s : 200
 info ✅ zscaler_zpa-0.2.0  took 6.9s : 200
 info ✅ etcd-0.1.1  took 6.133s : 200
 info ✅ pfsense-0.4.0  took 8.171s : 200

Comment on lines -413 to -450
// Datastream now throw an error if the aliases field is present so ensure that we remove that field.
const getTemplateRes = await retryTransientEsErrors(
() =>
esClient.indices.getIndexTemplate(
{
name: templateName,
},
{
ignore: [404],
}
),
{ logger }
);

const existingIndexTemplate = getTemplateRes?.index_templates?.[0];
if (
existingIndexTemplate &&
existingIndexTemplate.name === templateName &&
existingIndexTemplate?.index_template?.template?.aliases
) {
const updateIndexTemplateParams = {
name: templateName,
body: {
...existingIndexTemplate.index_template,
template: {
...existingIndexTemplate.index_template.template,
// Remove the aliases field
aliases: undefined,
},
},
};

await retryTransientEsErrors(
() => esClient.indices.putIndexTemplate(updateIndexTemplateParams, { ignore: [404] }),
{ logger }
);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT this isn't necessary. Creating an index template without the aliases field at all seems to have the same behavior in my manual testing and this is already what the main create index template logic below does.

@nchaulet It looks like you wrote this originally, do you know if there's anything I'm missing here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the original issue: #90984

It appears it was from an upgrade of the endpoint package that I can no longer reproduce, even using the old versions mentioned in the issue 0.16.2 -> 0.17.0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it could be reproduced if you upgraded from 7.10.0 to main, but this is not strictly supported since Fleet was not GA until 7.14.0. I think this is a safe thing to remove.

@joshdover joshdover force-pushed the fleet/install-perf branch from 35c863f to ca44aeb Compare April 28, 2022 18:22
@joshdover joshdover changed the title Optimize package installation performance [Fleet] Optimize package installation performance Apr 28, 2022
@joshdover joshdover added release_note:skip Skip the PR/issue when compiling release notes Team:Fleet Team label for Observability Data Collection Fleet team v8.3.0 ci:deploy-cloud labels Apr 28, 2022
@joshdover joshdover marked this pull request as ready for review April 28, 2022 18:52
@joshdover joshdover requested a review from a team as a code owner April 28, 2022 18:52
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@jen-huang jen-huang added release_note:enhancement and removed release_note:skip Skip the PR/issue when compiling release notes labels Apr 28, 2022
@juliaElastic
Copy link
Contributor

great work, should we run the script that install all packages to verify nothing breaks?

@joshdover
Copy link
Contributor Author

great work, should we run the script that install all packages to verify nothing breaks?

Good idea. I'm also going to move the APM instrumentation into a separate PR so if this needs to get reverted for any reason we still have the instrumentation in. This also makes it easier to create comparison traces before/after the optimizations.

@joshdover joshdover marked this pull request as draft April 29, 2022 10:43
@joshdover joshdover force-pushed the fleet/install-perf branch 3 times, most recently from dc11c73 to ea84f45 Compare April 29, 2022 13:03
@joshdover
Copy link
Contributor Author

There's some work that will need to be done to refactor the ES asset reference passing to be able to remove these refreshes on the SO index. A lot of logic is depending on updating this object and then querying it back which needs to be refactored to simply keeping a running list in memory that gets updated as we go without querying it back to avoid the needs for refreshes.

@joshdover joshdover force-pushed the fleet/install-perf branch from 10c4fa6 to 151acb6 Compare May 2, 2022 14:33
Comment on lines +40 to +45
esReferences = await updateEsAssetReferences(savedObjectsClient, packageInfo.name, esReferences, {
assetsToAdd: ilmPolicies.map((policy) => ({
type: ElasticsearchAssetType.ilmPolicy,
id: policy.name,
})),
});
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We weren't previously keeping track of these references so these weren't getting deleted on uninstall. Fixed in this PR.

return clusterPromise;
return await clusterPromise;
} catch (e) {
if (e?.statusCode === 400 && e.body?.error?.reason.includes('already exists')) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the error message that will be returned by ES if the custom component template already exists. Unfortunately there's not a better machine readable field name.

Comment on lines +280 to +285
id: 'endpoint.metadata-default-0.16.0-dev.0',
type: 'transform',
},
{
id: 'endpoint.metadata-default-0.16.0-dev.0',
id: 'endpoint.metadata_current-default-0.16.0-dev.0',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order changed here, but it doesn't really matter.

@@ -121,6 +123,41 @@ export async function installKibanaAssets(options: {

return installedAssets;
}

export async function installKibanaAssetsAndReferences({
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted this from _install_package to make mocking simpler

// Because Kibana assets are installed in parallel with ES assets with refresh: false, we almost always run into an
// issue that causes a conflict error due to this issue: https://github.com/elastic/kibana/issues/126240. This is safe
// to retry constantly until it succeeds to optimize this critical user journey path as much as possible.
pRetry(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this retry logic will retry failures other than conflict errors. I think that's probably ok, but happy to fix if we think it's a bad idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a chance 20 retry takes very long in case of trying to retry a persistent error?

@joshdover joshdover force-pushed the fleet/install-perf branch from 151acb6 to d22c8dd Compare May 2, 2022 14:44
@joshdover joshdover marked this pull request as ready for review May 2, 2022 14:45
@joshdover
Copy link
Contributor Author

I've been able to shave off another ~3-4 seconds with two additional optimizations but I'm going to move those to a separate PR so we can focus on getting this one in first.

@joshdover joshdover changed the title [Fleet] Optimize package installation performance [Fleet] Optimize package installation performance, phase 1 May 2, 2022
@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @joshdover

Copy link
Contributor

@juliaElastic juliaElastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I haven't tested locally

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting ci:cloud-deploy Create or update a Cloud deployment release_note:enhancement Team:Fleet Team label for Observability Data Collection Fleet team v8.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants