[APM] Guard against OOM in scripted metric agg for service maps #101920

dgieselaar · 2021-06-10T16:21:57Z

We've seen reports that the scripted metric agg we use for service maps can cause Elasticsearch to OOM, if the number of events is high. We can prevent this by setting a max number of events to be processed, and throwing an error if this limit is exceeded. It's quite hard to return something if we can't guarantee that we've seen the entire data set, and given the fact that we'll be moving away from the scripted metric agg sooner than later I'd suggest we just barf when the limit is met, and make the limit configurable. One document takes up about 1kb of memory.

Links

Pitch (internal)

elasticmachine · 2021-06-10T16:21:58Z

Pinging @elastic/apm-ui (Team:apm)

botelastic · 2022-02-27T14:05:40Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

iamjosh007 · 2023-01-12T14:52:12Z

Any update to this? This is literally crashing the cluster.

dgieselaar · 2023-01-12T14:58:20Z

@iamjosh007 not yet, but we're discussing re-prioritizing this work based on incoming feedback from customers. Are you also in touch with Elastic Support?

iamjosh007 · 2023-01-12T15:18:52Z

Yes, they referred me to this ticket and OOM Issues are hitting us hard for last few months.

dgieselaar · 2023-01-12T15:29:54Z

@iamjosh007 Thatnks, that helps us figure out how many customers are running into this. Given you're already talking to support I would suggest to keep those conversations in the appropriate venue - they're pulling us in when needed. In this specific case we are already involved.

Closes elastic#101920

Closes #101920 This PR does three things: - add a `terminate_after` parameter to the search request for the scripted metric agg. This is a configurable setting (`xpack.apm.serviceMapTerminateAfter`) and defaults to 100k. This is a shard-level parameter, so there's still the possibility of lots of shards individually returning 100k documents and the coordinating node running out of memory because it is collecting all these docs from individual shards. However, I suspect that there is already some protection in the reduce phase that will terminate the request with a stack_overflow_error without OOMing, I've reached out to the ES team to confirm whether this is the case. - add `xpack.apm.serviceMapMaxTraces`: this tells the max traces to inspect in total, not just per search request. IE, if `xpack.apm.serviceMapMaxTracesPerRequest` is 1, we simply chunk the traces in n chunks, so it doesn't really help with memory management. `serviceMapMaxTraces` refers to the total amount of traces to inspect. - rewrite `getConnections` to use local mutation instead of immutability. I saw huge CPU usage (with admittedly a pathological scenario where there are 100s of services) in the `getConnections` function, because it uses a deduplication mechanism that is O(n²), so I rewrote it to O(n). Here's a before : ![image](https://github.com/elastic/kibana/assets/352732/6c24a7a2-3b48-4c95-9db2-563160a57aef) and after: ![image](https://github.com/elastic/kibana/assets/352732/c00b8428-3026-4610-aa8b-c0046e8f0e08) To reproduce an OOM, start ES with a much smaller amount of memory: `$ ES_JAVA_OPTS='-Xms236m -Xmx236m' yarn es snapshot` Then run the synthtrace Service Map OOM scenario: `$ node scripts/synthtrace.js service_map_oom --from=now-15m --to=now --clean` Finally, navigate to `service-100` in the UI, and click on Service Map. This should trigger an OOM.

…159883) Closes elastic#101920 This PR does three things: - add a `terminate_after` parameter to the search request for the scripted metric agg. This is a configurable setting (`xpack.apm.serviceMapTerminateAfter`) and defaults to 100k. This is a shard-level parameter, so there's still the possibility of lots of shards individually returning 100k documents and the coordinating node running out of memory because it is collecting all these docs from individual shards. However, I suspect that there is already some protection in the reduce phase that will terminate the request with a stack_overflow_error without OOMing, I've reached out to the ES team to confirm whether this is the case. - add `xpack.apm.serviceMapMaxTraces`: this tells the max traces to inspect in total, not just per search request. IE, if `xpack.apm.serviceMapMaxTracesPerRequest` is 1, we simply chunk the traces in n chunks, so it doesn't really help with memory management. `serviceMapMaxTraces` refers to the total amount of traces to inspect. - rewrite `getConnections` to use local mutation instead of immutability. I saw huge CPU usage (with admittedly a pathological scenario where there are 100s of services) in the `getConnections` function, because it uses a deduplication mechanism that is O(n²), so I rewrote it to O(n). Here's a before : ![image](https://github.com/elastic/kibana/assets/352732/6c24a7a2-3b48-4c95-9db2-563160a57aef) and after: ![image](https://github.com/elastic/kibana/assets/352732/c00b8428-3026-4610-aa8b-c0046e8f0e08) To reproduce an OOM, start ES with a much smaller amount of memory: `$ ES_JAVA_OPTS='-Xms236m -Xmx236m' yarn es snapshot` Then run the synthtrace Service Map OOM scenario: `$ node scripts/synthtrace.js service_map_oom --from=now-15m --to=now --clean` Finally, navigate to `service-100` in the UI, and click on Service Map. This should trigger an OOM. (cherry picked from commit 1a9b241)

…59883) (#160060) # Backport This will backport the following commits from `main` to `8.8`: - [[APM] Circuit breaker and perf improvements for service map (#159883)](#159883)  ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport)  Co-authored-by: Dario Gieselaar <[email protected]>

dgieselaar · 2023-07-03T13:04:23Z

Verified that I can't take a cluster down under the same conditions as before the circuit breaker.

dgieselaar added the Team:APM All issues that need APM UI Team support label Jun 10, 2021

sorenlouv added [zube]: Inbox v7.15.0 and removed [zube]: Inbox labels Jun 16, 2021

sorenlouv assigned dgieselaar Jun 16, 2021

sorenlouv removed the v7.15.0 label Jun 16, 2021

zube bot added [zube]: Backlog and removed [zube]: 7.14 labels Jun 22, 2021

sorenlouv added [zube]: 7.16 and removed [zube]: Backlog labels Aug 12, 2021

formgeist added [zube]: Backlog and removed [zube]: 7.16 labels Aug 31, 2021

botelastic bot added the stale Used to mark issues that were closed for being stale label Feb 27, 2022

botelastic bot removed the stale Used to mark issues that were closed for being stale label Jan 12, 2023

sorenlouv added the 8.8 candidate label Jan 12, 2023

gbamparop added apm:performance APM UI - Performance Work apm:service-maps Service Map feature in APM and removed [zube]: Backlog labels Jan 16, 2023

gbamparop added apm:release-feature APM UI - Release Feature Goal 8.9 candidate and removed 8.8 candidate labels May 12, 2023

dgieselaar added a commit to dgieselaar/kibana that referenced this issue Jun 18, 2023

[APM] Circuit breaker and perf improvements for service map

fdc9bc5

Closes elastic#101920

dgieselaar mentioned this issue Jun 18, 2023

[APM] Circuit breaker and perf improvements for service map #159883

Merged

gbamparop added the apm:test-plan-8.9.0 label Jun 19, 2023

dgieselaar closed this as completed in #159883 Jun 20, 2023

dgieselaar added the apm:test-plan-done Pull request that was successfully tested during the test plan label Jul 3, 2023

sorenlouv mentioned this issue Aug 27, 2023

[APM] Service Map: Produce datasets to understand performance bottlenecks #164935

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[APM] Guard against OOM in scripted metric agg for service maps #101920

[APM] Guard against OOM in scripted metric agg for service maps #101920

dgieselaar commented Jun 10, 2021 •

edited by gbamparop

Loading

elasticmachine commented Jun 10, 2021

botelastic bot commented Feb 27, 2022

iamjosh007 commented Jan 12, 2023

dgieselaar commented Jan 12, 2023

iamjosh007 commented Jan 12, 2023

dgieselaar commented Jan 12, 2023

dgieselaar commented Jul 3, 2023

[APM] Guard against OOM in scripted metric agg for service maps #101920

[APM] Guard against OOM in scripted metric agg for service maps #101920

Comments

dgieselaar commented Jun 10, 2021 • edited by gbamparop Loading

Links

elasticmachine commented Jun 10, 2021

botelastic bot commented Feb 27, 2022

iamjosh007 commented Jan 12, 2023

dgieselaar commented Jan 12, 2023

iamjosh007 commented Jan 12, 2023

dgieselaar commented Jan 12, 2023

dgieselaar commented Jul 3, 2023

dgieselaar commented Jun 10, 2021 •

edited by gbamparop

Loading