Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[APM] Guard against OOM in scripted metric agg for service maps #101920

Closed
dgieselaar opened this issue Jun 10, 2021 · 7 comments · Fixed by #159883
Closed

[APM] Guard against OOM in scripted metric agg for service maps #101920

dgieselaar opened this issue Jun 10, 2021 · 7 comments · Fixed by #159883
Assignees
Labels
8.9 candidate apm:performance APM UI - Performance Work apm:release-feature APM UI - Release Feature Goal apm:service-maps Service Map feature in APM apm:test-plan-done Pull request that was successfully tested during the test plan Team:APM All issues that need APM UI Team support

Comments

@dgieselaar
Copy link
Member

dgieselaar commented Jun 10, 2021

We've seen reports that the scripted metric agg we use for service maps can cause Elasticsearch to OOM, if the number of events is high. We can prevent this by setting a max number of events to be processed, and throwing an error if this limit is exceeded. It's quite hard to return something if we can't guarantee that we've seen the entire data set, and given the fact that we'll be moving away from the scripted metric agg sooner than later I'd suggest we just barf when the limit is met, and make the limit configurable. One document takes up about 1kb of memory.

Links

@dgieselaar dgieselaar added the Team:APM All issues that need APM UI Team support label Jun 10, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/apm-ui (Team:apm)

@botelastic
Copy link

botelastic bot commented Feb 27, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@botelastic botelastic bot added the stale Used to mark issues that were closed for being stale label Feb 27, 2022
@iamjosh007
Copy link

Any update to this? This is literally crashing the cluster.

@botelastic botelastic bot removed the stale Used to mark issues that were closed for being stale label Jan 12, 2023
@dgieselaar
Copy link
Member Author

@iamjosh007 not yet, but we're discussing re-prioritizing this work based on incoming feedback from customers. Are you also in touch with Elastic Support?

@iamjosh007
Copy link

Yes, they referred me to this ticket and OOM Issues are hitting us hard for last few months.

@dgieselaar
Copy link
Member Author

@iamjosh007 Thatnks, that helps us figure out how many customers are running into this. Given you're already talking to support I would suggest to keep those conversations in the appropriate venue - they're pulling us in when needed. In this specific case we are already involved.

@gbamparop gbamparop added apm:performance APM UI - Performance Work apm:service-maps Service Map feature in APM and removed [zube]: Backlog labels Jan 16, 2023
@gbamparop gbamparop added apm:release-feature APM UI - Release Feature Goal 8.9 candidate and removed 8.8 candidate labels May 12, 2023
dgieselaar added a commit to dgieselaar/kibana that referenced this issue Jun 18, 2023
dgieselaar added a commit that referenced this issue Jun 20, 2023
Closes #101920

This PR does three things:

- add a `terminate_after` parameter to the search request for the
scripted metric agg. This is a configurable setting
(`xpack.apm.serviceMapTerminateAfter`) and defaults to 100k. This is a
shard-level parameter, so there's still the possibility of lots of
shards individually returning 100k documents and the coordinating node
running out of memory because it is collecting all these docs from
individual shards. However, I suspect that there is already some
protection in the reduce phase that will terminate the request with a
stack_overflow_error without OOMing, I've reached out to the ES team to
confirm whether this is the case.
- add `xpack.apm.serviceMapMaxTraces`: this tells the max traces to
inspect in total, not just per search request. IE, if
`xpack.apm.serviceMapMaxTracesPerRequest` is 1, we simply chunk the
traces in n chunks, so it doesn't really help with memory management.
`serviceMapMaxTraces` refers to the total amount of traces to inspect.
- rewrite `getConnections` to use local mutation instead of
immutability. I saw huge CPU usage (with admittedly a pathological
scenario where there are 100s of services) in the `getConnections`
function, because it uses a deduplication mechanism that is O(n²), so I
rewrote it to O(n). Here's a before :


![image](https://github.com/elastic/kibana/assets/352732/6c24a7a2-3b48-4c95-9db2-563160a57aef)

and after:

![image](https://github.com/elastic/kibana/assets/352732/c00b8428-3026-4610-aa8b-c0046e8f0e08)

To reproduce an OOM, start ES with a much smaller amount of memory:
`$ ES_JAVA_OPTS='-Xms236m -Xmx236m' yarn es snapshot`

Then run the synthtrace Service Map OOM scenario:
`$ node scripts/synthtrace.js service_map_oom --from=now-15m --to=now
--clean`

Finally, navigate to `service-100` in the UI, and click on Service Map.
This should trigger an OOM.
kibanamachine pushed a commit to kibanamachine/kibana that referenced this issue Jun 20, 2023
…159883)

Closes elastic#101920

This PR does three things:

- add a `terminate_after` parameter to the search request for the
scripted metric agg. This is a configurable setting
(`xpack.apm.serviceMapTerminateAfter`) and defaults to 100k. This is a
shard-level parameter, so there's still the possibility of lots of
shards individually returning 100k documents and the coordinating node
running out of memory because it is collecting all these docs from
individual shards. However, I suspect that there is already some
protection in the reduce phase that will terminate the request with a
stack_overflow_error without OOMing, I've reached out to the ES team to
confirm whether this is the case.
- add `xpack.apm.serviceMapMaxTraces`: this tells the max traces to
inspect in total, not just per search request. IE, if
`xpack.apm.serviceMapMaxTracesPerRequest` is 1, we simply chunk the
traces in n chunks, so it doesn't really help with memory management.
`serviceMapMaxTraces` refers to the total amount of traces to inspect.
- rewrite `getConnections` to use local mutation instead of
immutability. I saw huge CPU usage (with admittedly a pathological
scenario where there are 100s of services) in the `getConnections`
function, because it uses a deduplication mechanism that is O(n²), so I
rewrote it to O(n). Here's a before :

![image](https://github.com/elastic/kibana/assets/352732/6c24a7a2-3b48-4c95-9db2-563160a57aef)

and after:

![image](https://github.com/elastic/kibana/assets/352732/c00b8428-3026-4610-aa8b-c0046e8f0e08)

To reproduce an OOM, start ES with a much smaller amount of memory:
`$ ES_JAVA_OPTS='-Xms236m -Xmx236m' yarn es snapshot`

Then run the synthtrace Service Map OOM scenario:
`$ node scripts/synthtrace.js service_map_oom --from=now-15m --to=now
--clean`

Finally, navigate to `service-100` in the UI, and click on Service Map.
This should trigger an OOM.

(cherry picked from commit 1a9b241)
kibanamachine added a commit that referenced this issue Jun 20, 2023
…59883) (#160060)

# Backport

This will backport the following commits from `main` to `8.8`:
- [[APM] Circuit breaker and perf improvements for service map
(#159883)](#159883)

<!--- Backport version: 8.9.7 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Dario
Gieselaar","email":"[email protected]"},"sourceCommit":{"committedDate":"2023-06-20T16:38:23Z","message":"[APM]
Circuit breaker and perf improvements for service map
(#159883)\n\nCloses #101920\r\n\r\nThis PR does three things:\r\n\r\n-
add a `terminate_after` parameter to the search request for
the\r\nscripted metric agg. This is a configurable
setting\r\n(`xpack.apm.serviceMapTerminateAfter`) and defaults to 100k.
This is a\r\nshard-level parameter, so there's still the possibility of
lots of\r\nshards individually returning 100k documents and the
coordinating node\r\nrunning out of memory because it is collecting all
these docs from\r\nindividual shards. However, I suspect that there is
already some\r\nprotection in the reduce phase that will terminate the
request with a\r\nstack_overflow_error without OOMing, I've reached out
to the ES team to\r\nconfirm whether this is the case.\r\n- add
`xpack.apm.serviceMapMaxTraces`: this tells the max traces to\r\ninspect
in total, not just per search request. IE,
if\r\n`xpack.apm.serviceMapMaxTracesPerRequest` is 1, we simply chunk
the\r\ntraces in n chunks, so it doesn't really help with memory
management.\r\n`serviceMapMaxTraces` refers to the total amount of
traces to inspect.\r\n- rewrite `getConnections` to use local mutation
instead of\r\nimmutability. I saw huge CPU usage (with admittedly a
pathological\r\nscenario where there are 100s of services) in the
`getConnections`\r\nfunction, because it uses a deduplication mechanism
that is O(n²), so I\r\nrewrote it to O(n). Here's a before
:\r\n\r\n\r\n![image](https://github.com/elastic/kibana/assets/352732/6c24a7a2-3b48-4c95-9db2-563160a57aef)\r\n\r\nand
after:\r\n\r\n![image](https://github.com/elastic/kibana/assets/352732/c00b8428-3026-4610-aa8b-c0046e8f0e08)\r\n\r\nTo
reproduce an OOM, start ES with a much smaller amount of memory:\r\n`$
ES_JAVA_OPTS='-Xms236m -Xmx236m' yarn es snapshot`\r\n\r\nThen run the
synthtrace Service Map OOM scenario:\r\n`$ node scripts/synthtrace.js
service_map_oom --from=now-15m --to=now\r\n--clean`\r\n\r\nFinally,
navigate to `service-100` in the UI, and click on Service Map.\r\nThis
should trigger an
OOM.","sha":"1a9b2412299e98a210b4d902c4df92a710b32b97","branchLabelMapping":{"^v8.9.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:fix","Team:APM","v8.9.0","v8.8.2"],"number":159883,"url":"https://github.com/elastic/kibana/pull/159883","mergeCommit":{"message":"[APM]
Circuit breaker and perf improvements for service map
(#159883)\n\nCloses #101920\r\n\r\nThis PR does three things:\r\n\r\n-
add a `terminate_after` parameter to the search request for
the\r\nscripted metric agg. This is a configurable
setting\r\n(`xpack.apm.serviceMapTerminateAfter`) and defaults to 100k.
This is a\r\nshard-level parameter, so there's still the possibility of
lots of\r\nshards individually returning 100k documents and the
coordinating node\r\nrunning out of memory because it is collecting all
these docs from\r\nindividual shards. However, I suspect that there is
already some\r\nprotection in the reduce phase that will terminate the
request with a\r\nstack_overflow_error without OOMing, I've reached out
to the ES team to\r\nconfirm whether this is the case.\r\n- add
`xpack.apm.serviceMapMaxTraces`: this tells the max traces to\r\ninspect
in total, not just per search request. IE,
if\r\n`xpack.apm.serviceMapMaxTracesPerRequest` is 1, we simply chunk
the\r\ntraces in n chunks, so it doesn't really help with memory
management.\r\n`serviceMapMaxTraces` refers to the total amount of
traces to inspect.\r\n- rewrite `getConnections` to use local mutation
instead of\r\nimmutability. I saw huge CPU usage (with admittedly a
pathological\r\nscenario where there are 100s of services) in the
`getConnections`\r\nfunction, because it uses a deduplication mechanism
that is O(n²), so I\r\nrewrote it to O(n). Here's a before
:\r\n\r\n\r\n![image](https://github.com/elastic/kibana/assets/352732/6c24a7a2-3b48-4c95-9db2-563160a57aef)\r\n\r\nand
after:\r\n\r\n![image](https://github.com/elastic/kibana/assets/352732/c00b8428-3026-4610-aa8b-c0046e8f0e08)\r\n\r\nTo
reproduce an OOM, start ES with a much smaller amount of memory:\r\n`$
ES_JAVA_OPTS='-Xms236m -Xmx236m' yarn es snapshot`\r\n\r\nThen run the
synthtrace Service Map OOM scenario:\r\n`$ node scripts/synthtrace.js
service_map_oom --from=now-15m --to=now\r\n--clean`\r\n\r\nFinally,
navigate to `service-100` in the UI, and click on Service Map.\r\nThis
should trigger an
OOM.","sha":"1a9b2412299e98a210b4d902c4df92a710b32b97"}},"sourceBranch":"main","suggestedTargetBranches":["8.8"],"targetPullRequestStates":[{"branch":"main","label":"v8.9.0","labelRegex":"^v8.9.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/159883","number":159883,"mergeCommit":{"message":"[APM]
Circuit breaker and perf improvements for service map
(#159883)\n\nCloses #101920\r\n\r\nThis PR does three things:\r\n\r\n-
add a `terminate_after` parameter to the search request for
the\r\nscripted metric agg. This is a configurable
setting\r\n(`xpack.apm.serviceMapTerminateAfter`) and defaults to 100k.
This is a\r\nshard-level parameter, so there's still the possibility of
lots of\r\nshards individually returning 100k documents and the
coordinating node\r\nrunning out of memory because it is collecting all
these docs from\r\nindividual shards. However, I suspect that there is
already some\r\nprotection in the reduce phase that will terminate the
request with a\r\nstack_overflow_error without OOMing, I've reached out
to the ES team to\r\nconfirm whether this is the case.\r\n- add
`xpack.apm.serviceMapMaxTraces`: this tells the max traces to\r\ninspect
in total, not just per search request. IE,
if\r\n`xpack.apm.serviceMapMaxTracesPerRequest` is 1, we simply chunk
the\r\ntraces in n chunks, so it doesn't really help with memory
management.\r\n`serviceMapMaxTraces` refers to the total amount of
traces to inspect.\r\n- rewrite `getConnections` to use local mutation
instead of\r\nimmutability. I saw huge CPU usage (with admittedly a
pathological\r\nscenario where there are 100s of services) in the
`getConnections`\r\nfunction, because it uses a deduplication mechanism
that is O(n²), so I\r\nrewrote it to O(n). Here's a before
:\r\n\r\n\r\n![image](https://github.com/elastic/kibana/assets/352732/6c24a7a2-3b48-4c95-9db2-563160a57aef)\r\n\r\nand
after:\r\n\r\n![image](https://github.com/elastic/kibana/assets/352732/c00b8428-3026-4610-aa8b-c0046e8f0e08)\r\n\r\nTo
reproduce an OOM, start ES with a much smaller amount of memory:\r\n`$
ES_JAVA_OPTS='-Xms236m -Xmx236m' yarn es snapshot`\r\n\r\nThen run the
synthtrace Service Map OOM scenario:\r\n`$ node scripts/synthtrace.js
service_map_oom --from=now-15m --to=now\r\n--clean`\r\n\r\nFinally,
navigate to `service-100` in the UI, and click on Service Map.\r\nThis
should trigger an
OOM.","sha":"1a9b2412299e98a210b4d902c4df92a710b32b97"}},{"branch":"8.8","label":"v8.8.2","labelRegex":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"}]}]
BACKPORT-->

Co-authored-by: Dario Gieselaar <[email protected]>
@dgieselaar dgieselaar added the apm:test-plan-done Pull request that was successfully tested during the test plan label Jul 3, 2023
@dgieselaar
Copy link
Member Author

Verified that I can't take a cluster down under the same conditions as before the circuit breaker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8.9 candidate apm:performance APM UI - Performance Work apm:release-feature APM UI - Release Feature Goal apm:service-maps Service Map feature in APM apm:test-plan-done Pull request that was successfully tested during the test plan Team:APM All issues that need APM UI Team support
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants