-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[APM] Guard against OOM in scripted metric agg for service maps #101920
Comments
Pinging @elastic/apm-ui (Team:apm) |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Any update to this? This is literally crashing the cluster. |
@iamjosh007 not yet, but we're discussing re-prioritizing this work based on incoming feedback from customers. Are you also in touch with Elastic Support? |
Yes, they referred me to this ticket and OOM Issues are hitting us hard for last few months. |
@iamjosh007 Thatnks, that helps us figure out how many customers are running into this. Given you're already talking to support I would suggest to keep those conversations in the appropriate venue - they're pulling us in when needed. In this specific case we are already involved. |
Closes #101920 This PR does three things: - add a `terminate_after` parameter to the search request for the scripted metric agg. This is a configurable setting (`xpack.apm.serviceMapTerminateAfter`) and defaults to 100k. This is a shard-level parameter, so there's still the possibility of lots of shards individually returning 100k documents and the coordinating node running out of memory because it is collecting all these docs from individual shards. However, I suspect that there is already some protection in the reduce phase that will terminate the request with a stack_overflow_error without OOMing, I've reached out to the ES team to confirm whether this is the case. - add `xpack.apm.serviceMapMaxTraces`: this tells the max traces to inspect in total, not just per search request. IE, if `xpack.apm.serviceMapMaxTracesPerRequest` is 1, we simply chunk the traces in n chunks, so it doesn't really help with memory management. `serviceMapMaxTraces` refers to the total amount of traces to inspect. - rewrite `getConnections` to use local mutation instead of immutability. I saw huge CPU usage (with admittedly a pathological scenario where there are 100s of services) in the `getConnections` function, because it uses a deduplication mechanism that is O(n²), so I rewrote it to O(n). Here's a before : ![image](https://github.com/elastic/kibana/assets/352732/6c24a7a2-3b48-4c95-9db2-563160a57aef) and after: ![image](https://github.com/elastic/kibana/assets/352732/c00b8428-3026-4610-aa8b-c0046e8f0e08) To reproduce an OOM, start ES with a much smaller amount of memory: `$ ES_JAVA_OPTS='-Xms236m -Xmx236m' yarn es snapshot` Then run the synthtrace Service Map OOM scenario: `$ node scripts/synthtrace.js service_map_oom --from=now-15m --to=now --clean` Finally, navigate to `service-100` in the UI, and click on Service Map. This should trigger an OOM.
…159883) Closes elastic#101920 This PR does three things: - add a `terminate_after` parameter to the search request for the scripted metric agg. This is a configurable setting (`xpack.apm.serviceMapTerminateAfter`) and defaults to 100k. This is a shard-level parameter, so there's still the possibility of lots of shards individually returning 100k documents and the coordinating node running out of memory because it is collecting all these docs from individual shards. However, I suspect that there is already some protection in the reduce phase that will terminate the request with a stack_overflow_error without OOMing, I've reached out to the ES team to confirm whether this is the case. - add `xpack.apm.serviceMapMaxTraces`: this tells the max traces to inspect in total, not just per search request. IE, if `xpack.apm.serviceMapMaxTracesPerRequest` is 1, we simply chunk the traces in n chunks, so it doesn't really help with memory management. `serviceMapMaxTraces` refers to the total amount of traces to inspect. - rewrite `getConnections` to use local mutation instead of immutability. I saw huge CPU usage (with admittedly a pathological scenario where there are 100s of services) in the `getConnections` function, because it uses a deduplication mechanism that is O(n²), so I rewrote it to O(n). Here's a before : ![image](https://github.com/elastic/kibana/assets/352732/6c24a7a2-3b48-4c95-9db2-563160a57aef) and after: ![image](https://github.com/elastic/kibana/assets/352732/c00b8428-3026-4610-aa8b-c0046e8f0e08) To reproduce an OOM, start ES with a much smaller amount of memory: `$ ES_JAVA_OPTS='-Xms236m -Xmx236m' yarn es snapshot` Then run the synthtrace Service Map OOM scenario: `$ node scripts/synthtrace.js service_map_oom --from=now-15m --to=now --clean` Finally, navigate to `service-100` in the UI, and click on Service Map. This should trigger an OOM. (cherry picked from commit 1a9b241)
…59883) (#160060) # Backport This will backport the following commits from `main` to `8.8`: - [[APM] Circuit breaker and perf improvements for service map (#159883)](#159883) <!--- Backport version: 8.9.7 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT [{"author":{"name":"Dario Gieselaar","email":"[email protected]"},"sourceCommit":{"committedDate":"2023-06-20T16:38:23Z","message":"[APM] Circuit breaker and perf improvements for service map (#159883)\n\nCloses #101920\r\n\r\nThis PR does three things:\r\n\r\n- add a `terminate_after` parameter to the search request for the\r\nscripted metric agg. This is a configurable setting\r\n(`xpack.apm.serviceMapTerminateAfter`) and defaults to 100k. This is a\r\nshard-level parameter, so there's still the possibility of lots of\r\nshards individually returning 100k documents and the coordinating node\r\nrunning out of memory because it is collecting all these docs from\r\nindividual shards. However, I suspect that there is already some\r\nprotection in the reduce phase that will terminate the request with a\r\nstack_overflow_error without OOMing, I've reached out to the ES team to\r\nconfirm whether this is the case.\r\n- add `xpack.apm.serviceMapMaxTraces`: this tells the max traces to\r\ninspect in total, not just per search request. IE, if\r\n`xpack.apm.serviceMapMaxTracesPerRequest` is 1, we simply chunk the\r\ntraces in n chunks, so it doesn't really help with memory management.\r\n`serviceMapMaxTraces` refers to the total amount of traces to inspect.\r\n- rewrite `getConnections` to use local mutation instead of\r\nimmutability. I saw huge CPU usage (with admittedly a pathological\r\nscenario where there are 100s of services) in the `getConnections`\r\nfunction, because it uses a deduplication mechanism that is O(n²), so I\r\nrewrote it to O(n). Here's a before :\r\n\r\n\r\n![image](https://github.com/elastic/kibana/assets/352732/6c24a7a2-3b48-4c95-9db2-563160a57aef)\r\n\r\nand after:\r\n\r\n![image](https://github.com/elastic/kibana/assets/352732/c00b8428-3026-4610-aa8b-c0046e8f0e08)\r\n\r\nTo reproduce an OOM, start ES with a much smaller amount of memory:\r\n`$ ES_JAVA_OPTS='-Xms236m -Xmx236m' yarn es snapshot`\r\n\r\nThen run the synthtrace Service Map OOM scenario:\r\n`$ node scripts/synthtrace.js service_map_oom --from=now-15m --to=now\r\n--clean`\r\n\r\nFinally, navigate to `service-100` in the UI, and click on Service Map.\r\nThis should trigger an OOM.","sha":"1a9b2412299e98a210b4d902c4df92a710b32b97","branchLabelMapping":{"^v8.9.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:fix","Team:APM","v8.9.0","v8.8.2"],"number":159883,"url":"https://github.com/elastic/kibana/pull/159883","mergeCommit":{"message":"[APM] Circuit breaker and perf improvements for service map (#159883)\n\nCloses #101920\r\n\r\nThis PR does three things:\r\n\r\n- add a `terminate_after` parameter to the search request for the\r\nscripted metric agg. This is a configurable setting\r\n(`xpack.apm.serviceMapTerminateAfter`) and defaults to 100k. This is a\r\nshard-level parameter, so there's still the possibility of lots of\r\nshards individually returning 100k documents and the coordinating node\r\nrunning out of memory because it is collecting all these docs from\r\nindividual shards. However, I suspect that there is already some\r\nprotection in the reduce phase that will terminate the request with a\r\nstack_overflow_error without OOMing, I've reached out to the ES team to\r\nconfirm whether this is the case.\r\n- add `xpack.apm.serviceMapMaxTraces`: this tells the max traces to\r\ninspect in total, not just per search request. IE, if\r\n`xpack.apm.serviceMapMaxTracesPerRequest` is 1, we simply chunk the\r\ntraces in n chunks, so it doesn't really help with memory management.\r\n`serviceMapMaxTraces` refers to the total amount of traces to inspect.\r\n- rewrite `getConnections` to use local mutation instead of\r\nimmutability. I saw huge CPU usage (with admittedly a pathological\r\nscenario where there are 100s of services) in the `getConnections`\r\nfunction, because it uses a deduplication mechanism that is O(n²), so I\r\nrewrote it to O(n). Here's a before :\r\n\r\n\r\n![image](https://github.com/elastic/kibana/assets/352732/6c24a7a2-3b48-4c95-9db2-563160a57aef)\r\n\r\nand after:\r\n\r\n![image](https://github.com/elastic/kibana/assets/352732/c00b8428-3026-4610-aa8b-c0046e8f0e08)\r\n\r\nTo reproduce an OOM, start ES with a much smaller amount of memory:\r\n`$ ES_JAVA_OPTS='-Xms236m -Xmx236m' yarn es snapshot`\r\n\r\nThen run the synthtrace Service Map OOM scenario:\r\n`$ node scripts/synthtrace.js service_map_oom --from=now-15m --to=now\r\n--clean`\r\n\r\nFinally, navigate to `service-100` in the UI, and click on Service Map.\r\nThis should trigger an OOM.","sha":"1a9b2412299e98a210b4d902c4df92a710b32b97"}},"sourceBranch":"main","suggestedTargetBranches":["8.8"],"targetPullRequestStates":[{"branch":"main","label":"v8.9.0","labelRegex":"^v8.9.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/159883","number":159883,"mergeCommit":{"message":"[APM] Circuit breaker and perf improvements for service map (#159883)\n\nCloses #101920\r\n\r\nThis PR does three things:\r\n\r\n- add a `terminate_after` parameter to the search request for the\r\nscripted metric agg. This is a configurable setting\r\n(`xpack.apm.serviceMapTerminateAfter`) and defaults to 100k. This is a\r\nshard-level parameter, so there's still the possibility of lots of\r\nshards individually returning 100k documents and the coordinating node\r\nrunning out of memory because it is collecting all these docs from\r\nindividual shards. However, I suspect that there is already some\r\nprotection in the reduce phase that will terminate the request with a\r\nstack_overflow_error without OOMing, I've reached out to the ES team to\r\nconfirm whether this is the case.\r\n- add `xpack.apm.serviceMapMaxTraces`: this tells the max traces to\r\ninspect in total, not just per search request. IE, if\r\n`xpack.apm.serviceMapMaxTracesPerRequest` is 1, we simply chunk the\r\ntraces in n chunks, so it doesn't really help with memory management.\r\n`serviceMapMaxTraces` refers to the total amount of traces to inspect.\r\n- rewrite `getConnections` to use local mutation instead of\r\nimmutability. I saw huge CPU usage (with admittedly a pathological\r\nscenario where there are 100s of services) in the `getConnections`\r\nfunction, because it uses a deduplication mechanism that is O(n²), so I\r\nrewrote it to O(n). Here's a before :\r\n\r\n\r\n![image](https://github.com/elastic/kibana/assets/352732/6c24a7a2-3b48-4c95-9db2-563160a57aef)\r\n\r\nand after:\r\n\r\n![image](https://github.com/elastic/kibana/assets/352732/c00b8428-3026-4610-aa8b-c0046e8f0e08)\r\n\r\nTo reproduce an OOM, start ES with a much smaller amount of memory:\r\n`$ ES_JAVA_OPTS='-Xms236m -Xmx236m' yarn es snapshot`\r\n\r\nThen run the synthtrace Service Map OOM scenario:\r\n`$ node scripts/synthtrace.js service_map_oom --from=now-15m --to=now\r\n--clean`\r\n\r\nFinally, navigate to `service-100` in the UI, and click on Service Map.\r\nThis should trigger an OOM.","sha":"1a9b2412299e98a210b4d902c4df92a710b32b97"}},{"branch":"8.8","label":"v8.8.2","labelRegex":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"}]}] BACKPORT--> Co-authored-by: Dario Gieselaar <[email protected]>
Verified that I can't take a cluster down under the same conditions as before the circuit breaker. |
We've seen reports that the scripted metric agg we use for service maps can cause Elasticsearch to OOM, if the number of events is high. We can prevent this by setting a max number of events to be processed, and throwing an error if this limit is exceeded. It's quite hard to return something if we can't guarantee that we've seen the entire data set, and given the fact that we'll be moving away from the scripted metric agg sooner than later I'd suggest we just barf when the limit is met, and make the limit configurable. One document takes up about 1kb of memory.
Links
The text was updated successfully, but these errors were encountered: