Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[APM] limit service map scripted metric agg based on shard count (#18…
…6417) ## Summary #179229 This PR addresses the need to limit the amount of data that the scripted metric aggregation in the service map processes in one request which can lead to timeouts and OOMs, often resulting in the user seeing [parent circuit breaker](https://www.elastic.co/guide/en/elasticsearch/reference/current/circuit-breaker.html#parent-circuit-breaker) errors and no service map visualization. This query can fire up to 20 times max depending on how many trace ids are fetched in subsequent query, contributing more to exceeding the total allowable memory. These changes will not remove the possibility of OOMs or circuit breaker errors. It doesn't control for multiple users or other processes happening in kibana, rather we are removing the current state of querying for an unknown number of documents by providing a hard limit and a way to easily tweak that limit. ## Changes - Make get_service_paths_from_trace_ids "shard aware" by adding an initial query, `get_trace_ids_shard_data` without the aggregations and only the trace id filter and other filters in order to see how many shards were searched - Use a baseline of 2_576_980_377 bytes max from new config `serverlessServiceMapMaxAvailableBytes`, for all get_service_paths_from_trace_ids queries when hitting the `/internal/apm/service-map` - Calculate how many docs we should retrieve per shard and set that to `terminateAfter` and also as part of the map phase to ensure we never send more than this number to reduce - Calculation is: ((serverlessServiceMapMaxAvailableBytes / average document size) / totalRequests) / numberOfShards Eg: 2_576_980_377 / 495 avg doc size = 5,206,020 total docs 5,206,020 total docs / 10 requests = 520,602 docs per query 520,602 docs per query / 3 shards = **173,534 docs per shard** Since 173,534 is greater than the default setting `serviceMapTerminateAfter`, docs per shard is 100k - Ensure that `map_script` phase won't process duplicate events - Refactor the `processAndReturnEvent` function to replace recursion with a loop to mitigate risks of stack overflow and excessive memory consumption when processing deep trees ## Testing ### Testing that the scripted metric agg query does not exceed the request circuit breaker - start elasticsearch with default settings - on `main`, without these changes, update the request circuit breaker limit to be 2mb: ``` PUT /_cluster/settings { "persistent": { "indices.breaker.request.limit": "2mb" } } ``` - run synthtrace `node scripts/synthtrace.js service_map_oom --from=now-15m --to=now --clean` - Go to the service map, and you should see this error: <img width="305" alt="Screenshot 2024-06-20 at 2 41 18 PM" src="https://github.com/elastic/kibana/assets/1676003/517709e5-f5c0-46bf-a06f-5817458fe292"> - checkout this PR - set the apm kibana setting to 2mb(binary): `xpack.apm.serverlessServiceMapMaxAvailableBytes: 2097152`. this represents the available space for the [request circuit breaker](https://www.elastic.co/guide/en/elasticsearch/reference/current/circuit-breaker.html#request-circuit-breaker), since we aren't grabbing that dynamically. - navigate to the service map and you should not get this error and the service map should appear --------- Co-authored-by: Carlos Crespo <[email protected]> Co-authored-by: Elastic Machine <[email protected]>
- Loading branch information