Skip to content

Commit

Permalink
[APM] limit service map scripted metric agg based on shard count (#18…
Browse files Browse the repository at this point in the history
…6417)

## Summary

#179229

This PR addresses the need to limit the amount of data that the scripted
metric aggregation in the service map processes in one request which can
lead to timeouts and OOMs, often resulting in the user seeing [parent
circuit
breaker](https://www.elastic.co/guide/en/elasticsearch/reference/current/circuit-breaker.html#parent-circuit-breaker)
errors and no service map visualization. This query can fire up to 20
times max depending on how many trace ids are fetched in subsequent
query, contributing more to exceeding the total allowable memory.

These changes will not remove the possibility of OOMs or circuit breaker
errors. It doesn't control for multiple users or other processes
happening in kibana, rather we are removing the current state of
querying for an unknown number of documents by providing a hard limit
and a way to easily tweak that limit.

## Changes
- Make get_service_paths_from_trace_ids "shard aware" by adding an
initial query, `get_trace_ids_shard_data` without the aggregations and
only the trace id filter and other filters in order to see how many
shards were searched
- Use a baseline of 2_576_980_377 bytes max from new config
`serverlessServiceMapMaxAvailableBytes`, for all
get_service_paths_from_trace_ids queries when hitting the
`/internal/apm/service-map`
- Calculate how many docs we should retrieve per shard and set that to
`terminateAfter` and also as part of the map phase to ensure we never
send more than this number to reduce
- Calculation is: ((serverlessServiceMapMaxAvailableBytes / average
document size) / totalRequests) / numberOfShards
Eg: 2_576_980_377 / 495 avg doc size = 5,206,020 total docs
 5,206,020 total docs / 10 requests = 520,602 docs per query
520,602 docs per query / 3 shards = **173,534 docs per shard**
Since 173,534 is greater than the default setting
`serviceMapTerminateAfter`, docs per shard is 100k
- Ensure that `map_script` phase won't process duplicate events
- Refactor the `processAndReturnEvent` function to replace recursion
with a loop to mitigate risks of stack overflow and excessive memory
consumption when processing deep trees


## Testing

### Testing that the scripted metric agg query does not exceed the
request circuit breaker
- start elasticsearch with default settings
- on `main`, without these changes, update the request circuit breaker
limit to be 2mb:
```
 PUT /_cluster/settings
{
  "persistent": {
    "indices.breaker.request.limit": "2mb"
  }
}
```
- run synthtrace `node scripts/synthtrace.js service_map_oom
--from=now-15m --to=now --clean`
- Go to the service map, and you should see this error:
<img width="305" alt="Screenshot 2024-06-20 at 2 41 18 PM"
src="https://github.com/elastic/kibana/assets/1676003/517709e5-f5c0-46bf-a06f-5817458fe292">

- checkout this PR
- set the apm kibana setting to 2mb(binary):
`xpack.apm.serverlessServiceMapMaxAvailableBytes: 2097152`. this
represents the available space for the [request circuit
breaker](https://www.elastic.co/guide/en/elasticsearch/reference/current/circuit-breaker.html#request-circuit-breaker),
since we aren't grabbing that dynamically.
- navigate to the service map and you should not get this error and the
service map should appear

---------

Co-authored-by: Carlos Crespo <[email protected]>
Co-authored-by: Elastic Machine <[email protected]>
  • Loading branch information
3 people authored Jul 2, 2024
1 parent dbd1334 commit 75874ca
Show file tree
Hide file tree
Showing 6 changed files with 314 additions and 137 deletions.
1 change: 1 addition & 0 deletions x-pack/plugins/observability_solution/apm/server/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ const configSchema = schema.object({
serviceMapFingerprintGlobalBucketSize: schema.number({
defaultValue: 1000,
}),
serviceMapMaxAllowableBytes: schema.number({ defaultValue: 2_576_980_377 }), // 2.4GB
serviceMapTraceIdBucketSize: schema.number({ defaultValue: 65 }),
serviceMapTraceIdGlobalBucketSize: schema.number({ defaultValue: 6 }),
serviceMapMaxTracesPerRequest: schema.number({ defaultValue: 50 }),
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
/*
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
* or more contributor license agreements. Licensed under the Elastic License
* 2.0; you may not use this file except in compliance with the Elastic License
* 2.0.
*/

import { calculateDocsPerShard } from './calculate_docs_per_shard';

describe('calculateDocsPerShard', () => {
it('calculates correct docs per shard', () => {
expect(
calculateDocsPerShard({
serviceMapMaxAllowableBytes: 2_576_980_377,
avgDocSizeInBytes: 495,
totalShards: 3,
numOfRequests: 10,
})
).toBe(173534);
});
it('handles zeros', () => {
expect(() =>
calculateDocsPerShard({
serviceMapMaxAllowableBytes: 0,
avgDocSizeInBytes: 0,
totalShards: 0,
numOfRequests: 0,
})
).toThrow('all parameters must be > 0');
});
});
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
/*
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
* or more contributor license agreements. Licensed under the Elastic License
* 2.0; you may not use this file except in compliance with the Elastic License
* 2.0.
*/

interface Params {
serviceMapMaxAllowableBytes: number;
avgDocSizeInBytes: number;
totalShards: number;
numOfRequests: number;
}

export const calculateDocsPerShard = ({
serviceMapMaxAllowableBytes,
avgDocSizeInBytes,
totalShards,
numOfRequests,
}: Params): number => {
if (
serviceMapMaxAllowableBytes <= 0 ||
avgDocSizeInBytes <= 0 ||
totalShards <= 0 ||
numOfRequests <= 0
) {
throw new Error('all parameters must be > 0');
}
const bytesPerRequest = Math.floor(serviceMapMaxAllowableBytes / numOfRequests);
const totalNumDocsAllowed = Math.floor(bytesPerRequest / avgDocSizeInBytes);
const numDocsPerShardAllowed = Math.floor(totalNumDocsAllowed / totalShards);

return numDocsPerShardAllowed;
};
Loading

0 comments on commit 75874ca

Please sign in to comment.