feat: add memory limiter to drop data when a soft limit is reached #1827
+81
−18
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
At the moment, if there is pressure in the pipeline for any reason, and batches are failed to export, they will start building up in the queues of the collector exporter and grow memory unboundly.
Since we don't set any memory request or limit on the node collectors ds, they will just go on to consume more and more of the available memory on the node:
Levels of Protections
To prevent the above issues, we imply few level of protections, listed from first line to last resort:
352MiB
. At this point, go runtime GC should kick in and start reclaiming memory aggressively.384MiB
. When the heap allocations reach this amount, the collector will start dropping batches of data after they are exported from thebatch
processor, instead of streaming them down the pipeline.512MiB
. When the heap reaches this number, a forced GC is performed.256MiB
. This ensures we have at least this amount of memory to handle normal traffic and some slack for spikes without running into OOM. the rest of the memory is consumed from available memory on the node which by handy for more buffering, but may also cause OOM if the node has no resources.Future Work
Add configuration options to set these values, preferably as a spectrum for trace-offs: "resource-stability", "resource-spikecapacity"
drop the data as it received not after it is batched - Feature Request: Memory Limiter Processor opt-in configuration to drop data instead of refusing it open-telemetry/opentelemetry-collector#11726
drop data at receiver when it's implemented in collector - Applying memory_limiter extension open-telemetry/opentelemetry-collector#9591