Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add memory limiter to drop data when a soft limit is reached #1827

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

blumamir
Copy link
Collaborator

Problem

At the moment, if there is pressure in the pipeline for any reason, and batches are failed to export, they will start building up in the queues of the collector exporter and grow memory unboundly.

Since we don't set any memory request or limit on the node collectors ds, they will just go on to consume more and more of the available memory on the node:

  1. Will show a pick in resource consumption on the cluster metrics.
  2. Starve other pods on the same node, which now has less spare memory to grow into.
  3. If the issue is not transient, the memory will just keep increasing over time
  4. The amount of data in the retry buffers, will keep the CPU busy attempting to retry the rejected or unsuccessful batches.

Levels of Protections

To prevent the above issues, we imply few level of protections, listed from first line to last resort:

  1. setting GOMEMLIMIT to a (now hardcoded constant) 352MiB. At this point, go runtime GC should kick in and start reclaiming memory aggressively.
  2. Setting the otel collector soft limit to (now hardcoded constant) 384MiB. When the heap allocations reach this amount, the collector will start dropping batches of data after they are exported from the batch processor, instead of streaming them down the pipeline.
  3. Setting the otel collector hard limit to 512MiB. When the heap reaches this number, a forced GC is performed.
  4. Setting the memory request to 256MiB. This ensures we have at least this amount of memory to handle normal traffic and some slack for spikes without running into OOM. the rest of the memory is consumed from available memory on the node which by handy for more buffering, but may also cause OOM if the node has no resources.

Future Work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant