Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Link healthcheckextension with memory limiter rejecting spans #30168

Closed
cskinfill opened this issue Dec 21, 2023 · 7 comments
Closed

Link healthcheckextension with memory limiter rejecting spans #30168

cskinfill opened this issue Dec 21, 2023 · 7 comments
Labels
closed as inactive enhancement New feature or request extension/healthcheck Health Check Extension Stale

Comments

@cskinfill
Copy link

Component(s)

extension/healthcheck

Is your feature request related to a problem? Please describe.

the memory limiter processor is rejecting spans. I would like to setup a readiness probe for the pod and when it starts rejecting spans, have the probe fail. This will stop more spans coming to the pod until it can recover, and should cause clients to send spans to other pods.

Describe the solution you'd like

Have the health check extension fail if spans are being rejected.

Describe alternatives you've considered

A readiness probe that scrapes the self metrics of the otel collector pod to see if there are rejected spans.

Additional context

No response

@cskinfill cskinfill added enhancement New feature or request needs triage New item requiring triage labels Dec 21, 2023
@github-actions github-actions bot added the extension/healthcheck Health Check Extension label Dec 21, 2023
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@SophieDeBenedetto
Copy link

👋 hello! At GitHub, we've been running into this issue and are very interested in seeing this change funded 😄

We observe the following:

  • Some of our collectors were hitting a GC spiral due to memory pressure
  • Requests to those collectors continue to come in, causing an increase in latency in 2xx responses followed by an eventual increase in 5xx responses from the instances

We expected the readinessProbe to fail when the memory limiter starts causing the collector to drop or refuse spans, but requests continued to succeed thus getting more traffic sent to them, resulting in the rise in 5xx responses to requests coming through the otel/http receiver.

We’re interested in this behavior:

  • When a collector instance is experiencing memory pressure
  • Then the collector instance should not be considered healthy and should not continue to take traffic

I know we chatted about this a bit in Slack but I thought I'd through a comment here for transparency.

Thanks and do let us know if this is on the radar for fixing anytime soon!

@atoulme
Copy link
Contributor

atoulme commented Apr 5, 2024

Please take a look at #30673 as it might offer a fix for this issue. We could use help to review and try this out.

@atoulme atoulme removed the needs triage New item requiring triage label Apr 5, 2024
@mwear
Copy link
Member

mwear commented Apr 8, 2024

I think that the foundations are in place to solve this problem, but the problem is likely not solved as is. #30673 introduces a version of the healthcheck extension based on component status reporting, which is a prerequisite. The next piece would be to update the memory limiter to report error statuses (via component status reporting) when it detects problematic conditions and to clear them when they resolve.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Jun 10, 2024
Copy link
Contributor

github-actions bot commented Aug 9, 2024

This issue has been closed as inactive because it has been stale for 120 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
closed as inactive enhancement New feature or request extension/healthcheck Health Check Extension Stale
Projects
None yet
Development

No branches or pull requests

4 participants