Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Traces/metrics not reported after a while using OTLP exporter #3392

Closed
Meemaw opened this issue Jul 7, 2023 · 15 comments
Closed

Traces/metrics not reported after a while using OTLP exporter #3392

Meemaw opened this issue Jul 7, 2023 · 15 comments
Assignees

Comments

@Meemaw
Copy link
Contributor

Meemaw commented Jul 7, 2023

Describe the bug
After a while traces & some metrics stop being reported by the router using OTLP exporter ~ Datadog. The timing here varies, but is usually a few hours. Traces are always missing when this happens, while some metrics are still reported while others are not.

Example of metrics that are still reported:

  • apollo_router_http_requests_total

Example of metrics that dissapear:

  • apollo_router_processing_time
  • apollo_router_session_count_total
  • apollo_router_cache_size
  • apollo_router_query_planning_time
  • ...

This happens on latest version, but has been happening for a long time (half a year at least). I suspect this is a bug in the router, because restarting the deployment always fixes the issue.

Its hard to reproduce this locally obviously, so this is more for tracking and getting information if anyone else is experiencing similar issues.

@garypen garypen self-assigned this Jul 11, 2023
@garypen
Copy link
Contributor

garypen commented Jul 20, 2023

@Meemaw We recently (1.24.0) add support for temporality for our OTLP metrics. This is designed for use with metrics ingesters such as datadog who prefer aggregation deltas to cumulative (which is the OTEL default).

It may resolve the issues you are seeing. Sample configuration fragment:

...
        metrics:
          otlp:
            temporality: delta
... 

If you could try this and let us know if things improve, that would be helpful. As you note, it is difficult to debug/track, but we have found this improves reporting in the testing we have managed to perform.

@Meemaw
Copy link
Contributor Author

Meemaw commented Jul 20, 2023

@garypen will report how it goes.

There is actually 1 thing I noticed after the upgrade. apollo_router_session_count_total metric is now reporting differently, showing negative values at times.

Screenshot 2023-07-20 at 13 24 35

@garypen
Copy link
Contributor

garypen commented Jul 21, 2023

I think that's a different problem, because I've noticed that the apollo_router_session_count values are odd even with cumulative temporality. I agree that it's more obvious with delta. I'll file an issue for that.

@garypen
Copy link
Contributor

garypen commented Jul 25, 2023

see: #3485

@abernix
Copy link
Member

abernix commented Aug 9, 2023

Any updates on this, @Meemaw ? 😄

@Meemaw
Copy link
Contributor Author

Meemaw commented Aug 9, 2023

@abernix we still see metrics/traces disappearing after a while on v1.26.0.

@garypen
Copy link
Contributor

garypen commented Aug 10, 2023

@Meemaw That's disappointing. We have been using 1.26.0 with delta temporality successfully with datadog over the last couple of weeks.

@Meemaw
Copy link
Contributor Author

Meemaw commented Aug 10, 2023

@garypen That's only relevant for metrics, right? Also not seeing traces which shouldn't be affected by that change.

This is our config (in case you see anything wrong):

telemetry:
  metrics:
    common:
      service_name: "${env.DD_SERVICE:-graphql-federation}"
    otlp:
      endpoint: "http://${env.DD_AGENT_HOST:-datadog}:4317"
      temporality: delta
  tracing:
    trace_config:
      service_name: "${env.DD_SERVICE:-graphql-federation}"
      service_namespace: "${env.DD_ENV:-development}"
      sampler: "${env.DD_TRACE_SAMPLE_RATE:-1}"
      parent_based_sampler: true
      attributes:
        version: "${env.DD_VERSION:-development}"
    otlp:
      endpoint: "http://${env.DD_AGENT_HOST:-datadog}:4317"

We have some other services which are using the otlp grpc endpoint and they work without issues.

@garypen
Copy link
Contributor

garypen commented Aug 10, 2023

@Meemaw It is only relevant for metrics, but you wrote: "we still see metrics/traces disappearing after a while on v1.26.0." so I was commenting on the metrics part of that. I should probably have made that clear.

I can't see anything wrong with your config.

Just out of interest, are any of your other functional services written in rust and using the opentelemetry-rust crate?

@Meemaw
Copy link
Contributor Author

Meemaw commented Aug 10, 2023

@Meemaw It is only relevant for metrics, but you wrote: "we still see metrics/traces disappearing after a while on v1.26.0." so I was commenting on the metrics part of that. I should probably have made that clear.

I can't see anything wrong with your config.

Just out of interest, are any of your other functional services written in rust and using the opentelemetry-rust crate?

No, others are in Go.

@Meemaw
Copy link
Contributor Author

Meemaw commented Aug 10, 2023

@garypen another observation. Metrics emitted by us (in a custom rust plugin) do not disappear.

@abernix
Copy link
Member

abernix commented Aug 24, 2023

This is blocked until #3601 is done, so track that one first if you're curious about progress. ;)

@garypen
Copy link
Contributor

garypen commented Aug 29, 2023

One other observation. I have noted that if there is no activity, for whatever reason, around a particular metric for a "while", then our datadog widget just stops reporting data. It's as though it is waiting for more data to arrive before it resumes graphing. Could this be part of the problem you are seeing @Meemaw ? i.e.: rather than metrics that you've previously seen disappearing, what you are seeing is that metrics suddenly stop being updated and then, maybe, later they are updated.

@Meemaw
Copy link
Contributor Author

Meemaw commented Aug 29, 2023

One other observation. I have noted that if there is no activity, for whatever reason, around a particular metric for a "while", then our datadog widget just stops reporting data. It's as though it is waiting for more data to arrive before it resumes graphing. Could this be part of the problem you are seeing @Meemaw ? i.e.: rather than metrics that you've previously seen disappearing, what you are seeing is that metrics suddenly stop being updated and then, maybe, later they are updated.

By no activity you mean router having no traffic and metrics not being emitted? We have constant high rps traffic, so this would not be the case.

@Meemaw
Copy link
Contributor Author

Meemaw commented Oct 23, 2023

@abernix @garypen this seems to be fixed on newer versions of router 🎉

@Meemaw Meemaw closed this as completed Oct 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants