Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datadog exporter ignores attributes #2066

Closed
Tracked by #2878
deweyjose opened this issue Nov 8, 2022 · 11 comments · Fixed by #3196 or #3421
Closed
Tracked by #2878

datadog exporter ignores attributes #2066

deweyjose opened this issue Nov 8, 2022 · 11 comments · Fixed by #3196 or #3421
Assignees

Comments

@deweyjose
Copy link
Contributor

Describe the bug
The Datadog exporter does not apply attributes configured in Router.yaml to spans. Our team needs to set the version tag at the root of all traces exported to Datadog for Deployment tracking.

To Reproduce
Steps to reproduce the behavior:

  tracing:
    trace_config:
      service_name: "${env.DD_SERVICE:-rusty-router}"
      attributes:
        version: "${env.DD_VERSION:-test}"
    propagation:
      datadog: true

Expected behavior
We expect to all traces to include a version tag at the root of each span named version.

Output
If applicable, add output to help explain your problem.
We do not see the version tag.

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context

As a workaround we added a supergraph plugin that adds a span explicitly for this purpose, but it seems wasteful. It also does not help when Router receives invalid gql request that cause 400s as those short circuit any plugins we right - our version spans will not be applied.

@bnjjj
Copy link
Contributor

bnjjj commented Nov 9, 2022

related to #1162

@BrynCooke
Copy link
Contributor

This is fixed in the otel upgrade branch but we need a new otel release.

@lennyburdette
Copy link
Contributor

The fix for the DD exporter is in an unreleased version of open-telemetry-datadog, but if you're building a custom binary you can apply a patch in your Cargo.toml:

router/Cargo.toml

Lines 43 to 49 in 3bb8a67

# TODO: to delete
# [patch.crates-io]
# opentelemetry = { git = "https://github.com/open-telemetry/opentelemetry-rust.git", rev = "e5ef3552efab2bdbf2f838023c37461cd799ab2c"}
# opentelemetry-http = { git = "https://github.com/open-telemetry/opentelemetry-rust.git", rev = "e5ef3552efab2bdbf2f838023c37461cd799ab2c"}
# opentelemetry-jaeger = { git = "https://github.com/open-telemetry/opentelemetry-rust.git", rev = "e5ef3552efab2bdbf2f838023c37461cd799ab2c"}
# opentelemetry-zipkin = { git = "https://github.com/open-telemetry/opentelemetry-rust.git", rev = "e5ef3552efab2bdbf2f838023c37461cd799ab2c"}
# opentelemetry-datadog = { git = "https://github.com/open-telemetry/opentelemetry-rust.git", rev = "e5ef3552efab2bdbf2f838023c37461cd799ab2c"}

I confirmed that the patches work.

An alternative is to configure your DD agent to listen for OTLP traces.

# datadog agent config
otlp_config:
  receiver:
    protocols:
      grpc:
      endpoint: 0.0.0.0:4317
# router config
telemetry:
  tracing:
    trace_config:
      attributes:
        version: "2"
        env: "dev"
        test: "42"
    otlp:
      endpoint: 0.0.0.0:4317
      protocol: grpc

@abernix
Copy link
Member

abernix commented Mar 21, 2023

I'll put the same comment here as I put on #1162 (comment):

This is now waiting on open-telemetry/opentelemetry-rust#965, which we're hoping comes soon.

(We are led to believe it's coming very soon. There's just maybe a patch in there somewhere that might be very relevant for Datadog, specifically.)

@abernix abernix removed the triage label Mar 21, 2023
@abernix
Copy link
Member

abernix commented Mar 29, 2023

The fix for this is purportedly is now in https://github.com/open-telemetry/opentelemetry-rust/releases/tag/v0.19.0. I'll see about getting the integration of that upgrade prioritized on our side.

@jmvtrinidad
Copy link

The fix for the DD exporter is in an unreleased version of open-telemetry-datadog, but if you're building a custom binary you can apply a patch in your Cargo.toml:

router/Cargo.toml

Lines 43 to 49 in 3bb8a67

# TODO: to delete
# [patch.crates-io]
# opentelemetry = { git = "https://github.com/open-telemetry/opentelemetry-rust.git", rev = "e5ef3552efab2bdbf2f838023c37461cd799ab2c"}
# opentelemetry-http = { git = "https://github.com/open-telemetry/opentelemetry-rust.git", rev = "e5ef3552efab2bdbf2f838023c37461cd799ab2c"}
# opentelemetry-jaeger = { git = "https://github.com/open-telemetry/opentelemetry-rust.git", rev = "e5ef3552efab2bdbf2f838023c37461cd799ab2c"}
# opentelemetry-zipkin = { git = "https://github.com/open-telemetry/opentelemetry-rust.git", rev = "e5ef3552efab2bdbf2f838023c37461cd799ab2c"}
# opentelemetry-datadog = { git = "https://github.com/open-telemetry/opentelemetry-rust.git", rev = "e5ef3552efab2bdbf2f838023c37461cd799ab2c"}

I confirmed that the patches work.

An alternative is to configure your DD agent to listen for OTLP traces.

# datadog agent config
otlp_config:
  receiver:
    protocols:
      grpc:
      endpoint: 0.0.0.0:4317
# router config
telemetry:
  tracing:
    trace_config:
      attributes:
        version: "2"
        env: "dev"
        test: "42"
    otlp:
      endpoint: 0.0.0.0:4317
      protocol: grpc

@lennyburdette do you mean to uncomment the patch.crates-io and build? Thanks

@lennyburdette
Copy link
Contributor

Yeah, though this might be pretty out-of-date now so I can't guarantee that it will still work.

@abernix
Copy link
Member

abernix commented May 23, 2023

We are unfortunately still blocked on the OpenTelemetry update because of upstream dependencies. It is perhaps getting close though.

@abernix
Copy link
Member

abernix commented May 31, 2023

Ok, in theory, this is fixed by https://github.com/tokio-rs/tracing-opentelemetry/releases/tag/v0.19.0 and we have an umbrella ticket #2878 that tracks that need (since it unblocks many constraints — like this one — now, at least in theory.) We're proritizing making that update in the next week or two.

@garypen garypen self-assigned this Jun 1, 2023
@garypen
Copy link
Contributor

garypen commented Jun 1, 2023

I have verified that this is fixed in my opentelemetry 0.19.0 PR: #3196

garypen added a commit that referenced this issue Jun 5, 2023
The update requires a change to the implementation and test update as
follows:

- In otel 0.18.0, processor factories had a `with_memory(bool)` method
which we were using when building our prometheus exporter. AFAICT, this
used to be a mechanism for controlling how metrics handled stale gauges.
In 0.19.0, [this method was
removed](open-telemetry/opentelemetry-rust#946)
and now gauges are all assumed to be as though they were created with
`false`. We had been providing `true` on our call. I'm not 100% certain
of the impact of this change, but it appears that we can ignore it. We
may need to consider it more carefully if problems arise.
- There are now two standard OTEL attributes:
```otel_scope_name="apollo/router",otel_scope_version=""``` added to
output and a number of tests had to be updated to accommodate that
change.
- One of our tests appeared to be searching for
`apollo_router_cache_hit_count` (and this was working) when it should
have been searching for `apollo_router_cache_hit_count_total` (likewise
for miss). I've updated the test and think this is the correct thing to
do. It looks like a bug was fixed in otel and this change matches the
fix.

The upgrade fixes many of the outstanding issues related to
opentelemetry and various APM vendors:

Fixes: #2878
Fixes: #2066 
Fixes: #2959 
Fixes: #2225 
Fixes: #1520 

<!-- start metadata -->

**Checklist**

Complete the checklist (and note appropriate exceptions) before a final
PR is raised.

- [x] Changes are compatible[^1]
- [x] Documentation[^2] completed
- [x] Performance impact assessed and acceptable
- Tests added and passing[^3]
    - [x] Unit Tests
    - [x] Integration Tests
    - [ ] Manual Tests

**Exceptions**

*Note any exceptions here*

**Notes**

[^1]. It may be appropriate to bring upcoming changes to the attention
of other (impacted) groups. Please endeavour to do this before seeking
PR approval. The mechanism for doing this will vary considerably, so use
your judgement as to how and when to do this.
[^2]. Configuration is an important part of many changes. Where
applicable please try to document configuration examples.
[^3]. Tick whichever testing boxes are applicable. If you are adding
Manual Tests:
- please document the manual testing (extensively) in the Exceptions.
- please raise a separate issue to automate the test and label it (or
ask for it to be labeled) as `manual test`
@o0Ignition0o
Copy link
Contributor

Reopening the issue since we have to revert the upgrade until they release a patch. See #3242

@o0Ignition0o o0Ignition0o reopened this Jun 15, 2023
@garypen garypen mentioned this issue Jul 12, 2023
6 tasks
garypen added a commit that referenced this issue Jul 12, 2023
The update requires a change to the implementation and test update as
follows:

- In otel 0.18.0, processor factories had a `with_memory(bool)` method
which we were using when building our prometheus exporter. AFAICT, this
used to be a mechanism for controlling how metrics handled stale gauges.
In 0.19.0, [this method was
removed](open-telemetry/opentelemetry-rust#946)
and now gauges are all assumed to be as though they were created with
`false`. We had been providing `true` on our call. I'm not 100% certain
of the impact of this change, but it appears that we can ignore it. We
may need to consider it more carefully if problems arise.
- There are now two standard OTEL attributes:
```otel_scope_name="apollo/router",otel_scope_version=""``` added to
output and a number of tests had to be updated to accommodate that
change.
- One of our tests appeared to be searching for
`apollo_router_cache_hit_count` (and this was working) when it should
have been searching for `apollo_router_cache_hit_count_total` (likewise
for miss). I've updated the test and think this is the correct thing to
do. It looks like a bug was fixed in otel and this change matches the
fix.
 
Regarding that last point. The prometheus spec mandates naming format
and the change was part of the compliance with that spec. This PR made
the change:
open-telemetry/opentelemetry-rust#952

The two affected counters in the router were:

apollo_router_cache_hit_count -> apollo_router_cache_hit_count_total
apollo_router_cache_miss_count -> apollo_router_cache_miss_count_total

It's good that our prometheus metrics are now spec compliant, but we
should note this in the release notes and (if possible) somewhere in our
documentation. I'll add it to the changeset at least.

The upgrade fixes many of the outstanding issues related to
opentelemetry and various APM vendors:

Fixes: #2878
Fixes: #2066 
Fixes: #2959 
Fixes: #2225 
Fixes: #1520 

<!-- start metadata -->

**Checklist**

Complete the checklist (and note appropriate exceptions) before a final
PR is raised.

- [x] Changes are compatible[^1]
- [x] Documentation[^2] completed
- [x] Performance impact assessed and acceptable
- Tests added and passing[^3]
    - [x] Unit Tests
    - [x] Integration Tests
    - [ ] Manual Tests

**Exceptions**

*Note any exceptions here*

**Notes**

[^1]. It may be appropriate to bring upcoming changes to the attention
of other (impacted) groups. Please endeavour to do this before seeking
PR approval. The mechanism for doing this will vary considerably, so use
your judgement as to how and when to do this.
[^2]. Configuration is an important part of many changes. Where
applicable please try to document configuration examples.
[^3]. Tick whichever testing boxes are applicable. If you are adding
Manual Tests:
- please document the manual testing (extensively) in the Exceptions.
- please raise a separate issue to automate the test and label it (or
ask for it to be labeled) as `manual test`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants