Improve / formalize configuration of telemetry / logging / observability / metrics / tracing #3226

BrynCooke · 2023-06-06T15:43:45Z

Currently we have an experimental_logging section that we should bring out of experimental and in addition support json formatting options. There have also been several other users asking for much more control over what is output in both spans and logs, so it's maybe time to take a more holistic look.

The things that users have asked us for are:

Secrets redaction
The ability to exclude spans from logging output
The ability to disable attributes on spans
The ability to only have attributes on spans if headers are present on the root request.
Optional log events for when there are http or graphql errors
Optionally set the supergraph span to error if there were graphql errors in the response
supergraph span is wrapped in an operation span that follows semantic conventions. Existing attributes on supergraph that are from semantic conventions are moved onto the new span.
* Log levels are made easier to use. (see below yaml)

Config

Suggest the following format:

telemetry:

  tracing:
    common: # Renamed from trace_config
      max_attributes_per_event: 128
      max_attributes_per_span: 128
      max_attributes_per_link: 128
      max_events_per_span: 128
      max_links_per_span: 128
      parent_based_sampler: true
      sampler: always_on
      service_name: router
      service_namespace: "default"
      resource:
        d: e

      # Resources are otel config not represented in the yaml config


    propagation:
      baggage: false
      jaeger: false
      datadog: false
      request:
        header_name: "X-REQUEST-ID"
      trace_context: false
      zipkin: false

    otlp:
      enabled: true
      endpoint: "http://localhost:4317/v1/traces"




  metrics:
    common:
      service_namespace: "default"
      service_name: router
      buckets:
        - 0.1

      resource:
        test: foo
    prometheus:
      enabled: true
      path: /metrics
    otlp:
      enabled: true


  instruments:
    default_attribute_requirement_level: required
    router:
      http.server.active_requests: true
      my_instrument:
        value: unit
        type: counter
        unit: kb
        description: "my description"
        event: on_error
        attributes:

          http.response.status_code: false
          "my_attribute":
            response_header: "X-MY-HEADER"
            default: "unknown"
            redact: "foo"

    supergraph:
      my_instrument:
        value: unit
        event: on_error
        type: counter
        unit: kb
        description: "my description"
    subgraph:
      my_instrument:
        value: unit
        event: on_error
        type: counter
        unit: kb
        description: "my description"


  events:
    router:
      request: true
      response: false
      error: false
      test:
        message: "foo"
        level: info
        attributes:
          http.response.body.size: false


  spans:
    default_attribute_requirement_level: required
    legacy_request_span: true
    # The request span will be disappearing
    # router is the new root span
    router:
      attributes:
        dd.trace_id: false
        http.request.body.size: false
        http.response.body.size: false
        http.request.method: false
        http.request.method.original: false
        http.response.status_code: false
        network.protocol.name: false
        network.protocol.version: false
        network.transport: false
        error.type: false
        network.type: false
        trace_id: false
        user_agent.original: false
        client.address: false
        client.port: false
        http.route: false
        network.local.address: false
        network.local.port: false
        network.peer.address: false
        network.peer.port: false
        server.address: false
        server.port: false
        url.path: false
        url.query: false
        url.scheme: false
        "x-custom1":
          trace_id: datadog
        "x-custom2":

          response_header: "X-CUSTOM2"

          default: "unknown"
        "x-custom3":
          request_header: "X-CUSTOM3"
        "x-custom5":
          response_context: "X-CUSTOM3"
        "x-custom8":
          env: "ENV_VAR"

      #etc...
    supergraph:
      attributes:
        graphql.document: false
        graphql.operation.name: true
        graphql.operation.type: true

        "x-custom":
          query_variable: "arg1"
          default: "unknown"
          redact: ""
        "x-custom2":
          response_body: "arg2"
        "x-custom4":
          request_context: "X-CUSTOM3"
        "x-custom5":
          response_context: "X-CUSTOM3"
        "x-custom6":
          operation_name: string
        "x-custom7":
          operation_name: hash
        "x-custom8":
          env: "ENV_VAR"
      #etc...
    subgraph:

      attributes:

        graphql.federation.subgraph.name: false
        graphql.operation.name: false
        graphql.operation.type: true
        "x-custom":
          subgraph_operation_name: string
          default: "unknown"
        "x-custom2":
          subgraph_response_body: "arg2"
        "x-custom4":
          request_context: "X-CUSTOM3"
        "x-custom5":
          response_context: "X-CUSTOM3"

Related is: #1840 which also has some good ideas that have been folded into the above.

Path forward

This is a fairly large set of changes and will need to be split into several tickets. Let's get agreements that this is the way forward and try and schedule it in as it is a blocker for some users to adoption and could have some major performance improvement ramifications due to reduced logging requriements.

Related issues

Implementation plan

There are a large number of items that need to be tackled. Once the new config structure has been implemented then we can start farming the tasks out to separate people.

Clean up existing config

Clean up endpoint processing. Move processing of endpoints to UriEndpoint and SocketEndpoint #3950
Make all exporters have a enabled field rather than using endpoint to activate. Add enabled field for telemetry exporters #3952
Deprecate Jaeger (Deprecating opentelemetry-jaeger open-telemetry/opentelemetry-rust#995) Improve telemetry documentation #3962
Remove use of Option in telemetry config Remove the use of Option in telemetry code #3968
Unify resource handling Unify resource handling #4034
Bring existing config into alignment between tracing and metrics #4043

New config structure

Add new Config with fields skipped for schema generation and serialization.
Create documentation for new tracing functionality.

Spans

Tracing

Add stdout exporter

Testing

Logging

Implement stdout logger
- Implement OTEL format
- Implement JSON format
- Implement text format

Instruments

Custom instruments defined by yaml #4319

Events

Custom events defined by yaml #4320

The text was updated successfully, but these errors were encountered:

bnjjj · 2023-06-07T11:30:25Z

I suggest to use format in that way:

format:
    json:
      location: true | false
      filename: true | false
      line_number: true | false
      spans: true | false
    # text:
    #   location: true | false
    #   filename: true | false
    #   line_number: true | false
    #   spans: true | false

it's also related to the discussion #1961

abernix · 2023-06-12T09:47:26Z

Should we close #1840 since its ideas have been folded into the above?

piclemx · 2023-06-21T14:56:38Z

This request include another way to send logs to?

BrynCooke · 2023-06-21T18:25:49Z

Not currently. But let's add it as this could be implemented using https://github.com/tokio-rs/tracing/tree/master/tracing-appender

kindermax · 2023-06-22T11:58:47Z

Hi, I can be wrong, but does the new logging configuration support injecting custom fields into JSON?
It would be very convenient to add some custom field to all logs such as service: my-apollo-router ?

BrynCooke · 2023-06-23T07:45:18Z

Let's add this, I''ll have a think about how to update the config.

Geal · 2023-06-23T08:03:39Z

pointing out now that any initiative around logging configuration has a huge risk of destroying performance, so this should be built carefully

hrkfdn · 2023-06-26T07:59:15Z

Would also be cool if field names could be specified, i.e. to fit into a certain log structure, which would be necessary to make logs searchable/indexable. The example below would "rename" message to msg:

format:
    json:
        - type: spans
          name: spans
        - type: message
          name: msg

Wording of type and name is debatable of course. Also name could be optional if it's not deviating from type.

Bjohnson131 · 2023-07-31T17:52:56Z

Hello, my issue #3502 was linked here. I think that if I were to propose a location for this issue, maybe

telemetry -> redaction -> subgraph_errors (boolean)

would be the proper place? That would require you to change redaction from a list to an object.

Geal · 2023-08-29T09:48:38Z

related issues:

BrynCooke · 2023-08-29T14:37:47Z

I've spent some time thinking about this and updated the example.
I have not put in anything around customization of json format. Can people who have indicated that they want this give some example of why this is needed and what formats they are targeting. My fear is that even if we do something it won't be flexible enough to actually be useful. If there are specific well known formats that you'd like to see then that would be different.

hrkfdn · 2023-08-29T14:44:32Z

Our usecase is to follow company-wide log formats so that log lines are indexed properly. Being able to select and naming fields would be nice (as outlined in #3226 (comment)), as well as populating specific fields with static fields. Though I can see how the latter may be a little too specific.

yanns · 2023-08-29T14:48:20Z

Our usecase is to follow company-wide log formats so that log lines are indexed properly. Being able to select and naming fields would be nice (as outlined in #3226 (comment)), as well as populating specific fields with static fields. Though I can see how the latter may be a little too specific.

same usecases here (naming fields + adding static fields)

BrynCooke · 2023-08-29T14:58:58Z

Can you post some examples of the formats that you are seeking to work with?

`flatten` doesn't do what we want. Implement custom deserializer and add tests. Part of #3226

Serde `flatten` doesn't do what we want it to do. Custom `Deserializer` that will to the right thing, first trying a custom attribute and then a standard attribute. Part of #3226  --- **Checklist** Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review. - [ ] Changes are compatible[^1] - [ ] Documentation[^2] completed - [ ] Performance impact assessed and acceptable - Tests added and passing[^3] - [ ] Unit Tests - [ ] Integration Tests - [ ] Manual Tests **Exceptions** *Note any exceptions here* **Notes** [^1]: It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. [^2]: Configuration is an important part of many changes. Where applicable please try to document configuration examples. [^3]: Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions. Co-authored-by: bryn <[email protected]>

Part of #3226

Many of the features are not implemented yet and will be enabled over time. Part of #3226

Note that this code is not currently used, but is needed for the other telemetry PR changes. Various issues with telemetry next config were discovered during documentation. The code changes are addressed in this PR with a separate docs PR in the works. Part of #3226  --- **Checklist** Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review. - [ ] Changes are compatible[^1] - [ ] Documentation[^2] completed - [ ] Performance impact assessed and acceptable - Tests added and passing[^3] - [ ] Unit Tests - [ ] Integration Tests - [ ] Manual Tests **Exceptions** *Note any exceptions here* **Notes** [^1]: It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. [^2]: Configuration is an important part of many changes. Where applicable please try to document configuration examples. [^3]: Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions. Co-authored-by: bryn <[email protected]>

BrynCooke · 2023-11-30T12:18:20Z

Just a quick update for people following this ticket.

We've spent considerable effort to make telemetry spans customizable via yaml and to follow the Otel semantic conventions. This will land in the next release.

Users with a commercial license will have fine grained control over what span attributes are present on spans and are able to attach arbitrary attributes. Free users are able to use the less granular requirement level that align with the opentelemetry semantic conventions.

Still TODO are instrument and event customization via yaml. We are hoping to get to these early next year, as they will allow users to setup conditional logging and metrics without having to reach for Rhai or a custom plugin.

Custom logging formats are also still on the table as well as exporting logs via opentelemetry bridge for collection via OTLP.

BrynCooke · 2024-05-31T09:38:32Z

Another update on this ticket.
Custom instruments and events are now possible in yaml, and users seem to be tacking advantage of these new features!

One thing that is still unimplemented is redaction. So if you need this then get in touch.

oskargotte · 2024-06-28T09:09:58Z

One thing that is still unimplemented is redaction. So if you need this then get in touch.

@BrynCooke What about #3502 ?
Are there any options to get the redacted subgraph errors included in the router logs but not in the GraphQL response? OR are there any plans to support it?

BrynCooke · 2024-07-19T16:27:28Z

@oskargotte We haven't has many folks asking for redaction. Some of them do redaction centrally via their logging software. We'll still need more people to ask for this feature.

Bjohnson131 · 2024-07-19T17:07:51Z

FWIW, logging redaction isn't really a standard feature for many programs that exist today, so it's understandable that this won't be supported.

This was referenced Jun 20, 2023

Router should mark fetch spans as error if the downstream response had errors #3123

Closed

Allow customization of otel error handling #3218

Open

Subgraph 400 responses do not mark the otel span status as an error #3219

Closed

BrynCooke mentioned this issue Jun 20, 2023

Improve logging experience #1840

Closed

4 tasks

BrynCooke mentioned this issue Jun 23, 2023

use the operation signature as graphql.document in spans #2703

Closed

6 tasks

BrynCooke changed the title ~~Formalize configuration of logging~~ Formalize configuration of telemetry Jul 10, 2023

chandrikas mentioned this issue Jul 10, 2023

Consolidated error rate metrics by extension field #2460

Open

This was referenced Jul 31, 2023

Errors not properly reported to Datadog #3516

Open

Add Option to redact subgraph errors in GraphQL response, but not in logs. #3502

Open

BrynCooke mentioned this issue Aug 7, 2023

Add the ability to add a value for a request header to the root span as an attribute #3543

Closed

abernix added the component/logging label Aug 7, 2023

abernix mentioned this issue Aug 7, 2023

Personal data logged out in the json format #2695

Closed

abernix changed the title ~~Formalize configuration of telemetry~~ Improve / formalize configuration of telemetry / logging / observability / metrics / tracing Aug 21, 2023

This was referenced Aug 21, 2023

Include specified correlation request header value(s) in all router log statements #3612

Closed

trace_id is not always displayed in json logs #3208

Closed

Geal mentioned this issue Aug 29, 2023

Router-Datadog spans tagging #3702

Closed

BrynCooke closed this as completed in #4061 Oct 25, 2023

BrynCooke reopened this Oct 25, 2023

BrynCooke mentioned this issue Oct 31, 2023

Fix serde deserialization for Extendable. #4123

Merged

6 tasks

BrynCooke linked a pull request Oct 31, 2023 that will close this issue

Fix serde deserialization for Extendable. #4123

Merged

6 tasks

BrynCooke pushed a commit that referenced this issue Oct 31, 2023

Fix serde deserialization for Extendable.

1ea731d

`flatten` doesn't do what we want. Implement custom deserializer and add tests. Part of #3226

BrynCooke pushed a commit that referenced this issue Oct 31, 2023

Fix serde deserialization for Extendable.

a7f1b78

`flatten` doesn't do what we want. Implement custom deserializer and add tests. Part of #3226

BrynCooke closed this as completed in #4123 Oct 31, 2023

bnjjj reopened this Oct 31, 2023

BrynCooke pushed a commit that referenced this issue Nov 2, 2023

Changes to telemetry configuration discovered during docs.

54ed87b

Part of #3226

BrynCooke mentioned this issue Nov 2, 2023

Changes to telemetry configuration discovered during docs #4133

Merged

6 tasks

BrynCooke pushed a commit that referenced this issue Nov 2, 2023

Changes to telemetry configuration discovered during docs.

12a4697

Part of #3226

BrynCooke pushed a commit that referenced this issue Nov 2, 2023

Documentation for new telemetry configuration.

791ac79

Many of the features are not implemented yet and will be enabled over time. Part of #3226

BrynCooke pushed a commit that referenced this issue Nov 2, 2023

Documentation for new telemetry configuration.

608874f

Many of the features are not implemented yet and will be enabled over time. Part of #3226

BrynCooke linked a pull request Nov 3, 2023 that will close this issue

Changes to telemetry configuration discovered during docs #4133

Merged

6 tasks

BrynCooke closed this as completed in #4133 Nov 3, 2023

BrynCooke reopened this Nov 3, 2023

This was referenced Nov 17, 2023

include_messages does not cover GraphQL errors from the subgraph #3851

Open

Add operation request metadata to access logs #4055

Open

BrynCooke mentioned this issue Nov 27, 2023

v1.33.2 router.yaml OpenTelemetry metric error occurred: Metrics error: Warning: Maximum data points for metric stream exceeded. Entry added to overflow. #4187

Open

o0Ignition0o mentioned this issue Dec 4, 2023

AWS X-Ray traceId handling in router #3445

Open

Geal mentioned this issue Jan 25, 2024

trace sampling headers are not sent if the trace is not sampled #4544

Closed

smyrick mentioned this issue Feb 7, 2024

Add subgraph span attribute for http status code #4614

Open

frittentheke mentioned this issue Dec 12, 2024

Make metric cardinality limit configurable (OpenTelemetry Protocol (OTLP) exporter) #6445

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve / formalize configuration of telemetry / logging / observability / metrics / tracing #3226

Improve / formalize configuration of telemetry / logging / observability / metrics / tracing #3226

BrynCooke commented Jun 6, 2023 •

edited

Loading

bnjjj commented Jun 7, 2023

abernix commented Jun 12, 2023

piclemx commented Jun 21, 2023

BrynCooke commented Jun 21, 2023

kindermax commented Jun 22, 2023

BrynCooke commented Jun 23, 2023

Geal commented Jun 23, 2023

hrkfdn commented Jun 26, 2023 •

edited

Loading

Bjohnson131 commented Jul 31, 2023

Geal commented Aug 29, 2023

BrynCooke commented Aug 29, 2023

hrkfdn commented Aug 29, 2023 •

edited

Loading

yanns commented Aug 29, 2023

BrynCooke commented Aug 29, 2023

BrynCooke commented Nov 30, 2023 •

edited

Loading

BrynCooke commented May 31, 2024

oskargotte commented Jun 28, 2024 •

edited

Loading

BrynCooke commented Jul 19, 2024

Bjohnson131 commented Jul 19, 2024

Improve / formalize configuration of telemetry / logging / observability / metrics / tracing #3226

Improve / formalize configuration of telemetry / logging / observability / metrics / tracing #3226

Comments

BrynCooke commented Jun 6, 2023 • edited Loading

Config

Path forward

Related issues

Implementation plan

Clean up existing config

New config structure

Spans

Tracing

Testing

Logging

Instruments

Events

bnjjj commented Jun 7, 2023

abernix commented Jun 12, 2023

piclemx commented Jun 21, 2023

BrynCooke commented Jun 21, 2023

kindermax commented Jun 22, 2023

BrynCooke commented Jun 23, 2023

Geal commented Jun 23, 2023

hrkfdn commented Jun 26, 2023 • edited Loading

Bjohnson131 commented Jul 31, 2023

Geal commented Aug 29, 2023

BrynCooke commented Aug 29, 2023

hrkfdn commented Aug 29, 2023 • edited Loading

yanns commented Aug 29, 2023

BrynCooke commented Aug 29, 2023

BrynCooke commented Nov 30, 2023 • edited Loading

BrynCooke commented May 31, 2024

oskargotte commented Jun 28, 2024 • edited Loading

BrynCooke commented Jul 19, 2024

Bjohnson131 commented Jul 19, 2024

BrynCooke commented Jun 6, 2023 •

edited

Loading

hrkfdn commented Jun 26, 2023 •

edited

Loading

hrkfdn commented Aug 29, 2023 •

edited

Loading

BrynCooke commented Nov 30, 2023 •

edited

Loading

oskargotte commented Jun 28, 2024 •

edited

Loading