Consolidated error rate metrics by extension field #2460

lennyburdette · 2023-01-23T16:51:00Z

The customer goal is to capture error rates, often categorized by an arbitrary code field in the error extensions:

{ "data": null, "errors": [{ "message": "xxx", "extensions": { "code": "MY_CODE" } }] }

There are two ways to convert errors.*.extensions.code into a metric attribute:

On non-20x responses, use telemetry.metrics.common.attributes.{supergraph,subgraph}.errors.
On 20x responses, use telemetry.metrics.common.attributes.{supergraph,subgraph}.response.body.

The latter isn't specific to errors, and the JSON path feature does not handle arrays, so you have to pick a single error per attribute:

telemetry:
  metrics:
    common:
      attributes:
        supergraph: # Attribute configuration for requests to/responses from the router
          response:
            body:
              # Apply the value of the provided path of the router's response body as an attribute
              - path: .errors[0].extensions.status
                name: error_from_body

The combination of these conditions makes it difficult to create a single metric that counts the occurrence of a type of error that I can use to build alerts.

Describe the solution you'd like
I think my ideal solution is that telemetry.metrics.common.attributes.{supergraph,subgraph}.errors works with 20x responses. Maybe there's an optional configuration property to opt-in to this behavior.

telemetry:
  metrics:
    common:
      attributes:
        supergraph: 
            errors:
              include_20x_responses: true # NEW
              include_messages: false
              extensions:
                - name: subgraph_error_extended_type 
                  path: .code

Describe alternatives you've considered
Alternatively, we could add an errors property next to body for generating attributes from the response.

telemetry:
  metrics:
    common:
      attributes:
        supergraph: 
          response:
            errors:
              include_messages: false
              extensions:
                - name: subgraph_error_extended_type 
                  path: .code

This will result in one metric for 20x responses and one for non-20x responses but at least we'd be able to capture all the errors in a 20x response.

Additional context
Subgraphs that conform to the draft "GraphQL over HTTP" spec should use 200 OK for responses that include top-level errors: https://graphql.github.io/graphql-over-http/draft/#sec-application-json. We want the ability to capture top-level error rates in these situations.

The text was updated successfully, but these errors were encountered:

chandrikas · 2023-07-10T09:27:39Z

Can be grouped with umbrella config ticket: #3226

lennyburdette added raised by user triage labels Jan 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidated error rate metrics by extension field #2460

Consolidated error rate metrics by extension field #2460

lennyburdette commented Jan 23, 2023

chandrikas commented Jul 10, 2023

Consolidated error rate metrics by extension field #2460

Consolidated error rate metrics by extension field #2460

Comments

lennyburdette commented Jan 23, 2023

chandrikas commented Jul 10, 2023