Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidated error rate metrics by extension field #2460

Open
lennyburdette opened this issue Jan 23, 2023 · 1 comment
Open

Consolidated error rate metrics by extension field #2460

lennyburdette opened this issue Jan 23, 2023 · 1 comment

Comments

@lennyburdette
Copy link
Contributor

The customer goal is to capture error rates, often categorized by an arbitrary code field in the error extensions:

{ "data": null, "errors": [{ "message": "xxx", "extensions": { "code": "MY_CODE" } }] }

There are two ways to convert errors.*.extensions.code into a metric attribute:

  1. On non-20x responses, use telemetry.metrics.common.attributes.{supergraph,subgraph}.errors.
  2. On 20x responses, use telemetry.metrics.common.attributes.{supergraph,subgraph}.response.body.

The latter isn't specific to errors, and the JSON path feature does not handle arrays, so you have to pick a single error per attribute:

telemetry:
  metrics:
    common:
      attributes:
        supergraph: # Attribute configuration for requests to/responses from the router
          response:
            body:
              # Apply the value of the provided path of the router's response body as an attribute
              - path: .errors[0].extensions.status
                name: error_from_body

The combination of these conditions makes it difficult to create a single metric that counts the occurrence of a type of error that I can use to build alerts.

Describe the solution you'd like
I think my ideal solution is that telemetry.metrics.common.attributes.{supergraph,subgraph}.errors works with 20x responses. Maybe there's an optional configuration property to opt-in to this behavior.

telemetry:
  metrics:
    common:
      attributes:
        supergraph: 
            errors:
              include_20x_responses: true # NEW
              include_messages: false
              extensions:
                - name: subgraph_error_extended_type 
                  path: .code 

Describe alternatives you've considered
Alternatively, we could add an errors property next to body for generating attributes from the response.

telemetry:
  metrics:
    common:
      attributes:
        supergraph: 
          response:
            errors:
              include_messages: false
              extensions:
                - name: subgraph_error_extended_type 
                  path: .code 

This will result in one metric for 20x responses and one for non-20x responses but at least we'd be able to capture all the errors in a 20x response.

Additional context
Subgraphs that conform to the draft "GraphQL over HTTP" spec should use 200 OK for responses that include top-level errors: https://graphql.github.io/graphql-over-http/draft/#sec-application-json. We want the ability to capture top-level error rates in these situations.

@chandrikas
Copy link
Contributor

Can be grouped with umbrella config ticket: #3226

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants