Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distinguish which subgraph threw error in prometheus metrics #1493

Closed
KennethWussmann opened this issue Aug 10, 2022 · 3 comments
Closed

Distinguish which subgraph threw error in prometheus metrics #1493

KennethWussmann opened this issue Aug 10, 2022 · 3 comments

Comments

@KennethWussmann
Copy link

KennethWussmann commented Aug 10, 2022

Is your feature request related to a problem? Please describe.
We're coming from #1198 with a similar problem. When we apply the custom attributes of the returned errors they get added to the Prometheus metrics, but it's not possible to tell which subgraph threw the error.

Given configuration:

telemetry:
  metrics:
    prometheus:
      enabled: true
    common:
      attributes:
        router:
          response:
            body:
              - path: .errors[0].extensions.code
                name: error_code

Will create the following Prometheus Metric:

http_request_duration_seconds_bucket{error_code="BAD_USER_INPUT", le="+Inf", namespace="graphql", service_name="apollo-router", status="200"}

The metric now contains the error code as configured, but from the attributes it's not possible to tell which subgraph nor operation caused this issue. That would be essential information to work with this metric.

As an additional plus: This was mentioned in the other issue as well. .errors[0].extensions.code only retrieves the first error, allowing the JSON path syntax .errors[*].extensions.code to get them all would be nice to have as well.

Describe the solution you'd like
It would be great to have the possibility to include the subgraph and operation name into this metric.

Describe alternatives you've considered
We also tried the following config:

telemetry:
  metrics:
    prometheus:
      enabled: true
    common:
      attributes:
        subgraph: 
          all:
            static:
              - name: kind
                value: subgraph_request
            errors:
              include_messages: true
              extensions:
                - name: subgraph_error_code
                  path: .code

But we later found out this config only works for subgraph errors that are not returning with HTTP status 200 or when the subgraph can't be reached. For client errors like BAD_USER_INPUT this would not be the case and this config therefore does not work.

@bnjjj
Copy link
Contributor

bnjjj commented Aug 10, 2022

Did you try with this configuration ?

telemetry:
  metrics:
    prometheus:
      enabled: true
    common:
      attributes:
        subgraph: 
          all:
            static:
              - name: kind
                value: subgraph_request
            response:
              body:
                - path: .errors[0].extensions.code
                  name: subgraph_error_code
            errors:
              extensions:
                - name: subgraph_error_code
                  path: .code

@KennethWussmann
Copy link
Author

Hey @bnjjj, thanks for the quick answer.
Yes indeed that does what we were looking for. We didn't knew about this configuration option [from looking at the docs}(https://www.apollographql.com/docs/router/configuration/metrics#adding-custom-attributeslabels).

@bnjjj
Copy link
Contributor

bnjjj commented Aug 11, 2022

TLDR: when the http status code from subgraph != 200 you need to specify what do you want in error section, if it's a graphql error with an http status code == 200 then you need to specify configuration in response. There aren't exclusive at all you can configure both like in this example

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants