Metrics from error body #1198

lennyburdette · 2022-06-06T19:53:38Z

GraphQL HTTP responses are typically 200 OK regardless of whether an error occurred. Generic APMs and monitoring tools don't have a way to distinguish between { data } and { data: null, errors: [...] } for error rate detection.

It's also common to add metadata to errors to replicate HTTP status code semantics like this:

{
  "data": null,
  "errors": [
    {
      "message": "Something went wrong",
      "extensions": {
        "status": "INTERAL_SERVER_ERROR"
      }
    }
  ]
}

The current Prometheus metrics provides counters for high-level HTTP status:

http_requests_total{status="200"} 12
http_requests_total{status="500"} 4
http_requests_total{status="200",subgraph="subgraph"} 11

Describe the solution you'd like
A customer has asked for counters based on data from within the error extensions.

With config similar to:

telemetry:
  metrics:
    prometheus:
      enabled: true
      response_counters:
        - errors.extensions.status

It would be great to get metrics like:

response_fields{path="errors.extensions.status",value:"INTERNAL_SERVER_ERROR"} 12
response_fields{path="errors.extensions.status",value:"INVALID_REQUEST"} 24
response_fields{subgraph="subgraph",path="errors.extensions.status",value:"INVALID_REQUEST"} 8

(Note that the last metric in that example is the response to the fetch to the subgraph.)

Describe alternatives you've considered
Open to ideas!

Additional context
This is an enterprise customer request.

The text was updated successfully, but these errors were encountered:

bnjjj · 2022-06-08T14:19:31Z

I recently added feature to add custom attributes to metrics and I think we could extend this feature by adding an entry named from_response .
Example:

telemetry:
  metrics:
    common:
      attributes:
        static:
          - name: "version"
            value: "v1.0.0"
        from_headers:
          - named: "content-type"
            rename: "payload_type"
            default: "application/json"
          - named: "x-custom-header-to-add"
        
        from_response:
          - path: errors.extensions.status
            rename: extended_status
            default: optional_default_value

bnjjj · 2022-06-09T11:15:35Z

Regarding your example I think I also forgot to add another configuration specific to subgraphs:

telemetry:
  metrics:
    common:
      attributes:
        static:
          - name: "version"
            value: "v1.0.0"
        from_headers:
          - named: "content-type"
            rename: "payload_type"
            default: "application/json"
          - named: "x-custom-header-to-add"
        from_response:
          - path: errors.extensions.status
            rename: extended_status
            default: optional_default_value
    subgraphs:
      my_subgraph:
        attributes:
          from_response:
            - path: errors.extensions.status
              rename: extended_status
              default: optional_default_value

lennyburdette · 2022-06-09T14:12:47Z

So the metrics would look like this?

http_requests_total{status="200"} 12
http_requests_total{status="200",extended_status="INTERNAL_SERVER_ERROR"} 12
http_requests_total{status="200",extended_status="INVALID_REQUEST"} 12
http_requests_total{status="200",subgraph="subgraph",extended_status="INTERNAL_SERVER_ERROR"} 11
http_requests_total{status="200",subgraph="subgraph",extended_status="INVALID_REQUEST"} 11

I think that would work!

abernix · 2022-06-10T07:15:19Z

I haven't fully investigated what we have in terms of options here or the from_response implementation, but I'm a bit worried things are getting a bit blurry here from a configuration standpoint. I think that from_headers and from_response is more (but not completely) self explanatory to me when they are within the subgraphs section, but the common section seems to throw things off in my mind, particularly because we're sitting in the position of a proxy and have both upstream (subgraphs) and downstream (client) concepts to think about.

Further, headers are classically part of a request and response (i.e., request.headers and response.headers) so thinking about a "response" and "headers" separately isn't a natural hierarchy in my mind, at least as I understand things right now. (Maybe I'm thinking about this wrong!)

It might make sense to refer to the GraphQL response (i.e., { data: ..., errors: ..., extensions: ... }) as ... well, graphql_response, but maybe in a similar way to how the HTTP Fetch API standard has response.headers and response.json() we want to think about this as "response.graphql" for both upstream and downstream_?

I'll just bikeshed some ideas here around structure, but do we think something like this structure adds clarity?

   from_upstream_subgraph:
     request:
       header:
          - named: "content-type"
            rename: "payload_type"
            default: "application/json"
          - named: "x-custom-header-to-add"
       graphql:
          - path: errors.extensions.status
            rename: extended_status
            default: optional_default_value
    from_downstream_client:
      # Do we need a section like this?

lennyburdette · 2022-06-10T16:27:04Z

Could the metrics configuration align with the four services (Router, Execution, QueryPlanner, Subgraph)? Router metrics would be "downstream" client request/response metrics, while Subgraph metrics would be "upstream" subgraph request/response metrics.

BrynCooke · 2022-06-14T09:54:03Z

We're going to have to decide how much we want the telemetry plugin to do in terms of collecting data. It's worth noting that at the point where we update the metrics, the content of the subgraph responses are not available to us by default right now.

Here is a suggestion that hopefully covers most cases, but could also be used with rhai in the case that users want to get something from the subgraphs. They would be required to populate context.

telemetry:
  metrics:
    common:
      attributes:
        static:
          - name: "version"
            value: "v1.0.0"
        request:
          - header:
               named: "content-type"
               rename: "payload_type"
               default: "application/json"
          - header:
              named: "x-custom-header-to-add"
        response:
          - body:
              path: errors.extensions.status
              name: extended_status
              default: optional_default_value
        context:
          - named: "foo"
            default: "application/json"

BrynCooke · 2022-06-16T08:24:47Z

We've settled on the following config:

telemetry:
  metrics: 
    common:
      attributes:
        router:
          static:
            - name: "version"
              value: "v1.0.0"
          request:
            - header:
                named: "content-type"
                rename: "payload_type"
                default: "application/json"
            - header:
                named: "x-custom-header-to-add"
          response:
            - body:
                path: errors.extensions.status
                name: extended_status
                default: optional_default_value
          context:
            - named: "foo"
          
        subgraph:
          all:
            static:
              - name: "version"
                value: "v1.0.0"
              request:
                - header:
                    named: "content-type"
                    rename: "payload_type"
                    default: "application/json"
                - header:
                    named: "x-custom-header-to-add"
              response:
                - body:
                    path: errors.extensions.status
                    name: extended_status
                    default: optional_default_value
              context:
                - named: "foo"
          subgraphs:
            products:
              static:
                - name: "version"
                  value: "v1.0.0"
                request:
                  - header:
                      named: "content-type"
                      rename: "payload_type"
                      default: "application/json"
                  - header:
                      named: "x-custom-header-to-add"
                response:
                  - body:
                      path: errors.extensions.status
                      name: extended_status
                      default: optional_default_value
                context:
                  - named: "foo"

ramapalani · 2022-07-12T17:43:13Z

Thanks a lot for this feature.

I'm trying to differentiate between these two errors (one subgraph returning error vs two subgraphs returning error)
error1.txt
error2.txt

My prometheus configuration

telemetry:
  metrics:
    prometheus:
      enabled: true
    common:
      attributes:
        router:
          response:
            body:
            - path: .errors[0].extensions.type
              name: error_extended_type
        subgraph:
          all:
            response:
              body:
              - path: .errors[0].extensions.type
                name: subgraph_error_extended_type

router.response.body reads the first error properly if there is one.
I may not know how many more errors are possible, so I tried using subgraph.all.response.body, but this doesn't have any effect. Is there a way for me to capture all errors and not just the first error?

something like this:

http_requests_total{error_extended_type="SubrequestHttpError",subgraph="pandas", operation_name="AllPandas",status="200"} 2
http_requests_total{error_extended_type="SubrequestHttpError",subgraph="products", operation_name="AllPandas",status="200"} 1

bnjjj · 2022-07-25T13:43:32Z

@ramapalani Ok I think there are 2 different things we're missing here. First one could be to add support for json query syntax like this .errors[*].extensions.type but it would just create an array with all error types inside. Which is not what you want.

What is missing in our implementation and could be useful for you is to add an error subsection to reflect the same behavior we have in our plugin hooks. For example instead of having:

telemetry:
  metrics:
    prometheus:
      enabled: true
    common:
      attributes:
        subgraph:
          all:
            response:
              body:
              - path: .errors[0].extensions.type
                name: subgraph_error_extended_type

you could have :

telemetry:
  metrics:
    prometheus:
      enabled: true
    common:
      attributes:
        subgraph:
          all:
            error: # Only work if it's a valid GraphQL error
              include_message: true # Will include the error message in a message attribute
              include_locations: true
              include_path: true
              extensions: # Include extension data
                name: subgraph_error_extended_type # Name of the attribute
                path: .type # JSON query path to fetch data from extensions

And by doing this, using your second example file you would have metrics like this:

http_requests_total{error_extended_type="SubrequestHttpError",message="HTTP fetch failed from 'pandas': HTTP fetch failed from 'redacted': redacted",subgraph="pandas", operation_name="AllPandas",status="200"} 2
http_requests_total{error_extended_type="SubrequestHttpError",message="HTTP fetch failed from 'products': HTTP fetch failed from 'redacted': redacted",subgraph="products", operation_name="AllPandas",status="200"} 1

lennyburdette added raised by user triage labels Jun 6, 2022

bnjjj self-assigned this Jun 8, 2022

BrynCooke mentioned this issue Jun 14, 2022

Namespace not set when sending *directly* to Datadog #1162

Open

bnjjj mentioned this issue Jun 16, 2022

Add labels/resources on span from body #1270

Open

bnjjj mentioned this issue Jun 23, 2022

Custom metric attributes from response/request headers/body #1300

Merged

bnjjj closed this as completed in #1300 Jun 29, 2022

lennyburdette mentioned this issue Jul 22, 2022

Configurable query depth limiting #1413

Closed

bnjjj mentioned this issue Jul 25, 2022

Add error section in custom attributes configuration #1434

Closed

KennethWussmann mentioned this issue Aug 10, 2022

Distinguish which subgraph threw error in prometheus metrics #1493

Closed

abernix removed the triage label Aug 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics from error body #1198

Metrics from error body #1198

lennyburdette commented Jun 6, 2022

bnjjj commented Jun 8, 2022

bnjjj commented Jun 9, 2022

lennyburdette commented Jun 9, 2022

abernix commented Jun 10, 2022

lennyburdette commented Jun 10, 2022 •

edited

Loading

BrynCooke commented Jun 14, 2022 •

edited

Loading

BrynCooke commented Jun 16, 2022 •

edited by bnjjj

Loading

ramapalani commented Jul 12, 2022 •

edited

Loading

bnjjj commented Jul 25, 2022 •

edited

Loading

Metrics from error body #1198

Metrics from error body #1198

Comments

lennyburdette commented Jun 6, 2022

bnjjj commented Jun 8, 2022

bnjjj commented Jun 9, 2022

lennyburdette commented Jun 9, 2022

abernix commented Jun 10, 2022

lennyburdette commented Jun 10, 2022 • edited Loading

BrynCooke commented Jun 14, 2022 • edited Loading

BrynCooke commented Jun 16, 2022 • edited by bnjjj Loading

ramapalani commented Jul 12, 2022 • edited Loading

bnjjj commented Jul 25, 2022 • edited Loading

lennyburdette commented Jun 10, 2022 •

edited

Loading

BrynCooke commented Jun 14, 2022 •

edited

Loading

BrynCooke commented Jun 16, 2022 •

edited by bnjjj

Loading

ramapalani commented Jul 12, 2022 •

edited

Loading

bnjjj commented Jul 25, 2022 •

edited

Loading