Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics from error body #1198

Closed
lennyburdette opened this issue Jun 6, 2022 · 9 comments · Fixed by #1300
Closed

Metrics from error body #1198

lennyburdette opened this issue Jun 6, 2022 · 9 comments · Fixed by #1300
Assignees

Comments

@lennyburdette
Copy link
Contributor

GraphQL HTTP responses are typically 200 OK regardless of whether an error occurred. Generic APMs and monitoring tools don't have a way to distinguish between { data } and { data: null, errors: [...] } for error rate detection.

It's also common to add metadata to errors to replicate HTTP status code semantics like this:

{
  "data": null,
  "errors": [
    {
      "message": "Something went wrong",
      "extensions": {
        "status": "INTERAL_SERVER_ERROR"
      }
    }
  ]
}

The current Prometheus metrics provides counters for high-level HTTP status:

http_requests_total{status="200"} 12
http_requests_total{status="500"} 4
http_requests_total{status="200",subgraph="subgraph"} 11

Describe the solution you'd like
A customer has asked for counters based on data from within the error extensions.

With config similar to:

telemetry:
  metrics:
    prometheus:
      enabled: true
      response_counters:
        - errors.extensions.status

It would be great to get metrics like:

response_fields{path="errors.extensions.status",value:"INTERNAL_SERVER_ERROR"} 12
response_fields{path="errors.extensions.status",value:"INVALID_REQUEST"} 24
response_fields{subgraph="subgraph",path="errors.extensions.status",value:"INVALID_REQUEST"} 8

(Note that the last metric in that example is the response to the fetch to the subgraph.)

Describe alternatives you've considered
Open to ideas!

Additional context
This is an enterprise customer request.

@bnjjj
Copy link
Contributor

bnjjj commented Jun 8, 2022

I recently added feature to add custom attributes to metrics and I think we could extend this feature by adding an entry named from_response .
Example:

telemetry:
  metrics:
    common:
      attributes:
        static:
          - name: "version"
            value: "v1.0.0"
        from_headers:
          - named: "content-type"
            rename: "payload_type"
            default: "application/json"
          - named: "x-custom-header-to-add"
        
        from_response:
          - path: errors.extensions.status
            rename: extended_status
            default: optional_default_value

@bnjjj
Copy link
Contributor

bnjjj commented Jun 9, 2022

Regarding your example I think I also forgot to add another configuration specific to subgraphs:

telemetry:
  metrics:
    common:
      attributes:
        static:
          - name: "version"
            value: "v1.0.0"
        from_headers:
          - named: "content-type"
            rename: "payload_type"
            default: "application/json"
          - named: "x-custom-header-to-add"
        from_response:
          - path: errors.extensions.status
            rename: extended_status
            default: optional_default_value
    subgraphs:
      my_subgraph:
        attributes:
          from_response:
            - path: errors.extensions.status
              rename: extended_status
              default: optional_default_value

@lennyburdette
Copy link
Contributor Author

So the metrics would look like this?

http_requests_total{status="200"} 12
http_requests_total{status="200",extended_status="INTERNAL_SERVER_ERROR"} 12
http_requests_total{status="200",extended_status="INVALID_REQUEST"} 12
http_requests_total{status="200",subgraph="subgraph",extended_status="INTERNAL_SERVER_ERROR"} 11
http_requests_total{status="200",subgraph="subgraph",extended_status="INVALID_REQUEST"} 11

I think that would work!

@abernix
Copy link
Member

abernix commented Jun 10, 2022

I haven't fully investigated what we have in terms of options here or the from_response implementation, but I'm a bit worried things are getting a bit blurry here from a configuration standpoint. I think that from_headers and from_response is more (but not completely) self explanatory to me when they are within the subgraphs section, but the common section seems to throw things off in my mind, particularly because we're sitting in the position of a proxy and have both upstream (subgraphs) and downstream (client) concepts to think about.

Further, headers are classically part of a request and response (i.e., request.headers and response.headers) so thinking about a "response" and "headers" separately isn't a natural hierarchy in my mind, at least as I understand things right now. (Maybe I'm thinking about this wrong!)

It might make sense to refer to the GraphQL response (i.e., { data: ..., errors: ..., extensions: ... }) as ... well, graphql_response, but maybe in a similar way to how the HTTP Fetch API standard has response.headers and response.json() we want to think about this as "response.graphql" for both upstream and downstream_?

I'll just bikeshed some ideas here around structure, but do we think something like this structure adds clarity?

   from_upstream_subgraph:
     request:
       header:
          - named: "content-type"
            rename: "payload_type"
            default: "application/json"
          - named: "x-custom-header-to-add"
       graphql:
          - path: errors.extensions.status
            rename: extended_status
            default: optional_default_value
    from_downstream_client:
      # Do we need a section like this?

@lennyburdette
Copy link
Contributor Author

lennyburdette commented Jun 10, 2022

Could the metrics configuration align with the four services (Router, Execution, QueryPlanner, Subgraph)? Router metrics would be "downstream" client request/response metrics, while Subgraph metrics would be "upstream" subgraph request/response metrics.

@BrynCooke
Copy link
Contributor

BrynCooke commented Jun 14, 2022

We're going to have to decide how much we want the telemetry plugin to do in terms of collecting data. It's worth noting that at the point where we update the metrics, the content of the subgraph responses are not available to us by default right now.

Here is a suggestion that hopefully covers most cases, but could also be used with rhai in the case that users want to get something from the subgraphs. They would be required to populate context.

telemetry:
  metrics:
    common:
      attributes:
        static:
          - name: "version"
            value: "v1.0.0"
        request:
          - header:
               named: "content-type"
               rename: "payload_type"
               default: "application/json"
          - header:
              named: "x-custom-header-to-add"
        response:
          - body:
              path: errors.extensions.status
              name: extended_status
              default: optional_default_value
        context:
          - named: "foo"
            default: "application/json"

@BrynCooke
Copy link
Contributor

BrynCooke commented Jun 16, 2022

We've settled on the following config:

telemetry:
  metrics: 
    common:
      attributes:
        router:
          static:
            - name: "version"
              value: "v1.0.0"
          request:
            - header:
                named: "content-type"
                rename: "payload_type"
                default: "application/json"
            - header:
                named: "x-custom-header-to-add"
          response:
            - body:
                path: errors.extensions.status
                name: extended_status
                default: optional_default_value
          context:
            - named: "foo"
          
        subgraph:
          all:
            static:
              - name: "version"
                value: "v1.0.0"
              request:
                - header:
                    named: "content-type"
                    rename: "payload_type"
                    default: "application/json"
                - header:
                    named: "x-custom-header-to-add"
              response:
                - body:
                    path: errors.extensions.status
                    name: extended_status
                    default: optional_default_value
              context:
                - named: "foo"
          subgraphs:
            products:
              static:
                - name: "version"
                  value: "v1.0.0"
                request:
                  - header:
                      named: "content-type"
                      rename: "payload_type"
                      default: "application/json"
                  - header:
                      named: "x-custom-header-to-add"
                response:
                  - body:
                      path: errors.extensions.status
                      name: extended_status
                      default: optional_default_value
                context:
                  - named: "foo"

@ramapalani
Copy link

ramapalani commented Jul 12, 2022

Thanks a lot for this feature.

I'm trying to differentiate between these two errors (one subgraph returning error vs two subgraphs returning error)
error1.txt
error2.txt

My prometheus configuration

telemetry:
  metrics:
    prometheus:
      enabled: true
    common:
      attributes:
        router:
          response:
            body:
            - path: .errors[0].extensions.type
              name: error_extended_type
        subgraph:
          all:
            response:
              body:
              - path: .errors[0].extensions.type
                name: subgraph_error_extended_type

router.response.body reads the first error properly if there is one.
I may not know how many more errors are possible, so I tried using subgraph.all.response.body, but this doesn't have any effect. Is there a way for me to capture all errors and not just the first error?

something like this:

http_requests_total{error_extended_type="SubrequestHttpError",subgraph="pandas", operation_name="AllPandas",status="200"} 2
http_requests_total{error_extended_type="SubrequestHttpError",subgraph="products", operation_name="AllPandas",status="200"} 1

@bnjjj
Copy link
Contributor

bnjjj commented Jul 25, 2022

@ramapalani Ok I think there are 2 different things we're missing here. First one could be to add support for json query syntax like this .errors[*].extensions.type but it would just create an array with all error types inside. Which is not what you want.

What is missing in our implementation and could be useful for you is to add an error subsection to reflect the same behavior we have in our plugin hooks. For example instead of having:

telemetry:
  metrics:
    prometheus:
      enabled: true
    common:
      attributes:
        subgraph:
          all:
            response:
              body:
              - path: .errors[0].extensions.type
                name: subgraph_error_extended_type

you could have :

telemetry:
  metrics:
    prometheus:
      enabled: true
    common:
      attributes:
        subgraph:
          all:
            error: # Only work if it's a valid GraphQL error
              include_message: true # Will include the error message in a message attribute
              include_locations: true
              include_path: true
              extensions: # Include extension data
                name: subgraph_error_extended_type # Name of the attribute
                path: .type # JSON query path to fetch data from extensions

And by doing this, using your second example file you would have metrics like this:

http_requests_total{error_extended_type="SubrequestHttpError",message="HTTP fetch failed from 'pandas': HTTP fetch failed from 'redacted': redacted",subgraph="pandas", operation_name="AllPandas",status="200"} 2
http_requests_total{error_extended_type="SubrequestHttpError",message="HTTP fetch failed from 'products': HTTP fetch failed from 'redacted': redacted",subgraph="products", operation_name="AllPandas",status="200"} 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants