Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for federated tracing #16

Merged
merged 3 commits into from
Sep 10, 2019

Conversation

lennyburdette
Copy link
Contributor

Federated GraphQL services should include timing and error
information as a Base64-encoded protocol buffer message in
the "extensions.ftv1" field. The gateway requests traces
by adding a special header to the GraphQL request, and combines
traces from all federated services into a single trace.

This change includes a Tracer that uses the graphql-ruby
tracing API to record field timings and info and store it
on the execution context. It also includes methods on the
ApolloFederation::Tracing module to pluck the info from the
context, convert it to an encoded string, and attach it to the
query result's extensions.

I used the Apollo Server typescript code as reference:

As well as an unfinished fork of apollo-tracing-ruby:

Federated tracing documentation: https://www.apollographql.com/docs/apollo-server/federation/metrics/

Addresses #14

cc @rylanc @zionts @glasser

@jturkel
Copy link
Contributor

jturkel commented Aug 6, 2019

Out of curiosity do you have thoughts on how non-federated use cases that send traces directly to the Apollo ingress endpoint should be supported e.g. what should live in this gem vs. apollo-tracing-ruby vs. graphql-ruby? We (Salsify) don't have any immediate plans to work on our fork of apollo-tracing-ruby because we decided not to move forward with Apollo due to some missing features.

@lennyburdette
Copy link
Contributor Author

@jturkel These are great questions and I honestly don't have strong opinions. My use cases are completely tied to managed schema federation with Apollo, but if someone makes a case for organizing this code differently, I'm all ears.

  • The federation tracing spec is markedly simpler than the full /api/ingress/traces API. Federated traces contain only the start/end/duration/node tree field on the protobuf message. I don't think it makes sense (yet) to share an implementation.
  • I like the idea of this gem being a one-stop-shop for federated services, so even if someone completes the implementation in apollo-tracing-ruby, I'd still want the user experience to be as simple as installing and configuring this gem.

That being said, thanks for the work done so far in apollo-tracing-ruby! It was hugely helpful in getting this to work.

@jturkel
Copy link
Contributor

jturkel commented Aug 8, 2019

Makes sense. I agree that the small amount of potentially shared code (i.e. the NodeMap + parts of the Tracer) doesn't warrant adding dependencies between the two gems. There may be scenarios where servers want to both return the trace information to a gateway server and report it directly to a Apollo (e.g. if the API can also be called from non-gateway clients) but I doubt computing trace information twice will add much overhead.

@kerryaustin
Copy link

Excited to try this, thanks for the hard work @lennyburdette !!! 🎉

@kerryaustin
Copy link

@rylanc @noaelad

@rylanc rylanc self-requested a review August 13, 2019 19:36
@rylanc
Copy link
Contributor

rylanc commented Aug 13, 2019

Sorry for the delay. I've been on vacation. I'll take a look at it this week

root: trace[:node_map].root,
)

json = result.to_h

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this line be after you do your additional result changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops, yes! not sure how this is working today

module Tracing
KEY = :ftv1

def self.use(schema)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind wrapping these in a class << self?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have a preference between class << self vs extend self?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol never mind ... rubocop enforces module_function

@lennyburdette lennyburdette force-pushed the federated-tracing branch 3 times, most recently from 3c8d73e to 1a787c8 Compare August 20, 2019 16:31
@kerryaustin
Copy link

So.....can we get this merged?

@noaelad
Copy link

noaelad commented Sep 3, 2019

Sorry for the delay on this folks, I'll take a pass at it today

execute_field_lazy(data, &block)
else
yield
end
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about the implementation of the tracer. I couldn't find any documentation of these four events (execute_query, execute_query_lazy, execute_field, execute_field_lazy) beyond just their names here:
https://graphql-ruby.org/api-doc/1.9.11/GraphQL/Tracing
So I don't understand what the difference is between the lazy and non-lazy ones and whether the tracer is tracking each one correctly. Can you shed more light on how you figured this out?

The "Step 1" "Step 2"... comments are also confusing - is there some documentation that guarantees the events always occur in that order?

Finally - why is the execute_query event used to record only a start_time timestamp and not an end_time timestamp after the block.call line? And similarly why is execute_query_lazy only recording the end_time? It seems like we're assuming that both of these events always occur in sequence. Is there documentation for that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noaelad TBH, I probably wouldn't have figured this out on my own. I used this in-progress PR as reference: https://github.com/salsify/apollo-tracing-ruby/blob/feature/new-apollo-api/lib/apollo_tracing/tracer.rb

I believe the steps are guaranteed to execute in the order described in my comments. The naming is definitely confusing—might have to poke rmosolgo for an explaination.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is around relying on undocumented behavior in graphql-ruby, especially where that undocumented behavior doesn't make a lot of immediate sense to me.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried setting both start_time and end_time in both execute_query and execute_query_lazy?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pulled your branch and tried playing around, when both methods (execute_query and execute_query_lazy) have the same implementation the only spec failures result from the fact that timestamps are increased twice and not just once (which makes sense in today's world). I'd prefer an implementation like this that is more robust and relies less on the fact that graphql-ruby happens to call both execute_query and execute_query_lazy in a specific sequence. WDYT?

By the way the specs would still be a brittle to changes in graphql-ruby but I don't have a good suggestion for mitigating that.

      def self.execute_query(data, &block)
        query = data.fetch(:query)
        return block.call unless query.context && query.context[:tracing_enabled]

        trace = query.context.namespace(ApolloFederation::Tracing::KEY)
        trace.merge!(
          start_time: Time.now.utc,
          start_time_nanos: Process.clock_gettime(Process::CLOCK_MONOTONIC, :nanosecond),
          node_map: NodeMap.new,
        )
        result = block.call
        trace.merge!(
          end_time: Time.now.utc,
          end_time_nanos: Process.clock_gettime(Process::CLOCK_MONOTONIC, :nanosecond),
        )

        result
      end

      def self.execute_query_lazy(data, &block)
        execute_query(data, &block)
      end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this was a fun rabbit hole, but I think I understand tracing mechanics well now. I added a fat comment that explains the ordering of trace events. The key was figuring out that there are two execution phases: 1) non-lazy, 2) lazy. I could not figure out a way for execute_query_lazy to not fire, so I think it's still a good place to record ending times.

Here's a graphql-ruby test that shows execution order. It always has eql as the last step in the event list.

We could record an end time at the end of execute_query but it will always get thrown away.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for digging into this

@noaelad noaelad self-requested a review September 6, 2019 23:38
Copy link

@noaelad noaelad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noaelad
Copy link

noaelad commented Sep 9, 2019

@lennyburdette this makes sense to me, thanks for digging into tracing.
Can you please setup CircleCI with your github account to run the test suite on your fork, so we can run it through the test suite before merging?

Lenny Burdette added 2 commits September 9, 2019 17:33
Federated GraphQL services should include timing and error
information as a Base64-encoded protocol buffer message in
the `"extensions.ftv1"` field. The gateway requests traces
by adding a special header to the GraphQL request, and combines
traces from all federated services into a single trace.

This change includes a Tracer that uses the graphql-ruby
[tracing API][t] to record field timings and info and store it
on the execution context. It also includes methods on the
`ApolloFederation::Tracing` module to pluck the info from the
context, convert it to an encoded string, and attach it to the
query result's extensions.

I used the Apollo Server typescript code as reference:
* https://github.com/apollographql/apollo-server/blob/master/packages/apollo-engine-reporting/src/federatedExtension.ts
* https://github.com/apollographql/apollo-server/blob/master/packages/apollo-engine-reporting/src/treeBuilder.ts

As well as an unfinished fork of apollo-tracing-ruby:
* https://github.com/salsify/apollo-tracing-ruby/blob/feature/new-apollo-api/lib/apollo_tracing/tracer.rb
* https://github.com/salsify/apollo-tracing-ruby/blob/feature/new-apollo-api/lib/apollo_tracing/trace_tree.rb

Federated tracing documentation: https://www.apollographql.com/docs/apollo-server/federation/metrics/

Addresses Gusto#14

[t]:https://graphql-ruby.org/queries/tracing.html
* `module_function` vs repetitive `self.`
* fix a call to `result.to_h` in `.attach_trace_to_result`
@lennyburdette lennyburdette force-pushed the federated-tracing branch 3 times, most recently from e59edf6 to 634edc2 Compare September 10, 2019 00:43
…est to cover lazy fields

will squash later
@lennyburdette
Copy link
Contributor Author

@noaelad got the tests passing!

Copy link

@noaelad noaelad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing this very useful feature!

@noaelad noaelad merged commit 57ecc5b into Gusto:master Sep 10, 2019
grxy pushed a commit that referenced this pull request Sep 10, 2019
# [0.4.0](v0.3.2...v0.4.0) (2019-09-10)

### Features

* add support for federated tracing ([#16](#16)) ([57ecc5b](57ecc5b)), closes [#14](#14)
@grxy
Copy link
Collaborator

grxy commented Sep 10, 2019

🎉 This PR is included in version 0.4.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

@grxy grxy added the released label Sep 10, 2019
@lennyburdette
Copy link
Contributor Author

Thank you for the review!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants