Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support gRPC #403

Merged
merged 9 commits into from
Apr 30, 2018
Merged

Support gRPC #403

merged 9 commits into from
Apr 30, 2018

Conversation

Jared-Prime
Copy link
Contributor

@Jared-Prime Jared-Prime commented Apr 13, 2018

This PR provides an implementation for distributed tracing for Ruby gRPC services and clients.

Screenshot from Integration Using Demo Application: https://github.com/Jared-Prime/grpc-demo

Screenshot from Integration Using Demo Application: https://github.com/Jared-Prime/grpc-demo

In gRPC nomenclature, an "interceptor" is similar to middleware. When running a service or a client, the #patch method prepends the datadog interceptor onto the default middleware call chain, ensuring traces span across the scope of the protocol implementation.

The following types of gRPC calls have been implemented and tested:

  • request/reply
  • streaming requests
  • streaming replies
  • bidirectional streaming

A demo application is available at https://github.com/Jared-Prime/grpc-demo . A version of this patch is undergoing internal testing at my workplace ( Kenna Security ) as well.

This Pull Request closes #379 if accepted and merged; it closes #378 ( a previous attempt at this feature ) as well.

Review, suggestions, and refinements are welcome :)

This was referenced Apr 13, 2018
@adomokos
Copy link

Hello,

we would need this feature as well. Can someone from the maintainers review it?

Much appreciated!

@delner
Copy link
Contributor

delner commented Apr 17, 2018

Hey @adomokos, we're as excited about this pull request as you are, and are actively working with @Jared-Prime on this. Rest assured its coming!

@adomokos
Copy link

💥 Thanks @delner!

Copy link
Contributor

@delner delner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great so far, some misc feedback.

# inject trace context information into gRPC metadata prior to
# sending the request to the server.
class Client < Base
def trace(keywords)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might want to catch and handle errors here. In the (unlikely) event the tracing code raises an error, the actual gRPC call would not happen. Generally speaking, the goal of trace code like this is to be minimally impactful, so it might be prudent to do some kind of begin rescue ensure else setup here.

Same applies to the Server code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm uncertain how to proceed with this error handling. Maybe you had something specific in mind?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think doing:

begin
  keywords[:metadata].each do |header, value|
    span.set_tag(header, value)
  end

  Datadog::GRPCPropagator
    .inject!(span.context, keywords[:metadata])
rescue StandardError => e
  Datadog::Tracer.log.debug("GRPC client trace failed: #{e}")
end

yield

Or something to that effect. It isn't bulletproof exactly without really chopping up the code, but might help. You could also consider extracting that code into another method named something like annotate! then add the error handling in there, if that seems cleaner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I get it now. I like the annotate! approach and will try that

# The `#intercept!` method is implemented in gRPC; this module
# will be prepended to the original class, effectively injecting
# our tracing middleware into the head of the call chain.
module InterceptWithDatadog
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this constant need to be defined in GRPC::InterceptionContext? (As opposed to the Datadog::Contrib::GRPC one?) We generally try to avoid opening/modifying classes or injecting new constants into integration namespaces, if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does and is defined here. If you'd like, we can opt for the direct insertion of interceptors as in this example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand: for the default use :grpc scenario, we're not sure what RPC service to inject into, so we just prepend our module into the InterceptionContext yeah? If so, I think I should clarify what I meant; I think we can move GRPC::InterceptionContext::InterceptWithDatadog to Datadog::Contrib::GRPC::InterceptWithDatadog or something similar. The point of which is mostly to just avoid defining constants in the GRPC namespace, even if we do have to inject modules into constants defined there.

Copy link
Contributor

@delner delner Apr 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different question here: what happens if a user defines something like:

# Activate default GRPC tracing
Datadog.configure do |c|
  c.use :grpc
end

# Override configuration for one particular client
alternate_client = Demo::Echo::Service.rpc_stub_class.new(
  'localhost:50051',
  :this_channel_is_insecure,
  :interceptors => [configured_interceptor]
)

Would this cause this alternate_client to have two Datadog interceptors in it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. The login in already_prepended? should prevent that from happening. The intercept! call happens during the RPC interaction, so it should only add the default datadog interceptor to the call chain if the trace has not begun and if no other datadog interceptor has been supplied.

end

def choose_datadog_interceptor(args)
if args.key?(:metadata)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the reasoning here for how an interceptor is chosen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the gRPC Ruby core library, Client and Server interceptors have the same method names, but different keyword arguments. Clients always have a metadata keyword whereas Servers do not; servers always have a call keyword. A metadata is very similar to a header object in many HTTP libraries; it may be just a Hash in some cases. A call is an internal object ( a GRPC::ActiveCall ) that provides access to the ongoing communication between client and server; it provides access to the metadata sent by the client.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, is there any more concrete/reliable way of differentiating between client and server calls? It'd be worthwhile if possible, otherwise we can just go with what will work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not finding anything more concrete. I'll keep looking, but the internals of the grpc gem aren't revealing anything more direct.


def prepend_interceptor
::GRPC::InterceptionContext
.prepend(::GRPC::InterceptionContext::InterceptWithDatadog)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What versions of Ruby are we supporting with GRPC? A limited set is okay, but I can see this won't work with 1.9.3, and might not work with 2.0 (I think you might have to do send(:prepend)?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to it's use of keyword arguments, only 2.x is supported.

I think if you're uncomfortable with modifying the existing object then we could simply require the more involved configuration. Otherwise, I'll look into and test some earlier 2.x rubies.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requiring the more involved configuration seems "cleaner" and simpler, however, I think for now, prepending into the GRPC module is okay, for the sake of having some out of the box, easy GRPC tracing.

Let's just verify its working with 2.0+ for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looks like the grpc gem itself has a dependency on jwt, which requires a Ruby >= 2.1. Since 2.1 is past EOL, how about we just require >= 2.2? A quick test on my local machine works out OK.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems fine with me; seems like a feature that should require sufficiently new Ruby anyways. We'll just have to be sure to add that to the documentation.

metadata[GRPC_METADATA_TRACE_ID] = context.trace_id.to_s
metadata[GRPC_METADATA_PARENT_ID] = context.span_id.to_s
metadata[GRPC_METADATA_SAMPLING_PRIORITY] = context.sampling_priority.to_s
metadata.delete(GRPC_METADATA_SAMPLING_PRIORITY) unless context.sampling_priority
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor but can we just do:

metadata[GRPC_METADATA_SAMPLING_PRIORITY] = context.sampling_priority.to_s if context.sampling_priority

end

def sampling_priority
value = @metadata.fetch(GRPC_METADATA_SAMPLING_PRIORITY, -1).to_i
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Datadog::Ext::Priority::USER_REJECT has a value of -1 which is legitimate. Don't we want to propagate that?

Something like?

value = @metadata.fetch(GRPC_METADATA_SAMPLING_PRIORITY, nil)
value.nil? ? nil : value.to_i
# Or you could do:
value && value.to_i


let(:configured_client_interceptor) do
Datadog::Contrib::GRPC::DatadogInterceptor::Client.new do |c|
c.service_name = 'cepsr'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, like the name!

require 'ddtrace'

RSpec.describe 'tracing on the client connection' do
subject { Datadog::Contrib::GRPC::DatadogInterceptor::Client.new }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest giving subject blocks names. Something like subject(:client) would be good here.

@@ -0,0 +1,88 @@
require 'grpc'

module GRPCHelper
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a huge deal, but food for thought: consider turning this into an RSpec shared context that you can compose into the other example files.

Then you can stub the constants to make them more composable. A reason for doing so is that it allows the RSpec example to optionally override and tweak behavior, without having to define constants per example.

e.g. a brief example of this could look like:

# Shared context
RSpec.shared_context 'GRPC service' do
  def run_request_reply(address = '0.0.0.0:50052', client = nil)
    runner(address, client) { |c| c.basic(message_class.new) }
  end

  # Stubbed constant
  # This message class can be overidden in an example to change behavior
  let(:message_class) do
    stub_const('TestMessage', Class.new do
      class << self
         def marshal(_o)
           ''
         end
   
         def unmarshal(_o)
           new
         end
       end
    end)
  end
end

# Compose it into a test
RSpec.describe "GRPC test" do
  include_context 'GRPC service'
end

For a better example in practice, check out the Resque job shared context here: https://github.com/DataDog/dd-trace-rb/pull/405/files

This is only a suggestion though; only implement it if you think it will benefit these tests.

@delner
Copy link
Contributor

delner commented Apr 26, 2018

I think we're really close to merging this one now. Final steps:

  1. Need to try out the performance tests on our end. You had setup a repo for this already, yeah? Where could I find that?
  2. Add GRPC to a sample app to generate sample traces. Do you have any examples of a minimal GRPC setup that I could use as a reference?
  3. Add documentation to GettingStarted.md, in a new GRPC section.

@Jared-Prime
Copy link
Contributor Author

  1. https://github.com/Jared-Prime/grpc-demo/ should be ready. I've also pushed a docker image with Ruby GRPC and the latest version of this branch: https://hub.docker.com/r/jprime/grpc-demo/tags/
  2. The demo code here implements this protocol. The setup generally follows what happens in the benchmark script. I'll be happy to answer any questions or write some basic docs if the setups not clear.
  3. Would you like me to add it or would you like to add the documentation after testing and experimenting with the code a bit? Either way works for me.

Thanks!

@delner
Copy link
Contributor

delner commented Apr 26, 2018

I don't see why we can't add a rough draft for it now. Feel free to add something to GettingStarted.md, and we can revise it later if need be.


| Key | Description | Default |
| --- | --- | --- |
| ``service_name`` | Service name used for `grpc` instrumentation | grape |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to use a default other than grape 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops 🤦‍♂️

@@ -1102,6 +1154,30 @@ Then you must enable the request queuing feature in the integration handling the

For Rack based applications, see the [documentation](#rack) for details for enabling this feature.

### HTTP request queuing
Copy link
Contributor

@delner delner Apr 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did this end up in this PR? I think this got merged to 0.13-dev already. Bad rebase? Looking at this, this PR has a lot of commits it shouldn't have in it (which have already been merged to 0.13-dev.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting. Fixing....

Jared Davis added 8 commits April 26, 2018 14:17
…C::Interceptor registry.This leads to cleaner separation between client side and server sidetracing implementation and testing.
use service name from the pin
- change module name to `Datadog::Contrib::GRPC::InterceptWithDatadog`
- allow sampling priority value less than zero ( for user reject )
- use named subjects in spec
@delner
Copy link
Contributor

delner commented Apr 26, 2018

Okay, I've tried this out in a sample application. Seems to be working.

screen shot 2018-04-26 at 5 19 45 pm

I think the big gap between the client and server span is clock drift.

@delner delner requested a review from palazzem April 27, 2018 17:36
@delner delner added integrations Involves tracing integrations community Was opened by a community member feature Involves a product feature labels Apr 27, 2018
@delner delner added this to the 0.13.0 milestone Apr 27, 2018
Copy link
Contributor

@delner delner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'm satisfied with this. Given the size of this, I'm going to request one other review from another maintainer on our side, but otherwise I think we should be good to merge this.

Copy link
Contributor

@palazzem palazzem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive contribution! Thank you very much @Jared-Prime! It's a huge improvement to our supported libraries!

Datadog::GRPCPropagator
.inject!(span.context, metadata)
rescue StandardError => e
Datadog::Tracer.log.debug("GRPC client trace failed: #{e}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@Jared-Prime
Copy link
Contributor Author

Thank you for your help in completing this, @delner

@delner delner merged commit 42c9b0a into DataDog:0.13-dev Apr 30, 2018
@delner
Copy link
Contributor

delner commented Apr 30, 2018

Hey @Jared-Prime , I merged this to 0.13-dev but it appears to be breaking the CI. Could you take a look at this? https://circleci.com/gh/DataDog/dd-trace-rb/2608#tests/containers/1

Seems like these two examples are flaky, tend to break randomly:

1) gRPC integration test server stream behaves like associates child spans with the parent should eq 8432511801728020899
     Failure/Error: expect(child_span.parent_id).to eq parent_span.span_id
     
       expected: 8432511801728020899
            got: 0
     
       (compared using ==)
     Shared Example Group: "associates child spans with the parent" called from ./spec/ddtrace/contrib/grpc/integration_spec.rb:65
     # ./spec/ddtrace/contrib/grpc/integration_spec.rb:49:in `block (3 levels) in <top (required)>'

  2) gRPC integration test bidirectional stream behaves like associates child spans with the parent should eq 5439390562754383015
     Failure/Error: expect(child_span.parent_id).to eq parent_span.span_id
     
       expected: 5439390562754383015
            got: 0
     
       (compared using ==)
     Shared Example Group: "associates child spans with the parent" called from ./spec/ddtrace/contrib/grpc/integration_spec.rb:70
     # ./spec/ddtrace/contrib/grpc/integration_spec.rb:49:in `block (3 levels) in <top (required)>'

@Jared-Prime
Copy link
Contributor Author

@delner Do you think we need to set sampling priority on these two examples?

@delner
Copy link
Contributor

delner commented Apr 30, 2018

@Jared-Prime Probably. Distributed tracing is supposed to always distribute the same sampling_priority value via headers so spans within the same trace don't get randomly sampled out.

@delner
Copy link
Contributor

delner commented May 1, 2018

I think this has been solved in #412.

@delner
Copy link
Contributor

delner commented May 1, 2018

@Jared-Prime Just want to say, thanks for your amazing contribution! This is a really cool feature, and you did a great job with it, as well as being on top of things throughout the PR process. It's been a pleasure working with you on this from the maintainer side!

Looking forward to seeing others enjoy this new feature as well!

🎉 🎉 🎉

delner added a commit that referenced this pull request May 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Was opened by a community member feature Involves a product feature integrations Involves tracing integrations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants