Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: instrumentation for racecar #72

Merged

Conversation

chrisholmes
Copy link
Contributor

@chrisholmes chrisholmes commented Jun 28, 2022

This PR implements #15

Racecar is built on top of rdkafka, but it doesn't use the API that is implemented by the rdkafka instrumentation so an racecar specific adapter is required.

Racecar's main loop repeatedly calls a rdkafka consumer's #poll method and then sends the returned messages to its own consumer classes (aka processors).

This instrumentation works by patching the Runner class and then its processor (Racecar::Consumer objects) at initialization.

@plantfansam
Copy link
Contributor

plantfansam commented Jun 29, 2022

Thanks @chrisholmes! Are you set up with a tracing UI locally? If so, would it be possible to include a screenshot of a racecar trace? This isn't part of our PR template (yet!), but it might be nice to see what this looks like! If you're not set up to view traces locally and getting that going is onerous, do not worry about it — we can review as-is.

@chrisholmes
Copy link
Contributor Author

@plantfansam yep, makes sense. Here's the view in jaeger:

image

Copy link
Collaborator

@arielvalentin arielvalentin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for this submission!

I am curious about the choice to mixin the instrumentation separately of the processor object vs mixing into the built in instrumenter1 or decorating process, and process_batch.

Would you mind sharing your thought process behind the overall design?

Footnotes

  1. https://github.com/zendesk/racecar/blob/master/lib/racecar/instrumenter.rb

MINIMUM_VERSION = Gem::Version.new('2.0')

compatible do
Gem.loaded_specs['racecar'].version >= MINIMUM_VERSION
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid using Gem.loaded_specs whenever possible and reference the VERSION specified in the gem itself.

Some background into why: https://github.com/open-telemetry/opentelemetry-ruby/issues/988#issuecomment-1018676853

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, let me know if I'd done what you'd expect.

@chrisholmes
Copy link
Contributor Author

Thanks again for this submission!

I am curious about the choice to mixin the instrumentation separately of the processor object vs mixing into the built in instrumenter1 or decorating process, and process_batch.

Would you mind sharing your thought process behind the overall design?

Footnotes

  1. https://github.com/zendesk/racecar/blob/master/lib/racecar/instrumenter.rb

I've tried a few different ways, but this form (though it is quite ugly) is the only format I can get the tests to pass so its spans are consistent to the other kafka instrumentations.

This is my understanding of the impact of the methods you've mentioned:

When patching process and process_batch on Racecar::Runner the tests will pass except that exceptions are not propagated up to the Otel layer when racecar has its pause functionality enabled. Frustratingly, with_pause doesn't propagate the exception it rescues.

As far as I can tell, it is not fully possible to adapt Racecar's instrumentation interface as the process_batch payload doesn't have access to the batch of messages or their headers so it is not possible to add links to the traces. The process_message payload does have access to the message headers so we could use it there.

What do you think about this? Is it worth omitting batch message links or errors for a simpler looking instrumentation?

@ericmustin ericmustin self-assigned this Jul 19, 2022
@ericmustin
Copy link
Contributor

assigned myself to this so i can remember to review while ariel is out, apols on the delay here chris.

@chrisholmes chrisholmes force-pushed the create_racecar_instrumentation branch from 2c9c897 to 27d8ff9 Compare July 19, 2022 17:06
@chrisholmes chrisholmes requested a review from robbkidd as a code owner July 19, 2022 17:06
@chrisholmes chrisholmes force-pushed the create_racecar_instrumentation branch 2 times, most recently from 059bcb2 to c0611b0 Compare July 19, 2022 17:24
Copy link
Contributor

@ericmustin ericmustin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good.

There are a few TODOs you have that still need to be completed,

The only real blocker to me is it's not clear to me why we need to take rdkafka as a hard dependency and install the instrumentation, I'd like to understand this better and whether it would be possible to decouple them. If not, I'd like to make sure we respect any env var's that disable the rdkafka instrumentation as well as make sure there's no edge cases in version requirements.

Nice work as always, apols on the delay on review

Comment on lines 7 to 15
## TODO: Include the supported version to be tested here.
## Example:
# appraise 'rack-2.1' do
# gem 'rack', '~> 2.1.2'
# end

# appraise 'rack-2.0' do
# gem 'rack', '2.0.8'
# end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted, this needs to be updated to run appraisals for racecar

end

install do |_config|
OpenTelemetry::Instrumentation::Rdkafka::Instrumentation.instance.install({})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused here. So is the idea that the racecar instrumentation depends on the rdkafka instrumentation?

  1. Is it possible to check if the instrumentation is already installed before attempting to re-install it?
  2. What's the desired behavior if the rdkafka instrumentation has been intentionally disabled by the user via env var? Are we essentially overwriting that choice.
    2a. Hypothetical since there's no config options at the moment, but won't we automatically overwriting any configuration the user has supplied for their rdkafka instrumentation should it be aded in the future?

I guess i have to think a bit more closely about what the "right" behavior should be, but imo we should be respecting if this instrumentation has been disabled and, imo, its not clear to me why we need to take rdkafka as a dependancy.

Copy link
Contributor Author

@chrisholmes chrisholmes Jul 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, my intention here was to rely on the Rdkafka implementation of producer spans as adding a Racecar implementation would duplicate them like this:

racecar send span:
  rdkafka send span:
    consumer span

There is a similar dependency between the Rack instrumentation with Rails and Sinatra so I did assume it would be alright to follow that pattern, but it is being used slightly different here and less extensively.

  1. Is it possible to check if the instrumentation is already installed before attempting to re-install it?

Yes, there is an interface we could use.

  1. What's the desired behavior if the rdkafka instrumentation has been intentionally disabled by the user via env var? Are we essentially overwriting that choice.
    2a. Hypothetical since there's no config options at the moment, but won't we automatically overwriting any configuration the user has supplied for their rdkafka instrumentation should it be aded in the future?

Yes, I agree, there probably needs to be some more consideration here. If we don't want to load rdkafka's instrumentation then we should probably load some instrumentation here for producer spans.

As alternatives to this, I can think of the following options:

  • Duplicate the spans (there is precedence for this with Faraday and it wrapping of other clients like net/http)
  • Have the racecar producer instrumentation used untraced so that the rdkafka producer spans are dropped (I think this would break context propagation so it's probably not a good idea).

What do you think?


spec.add_dependency 'opentelemetry-api', '~> 1.0'
spec.add_dependency 'opentelemetry-instrumentation-base', '~> 0.21.0'
spec.add_dependency 'opentelemetry-instrumentation-rdkafka', '~> 0.2.0'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why ~> 0.2.0? This is lower than the min version of our rdkafka instrumentation (0.10.0)

Should we be making the rdkafka instrumentation more permissive? As mentioned before, currently the coupling between these two instrumentations feels unclear and maybe a little bit unnecessary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why it's like this actually. I'll update it depending on the conversation above.

@chrisholmes chrisholmes force-pushed the create_racecar_instrumentation branch 2 times, most recently from 673b5e7 to a2395b6 Compare July 28, 2022 08:40
@ericmustin
Copy link
Contributor

hey @chrisholmes just ack'ing your followups, plan to catch up here otw. hectic week for me there's been some internal stuff going on at work.

one thing we discussed in the SIG mtg on tuesday was moving this approach to use ASN notifications. I think everyone agreed that if it could be accomplished, it would be preferrable. I'm going to try to see if i can't fix the issues you'd mentioned when you tried the approach, if it is manageable to fix, this would probably be the preferred way forward.

@chrisholmes chrisholmes force-pushed the create_racecar_instrumentation branch from 66a0332 to 00db269 Compare July 28, 2022 15:25
@chrisholmes
Copy link
Contributor Author

hey @chrisholmes just ack'ing your followups, plan to catch up here otw. hectic week for me there's been some internal stuff going on at work.

thanks @ericmustin!

one thing we discussed in the SIG mtg on tuesday was moving this approach to use ASN notifications. I think everyone agreed that if it could be accomplished, it would be preferrable. I'm going to try to see if i can't fix the issues you'd mentioned when you tried the approach, if it is manageable to fix, this would probably be the preferred way forward.

Sure, I'm happy for you to try. I should have a working copy somewhere with my experiment too that I can push to a branch though I did it by implementing a customer Racer::Instrumenter rather than hoooking into ASN. For reference, the only real issue I had was with the Racecar's runner not providing batch message headers to the instrumentation layer. If we're happy to omit links for batch messages then it should be a simply implementation.

@arielvalentin
Copy link
Collaborator

@chrisholmes Thank you again for your patience and for being willing to try a different approach here.

How would you feel about submitting an upstream PR to see if @dschierbeck or @bquorning would be willing to include the batch message headers so we are able to use links?

@chrisholmes chrisholmes force-pushed the create_racecar_instrumentation branch from d9819eb to ae6dee4 Compare October 13, 2022 21:18
@@ -50,6 +50,7 @@ Gem::Specification.new do |spec|
spec.add_dependency 'opentelemetry-instrumentation-net_http', '~> 0.21.0'
spec.add_dependency 'opentelemetry-instrumentation-pg', '~> 0.22.0'
spec.add_dependency 'opentelemetry-instrumentation-que', '~> 0.4.0'
spec.add_dependency 'opentelemetry-instrumentation-racecar', '~> 0.1.0'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker, but a thought: perhaps worth seeing how this performs in production circumstances before adding to instrumentation-all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was considering something like that. Has this happened before?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, but maybe @open-telemetry/ruby-contrib-maintainers might know? I don't think there's any problem with it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally speaking I'm ok with adding it. If folks that use all don't want it they should disable it.

I minor bump with the change log/Release notes should be enough of a signal

headers ||= {}

tracer.in_span("#{topic} send", attributes: attributes, kind: :producer) do
OpenTelemetry.propagation.inject(headers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting the reliance on headers interface, which is documented in rdkafka here.

end

def start(_name, _id, payload)
attrs = attributes(payload)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, but I don't think attributes is called anywhere else, so we could just keep this method in start 🤷

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, this introduces a ABC rubocop failure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. We can always enable/disable the cop but I don't think it's worth the time 😄

def start(_name, _id, payload)
attrs = attributes(payload)

parent_context = OpenTelemetry.propagation.extract(payload[:headers], getter: OpenTelemetry::Common::Propagation.symbol_key_getter)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

til symbol_key_getter

def add_subscribers
require_relative 'process_message_subscriber'
subscriber = ProcessMessageSubscriber.new
::ActiveSupport::Notifications.subscribe('process_message.racecar', subscriber)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we link to what emits this?

Racecar::Instrumentation.instance.tracer
end

def start(_name, _id, payload)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we're satisfying the ActiveSupport::Subscriber interface here. If so, can we explicitly say so in a comment? IDK if it makes sense to subclass the Subscriber class, but that would perhaps be another way of achieving this.

attributes
end

def finish(name, id, payload)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this instrumentation is adapted from the ActiveSupport instrumentation's SpanSubscriber (link), so maybe these questions were answered over there. At the risk of being told to RTFM, are we sure that:

  1. start and finish will be called sequentially, with all context.attach calls matched with a context.detach call.
  2. start and finish will be invoked by the same Thread.

If user code attaches to a new context between start and finish, but neglects to detach from that context, then detach will not work as expected. Of course, they'd see an error in their logs, so ... that'll probably help them fix it.

Similarly, if finish is called in a different thread than start, it'll be using the wrong Context stack, since that's stored on Thread.current.

I don't know much about ActiveSupport notifications (or Rdkafka! or Racecar!), so I'm most likely being paranoid here, but thought it was worth asking.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahayworth might have insight, since he worked on a similar thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, it's not 100% in manual, but ActiveSupport::Notifications will call in that order and in the same thread (they leave to subscribers to decide if async is needed). The code isn't terribly complicated so you could follow on the source

Copy link
Contributor

@plantfansam plantfansam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still like to see my open comments resolved at some point, but I don't think it's worth blocking merge. Happy to put this in the wild and see how it fares.

@arielvalentin arielvalentin enabled auto-merge (squash) October 25, 2022 14:49
auto-merge was automatically disabled November 2, 2022 22:38

Head branch was pushed to by a user without write access

@chrisholmes chrisholmes force-pushed the create_racecar_instrumentation branch from 835f9b3 to 7625877 Compare November 3, 2022 14:17
@chrisholmes
Copy link
Contributor Author

@arielvalentin @plantfansam: I've resolved some conflicts and I think that I've switched over to the new CI structure. Please let me know if I've missed anything.

@arielvalentin arielvalentin merged commit 7b87ce5 into open-telemetry:main Nov 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants