Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate Rails instrumentation to use only ActiveSupport::Notifications #218

Open
5 of 11 tasks
ahayworth opened this issue Dec 6, 2022 · 26 comments
Open
5 of 11 tasks
Assignees
Labels
feature New feature or request help wanted Extra attention is needed instrumentation keep Ensures stale-bot keeps this issue/PR open

Comments

@ahayworth
Copy link
Contributor

ahayworth commented Dec 6, 2022

We recently uncovered an issue wherein our action_pack instrumentation caused a production problem. There are many ways to consider such an issue, but one way is to acknowledge that monkey-patching is fraught with peril, and that we should consider different methods of instrumentation.

For Rails, we should be able to use ActiveSupport::Notifications. We already use this to great effect in our action_view instrumentation, and we think we can expand this further to the rest of our Rails instrumentation.

To do this, we need to:

  • Verify that Rails 7 implements safer subscription start/end behavior (a crashing subscriber should not prevent us from finishing our spans correctly).
  • Audit the instrumentation we have for Rails, and figure out what ActiveSupport notifications could provide similar levels of instrumentation:
    • action_pack
    • action_view
    • active_job
      • We can't drop our patch for serializing metadata regardless of what we choose to do here.
      • We implement our enqueue instrumentation as an around_enqueue filter; which is precisely what Rails does.
      • perform.active_job will likely cover the rest.
    • active_record
      • This was one of the things preventing us from doing notifications the first time around: we only get sql.active_record and instantiation.active_record.
      • sql.active_record is great, but it's low-level and won't capture whether we are doing a #find or a #destroy, etc. Nor will it capture callbacks.
      • instantiation.active_record will capture callbacks, but it's pretty basic and still doesn't tell us what method we're really in.
    • active_support
      • This is already done, because it doesn't do anything on its own! 🎉
      • It's a base for using notifications for spans, which is precisely what we want.
      • Basically, previous contributors thought ahead to a future where it would be useful to have a generic "subscribe, make a span" kind of thing. Yay!
  • Determine how many (if any) notifications are generated entirely after the instrumented operation is completed, and how many are generated around the instrumented operation.
    • This has implications around accessing the current_span, which may be a deal-breaker.
  • Augment our custom span subscriber in any way necessary to support a more broad use-case than what it does now
    • We need to figure out if there is a better way to do this in Rails 7 and if we should adjust this span subscriber.
    • We also should look and see if we could backport the Rails 7 fixes here ... or get that backported upstream.
  • Work through instrumentation and convert them.
    • Depending on the outcome of the safety/backport investigations, we can decide whether or not we wish to keep the monkey-patched versions around for older Rails installations or not.
@ahayworth ahayworth added feature New feature or request help wanted Extra attention is needed instrumentation labels Dec 6, 2022
@ahayworth ahayworth self-assigned this Dec 6, 2022
@ahayworth
Copy link
Contributor Author

This is the PR that improved safety in Rails 7: rails/rails#43282

@plantfansam
Copy link
Contributor

plantfansam commented Dec 6, 2022

I think this is a great idea. Even if we discover issues with ASN, that sounds like a great reason to contribute fixes upstream (to Rails) 😄

@richardmcmillen
Copy link
Contributor

This is cool to see!

I noted in my MR here #204 that the ActionPack instrumentation could probably be refactored to use the same subscriber as I created in the ME. Just in case anyone wants to have a look.

@robbkidd
Copy link
Member

Talked about this during the SIG today. Attendees are fans of the idea to move away from monkeypatching and towards AS::N subscriptions.

@tjefferson08
Copy link

Can AS::N support current_span ?

One thing I do a lot in my apps:

# in a controller
def index
  OpenTelemetry::Trace.current_span.add_attributes(interesting_stuff)
  # ... rest of action
end

I'm not super familiar with AS::N, but I'd think deferring span creation to after the execution of index would break this use case.

@arielvalentin
Copy link
Collaborator

@ahayworth I started to look into this a little bit today and noticed a few constraints and dependencies that will make it difficult to use the existing AS Notifications instrumentation outside of a Rails application.

The SpanSubscriber has a director dependency on Rails and assumes it is installed (introduced here open-telemetry/opentelemetry-ruby#993):

https://github.com/open-telemetry/opentelemetry-ruby-contrib/blob/main/instrumentation/active_support/lib/opentelemetry/instrumentation/active_support/span_subscriber.rb#L33

Some applications, e.g. the GitHub Monolith, do not load the entire Rails environment at startup and in some cases only loads a subset of ActiveSupport, ActiveJob or ActiveRecord. We currently do not use ActiveSupport instrumentation in the monolith but we do use the current ActiveRecord instrumentation.

We would need to make changes to the existing AS Notifications gem in order to remove this dependency on Rails.

arielvalentin added a commit to arielvalentin/opentelemetry-ruby-contrib that referenced this issue Dec 27, 2022
This makes it possible to use this instrumentation in non-Rails applications that leverage AS::Notifications.

In addition to that it mitigates a race condition where a subscription may be registered by a separate thread before obtaining a lock on the notifications object.

See open-telemetry#218
arielvalentin added a commit to arielvalentin/opentelemetry-ruby-contrib that referenced this issue Dec 27, 2022
This makes it possible to use this instrumentation in non-Rails applications that leverage AS::Notifications.

In addition to that it mitigates a race condition where a subscription may be registered by a separate thread before obtaining a lock on the notifications object.

See open-telemetry#218
arielvalentin added a commit to arielvalentin/opentelemetry-ruby-contrib that referenced this issue Dec 27, 2022
This makes it possible to use this instrumentation in non-Rails applications that leverage AS::Notifications.

In addition to that it mitigates a race condition where a subscription may be registered by a separate thread before obtaining a lock on the notifications object.

See open-telemetry#218
@arielvalentin
Copy link
Collaborator

I've opened this PR, which should address issues with using AS Notifications outside of Rails applications

#242

@arielvalentin
Copy link
Collaborator

cc: #269 #268

arielvalentin added a commit to arielvalentin/opentelemetry-ruby-contrib that referenced this issue Jan 12, 2023
This makes it possible to use this instrumentation in non-Rails applications that leverage AS::Notifications.

In addition to that it mitigates a race condition where a subscription may be registered by a separate thread before obtaining a lock on the notifications object.

See open-telemetry#218
arielvalentin added a commit to arielvalentin/opentelemetry-ruby-contrib that referenced this issue Jan 13, 2023
This makes it possible to use this instrumentation in non-Rails applications that leverage AS::Notifications.

In addition to that it mitigates a race condition where a subscription may be registered by a separate thread before obtaining a lock on the notifications object.

See open-telemetry#218
arielvalentin added a commit to arielvalentin/opentelemetry-ruby-contrib that referenced this issue Jan 14, 2023
This makes it possible to use this instrumentation in non-Rails applications that leverage AS::Notifications.

In addition to that it mitigates a race condition where a subscription may be registered by a separate thread before obtaining a lock on the notifications object.

See open-telemetry#218
arielvalentin added a commit that referenced this issue Jan 14, 2023
* fix: Drop Rails dependency for ActiveSupport Instrumentation

This makes it possible to use this instrumentation in non-Rails applications that leverage AS::Notifications.

In addition to that it mitigates a race condition where a subscription may be registered by a separate thread before obtaining a lock on the notifications object.

See #218

* chore: Update linter rule for Metrics/MethodLength

In most cases long methods are going to be OK since we do not want to minimize the amount of method dispatches we have to make in our instrumentations.

* fix: Linter errors

* squash: linter fix

* squash: fix whitespace
@ahayworth
Copy link
Contributor Author

Apologies for disappearing for awhile on this - I took a break from all work for a long time. On the upside, I'm fully refreshed and ready to work on things. 😄

Some applications, e.g. the GitHub Monolith, do not load the entire Rails environment at startup

@arielvalentin thank you for making changes to support this use case! It's exceedingly uncommon (as you and I both know), but an important one nonetheless. ❤️

Can AS::N support current_span ?

@tjefferson08 It depends ™️ primarily on when the AS::Notification is created. Many things in rails do something like this (really inaccurate pseudocode, apologies):

inside_notification(name: blah) do
  do_some_work
end

In that case, accessing current_span should work well. However, if the notification is generated completely after the thing it instruments, then accessing current_span would not work as expected. We'll need to figure out how widespread each variant is within Rails, and I don't know the answer to that off the top of my head. I'll add it to the todo list.

@ahayworth
Copy link
Contributor Author

@arielvalentin I took a look through our instrumentation, mapping things to AS::N calls. Basically, active_record is really the sticky one - I'm not sure how we'd be able to get things like User#find with precisely the same semantics as we have before.

I'm curious what you think, take a look at the updated checklist with some notes. Maybe this is something we should talk to upstream about?

@arielvalentin
Copy link
Collaborator

arielvalentin commented Jan 24, 2023

What did dd-trace-rb beelines, NewRelic, or signal-fx do?

@arielvalentin
Copy link
Collaborator

cc: #293

@ahayworth
Copy link
Contributor Author

It looks like they just subscribe to the two events - instantiate and sql - and create pretty basic spans around them. They don't create different span names depending on the different methods called (#create vs #destroy, etc) like we do: https://github.com/DataDog/dd-trace-rb/tree/a7c8aa5d81b67d136665dfb86d2345b783fb2290/lib/datadog/tracing/contrib/active_record

@arielvalentin
Copy link
Collaborator

I think that's ok. Low cardinality spans names and high cardinality tags I think is the better way to go.

How do others feel? @open-telemetry/ruby-contrib-approvers

@arielvalentin
Copy link
Collaborator

Looking at the notification events I see some interesting things at least for ActiveRecord:

https://guides.rubyonrails.org/active_support_instrumentation.html#active-record

Key Value OTel Description
:sql SQL statement db.statement Optional since we probably favor DB Driver
:name Name of the operation span.name per the example it likes like to would be Post Load
:connection Connection object net.* we can extract attributes about the DB itself
:binds Bind parameters N/A
:type_casted_binds Typecasted bind parameters N/A
:statement_name SQL Statement name ? Unclear if this would be helpful to record. Is this the Prepared Statement Name?
:cached true is added when cached queries used ? sql.cached rails.activerecord.query.cached ? Unclear if this should be custom

@ericmustin
Copy link
Contributor

@ahayworth @arielvalentin wrt span name not having the same semantics in ddtracerb I think what’s worth understanding is that the ddtracerb software works tightly coupled to the datadog agent, where a combination of obfuscation+Normalization occurs on the sql statement (attached in datadogs tracing system as span.resource ). This creates the same behavior of “low cardinality span resource high cardinality span tags” as opentelemetry’s “low cardinality span name high cardinality span attributes” approach, but since opentelemetry lacks any standard sql obfuscation+normalization, and we don’t take that approach currently anyway for span name, @ahayworth is correct that migrating to this style of instrumentation would effectively be a breaking change to our instrumentation output (although not according to semver2.0 since pre 1.0 software offers no such guarantees). Whether or not it’s worth that break, idk. Perhaps we could introduce this gracefully, so users could stay on the old style of instrumentation, or we could use the 1.0 as an opportunity to change span naming convention.

@arielvalentin
Copy link
Collaborator

Perhaps the best way to move forward is to catalog ActiveSupport notifications and create a mapping guide, which may help better inform our decisions:

https://docs.google.com/spreadsheets/d/1T7uYBewovl6YGo0uAsvKLLSWO0z_m6-rRJJtzoPb264/edit

Here's my 2 cents:

I'm personally not very concerned about the breaking chances for span names, especially in cases like where the names deviate from the specification, which I think will be a welcomed change (e.g. ActionPack).

I'm in favor of ripping the bandaid off and skipping adding backward comparability for these. If folks want the old span names and instrumentation I'd advocate that we tell them to pin to an earlier version.

@ahayworth
Copy link
Contributor Author

@arielvalentin I wasn't able to edit that spreadsheet, so I made a copy and updated it here (you should have edit access through your work account): https://docs.google.com/spreadsheets/d/1mObwGGdJtmeG2YCWCQxWEwwpBEOA-3EXb0WT83bWc8I/edit#gid=379636133

I haven't mapped any additional types yet, but I did finish pulling over all the events and info from the rails guides. Click the little expandy-thing by each group to see them all. I'll try to do some mapping tomorrow! 😄

@github-actions
Copy link
Contributor

👋 This issue has been marked as stale because it has been open with no activity. You can: comment on the issue or remove the stale label to hold stale off for a while, add the keep label to hold stale off permanently, or do nothing. If you do nothing this issue will be closed eventually by the stale bot.

@github-actions github-actions bot added the stale Marks an issue/PR stale label Apr 27, 2023
@arielvalentin arielvalentin added keep Ensures stale-bot keeps this issue/PR open and removed stale Marks an issue/PR stale labels May 5, 2023
@arielvalentin
Copy link
Collaborator

👋🏼 @jhawthorn The OTel Ruby instrumentation currently relies on monkey patching Rails libraries and I am going to start to work on migrating them to use Active Support Notifications.

While investigating next steps for this implementation I saw that this PR rails/rails#43390 removes start/finish methods from the Subscriber class, which is what we currently use in some instrumentations to know when to start and finish a span.

Here is an example of a custom subscriber that relies on those template methods: https://github.com/open-telemetry/opentelemetry-ruby-contrib/blob/main/instrumentation/racecar/lib/opentelemetry/instrumentation/racecar/process_message_subscriber.rb

This is important because it allows us propagate the parent span context so that when we create additional child spans they are properly linked.

If I understand the changes correctly there will be backward compatibility so we will be able to continue to use the start and stop methods after Rails 7.1, is that correct?

Is there a plan to drop support for the those lifecycle methods in favor of something else?

@arielvalentin
Copy link
Collaborator

Hello friends: @robbkidd @robbkidd @chrisholmes @ahayworth

I started a bit of an internal refactoring (#641) to see what it would take to separate the logic of payload handling and the subscriber registration logic so that users would be able to customize the logic around creating spans, setting span attributes or adding events, etc... however this refactoring has left me with a more questions than answers.

Why did we choose to set the span name to this instead of using the name of the event given to the subscriber? name.split('.')[0..1].reverse.join(' ').freeze

Did we ever intend on supporting regular expressions for the subscriber like AS::Notifications supports?

Did we ever intend on adding support for Span Links or did we implicitly only ever want to use Parent-Child?

The payload does not include the context object and instead has a token. Is the only way for children to access the context is to use the Thread Local Current Context?

Do we envision a scenario ever where AS::Notifications switches to out of thread processing making the current thread context stack invalid?

@arielvalentin
Copy link
Collaborator

We have started moving forward with the work here and have completed the migration for active_job and action_pack.

Here is what we have come up with so far:

  1. Prefer Semantic Conventions whenever possible
  2. Use the Rails defined Event Name when semantic conventions do not exist, i.e no name.split('.')[0..1].reverse.join(' ').freeze

@arielvalentin
Copy link
Collaborator

arielvalentin commented Dec 17, 2023

@jhawthorn @bensheldon @composerinteralia I started looking into switching ActiveRecord instrumentation to use sql.active_record notifications but wanted to highlight what @ahayworth points out in the description of this issue:

sql.active_record is great, but it's low-level and won't capture whether we are doing a #find or a #destroy, etc. Nor will it capture callbacks.

The current instrumentation patches AR persistence and query methods that add non-semconv internal spans but as a result they capture time spend in callbacks.

Do you think Rails upstream would be amenable to adding notifications around method AS methods so we are able to capture some callback timings?

If not, then I think we have to stick with patching AR and use sql.active_record to enrich spans with information about whether or not the query was cached or if it was performed async.

The lower level DB drivers are the ones performing sanitization for non-AR users but we could potentially have some cost savings here if the bind parameters are able to give us a hint that the query is already sanitized and used that in lieu of obfuscating the db query.

Some bug reports we received about our current instrumentation come from the fact that some gems patch AR incorrectly and change the method signature of the AR public API. I know it's not something that we should worry about, but I also do not want the instrumentation to fall behind changes in Rails that would result in the same problem.

I am considering updating the instrumentation to use argument forwarding instead of redefining method signatures:

         # This is the current definition in the gem 
          def decrement!(attribute, by = 1, touch: nil)
            tracer.in_span("#{self.class}#decrement!") do
              super
            end
          end
         # using forwarding to mitigate method signature changes
          def decrement!(...)
            tracer.in_span("#{self.class}#decrement!") do
              super
            end
          end

Taking that a step further I am considering proposing lower cardinality span names with semconv attributes:

         # using forwarding to mitigate method signature changes
          def decrement!(...)
            tracer.in_span("active_record#decrement!", attributes: { "code.namespace" => self.class.name }) do
              super
            end
          end

What are y'alls thoughts on this?

@jhawthorn
Copy link

sql.active_record is great, but it's low-level and won't capture whether we are doing a #find or a #destroy, etc. Nor will it capture callbacks.

payload[:name] captures these two. It like won't report 1:1 exactly what otel previously reported, but that may be a good thing (to avoid user confusion in having different things logged in development vs reported in production). There may be some missing (ex. decrement/decrement are reported as "update" which is correct, but maybe not as specific as desired) I'd be happy to help filling in those gaps.

>> ActiveSupport::Notifications.subscribe("sql.active_record") {|_,_,_,_,payload| p payload[:name] }
>> User.first
"User Load"
>> user = User.create!
"TRANSACTION"
"User Create"
"TRANSACTION"
>> user.destroy!
"TRANSACTION"
"User Destroy"
"TRANSACTION"

@arielvalentin
Copy link
Collaborator

payload[:name] captures these two. It like won't report 1:1 exactly what otel previously reported, but that may be a good thing (to avoid user confusion in having different things logged in development vs reported in production).
There may be some missing (ex. decrement/decrement are reported as "update" which is correct, but maybe not as specific as desired) I'd be happy to help filling in those gaps.

Then in these cases using the name key may be enough for naming spans.

What about measuring time spent in AR callbacks? AFAICT sql.active_record timing does not include that.

@olepbr
Copy link
Contributor

olepbr commented Nov 20, 2024

I believe #1258 is another example of "monkey-patching is fraught with peril" =)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request help wanted Extra attention is needed instrumentation keep Ensures stale-bot keeps this issue/PR open
Projects
None yet
Development

No branches or pull requests

9 participants