Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sidekiq integration can't be combined with other integrations #535

Closed
kevinnio opened this issue Jun 30, 2023 · 21 comments
Closed

Sidekiq integration can't be combined with other integrations #535

kevinnio opened this issue Jun 30, 2023 · 21 comments
Labels
bug Something isn't working stale Marks an issue/PR stale

Comments

@kevinnio
Copy link
Contributor

kevinnio commented Jun 30, 2023

Hi. I'm new to opentelemetry-ruby so I may be missing something. I tried to make this work for a while, I looked around in the repo's issues, everything to no avail so now I'm asking here.

Whenever I include the sidekiq instrumentation in my configure block inside my rails app spans from other instrumentations are no longer generated, or at least not visible in the console anymore.

Let's ilustrate the issue by starting with only PG:

OpenTelemetry::SDK.configure do |c|
  c.service_name = 'MyApp'
  c.use 'OpenTelemetry::Instrumentation::PG'
end

OTEL_TRACES_EXPORTER is set to console.

# app/models/my_model.rb
class MyModel < ApplicationModel
end

# app/jobs/my_job.rb
class MyJob 
  include Sidekiq::Job
  
  def perform
    MyModel.find_or_create_by!(id: 1)
  end
end

# in a rails console
MyJob.perform_async

Let's say I boot up sidekiq. As expected, I can see a bunch of PG traces in the console.

Now, let's add the Sidekiq:

OpenTelemetry::SDK.configure do |c|
  c.service_name = 'MyApp'
  c.use 'OpenTelemetry::Instrumentation::PG'
  c.use 'OpenTelemetry::Instrumentation::Sidekiq'
end

Restart sidekiq and run the same job again. Now, only one single Sidekiq trace is visible in the console. The PG ones are missing. I expected to see one parent Sidekiq span and a bunch of PG child ones.

I tried with multiple instrumentations, not only PG. So far, this always happens.

I created an example app so you can replicate the issue: https://github.com/kevinnio/opentel-test. Please let me know if you need additional info.

bundle info:

* opentelemetry-sdk (1.2.1)
* opentelemetry-exporter-otlp (0.25.0)
* opentelemetry-instrumentation-pg (0.25.2)
* opentelemetry-instrumentation-sidekiq (0.24.1)
* rails (7.0.6)
* sidekiq (7.1.2)
* pg (1.5.3)
@kevinnio kevinnio added the bug Something isn't working label Jun 30, 2023
@arielvalentin
Copy link
Collaborator

Thanks for creating the repo to help recreate this issue!

I do not have much experience using Sidekiq and I was wondering if you see this problem when running the sample application and export the spans to local collector. Do you see the spans then?

Here is an example to get you started:

If you see all of the instrumentation spans being exported to the collector and then available in a backend like zipkin then it narrows down the problem to an issue with the console exporter. If not then I think we have to dig in to figure out what's missing or not working when your demo app registers in the individual instrumentations.

@kevinnio
Copy link
Contributor Author

Thanks for the quick reply!

I was in fact working with a local Grafana Tempo instance when I noticed the missing spans. I just used the console exporter to simplify the explanation. For good measure, I just tested with zipkin. I still see the same behavior: all spans dissapear except for the Sidekiq one. See screenshots attached.

Only PG instrumentation enabled:

image




Both PG and Sidekiq enabled:

image
image

I also noticed some warnings in the terminal running sidekiq when both instrumentations are enabled. Maybe related?

2023-06-30T04:52:21.595Z pid=1153 tid=c7t class=MyJob jid=18ca6a30e3a9f49d63db4bb4 INFO: start
W, [2023-06-29T22:52:21.721426 #1153]  WARN -- : Calling finish on an ended Span.
W, [2023-06-29T22:52:21.721612 #1153]  WARN -- : Calling finish on an ended Span.
W, [2023-06-29T22:52:21.721895 #1153]  WARN -- : Calling finish on an ended Span.
W, [2023-06-29T22:52:21.722923 #1153]  WARN -- : Calling finish on an ended Span.
W, [2023-06-29T22:52:21.723173 #1153]  WARN -- : Calling finish on an ended Span.
W, [2023-06-29T22:52:21.723351 #1153]  WARN -- : Calling finish on an ended Span.
W, [2023-06-29T22:52:21.723541 #1153]  WARN -- : Calling finish on an ended Span.
W, [2023-06-29T22:52:21.723712 #1153]  WARN -- : Calling finish on an ended Span.
W, [2023-06-29T22:52:21.724637 #1153]  WARN -- : Calling finish on an ended Span.
W, [2023-06-29T22:52:21.725160 #1153]  WARN -- : Calling finish on an ended Span.
W, [2023-06-29T22:52:21.725792 #1153]  WARN -- : Calling finish on an ended Span.
W, [2023-06-29T22:52:21.726269 #1153]  WARN -- : Calling finish on an ended Span.
W, [2023-06-29T22:52:21.726457 #1153]  WARN -- : Calling finish on an ended Span.
W, [2023-06-29T22:52:21.727614 #1153]  WARN -- : Calling finish on an ended Span.
W, [2023-06-29T22:52:21.727884 #1153]  WARN -- : Calling finish on an ended Span.
W, [2023-06-29T22:52:21.729123 #1153]  WARN -- : Calling finish on an ended Span.
W, [2023-06-29T22:52:21.729349 #1153]  WARN -- : Calling finish on an ended Span.
W, [2023-06-29T22:52:21.729638 #1153]  WARN -- : Calling finish on an ended Span.
W, [2023-06-29T22:52:21.731097 #1153]  WARN -- : Calling finish on an ended Span.
W, [2023-06-29T22:52:21.732190 #1153]  WARN -- : Calling finish on an ended Span.
W, [2023-06-29T22:52:21.735648 #1153]  WARN -- : Calling finish on an ended Span.
2023-06-30T04:52:21.735Z pid=1153 tid=c7t class=MyJob jid=18ca6a30e3a9f49d63db4bb4 elapsed=0.14 INFO: done

@arielvalentin
Copy link
Collaborator

Thanks for giving it a shot and reporting the results.

Looking at the example it occurred to me we actually did not have any automated tests against sidekiq 7 and I believe there are some incompatibilities in our latest instrumentation that needed to be fixed: #527 (comment)

I am surprised that you are getting any spans at all right now. Would you mind downgrading to sidekiq 6 and confirm that it works?

@kevinnio
Copy link
Contributor Author

kevinnio commented Jun 30, 2023

Just tested with both sidekiq 6.5.9 and 5.2.7. Same behavior.

I'm now working on a test outside of rails just to see if that's related.

@kevinnio
Copy link
Contributor Author

No luck. Exact same behavior.

https://github.com/kevinnio/opentel-por-test

@arielvalentin
Copy link
Collaborator

@plantfansam are y'all using sidekiq at Shopify? Are y'all experiencing this behavior?

@arielvalentin
Copy link
Collaborator

One more request for you @kevinnio. If you use the all instrumentation does it still happen?

@arielvalentin
Copy link
Collaborator

@chrisholmes shared some his experience working with this instrumentation:

2 hours ago
I might have run into this problem recently as we recently instrumented a sidekiq app but with different dependencies. I couldn’t get job spans to show until I enabled one of the extra instrumentation options :trace_processor_process_one.
However, we did also enable a different propagation option that isn’t in the issue you’ve linked.
I got it to work with:

'OpenTelemetry::Instrumentation::Sidekiq' => {
      propagation_style: :child,
      span_naming: :job_class,
      trace_processor_process_one: true
    }

I will have to look into those options to see what they do but it looks like the default configuration may be following the messaging specification and creating new root spans, however the fact that you are getting warnings saying the Span#finish is being invoked multiple times makes me think there is a bug in the instrumentation where it is not detaching the current span context correctly.

@kevinnio would you be interested in continuing to dig into this issue and potentially submitting a PR for a fix?

@stevenharman
Copy link
Contributor

stevenharman commented Jun 30, 2023

We are using Sidekiq (7.1.1), but are on an older version of the sidekiq instrumentation (0.23.0). We see the Sidekiq spans make it through to Honeycomb, and then see a bunch of PG, Faraday, ActiveRecord, etc… spans as child spans of the root Sidekiq span.
We are using the default configs, which means link propagation.

I'm going to upgrade our Sidekiq (we also run Sidekiq Enterprise/Pro, though I'm not sure that's got anything to do with it) to the latest and see if we're still working.

UPDATE. I upgraded to the latest Sidekiq releases (7.1.2, Pro: 7.1.3, Ent: 7.1.1) and I'm still seeing root Spans for Sidekiq, with lots of child spans for PG, NetHTTP, etc…

@kevinnio
Copy link
Contributor Author

kevinnio commented Jun 30, 2023

@stevenharman Thanks for the input. Just tested with 0.23.0 in my example app but sadly, the issue is still there.

@arielvalentin Sure. I'll give it a try.

@kevinnio
Copy link
Contributor Author

kevinnio commented Jul 1, 2023

Found the issue. The issue starts here:

Root cause

When :trace_processor_process_one is false (default behavior), the code will do the following in this order:

  1. Insert an untraced entry into Context.current. This means nothing will be traced for the duration of this block.
  2. Run Sidekiq::Processor#process_one through calling super, which actually executes the job.
  3. Remove the untraced entry.

This results in nothing inside the job being traced. That's why those child spans are missing.

Fixes

Now, one fix is setting trace_processor_process_one: true because that will prevent the untraced entry from being inserted into the context, but it produces a bunch of sidekiq::processor#process_one traces every time the method is called, even if no job is processed. I don't find them particularly useful, specially compared with the full job trace.

image

So, another possible fix would be to change the code and replace the patch for #process_one with this one

def process_one
  if instrumentation_config[:trace_processor_process_one]
    attributes = {}
    attributes[SemanticConventions::Trace::PEER_SERVICE] = instrumentation_config[:peer_service] if instrumentation_config[:peer_service]
    tracer.in_span('Sidekiq::Processor#process_one', attributes: attributes) { super }
  else
-   OpenTelemetry::Common::Utilities.untraced { super }
+   super
  end
end

This way everyone can have full job traces and if they wish to enable :trace_processor_process_one they can still do so.

What do you think of my suggestion? Should I submit a PR right away?

@arielvalentin
Copy link
Collaborator

Thanks for tracking that down. I don't know enough about this instrumentation to know why it's implemented this way and why that's the default.

I'll take a look at the git history and conversations in the PR to see if I am able to understand how we got here.

@kevinnio
Copy link
Contributor Author

kevinnio commented Jul 2, 2023

Forgot to mention that line is also related to those Calling finish on an ended Span warnings.

There's a check for an untraced entry here and when true SDK::Tracer#start_span will always return the current span instead of creating a new one, which in this case is the Sidekiq root span.

Looks like at this point the code is expecting no current span to exist so it will return nil, but since #start_root_span inserts one anyway it returns that instead. Possibly a bug? This will only happen when calling #start_root_span inside an untraced block.

@bolandoboy
Copy link

This is biting us in a different way (#544), but it all seems to come down to the untraced process_one span.

May I suggest to kill 2 birds with one stone by removing that span completely?

  1. Code and configuration would be simplified.
  2. If Sidekiq is instrumented as a messaging system, which seems to be the case because spans are created with messaging.* attributes, it will comply better with the new semantic conventions from this OTEP if a single deliver span is created on the consumer side.

@arielvalentin
Copy link
Collaborator

Thank you all for taking the time to investigate these issues and propose solutions.

I don't have any good feedback for y'all since I am not a Sidekiq user but given that you both have invested time in tracking down the issues I want to encourage you all to open PRs that changes the functionality to conform to messaging OTEPs.

If y'all could add appraisals for Sidekiq 7, that would also be greatly appreciated.

I will prioritize reviewing the changes as soon as they are in.

Thank you!

@kevinnio
Copy link
Contributor Author

kevinnio commented Jul 5, 2023

Thanks, @arielvalentin.

@monoviolento so we're going with completely removing the sidekiq::processor#process_one span, right? That would mean we can actually get rid of the whole patch for that method. I can submit a PR for that. If we do remove the patch, we'll also need to update the instrumentation settings and tests. I'll also try to add the appraisals for Sidekiq 7 myself.

@arielvalentin
Copy link
Collaborator

Thank you @kevinnio !

kevinnio added a commit to kevinnio/opentelemetry-ruby-contrib that referenced this issue Jul 6, 2023
According to the discussion in open-telemetry#535, we can better match spec by
avoiding the `sidekiq::processor#process_one` span altogether. This
fixes the issues described in open-telemetry#535 and open-telemetry#544.
@kevinnio
Copy link
Contributor Author

kevinnio commented Jul 6, 2023

Here are my PRs:

@arielvalentin
Copy link
Collaborator

Thank you! If it's not too much trouble could you add a screen shot of the instrumentation in action without process_one to the PR?

@kevinnio
Copy link
Contributor Author

kevinnio commented Jul 6, 2023

Screenshots added.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 6, 2023

👋 This issue has been marked as stale because it has been open with no activity. You can: comment on the issue or remove the stale label to hold stale off for a while, add the keep label to hold stale off permanently, or do nothing. If you do nothing this issue will be closed eventually by the stale bot.

@github-actions github-actions bot added the stale Marks an issue/PR stale label Aug 6, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale Marks an issue/PR stale
Projects
None yet
Development

No branches or pull requests

4 participants