Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Traces lost on upgrading to Ruby 3.1.0 & dd-trace 1.1.0 #2101

Closed
adrys-lab opened this issue Jun 22, 2022 · 31 comments
Closed

Traces lost on upgrading to Ruby 3.1.0 & dd-trace 1.1.0 #2101

adrys-lab opened this issue Jun 22, 2022 · 31 comments
Assignees
Labels
bug Involves a bug community Was opened by a community member
Milestone

Comments

@adrys-lab
Copy link

adrys-lab commented Jun 22, 2022

Current behaviour
We have been working on upgrading from Ruby 2.6.5 to Ruby 3.1.0.

With such upgrade, we had to upgrade also dd-trace gem version from 0.54.1 to 1.1.0

Once we have done it successfully, seems we have lost APM traces data for our service.

image

The traces lost are the incoming received HTTP Requests to our service (rack-request).

Notice in the picture the release time of Ruby 3.1.0 + ddtrace 1.1.0.

It is weird because it seems, we lost some API traces but some others appear (some /v3/first_resource appear, but /v3/another_resource don't appear).
Same for lots of ActiveAdmin traces lost.

Checkout also the number of errors has been lost
image

so in general, we suspect we are loosing data for HTTP.

Meanwhile, it seems in terms of logs, we have not lost any data (this controller APM has disappeared for instance)
image

By contrast, for sidekiq for instance, all is working fine.
image

So, potentially, it might be related with our rack or rails configuration, but we don't see what could be the issue 🤔 .

Expected behaviour
ALL the HTTP requests are traced correctly in the APM.

Before upgrade configuration

Ruby
2.6.5
dd-trace
0.54.1

require './components/platform/base/graphql/schema.rb'
require File.expand_path('../../lib/datadog_tracer', __dir__)

require 'datadog/statsd'
require 'ddtrace'

Datadog.configure do |c|
  service = 'web'
  hostname = DatadogTracer.resolve_datadog_hostname
  c.tracer(
    env: Rails.env,
    priority_sampling: true,
    hostname: hostname,
    tags: DatadogTracer.tags,
    enabled: ENV['DATADOG_TRACER_ENABLED'].present?)

  c.runtime_metrics_enabled = true
  c.runtime_metrics(statsd: Datadog::Statsd.new(hostname, DatadogTracer.statsd_port))

  c.use(:rack)
  c.use(:rails, service_name: service, distributed_tracing: true)
  c.use(:sidekiq, analytics_enabled: true, service_name: "#{service}-sidekiq", client_service_name: "#{service}-sidekiq")
  c.use(:rake, analytics_enabled: true, service_name: "#{service}-rake", quantize: { args: { show: :all } })
  c.use(:aws, service_name: "#{service}-aws")
  c.use :redis, service_name: "#{service}-redis-cache"
  c.use :redis, describes: { host: ENV['REDIS_HOST'] }, service_name: "#{service}-redis-sidekiq"
  c.use(:elasticsearch, service_name: "#{service}-elasticsearch")
  c.use(:shoryuken, service_name: "#{service}-shoryuken")

  # web backend-graphql-api
  c.use :graphql, schemas: [::Platform::Base::GraphQL::Schema], analytics_enabled: true, service_name: ENV['BACKEND_GRAPHQL_API'].present? ? 'backend-graphql-api' : 'web'
end

Current configuration (issue)

Ruby
3.1.0
dd-trace
1.1.0

require './components/platform/base/graphql/schema.rb'
require File.expand_path('../../lib/datadog_tracer', __dir__)

require 'datadog/statsd'
require 'ddtrace'

Datadog.configure do |c|
  service = 'web'
  hostname = DatadogTracer.resolve_datadog_hostname
  c.env = Rails.env
  c.agent.host = hostname
  c.tags = DatadogTracer.tags

  # runtime
  c.runtime_metrics.enabled = true
  c.runtime_metrics.statsd = Datadog::Statsd.new(hostname, DatadogTracer.statsd_port)

  # tracing
  c.tracing.enabled = ENV['DATADOG_TRACER_ENABLED'].present?
  c.tracing.priority_sampling = true
  c.tracing.report_hostname = true
  c.tracing.log_injection = true
  c.tracing.partial_flush.enabled = true

  # instrumenting
  c.tracing.instrument(:action_mailer, service_name: "#{service}-action-mailer")
  c.tracing.instrument(:active_job, service_name: "#{service}-active-job")
  c.tracing.instrument(:active_record, service_name: "#{service}-active-record")
  c.tracing.instrument(:active_support, service_name: "#{service}-active-support")
  c.tracing.instrument(:aws, service_name: "#{service}-aws")
  c.tracing.instrument(:elasticsearch, service_name: "#{service}-elasticsearch")
  c.tracing.instrument(:mysql2, service_name: "#{service}-mysql2")
  c.tracing.instrument(:rack)
  c.tracing.instrument(:rails, service_name: service, distributed_tracing: true)
  c.tracing.instrument(:rake, analytics_enabled: true, service_name: "#{service}-rake", quantize: { args: { show: :all } })
  c.tracing.instrument(:redis, service_name: "#{service}-redis-cache")
  c.tracing.instrument(:redis, describes: { host: ENV['REDIS_HOST'] }, service_name: "#{service}-redis-sidekiq")
  c.tracing.instrument(:shoryuken, service_name: "#{service}-shoryuken")
  c.tracing.instrument(:sidekiq, analytics_enabled: true, service_name: "#{service}-sidekiq", client_service_name: "#{service}-sidekiq")

  # web backend-graphql-api
  c.tracing.instrument(:graphql, schemas: [::Platform::Base::GraphQL::Schema], analytics_enabled: true, service_name: ENV['BACKEND_GRAPHQL_API'].present? ? 'backend-graphql-api' : 'web')
end

Current issue Environment

  • ddtrace version:
    1.1.0
  • Ruby version:
    3.1.0
  • Operating system:
  • Relevant library versions:
    Rails 6.1.6
@adrys-lab adrys-lab added bug Involves a bug community Was opened by a community member labels Jun 22, 2022
@adrys-lab adrys-lab changed the title Traces lost on upgrading to Ruby 3.1.0 Traces lost on upgrading to Ruby 3.1.0 & dd-trace 1.1.0 Jun 22, 2022
@adrys-lab
Copy link
Author

do you have any new in regards of this issue ? 🙏

@delner
Copy link
Contributor

delner commented Jun 27, 2022

Thanks for the detailed report @adrys-lab!

We'll be happy to take a look at this soon. It may be related to one or two other issues already opened on the board, so it may be addressed when those are.

I'll try to reproduce this if I can from what you've given me, but if I can't, I may need a little more help getting some more information.

@delner delner self-assigned this Jun 27, 2022
@adrys-lab
Copy link
Author

hi @delner please let us know once you find some solution, as we are still facing this issue.

As far as i understand, it's a known issue ?

and will be solved in greater gem versions ?

@muzfuz
Copy link

muzfuz commented Jul 5, 2022

Hello @delner just want to chime in to say we've found the same thing under the same circumstances as above (upgrade from 0.5.4 to 1.1). We've rolled back for now.

It seems like the core facet was also changed - this was for a Rails application.

image

@delner
Copy link
Contributor

delner commented Jul 6, 2022

Metric graphs don't tell the whole story here: because they are computed on a very specific set of dimensions (service, span name, etc), it's very possible the traces are still being produced and ingested at appropriate rates while the counts have been assigned to some other dimension if the service name changed. Basically, there are many possible explanations for a dip in this graph.

We need something more concrete. I need to see that traces are outflowing from the Ruby tracer itself either in smaller numbers, or a difference in the metadata (different tag values).

We can (and will) test the core parts of the tracer more generally on our end, but if this is happening specifically in Rails apps, it may be difficult for us to accurately recreate the conditions under which this occurred in your application. Without more to go on, it will be much harder to fix your particular problem.

@muzfuz
Copy link

muzfuz commented Jul 8, 2022

Cheers for getting back to this, let me try to provide more info.

First off here's a graph of overall trace ingestion rate for this same time period. This is taken from the default "APM Traces Estimated Usage" dashboard, I believe it's the datadog.estimated_usage.apm.ingested_traces metric. As you can see, there's a very clear drop when we rolled out the upgrade, and then an increase when we roll back. (The remaining trace data belongs to other services in our system).

image

Yes you're right about the dimensions question, it's also something that occurred to us. Firstly, we've verified that the service name did not change, but we did see something odd.

Below is the same graph for the Rails app, where the primary operation name is rack.request.

image

One thing I noticed after upgrading is that the primary operation name changed to rails.render_template and I had to toggle back to rack.request manually to see them. This is what rails.render_template looked like for the same time. Three things worth noting:

  • There was not a significant jump in this operation measure
  • It seems to have gotten slower
  • Our Rails app runs in API mode, meaning we do not render a significant amount of Rails templates - it's almost all JSON endpoints, so switching the primary operation like this was a bit odd.

image

image

I have flipped through the other operations available to us, and the traces did not show up there.

image

The upshot here is that those traces were definitely gone (rather than moved elsewhere), and we were seeing some strange behaviour with primary operations etc. I hope this helps clarify this, but I'm happy to provide more info if needed.

@delner
Copy link
Contributor

delner commented Jul 11, 2022

Thanks for the detailed graphs and your continued patience on this one. I want to get to the bottom of it.

Looking over your configuration again, there's two things I want to evaluate further:

  1. I saw c.tracing.partial_flush.enabled = true was set in 1.1.0 whereas it wasn't in 0.54.1. If possible, I would recommend turning this feature off. I have reason to suspect it may be messing with metrics, and I'd like to see what kind of effect, if any, it has.
  2. Looks like Rake instrumentation is on: I would also disable this for now. This works well for short-lived tasks, but it's been known to absorb traces from long running tasks (such as ones that start web processes) thereby making traces look as if they disappear where in fact they are being buffered. Perhaps this was mitigated by the activation of partial flushing.

Generally speaking, if the concern is inconsistency, then I'd like to strike feature parity between 0.x and 1.x where possible, and draw comparisons. Are you able to deploy these to some kind of staging or non-production environment? If you're able to share me code so I can replicate & run on my end, that would be even better.

@adrys-lab
Copy link
Author

Hi guys, @delner @muzfuz, sorry for late answer, i was out for Holidays.

After reading your responses, it seems for us, the amount of render_template operations, are the same than our version upgrades.

Let's take 1 controller as an example, and analyse the operations Comparing Before Upgrade VS Current.

Controller Name
CallsController

rack.request
image

rails.action_controller
image

As a hint, you can check we have several MONITORS alerting us about NO DATA for several controllers,
check that.

image


@delner if we disable rack instrumentations, for which operations the Controller requests should be related/appearing ?

I also can test the partial_flush disabling.

@adrys-lab
Copy link
Author

adrys-lab commented Aug 1, 2022

Would it be useful to send you OLD logs TAGS + NEW logs TAGS ?

I can not send you Traces tags because as you deduce, i can not find the new ones for some controllers ... 😞

@adrys-lab
Copy link
Author

Hi good morning, disabling partial_flush did nothing.

and we are not sure to disable instrumentation for rack since we don't know if we will loose all traces or for what operation will them appear 🤔

@delner
Copy link
Contributor

delner commented Aug 18, 2022

@adrys-lab I keep re-reading these graphs/screenshots, but they are stripped of too much context to make any sense of. :/ We need to filter out the noise and create some clarity.

Let's set aside all the previous suggestions (e.g. partial flushing)

I think we need to examine this from two angles:

  1. Get a clean, simplified comparison in the UI without all the extra dimensions (no specific resources, different spans, etc...) to verify the most basic metric counts are what we expect. Make sure the trace structure is identical as well.
  2. Stress test the tracer end-to-end, verify no spans are dropped.

Regarding (2), we've already conducted some preliminary stress testing (on version 1.2.0), which showed virtually no traces being dropped internally between the Ruby tracer and the Datadog agent. This leads me to believe there's no loss of data. However, I'm working on another test to double-check that all traces come through to the UI. It's still possible there's some kind of metric miscomputation or sampling issue in the agent.

@adrys-lab I'm going to need more of your help on (1), as I can't replicate your problem locally. What I would like to see is a just the aggregate trace count for a single service, who's underlying spans have not changed (no new spans, service names unchanged, etc) for both 0.54.2 and 1.3.0. Ideally I could see these graphs side-by-side, along with sample traces for each demonstrating the trace has not changed.

I'll get back to you regarding the results of my end-to-end tests.

@delner
Copy link
Contributor

delner commented Aug 18, 2022

Ran my end-to-end tests locally. Both ddtrace 0.54.2 & 1.3.0 look correct: 100% traces are sent, received, and displayed on the request graph. I have no reason to suspect the tracer or agent, even under high load, are dropping traces.

Assuming what we both observed of our respective applications is accurate, there are a few of possible explanations:

  1. Service/resource/operation name changes between 0.x and 1.x moved requests between different graphs. They disappeared from one, appeared on another. If this is the case, then it's likely a misunderstanding of what should be expected in the UI. I think this is the most likely.
  2. Some instrumentation or setting activated in your application is misbehaving, either generating a trace in an unexpected way such that it doesn't consistently generate hits on the graph you're looking at...
  3. ...instrumentation is sometimes not generating traces at all. (Seems very unlikely.)

Either way, there doesn't appear to be any issue with trace submission. I think the most likely possibility is (1). @adrys-lab I think we should take a closer look at the graphs you have, together, so we can ensure the requests didn't simply move somewhere else.

I'd also invite you to try a test like mine below for your own app: feeding it a specific number of requests in a controlled environment, checking to see that all requests appear in the graph between both 0.x and 1.x. If you could show a drop in the graph again, then it would be reasonably likely there's something going on in the activated instrumentation.


Here are the results from my testing:

Test setup

Sample acme-rack trace

stress-tracer-1 3-trace

Rack application endpoint

stress-tracer-1 3-endpoint

RSpec stress test

stress-tracer-1 3-test

ddtrace 0.54.2 results

Request graph for acme-rack service (10K)

stress-tracer-0 54 2-10k-graph

Request graph for acme-rack service (100K)

stress-tracer-0 54 2-100k-graph-default-sampling

ddtrace 1.3.0 results

Request graph for acme-rack service (10K)

stress-tracer-1 3-10k-graph

Request graph for acme-rack service (100K)

stress-tracer-1 3-100k-graph

@zackse
Copy link
Contributor

zackse commented Aug 19, 2022

Hi @delner, thanks for your continued investigation on this. Can you confirm whether your test apps show the expected/matching counts for metrics as well as traces?

Not to interrupt this thread, but FWIW here are some additional data points from our setup. I apologize if I am mixing up the terminology, but to summarize our observations:

  • We upgraded from 0.52.0 to 1.3.0
  • We use distributed tracing (microservices calling others)
  • The Request counts shown on the APM/traces page are similar with both 0.52.0 and 1.3.0
  • The log counts (for a search string matching the completion of a request) are similar with both versions
  • The "Requests and Errors" counts on the APM/services page in the Service Summary section has a significant drop-off after upgrade
  • The metrics for trace.rack.request.hits, trace.rails.action_controller.hits, trace.rails.render_template.hits all have a significant drop-off after upgrade
  • We tried disabling priority sampling (forcing 100% ingestion) but that didn't have any impact on the lower service request and metric counts

It sounds like you may be suggesting that some spans (and metrics?) were either relocated or are labeled differently such that they do not appear in the same place before/after upgrade.

@adrys-lab
Copy link
Author

adrys-lab commented Aug 19, 2022

same for us,

traces dropped down, and logs count is the same than before.

traces for trace.rack.request.hits and trace.rails.action_controller.hits have a significant drop too.

if service or another TAG/service would have change, we would not see the previous traces before the configuration change.
Is there a way to provide you access to our infrastructure and have a deeper look ? or we need to open a helpdesk ticket in DD ?

Screenshot 2022-08-19 at 07 08 55

Screenshot 2022-08-19 at 07 09 18

@adrys-lab
Copy link
Author

if we go one layer above our puma server, like in AWS infrastructure, all the metrics are the same though, the AWS Envoy stills having the same metrics.

so, traces/metrics, are lost within the application side.

image

@adrys-lab
Copy link
Author

adrys-lab commented Aug 19, 2022

our puma.requests.count metric, though, seems either flatten or a bit higher (could be due to increase of traffic volume.

image

@delner
Copy link
Contributor

delner commented Aug 19, 2022

@zackse

Can you confirm whether your test apps show the expected/matching counts for metrics as well as traces?

Yes, the metrics are correct as well. They are derived from the same source. Here's the equivalent graph in the metrics explorer (100K of 100K hits on 1.3.0):

stress-tracer-1 3-100k-request-hits-metric

The Request counts shown on the APM/traces page are similar with both 0.52.0 and 1.3.0

This is good, and would support my findings that there is no loss of requests or data.

The metrics for trace.rack.request.hits, trace.rails.action_controller.hits, trace.rails.render_template.hits all have a significant drop-off after upgrade

This is interesting, and still warrants further investigation. It's possible if the service name on some of the spans changed between versions, you expect to see these hits recategorized underneath another graph. The drop in the metric alone is not sufficient to draw any conclusion. I will need more contextual detail about the what data points the graph is summing up and how the underlying datapoints changed to make sense of it. I suspect trace.rails.action_controller.hits, trace.rails.render_template.hits drops because they are no longer top level or measured spans (if they are a part of the main service now).


The best test would be your app, in an isolated environment, throwing a fixed number of requests (10/100K) at an endpoint that experienced the drop in hits, comparing the service "requests" graph, "hits" graph, and example traces against their respective counter parts in 0.54.2 and 1.3.0. That would rule in or out instrumentation or configuration as an issue.

@delner
Copy link
Contributor

delner commented Aug 19, 2022

@adrys-lab

if service or another TAG/service would have change, we would not see the previous traces before the configuration change.

I think there is a case where you would, if spans were inconsistently top-level or measured because of the relative shape of the trace.

Is there a way to provide you access to our infrastructure and have a deeper look ? or we need to open a helpdesk ticket in DD ?

Yes, I would open a ticket, and have them escalate it to APM Ruby. You can say David Elner (Ruby APM engineer) requested you open and will take it up. Hopefully that should expedite things.

@zackse
Copy link
Contributor

zackse commented Aug 26, 2022

Hi @delner, thank you for your response and investigation.

We were able to replicate the problem where certain metrics/spans go missing in an isolated/test environment. It seems related to distributed tracing, where one service calls another: the spans/metrics for the downstream service (running ddtrace 1.3.0) are being recorded intermittently and the volume is significantly lower when the service is called indirectly.

Given a service A with ddtrace 0.52.0 that calls service B:

  • Service B w/ ddtrace 0.52.0: a trace where the top-level span is a service A resource includes child spans for service B, and service B's metrics are incremented accordingly
  • Service B w/ ddtrace 1.3.0: a trace for the same service A resource does not include spans for service B, and service B's metrics are not incremented

The 1.3.0 trace view does list logs for service B, and as discussed previously, the "request count" shown on the APM/traces page for service B matches before/after upgrade.

So it seems the logs from the two services are being tracked/correlated correctly, but the spans/metrics from service B have disappeared when service B is 1) upgraded to 1.3.0 and 2) downstream of another service.

When calling service B directly (a service B resource is the top-level of the trace) after upgrade to 1.3.0, however, spans/metrics are visible as before.

The behavior sounds similar to issue #2191, but we are not using OpenTracer.

@zackse
Copy link
Contributor

zackse commented Aug 27, 2022

Here is the trace flame graph before upgrading the downstream service to 1.3.0:

Screen Shot 2022-08-26 at 5 15 14 PM

After:
Screen Shot 2022-08-26 at 5 12 23 PM

@delner
Copy link
Contributor

delner commented Aug 30, 2022

@zackse I took a bit of time to try to recreate this in our rails-six integration app (using Faraday), and was able to reproduce the issue.

As you said, it does appear to happen intermittently. However, the spans on the second half are not lost, but are within their own trace which may make them appear as if they were lost.

I don't have an explanation for why this is happening yet, however now that I've been able to reproduce it locally, I'm confident we should be able to find an answer quickly. Thanks for sharing the tip about distributed tracing... I think this is the clue we needed to make progress on this. I'll keep you posted on my findings.

@zackse
Copy link
Contributor

zackse commented Aug 30, 2022

Thanks, @delner. We really appreciate the continued work on this. 👍

@adrys-lab
Copy link
Author

adrys-lab commented Sep 8, 2022

desperate-level-10-question:

would it solve something adding the rack middleware:

Datadog::Contrib::Rack::TraceMiddleware

in the application bootstrap ? 🤔

cc @delner

@delner
Copy link
Contributor

delner commented Sep 15, 2022

@adrys-lab No I don't believe so.

@delner
Copy link
Contributor

delner commented Sep 15, 2022

Update: after working more closely with @zackse and his team, we think we've found the issue is related to how sampling is applied to Ruby apps on the receiving end of a distributed trace. It isn't appropriately sending spans to the agent for metric computation.

We have a few good ideas how to fix this, and we expect to release a new version soon with the fix (likely no later than next week.) I'll post further updates here as we have them!

@adrys-lab
Copy link
Author

hi @delner would that fix the first issue mentioned in this thread ?

Thanks a lot for having a close look and providing help, thanks @zackse !

will keep this issue in bookmarks for tracking the fix version 🙏

@marcotc
Copy link
Member

marcotc commented Sep 15, 2022

👋 @adrys-lab @zackse @muzfuz, we have a prerelease gem published that will fix this issue and we'd like anyone experiencing this problem to try it out.

The build is based on 1.4.0 and has minimal changes, so the risk is very small if you are running 1.4.0 today. We plan on releasing it as 1.4.1 if we can confirm that it fixes the issue in real deployment settings.

To use this prerelease, replace gem 'ddtrace' in you Gemfile with this snippet instead:

source 'http://gems.datadoghq.com/prerelease-v2' do
  gem 'ddtrace', '1.4.0.fix.distributed.priority.sampling.269950'
end

I suggest running this in staging first if possible, before deployment to production.

@lyricsboy
Copy link

Thanks @marcotc! I have confirmed that this prerelease behaves similarly to our 🐒 -patched version in my candidate app in production. I see the correct trace.rack.request metrics and spans as expected.

@marcotc
Copy link
Member

marcotc commented Sep 15, 2022

🙌 thank you so much @lyricsboy!

@marcotc marcotc added this to the 1.4.1 milestone Sep 15, 2022
@marcotc
Copy link
Member

marcotc commented Sep 15, 2022

Version 1.4.1 has been released with this fix.

Thank you all for your patience and cooperation in solving this issue. 🙇

If the problem still persists please let us know!

@marcotc marcotc closed this as completed Sep 15, 2022
@adrys-lab
Copy link
Author

traces are back !!!

Im really appreciated for dedication and followup actions as well as patence here !

thank you all the involved people 🙏

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Involves a bug community Was opened by a community member
Projects
None yet
Development

No branches or pull requests

6 participants