-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Traces lost on upgrading to Ruby 3.1.0 & dd-trace 1.1.0 #2101
Comments
do you have any new in regards of this issue ? 🙏 |
Thanks for the detailed report @adrys-lab! We'll be happy to take a look at this soon. It may be related to one or two other issues already opened on the board, so it may be addressed when those are. I'll try to reproduce this if I can from what you've given me, but if I can't, I may need a little more help getting some more information. |
hi @delner please let us know once you find some solution, as we are still facing this issue. As far as i understand, it's a known issue ? and will be solved in greater gem versions ? |
Hello @delner just want to chime in to say we've found the same thing under the same circumstances as above (upgrade from 0.5.4 to 1.1). We've rolled back for now. It seems like the core facet was also changed - this was for a Rails application. |
Metric graphs don't tell the whole story here: because they are computed on a very specific set of dimensions (service, span name, etc), it's very possible the traces are still being produced and ingested at appropriate rates while the counts have been assigned to some other dimension if the service name changed. Basically, there are many possible explanations for a dip in this graph. We need something more concrete. I need to see that traces are outflowing from the Ruby tracer itself either in smaller numbers, or a difference in the metadata (different tag values). We can (and will) test the core parts of the tracer more generally on our end, but if this is happening specifically in Rails apps, it may be difficult for us to accurately recreate the conditions under which this occurred in your application. Without more to go on, it will be much harder to fix your particular problem. |
Cheers for getting back to this, let me try to provide more info. First off here's a graph of overall trace ingestion rate for this same time period. This is taken from the default "APM Traces Estimated Usage" dashboard, I believe it's the Yes you're right about the dimensions question, it's also something that occurred to us. Firstly, we've verified that the service name did not change, but we did see something odd. Below is the same graph for the Rails app, where the primary operation name is One thing I noticed after upgrading is that the primary operation name changed to
I have flipped through the other operations available to us, and the traces did not show up there. The upshot here is that those traces were definitely gone (rather than moved elsewhere), and we were seeing some strange behaviour with primary operations etc. I hope this helps clarify this, but I'm happy to provide more info if needed. |
Thanks for the detailed graphs and your continued patience on this one. I want to get to the bottom of it. Looking over your configuration again, there's two things I want to evaluate further:
Generally speaking, if the concern is inconsistency, then I'd like to strike feature parity between 0.x and 1.x where possible, and draw comparisons. Are you able to deploy these to some kind of staging or non-production environment? If you're able to share me code so I can replicate & run on my end, that would be even better. |
Hi guys, @delner @muzfuz, sorry for late answer, i was out for Holidays. After reading your responses, it seems for us, the amount of render_template operations, are the same than our version upgrades. Let's take 1 controller as an example, and analyse the operations Comparing Before Upgrade VS Current. Controller Name As a hint, you can check we have several MONITORS alerting us about NO DATA for several controllers, @delner if we disable rack instrumentations, for which operations the Controller requests should be related/appearing ? I also can test the partial_flush disabling. |
Would it be useful to send you OLD logs TAGS + NEW logs TAGS ? I can not send you Traces tags because as you deduce, i can not find the new ones for some controllers ... 😞 |
Hi good morning, disabling and we are not sure to disable instrumentation for |
@adrys-lab I keep re-reading these graphs/screenshots, but they are stripped of too much context to make any sense of. :/ We need to filter out the noise and create some clarity. Let's set aside all the previous suggestions (e.g. partial flushing) I think we need to examine this from two angles:
Regarding (2), we've already conducted some preliminary stress testing (on version 1.2.0), which showed virtually no traces being dropped internally between the Ruby tracer and the Datadog agent. This leads me to believe there's no loss of data. However, I'm working on another test to double-check that all traces come through to the UI. It's still possible there's some kind of metric miscomputation or sampling issue in the agent. @adrys-lab I'm going to need more of your help on (1), as I can't replicate your problem locally. What I would like to see is a just the aggregate trace count for a single service, who's underlying spans have not changed (no new spans, service names unchanged, etc) for both 0.54.2 and 1.3.0. Ideally I could see these graphs side-by-side, along with sample traces for each demonstrating the trace has not changed. I'll get back to you regarding the results of my end-to-end tests. |
Ran my end-to-end tests locally. Both ddtrace Assuming what we both observed of our respective applications is accurate, there are a few of possible explanations:
Either way, there doesn't appear to be any issue with trace submission. I think the most likely possibility is (1). @adrys-lab I think we should take a closer look at the graphs you have, together, so we can ensure the requests didn't simply move somewhere else. I'd also invite you to try a test like mine below for your own app: feeding it a specific number of requests in a controlled environment, checking to see that all requests appear in the graph between both 0.x and 1.x. If you could show a drop in the graph again, then it would be reasonably likely there's something going on in the activated instrumentation. Here are the results from my testing: Test setup
Sample
|
Hi @delner, thanks for your continued investigation on this. Can you confirm whether your test apps show the expected/matching counts for metrics as well as traces? Not to interrupt this thread, but FWIW here are some additional data points from our setup. I apologize if I am mixing up the terminology, but to summarize our observations:
It sounds like you may be suggesting that some spans (and metrics?) were either relocated or are labeled differently such that they do not appear in the same place before/after upgrade. |
same for us, traces dropped down, and logs count is the same than before. traces for if |
Yes, the metrics are correct as well. They are derived from the same source. Here's the equivalent graph in the metrics explorer (100K of 100K hits on 1.3.0):
This is good, and would support my findings that there is no loss of requests or data.
This is interesting, and still warrants further investigation. It's possible if the service name on some of the spans changed between versions, you expect to see these hits recategorized underneath another graph. The drop in the metric alone is not sufficient to draw any conclusion. I will need more contextual detail about the what data points the graph is summing up and how the underlying datapoints changed to make sense of it. I suspect The best test would be your app, in an isolated environment, throwing a fixed number of requests (10/100K) at an endpoint that experienced the drop in hits, comparing the service "requests" graph, "hits" graph, and example traces against their respective counter parts in 0.54.2 and 1.3.0. That would rule in or out instrumentation or configuration as an issue. |
I think there is a case where you would, if spans were inconsistently top-level or measured because of the relative shape of the trace.
Yes, I would open a ticket, and have them escalate it to APM Ruby. You can say David Elner (Ruby APM engineer) requested you open and will take it up. Hopefully that should expedite things. |
Hi @delner, thank you for your response and investigation. We were able to replicate the problem where certain metrics/spans go missing in an isolated/test environment. It seems related to distributed tracing, where one service calls another: the spans/metrics for the downstream service (running ddtrace 1.3.0) are being recorded intermittently and the volume is significantly lower when the service is called indirectly. Given a service A with ddtrace 0.52.0 that calls service B:
The 1.3.0 trace view does list logs for service B, and as discussed previously, the "request count" shown on the APM/traces page for service B matches before/after upgrade. So it seems the logs from the two services are being tracked/correlated correctly, but the spans/metrics from service B have disappeared when service B is 1) upgraded to 1.3.0 and 2) downstream of another service. When calling service B directly (a service B resource is the top-level of the trace) after upgrade to 1.3.0, however, spans/metrics are visible as before. The behavior sounds similar to issue #2191, but we are not using OpenTracer. |
@zackse I took a bit of time to try to recreate this in our As you said, it does appear to happen intermittently. However, the spans on the second half are not lost, but are within their own trace which may make them appear as if they were lost. I don't have an explanation for why this is happening yet, however now that I've been able to reproduce it locally, I'm confident we should be able to find an answer quickly. Thanks for sharing the tip about distributed tracing... I think this is the clue we needed to make progress on this. I'll keep you posted on my findings. |
Thanks, @delner. We really appreciate the continued work on this. 👍 |
desperate-level-10-question: would it solve something adding the rack middleware:
in the application bootstrap ? 🤔 cc @delner |
@adrys-lab No I don't believe so. |
Update: after working more closely with @zackse and his team, we think we've found the issue is related to how sampling is applied to Ruby apps on the receiving end of a distributed trace. It isn't appropriately sending spans to the agent for metric computation. We have a few good ideas how to fix this, and we expect to release a new version soon with the fix (likely no later than next week.) I'll post further updates here as we have them! |
👋 @adrys-lab @zackse @muzfuz, we have a prerelease gem published that will fix this issue and we'd like anyone experiencing this problem to try it out. The build is based on To use this prerelease, replace source 'http://gems.datadoghq.com/prerelease-v2' do
gem 'ddtrace', '1.4.0.fix.distributed.priority.sampling.269950'
end I suggest running this in staging first if possible, before deployment to production. |
Thanks @marcotc! I have confirmed that this prerelease behaves similarly to our 🐒 -patched version in my candidate app in production. I see the correct |
🙌 thank you so much @lyricsboy! |
Version 1.4.1 has been released with this fix. Thank you all for your patience and cooperation in solving this issue. 🙇 If the problem still persists please let us know! |
Current behaviour
We have been working on upgrading from Ruby 2.6.5 to Ruby 3.1.0.
With such upgrade, we had to upgrade also dd-trace gem version from 0.54.1 to 1.1.0
Once we have done it successfully, seems we have lost APM traces data for our service.
The traces lost are the incoming received HTTP Requests to our service (rack-request).
Notice in the picture the release time of Ruby 3.1.0 + ddtrace 1.1.0.
It is weird because it seems, we lost some API traces but some others appear (some /v3/first_resource appear, but /v3/another_resource don't appear).
Same for lots of ActiveAdmin traces lost.
Checkout also the number of errors has been lost
so in general, we suspect we are loosing data for HTTP.
Meanwhile, it seems in terms of logs, we have not lost any data (this controller APM has disappeared for instance)
By contrast, for sidekiq for instance, all is working fine.
So, potentially, it might be related with our rack or rails configuration, but we don't see what could be the issue 🤔 .
Expected behaviour
ALL the HTTP requests are traced correctly in the APM.
Before upgrade configuration
Ruby
2.6.5
dd-trace
0.54.1
Current configuration (issue)
Ruby
3.1.0
dd-trace
1.1.0
Current issue Environment
1.1.0
3.1.0
Rails 6.1.6
The text was updated successfully, but these errors were encountered: