-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resque workers deadlocked from ddtrace after processing jobs #3015
Comments
Thanks @shanet for the detailed breakdown, and sorry that you're running into this issue! Indeed it looks like the Looking at To be honest, I'm not entirely sure that glibc is doing at this point, but before going deeper on glibc, there's a few things I'd suggest we look into at the Ruby level.
I hope that either one of these may provide clues to the next steps. Thanks again for the patience on this issue. |
Hey @ivoanjo, thanks for taking a look at this so quickly and providing some ideas on what else to check.
I wouldn't place a ton of weight on this as bisecting this was very difficult. I learned that in some cases I had to let it run Resque jobs overnight before a worker would become hung. It's trivial to mark a commit as bad once a single worker hangs, but verifying one as good means accepting that a sufficient amount of time has passed without a worker hanging. As such, I cannot 100% conclusively say that this is where the problematic behavior was introduced, just that after letting many of these commits run for 12+ hours without problems this is where I landed.
I built a new image with the
I then observed a hung Resque worker within about 15 minutes of starting this container. I guess that means the ddtrace extension is not responsible.
I went to the
I checked seven hung Resque worker processes across two servers and found that the libraries in question were either:
Why the library being unloaded is inconsistent and why it's these two specifically raises more questions than answers... If it weren't for the fact that I know running ddtrace 0.x or (I believe) 1.2.0 fixes the issue I would be doubting it's even a ddtrace problem at this point. Is there anything in ddtrace that may be affecting the loading/unloading of shared objects with Ruby? I'm really struggling to understand how this is related to ddtrace now. |
Thanks for the replies!
Yeah, I'm very curious about this myself. There's nothing in ddtrace that tries to intercept loading/unloading of stuff. I half-wonder if what may be happening here is that ddtrace may be the "straw that breaks the camel's back", but for now let's proceed assuming that the issue is caused directly by ddtrace. I was looking into the ffi gem code, and the stack you shared, and realized that Ruby has decided to run all of the finalizers before it exits. E.g. what's happened is that a Ruby process forked, then did its work, and now is cleaning up before the process shuts down. This means that actually the unloading of libraries doesn't really need to happen, in fact ...but this also suggests to me that, as long as Looking at glibc code again, there seems to be a lot of stuff that sets the "thread gscope flag" (in a way, it looks to be a per-thread lock-lookalike). Because the flag/lock-lookalike in That is, can you try checking in gdb what's the One other thing that came to mind is -- are you running resque with or without Aside from that, I guess two "avoid the issue fast" workarounds could be:
|
I think I understand what's happening here. The issue is that, by default, Resque forks before each job it processes. According to POSIX, forking a multi-threaded program and then calling a non-async-signal-safe function, like Specifically, from the
So then in this case, ddtrace is running its background threads, resque forks to run a job, and then when the exited process is being cleaned up, it gets deadlocked on a mutex inside of the After all this I think we've been having this problem of deadlocked Resque workers for a long time now, but never noticed it because it was so difficult to reproduce. Something in ddtrace version 1.3.0 and beyond made it start happening far more frequently. I suppose maybe because there's more background threads now? I'm not sure, I just know now that it's extremely difficult, but possible, to reproduce this problem with versions prior to 1.3.0, including 0.x versions as well. The solution I came up with was to simply set Resque's In terms of ddtrace, I'm not sure if there's a fix here, or really if it's even a ddtrace problem as the issue lies with Resque's default forking model I believe. Maybe it would be good to stop the ddtrace threads before forking and then start them again in the parent process to avoid this problem? There's already a patched Regardless, thank you for your help. Once I realized the library being closed was not ddtrace's it made me turn my attention to glibc and only then did I realize the issue was not with the ddtrace upgrade we did, but with Resque's forking model in multithreaded environments. |
Thanks for the feedback! Indeed the discussion on newrelic that you mentioned, + the one on rescue side as well as sidekiq even does sound suspiciously familiar... It's somewhat weird that is seems to happen that much more often with resque -- it seems the timing for it forking seems to cause this issue more often than with other tools in the Ruby ecosystem. I'll make sure to open a PR to document this on our "Known issues and suggested configurations" section of the docs, so hopefully other users won't run into this really sharp edge. |
I've been thinking about this a bit more and I suspect I understand why this seems to show up Resque. I guess it's right there in the name of the feature -- fork per job. Considering you may have a bunch of machines running resque, and they're processing many jobs, then what the Resque folks accidentally built is a Ruby app fork safety fuzzer -- it probably forks so much more often than anything else in the Ruby ecosystem that any once-in-a-blue-moon forking bugs start showing up every day. That would explain why these kinds of issues seem to show up around Resque first... |
Right, it is happening consistently in production for us, but outside of prod the only way I could reproduce it was by putting extremely high load through Resque, like on the order of hundred of thousands of jobs being worked through by 10-15 workers. That's definitely a lot of forking! Also, for anyone else that stumbles across this in the future, our security team wasn't super thrilled with setting Rails.application.config.datadog_initialize = proc do
Datadog.configure do |c|
[...]
end
end
Rails.application.config.datadog_initialize.call Resque.before_fork do
Datadog.shutdown!
end
Resque.after_fork do |job|
Rails.application.config.datadog_initialize.call
end We're still testing the above as I write this, but it's looking good so far. The downside to this is that it only handles ddtrace's threads. If there's any other unsafe-fork threads running before the fork the same issue may occur and those would need to be stopped & restarted as well. |
**What does this PR do?**: This PR adds the "Resque workers hang on exit" issue discussed in #3015 to the "Known issues and suggested configurations" section of our docs. **Motivation**: This issue seems to come up a few times, e.g. in: * #466 * #2379 * #3015 ...so I've decided to document the issue and the best-known workaround so that other customers than unfortunately run into it may find our suggested solution. **Additional Notes**: N/A **How to test the change?**: Docs-only change.
Hi there, just chiming to say we are experiencing this problem as well...except we're using Sidekiq Swarm. Sidekiq Swarm boots the app, then forks a desired number of child processes. We're going to try and rollout the same trick above and will report back. @ivoanjo would it be possible to do what Rails does and use |
Thanks for letting us know about Swarm as well. @ngan, is the issue for you easily reproduceable, or something that happens more rarely? Having a reproducer for this issue would definitely help in us experimenting a bit with a few options (such as hooking on fork). Also, could you share your Datadog configuration as well? We'd like to look into if some settings may correlate to this happening more often vs less often. |
It's fairly reproducible on production for us. We process millions of jobs a day with about ~1400 workers. We'll see 5 of them get stuck after 20 or so minutes. When we send a
When we rolled out the workaround I mentioned above, we lost all Sidekiq traces 😓 so we had to revert. What's the official API/way to tell ddtrace to restart everything after forking? |
Ok so turns out we did it incorrectly. This is the proper way to apply the workaround for sidekiq swarm:
We're going to roll this out now and will report back... |
@ngan For reference, that solution I posted above did not end up working for us. In fact, I completely removed ddtrace from our Rails app and still experienced the deadlocked Resque workers. The severity is far less and I'm still not sure of the connection, but overall I believe the problem runs deeper with forking Ruby processes when there are loaded shared objects from gems' native extensions. The only solution I could find was to stop Resque from forking per the env var I wrote about above as I believe there's a fundamental architecture flaw with Resque's forking model in these situations. This is not a tenable solution for us due to security reasons however so I still need to find another solution that is to be determined. But the point is that I don't believe this is unique to ddtrace anymore at this point. |
@shanet Ah. Welp, we can now also confirm that we are still seeing stuck workers with the workaround in place. 😅 |
Yeah, our current suspicion is that this may not be entirely under ddtrace's control, but something we do may make the issue more likely to show up. If you shut down ddtrace before forking, then no ddtrace background threads should be running.
|
Hi all,
Maybe we have to check in https://bugs.ruby-lang.org/projects/ruby-master/ and create an issue if needed. My two cents. |
This seems dangerous to me. For example, with ddtrace, what if you have a background thread with data still in its buffer to be sent over the network? By forcefully aborting the program you're not giving any threads the ability to exit gracefully and opening up to potential loss.
I ran across that glibc bug report when I started investigating this (it's actually how I first knew I wasn't crazy with this whole thing and someone else was having the same issue!). The comment on it regarding async-safe-safe functions appears correct to me though. That's not to say FFI is at fault either, it's correctly cleaning up after itself by closing the shared objects. That's partly why this issue is so tricky, there's no single piece of software that's doing something wrong, but the confluence of all of them put together creates this extremely odd edge case. That said, I haven't given up on this either. I've recently been trying an approach to kill all threads in an |
I believe we solved our deadlocked resque workers issue with a combination of approaches in this thread. Here's a summary: First and foremost, I believe the proper fix to this all is to not have Resque fork per job by setting the So, if you need to keep Resque forking, here's what I did: Define Resque before and after fork blocks as such:
And then for the Datadog initialization:
The above is a combination of my previous attempts to shut down any background threads before the forked process exits and @CedricCouton's suggestion to make use of However, calling The other component here is that this approach solved 95% of the deadlocked workers, but months ago I observed a single deadlocked worker stuck on a I think that about wraps it up. Since deploying all of this and the 1.x ddtrace gem there have been no deadlocked workers which is a huge relief. My previous idea of using Thanks to everyone in this thread for your ideas. This was certainly an exceedingly difficult problem to debug and find an acceptable fix for that didn't involve intrusive patches to Ruby, FFI, or even glibc. |
Thanks @shanet for sharing your conclusions. Indeed, it's a really tough nut to crack, a combination of weaknesses in the whole ecosystem, with dd-trace-rb doing also its part to trigger the issue by having background threads sending data :/ |
For now this has been added as a known issue to https://docs.datadoghq.com/tracing/trace_collection/dd_libraries/ruby/#resque-workers-hang-on-exit with our recommended workaround, so I'll close this ticket for now. Please feel free to re-open if the workaround doesn't work or you need any help with it :) |
@ivoanjo I wonder if you've ever considered decorating Some example of that approach: https://github.com/Shopify/grpc_fork_safety |
Adopting the monkey patching of So, we're automatically restarting the threads currently as well... The annoying problem in this case is the need to shut down the library on the parent process. This is where I hesitate a bit more:
|
Interesting. I'm not using the gem myself, was just asking questions because I heard of some fork issues (cc @beauraF). And they had to call shutdown on
Yeah, something else I've used is simply to have a Mutex that wrap all the unsafe call (typically anything that can call |
@beauraF did you have an issue with Resque as well, or was it a different situation? I'd definitely like to know more, if you're up for sharing.
This makes a lot of sense! It occurs to me that a read/write lock would be the perfect fit for something like this -- we could have as many concurrent "readers" of We're discussing moving more of our networking code into the native rust-built libdatadog gem, and such a refactoring would provide a good opportunity to also solve this issue. I'll go ahead and create an issue to revisit this. |
Yup: https://bugs.ruby-lang.org/issues/20590 ruby/ruby#10864 |
Oh wow that's cool! I hadn't seen your PR haha, that's amazing :) |
Hey 👋
Yes, we had issue with resque. @CedricCouton and I are working together (see #3015 (comment)). To be precise, we in fact had issue with resque and pitchfork (without reforking). At the end, we fixed it with: module Doctolib
module O11y
module Extension
module ForkSafety
module Datadog
def _fork
Doctolib::Datadog.stop # ::Datadog.shutdown!
pid = super
Doctolib::Datadog.start # ::Datadog.configure(&configuration)
pid
end
end
end
end
end
end Also, we did the same for Happy to share everything you need, and even jump on a call if needed! |
@beauraF Interesting, I hadn't seen reports of it affecting pitchfork before, but thinking back to how it works, I can see how it may also trigger the issue.
I'd like to take you up on your offer! Can you drop me a quick e-mail at ivo.anjo at employerhq so we can set that up? |
Quick update here. While So anyone experiencing this on 3.3.x, I highly suggest to upgrade to 3.3.5. As for |
🙇 thanks for the heads-up! (And I'm VERY excited about ruby/ruby#10864 ) |
A few weeks ago we updated ddtrace in our application from version 0.53 to 1,11.1. Shortly after deploying this to production we noticed that some of our Resque workers would hang after they processed a job resulting in the worker to stop processing further jobs and causing a backups in the Resque queues. We identified the ddtrace upgrade as the source of this problem and reverted back to version 0.53 to resolve it for the time being.
Since then I've been working to identify what the root cause of this is. Here's what I know so far:
26ac04c06f87918eae773692dd428491bf8a11a4
. This appears to add tracking of CPU-time for profiling.Threads as displayed by GDB:
GDB backtrace of all threads:
Using GDB I also got a Ruby backtrace of the main thread. The other threads were marked as killed already by Ruby and did not have a control frame pointer anymore to read a backtrace from:
The one Datadog thread's full name is
Datadog::Tracing::Workers::AsyncTransport
per:I initially though the problem was with the
AsyncTransport
class, but after looking at the concurrency logic there I was unable to find any problems with it. Futhermore, returning to the backtraces above, the problem seems to lie elsewhere. Ruby has all threads other than the main thread marked as killed already so it would appear they joined the main thread cleanly.The main thread on the other hand is attempting to unload a library when it becomes stuck per the backtrace. Namely, it is stopping the fork (
rb_f_fork
andruby_stop
) then cleaning up withruby_cleanup
andlibrary_free
before finally becoming deadlocked atfutex_wait
.This backtrace combined with the git bisecting of the ddtrace gem leads me to believe there is an issue with something in the native extension preventing Ruby from continuing with its cleanup of the fork resulting in a hung Resque worker process. This only appears to happen at high load with sufficient number of Resque workers. It's also worth noting that with 10 Resque workers, I never saw them all become stuck. At most 5-6 of them would become stuck over the timespan of ~12 hours. I assume this had to do with why multiple workers were needed to replicate the problem in the first place.
There's one other notable situation that happened too. I've inspected many of these deadlocked processes via GDB while debugging this and they all had the same backtrace. However, in a single case, which I never saw a second time, the backtrace was different with it being stuck on a
getaddrinfo
call. I don't know if this is relevant since it ultimately still ends stuck on afutex_wait
call, but it stood out to me enough to save a copy of the backtrace when I saw it:Our environment is as follows:
Our ddtrace initialization config:
It's also worth noting that I was only able to replicate this behavior on AWS ECS. I tried reproducing it on a local Linux desktop (Arch rather than Ubuntu) and even when running 10 Resque workers could not replicate the issue. I don't know if that makes it environment specific or, since it only happens in high-load, that the considerably higher hardware capabilities of my desktop compared to the limited AWS containers prevented it from occurring.
I've also reviewed #466 which seems like a similar issue, but appears to be unrelated.
Overall, I've spent nearly two weeks getting to this point hoping to identify/solve the issue, but at least to obtain enough information to intelligently open a bug report. At this point I'm not sure what else to debug and am hoping for advice on what might be causing these deadlocks and how to further diagnose the problem. Thank you for reviewing.
The text was updated successfully, but these errors were encountered: