-
Notifications
You must be signed in to change notification settings - Fork 7.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2.x: Possible deadlock when using observeOn(Scheduler, boolean) #6146
Comments
What you are describing should not happen unless you either call onNext concurrently in the upstrem or your downstream does not request. Please provide a minimal runnable code example demonstrating the problem. |
The sub-Flowable where I'm seeing the deadlock looks like this:
(I call it a "sub-Flowable" to distinguish it from a larger surrounding Flowable whose code I have omitted.) That sub-Flowable is called in a loop, which is called in the lambda of a In my test, the sub-Flowable that I pasted is successfully executed a few times, but very rarely, one of the later invocations will hang. For the iterations that do not hang, I can see all of the log statements being hit. For the iteration that does hang, I can see the logs before the I was also able to use VisualVM to do a thread dump. Most of the threads have names like
There are 5 threads that are blocked/waiting that aren't simply sitting in a pool. One of them seems to be waiting for the
The other 4 blocked threads seem unrelated, but I can briefly describe them. The result of the sub-Flowable I pasted is being used to write to a cache; two of the other blocked threads are attempting to clear that cache, and it is known that the clear operation will block until any pending writes complete. The cache sits above this sub-Flowable, so the sub-Flowable can't reference it (which would lead to deadlock). One of the other blocked threads is waiting for the cache clear operation from the other threads to succeed. The final blocked thread contains the terminal The relevant logs look like this:
The only thing notable about these logs is that the logs before the The remaining memory in the JVM seems fine as well. Unfortunately, I can't really give you the entire test to reproduce the problem, as it is quite large and part of a proprietary code base. (And, you'd have to run the test over and over again for about an hour on average to see the deadlock.) I'll work on writing a smaller test that I can share with you that hopefully reproduces the deadlock more consistently, but so far I haven't had any luck. |
It is very likely |
@jkarshin I'm interested in this, a unit test would be great 👍 |
Minor update: I've added a
This seems to suggest that the downstream is successfully subscribing and requesting an item, and that it is not cancelling. I'm still working on making the apparent deadlock easier to reproduce. I haven't been able to get a small unit test to show the same problem, but I have written a slightly smaller integration test that reproduces the deadlock more frequently. (Still too huge for me to share any useful code, but I'm working on it.) |
I converted the Flowable in my previous code snippet into a unit-testable Flowable in an attempt to reproduce the hang in a unit test, however, I can not get it to hang. I then cloned RxJava to insert log statements into the internals of the Long story short, I discovered that
I don't know a lot about This brings up a question: For the IO Scheduler, the executors are cached so that they can be reused, correct? Is it possible that an executor can be recycled even if it hasn't completed its task? If so, the following situation might be causing my deadlock:
I'll try to capture the above situation in a unit test later this week. |
Try with NewThread so worker reuse is not in the picture. Also don't block and don't use any traditional wait-notify. Confine cache management into one thread. |
After much effort, I am unable to reproduce the deadlock in a unit test. With more logging, I've discovered that when the deadlock occurs in my production code, a My initial thought was that the thread responsible for clearing one of my caches was somehow getting put back into the pool, but further logging indicates that this is not the case: I added logging to determine the name of the thread that was in the ScheduledExecutorService mentioned above, and surprisingly, it matched the name of the thread from which I was doing the logging. (Which was not the thread that would have been blocked clearing my cache.) This leads me to believe that because I am nesting Flowables, Singles, Completables, etc, and doing My next course of action is going to be to refactor my code base to try to eliminate these nested blocking calls and see if that eliminates the deadlock problem. That will probably take a few weeks, so feel free to close this ticket in the meantime; I can re-open it and renew my investigation if the problem persists. Thanks again for the help. |
If a task executing on the standard schedulers does not respond to interruptions, it may lead to premature worker reuse. Unfortunately, waiting for a worker to run out of tasks may lead to memory leaks or excess pool growth (i.e., the self-release can't keep up with the request for more workers). Without seeing into the actual code, I don't think we can help you much. I'd suggest using a dedicated, single-threaded scheduler to manage the cache and retrieve items non-blockingly: Observable.interval(1, TimeUnit.MINUTES, Schedulers.single())
.doOnNext(v -> cache.clear())
.subscribe();
Observable.fromIterable(items)
.subscribeOn(Schedulers.single())
.map(v -> cache.get(v))
.observeOn(Schedulers.io())
.doOnNext(v -> { /* work with the item */ })
.subscribe(/* ... */); |
I was too stubborn to give up on this, so I pressed on and made some progress: Here are some logs that correspond to what I was describing in my previous post:
Basically, the logs indicate that some work is done on thread 37, then it is returned to the pool, then more work is done on thread 37 (without taking 37 out of the pool), then finally thread 37 tries to take itself out of the pool, then the deadlock occurs. Additionally, with a thread dump, I can see that thread 37 is waiting in a blockingGet. Normally, when a thread is put into the pool, it isn't used until a different thread takes it out of the pool. Ex:
Nowhere else in the logs does a thread attempt to check itself out of the pool, nor does a thread continue to do work after it has been returned to the pool (without being taken out of the pool). This makes me think I've got some combination of RxJava operators that is causing a thread to be I was also able to get a stack trace where thread 37 is returned to the pool prematurely:
(The Unfortunately, the stack trace doesn't reference my code at all, so I can't pinpoint what operators I am using. I did find it strange that there are a few calls to I'll try to reverse engineer the stack trace to find what combination of operators is causing that thread to get returned to the pool. If I can find the combination of operators, hopefully I can turn that into a unit test that consistently shows a thread being returned to the pool prematurely. |
I think I know what's going on.
On your side, you could try that the I'll think about the |
I believe you are correct. I found a piece of my code similar to what you described, and I was able to turn it into a unit test that consistently returns a worker thread to the pool prematurely. Using a RxJava: 2.1.11
The above hangs consistently on my machine and yields logs similar to the following:
(Replacing the Running against the 2.x branch of RxJava after the changes you made in response to my issue (#6167) fixes the hanging unit test and seems to fix my hanging production code. So, I will wait for the next RxJava release =) Thanks again! |
RxJava version: 2.1.11
Java: 1.8.0_181
I'm encountering an intermittent deadlock in a rather long Flowable, and I believe I've pinpointed it to an
observeOn(...)
call. (I've reached this conclusion through a series of log statements.) I haven't been able to trace through the test when the deadlock occurs, as it only occurs about once every 30 - 40 executions, and each execution takes about a minute. I've managed to reproduce the deadlock about a dozen times (each time, I've been adding more logging to figure out where things are getting stuck).In the test case where I experience the occasional deadlock, I expect only 1 item to be emitted through this part of the Flowable. I can see log statements 1 and 2 indicating that the item reaches the
observeOn(...)
and that the upstream is finished, but logs 3 and 4 are never reached. (I forgot to add adoOnError(...)
to make sure an exception isn't sneaking through and holding things up else where, but I'm fairly confident there aren't any uncaught exceptions. I've added adoOnError(...)
and am re-running my test now to make sure; I'll update my post once I have results.)Because the logs are hit in this way, this leads me to believe the
observeOn(...)
is locking up somehow. What's really strange is that everything works fine most of the time.None of the downstream operators should be attempting to dispose the
Flowable
early either. I believe my terminal operator is ablockingGet()
on a Single, no timeout or anything.I also logged the total number of threads in my JVM to see if I'm leaking threads somewhere, but I'm only at 49 when the deadlock occurs. (I'm using the IO scheduler, which I believe is backed by an unbounded pool, so I can't imagine I would be running out of worker threads.) I do have other Flowables doing unrelated tasks in the background. All of those Flowables are using the IO scheduler. Additionally, the up and down stream of the Flowable in my test also make use of the IO scheduler, but the deadlock always seems to happen here.
I realize that's not a lot of info to go off of, but I figured I'd ask the experts in case there's something glaring that I'm missing, or if there's something else I can do to figure out what's going on.
Thanks in advance!
The text was updated successfully, but these errors were encountered: