-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock using Caffeine cache and context propagation #13158
Comments
This is the stack trace of one of the deadlocked threads: "executor-thread-1" #16 daemon prio=5 os_prio=0 cpu=0.00ms elapsed=106.25s allocated=140K defined_classes=5 tid=0x0000029697292000 nid=0x6610 waiting on condition [0x000000d557cfd000] |
Doesn't ring a bell, but you also observed a CI issue in CP, so perhaps it's the same thing? |
I have tried it, and successfully ran jmeter with 1000 threads, I can't reproduce. |
Hi @FroMage. I know that reproducing deadlocks isn't easy, but in this case in my tests it happens almost every time. I tested it on two different machine, using Linux and Windows and Java 11 and Java 14. |
Tried it again and still no deadlock. How many cores do you have? |
Hi @FroMage, I have 8 cores. You can try decreasing the number of threads of the Quarkus executor from 200 to 10 using this configuration: quarkus.thread-pool.max-threads=10 However I think I finally figured out the problem. I spent some hours investigating but I concluded that it's a simple thread starvation issue. The quarkus-cache extension configures the caffeine executor with the Quarkus ManagedExecutor, as we see in the code below:
When CP extension is loaded, managedExecutor receives the current executor and Caffeine configures it as its executor. If CP is not loaded, managedExecutor receives null and Caffeine uses the ForkJoinPool.commomPool() I don't know if there is a reason to use the ManagedExecutor in Caffeine, but the solution to the problem is to pass executor as null when constructing the CaffeineCache object |
We use ManagedExecutor in this code to allow context propagation into cache computation threads. |
Hi @gwenneg, thank you for the explanation. I'm facing a thread starvation because the number of thread in the ManageExecutor is limited and the threads that Quarkus use to run the requests are shared with Caffeine. Do you think it is possible to create a ManageExecutor only for Caffeine? |
Propagating the context into cache computation and isolating the cache executor sound like conflicting goals to me, but I'm not a context propagation expert. @FroMage WDYT? |
@ben-manes Is there a way to avoid that thread starvation issue by changing the way Caffeine is used in the cache extension? |
I remember us discussing using Assuming you didn't have that feature, then using a synchronous cache would execute the work on the calling thread. That sounds like what you want here, right? If you wanted both, but to perform the work on the caller, then you could complete the future yourself such as CompletableFuture<Object> future = new CompletableFuture<Object>();
var prior = cache.asMap().putIfAbsent(key, future);
if (prior == null) {
try {
value = // ...
future.complete(value);
return value;
} catch (Throwable t) {
future.completeExceptionally(t);
throw t;
}
} else {
return future.get(binding.lockTimeout(), TimeUnit.MILLISECONDS);
} This would avoid using the executor for the computations, while still giving you the flexibility of future-based computations. |
Thanks for your answer @ben-manes. I'll work on something based on that suggestion. |
Thank you guys. If there is anything I can help, please let me know |
I just submitted a PR which should fix the issue when the I wonder how we should deal with deadlocks when |
Why is there a distinction? In the old & new code, if the caller is the one who is adding a new mapping ( It is only if you want the caller who created the new mapping to also honor the lock timeout does the work need to be delegated to a threadpool. In that case the starvation may still occur. Unless I'm mistaken, I think you can remove the executor and use only |
You're not mistaken, I didn't realize all cases could be treated the same way. I'll update the PR tomorrow. Thanks @ben-manes! |
BTW, the reason why the Also, in the future it's fairly possible that the Quarkus thread pool will get a much faster access to ThreadLocals than other threads, so even more value in keeping it instead of spawning thread pools left and right. |
Thanks for that information @FroMage. @ben-manes I updated the PR, I hope starvation is gone for good now! :) The extension code is much simpler now. |
wow, that is so much cleaner and simpler. very nice. |
Describe the bug
When using quarkus-cache together with quarkus-smallrye-context-propagation extension, a method annotated with @CacheResult deadlocks with 200 or more concurrent calls
Expected behavior
No deadlock
Actual behavior
When using a tool like jMeter to submit 200 or more simultaneos calls, all threads immediately deadlocks on a method annotated with @CacheResult
Removing the extension
quarkus-smallrye-context-propagation
, the issue doesn't occur anymoreTo Reproduce
I created an example project to reproduce this issue: https://github.com/sivelli/quarkus-cache-deadlock
But the code is really simple:
In pom.xml, if the following dependency is present, the problem happens when making 200 calls simultaneosly
Removing this dependency, no deadlock happens
Steps to reproduce the behavior:
Environment:
uname -a
orver
:Microsoft Windows [versão 10.0.18363.657]
java -version
:openjdk version "11.0.8" 2020-07-14
OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.8+10)
OpenJDK 64-Bit Server VM AdoptOpenJDK (build 11.0.8+10, mixed mode)
mvn -version
Apache Maven 3.6.3 (cecedd343002696d0abb50b32b541b8a6ba2883f)
Maven home: D:\Projetos\Java\apache-maven-3.6.3\bin..
Java version: 11.0.8, vendor: Red Hat, Inc., runtime: D:\Projetos\Java\jdk-11.0.8.10-2
Default locale: pt_BR, platform encoding: Cp1252
OS name: "windows 10", version: "10.0", arch: "amd64", family: "windows"
The text was updated successfully, but these errors were encountered: