Deadlock using Caffeine cache and context propagation #13158

sivelli · 2020-11-06T21:35:31Z

Describe the bug
When using quarkus-cache together with quarkus-smallrye-context-propagation extension, a method annotated with @CacheResult deadlocks with 200 or more concurrent calls

Expected behavior
No deadlock

Actual behavior
When using a tool like jMeter to submit 200 or more simultaneos calls, all threads immediately deadlocks on a method annotated with @CacheResult
Removing the extension quarkus-smallrye-context-propagation, the issue doesn't occur anymore

To Reproduce

I created an example project to reproduce this issue: https://github.com/sivelli/quarkus-cache-deadlock
But the code is really simple:

@ApplicationScoped
class ExampleCache {
    @CacheResult(cacheName = "test")
    public Integer getLength(String param) {
        return param.length();
    }
}

@Path("/resteasy/hello")
public class ExampleResource {
    @Inject
    ExampleCache cache;

    AtomicInteger counter = new AtomicInteger(); 

    @GET
    @Produces(MediaType.TEXT_PLAIN)
    public String hello(@QueryParam("name") String name) {
        System.out.println("request received " + counter.incrementAndGet());
        return "hello " + name + " Length:" + cache.getLength(Objects.toString(name, ""));
    }
}

In pom.xml, if the following dependency is present, the problem happens when making 200 calls simultaneosly

    <dependency>
        <groupId>io.quarkus</groupId>
        <artifactId>quarkus-smallrye-context-propagation</artifactId>
    </dependency>

Removing this dependency, no deadlock happens

Steps to reproduce the behavior:

Start the application
Submit 200 or more simultaneous requests (run testCache.jmx)

Environment:

Output of uname -a or ver:
Microsoft Windows [versão 10.0.18363.657]
Output of java -version:
openjdk version "11.0.8" 2020-07-14
OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.8+10)
OpenJDK 64-Bit Server VM AdoptOpenJDK (build 11.0.8+10, mixed mode)
Quarkus version: 1.9.2.Final
Output of mvn -version
Apache Maven 3.6.3 (cecedd343002696d0abb50b32b541b8a6ba2883f)
Maven home: D:\Projetos\Java\apache-maven-3.6.3\bin..
Java version: 11.0.8, vendor: Red Hat, Inc., runtime: D:\Projetos\Java\jdk-11.0.8.10-2
Default locale: pt_BR, platform encoding: Cp1252
OS name: "windows 10", version: "10.0", arch: "amd64", family: "windows"

The text was updated successfully, but these errors were encountered:

sivelli · 2020-11-06T21:49:29Z

This is the stack trace of one of the deadlocked threads:

"executor-thread-1" #16 daemon prio=5 os_prio=0 cpu=0.00ms elapsed=106.25s allocated=140K defined_classes=5 tid=0x0000029697292000 nid=0x6610 waiting on condition [0x000000d557cfd000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
- parking to wait for <0x000000070e334238> (a java.util.concurrent.CompletableFuture$Signaller)
at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:194)
at java.util.concurrent.CompletableFuture$Signaller.block([email protected]/CompletableFuture.java:1796)
at java.util.concurrent.ForkJoinPool.managedBlock([email protected]/ForkJoinPool.java:3128)
at java.util.concurrent.CompletableFuture.waitingGet([email protected]/CompletableFuture.java:1823)
at java.util.concurrent.CompletableFuture.get([email protected]/CompletableFuture.java:1998)
at io.quarkus.cache.runtime.CacheResultInterceptor.intercept(CacheResultInterceptor.java:46)
at io.quarkus.cache.runtime.CacheResultInterceptor_Bean.intercept(CacheResultInterceptor_Bean.zig:334)
at io.quarkus.arc.impl.InterceptorInvocation.invoke(InterceptorInvocation.java:41)
at io.quarkus.arc.impl.AroundInvokeInvocationContext.perform(AroundInvokeInvocationContext.java:41)
at io.quarkus.arc.impl.InvocationContexts.performAroundInvoke(InvocationContexts.java:32)
at org.acme.resteasy.ExampleCache_Subclass.getLength(ExampleCache_Subclass.zig:162)
at org.acme.resteasy.ExampleCache_ClientProxy.getLength(ExampleCache_ClientProxy.zig:185)
at org.acme.resteasy.ExampleResource.hello(ExampleResource.java:36)
at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke([email protected]/DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke([email protected]/Method.java:566)
at org.jboss.resteasy.core.MethodInjectorImpl.invoke(MethodInjectorImpl.java:170)
at org.jboss.resteasy.core.MethodInjectorImpl.invoke(MethodInjectorImpl.java:130)
at org.jboss.resteasy.core.ResourceMethodInvoker.internalInvokeOnTarget(ResourceMethodInvoker.java:643)
at org.jboss.resteasy.core.ResourceMethodInvoker.invokeOnTargetAfterFilter(ResourceMethodInvoker.java:507)
at org.jboss.resteasy.core.ResourceMethodInvoker.lambda$invokeOnTarget$2(ResourceMethodInvoker.java:457)
at org.jboss.resteasy.core.ResourceMethodInvoker$$Lambda$259/0x00000008002dbc40.get(Unknown Source)
at org.jboss.resteasy.core.interception.jaxrs.PreMatchContainerRequestContext.filter(PreMatchContainerRequestContext.java:364)
- locked <0x0000000702595648> (a org.jboss.resteasy.core.interception.jaxrs.PostMatchContainerRequestContext)
at org.jboss.resteasy.core.ResourceMethodInvoker.invokeOnTarget(ResourceMethodInvoker.java:459)
at org.jboss.resteasy.core.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:419)
at org.jboss.resteasy.core.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:393)
at org.jboss.resteasy.core.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:68)
at org.jboss.resteasy.core.SynchronousDispatcher.invoke(SynchronousDispatcher.java:492)
at org.jboss.resteasy.core.SynchronousDispatcher.lambda$invoke$4(SynchronousDispatcher.java:261)
at org.jboss.resteasy.core.SynchronousDispatcher$$Lambda$247/0x00000008002d8040.run(Unknown Source)
at org.jboss.resteasy.core.SynchronousDispatcher.lambda$preprocess$0(SynchronousDispatcher.java:161)
at org.jboss.resteasy.core.SynchronousDispatcher$$Lambda$254/0x00000008002d9c40.get(Unknown Source)
at org.jboss.resteasy.core.interception.jaxrs.PreMatchContainerRequestContext.filter(PreMatchContainerRequestContext.java:364)
- locked <0x00000007025956a0> (a org.jboss.resteasy.core.interception.jaxrs.PreMatchContainerRequestContext)
at org.jboss.resteasy.core.SynchronousDispatcher.preprocess(SynchronousDispatcher.java:164)
at org.jboss.resteasy.core.SynchronousDispatcher.invoke(SynchronousDispatcher.java:247)
at io.quarkus.resteasy.runtime.standalone.RequestDispatcher.service(RequestDispatcher.java:73)
at io.quarkus.resteasy.runtime.standalone.VertxRequestHandler.dispatch(VertxRequestHandler.java:131)
at io.quarkus.resteasy.runtime.standalone.VertxRequestHandler.access$000(VertxRequestHandler.java:37)
at io.quarkus.resteasy.runtime.standalone.VertxRequestHandler$1.run(VertxRequestHandler.java:94)
at org.jboss.threads.ContextClassLoaderSavingRunnable.run(ContextClassLoaderSavingRunnable.java:35)
at org.jboss.threads.EnhancedQueueExecutor.safeRun(EnhancedQueueExecutor.java:2046)
at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.doRunTask(EnhancedQueueExecutor.java:1578)
at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1452)
at org.jboss.threads.DelegatingRunnable.run(DelegatingRunnable.java:29)
at org.jboss.threads.ThreadLocalResettingRunnable.run(ThreadLocalResettingRunnable.java:29)
at java.lang.Thread.run([email protected]/Thread.java:834)
at org.jboss.threads.JBossThread.run(JBossThread.java:479)

gsmet · 2020-11-09T09:58:44Z

@FroMage @Ladicek does it ring a bell?

FroMage · 2020-11-09T10:11:51Z

Doesn't ring a bell, but you also observed a CI issue in CP, so perhaps it's the same thing?

FroMage · 2020-11-09T10:27:01Z

I have tried it, and successfully ran jmeter with 1000 threads, I can't reproduce.

sivelli · 2020-11-09T17:27:59Z

Hi @FroMage. I know that reproducing deadlocks isn't easy, but in this case in my tests it happens almost every time. I tested it on two different machine, using Linux and Windows and Java 11 and Java 14.
I changed a little the code to force a cache miss on each request. It may help to reproduce the error.
Anyway, I'll investigate more this issue. I had to disable all cache on my application to avoid this situation.

FroMage · 2020-11-10T15:53:59Z

Tried it again and still no deadlock. How many cores do you have?

sivelli · 2020-11-10T18:33:44Z

Hi @FroMage, I have 8 cores. You can try decreasing the number of threads of the Quarkus executor from 200 to 10 using this configuration: quarkus.thread-pool.max-threads=10

However I think I finally figured out the problem. I spent some hours investigating but I concluded that it's a simple thread starvation issue.

The quarkus-cache extension configures the caffeine executor with the Quarkus ManagedExecutor, as we see in the code below:

@Recorder
public class CaffeineCacheBuildRecorder {

    private static final Logger LOGGER = Logger.getLogger(CaffeineCacheBuildRecorder.class);

    public void buildCaches(BeanContainer beanContainer,
            Set<CaffeineCacheInfo> cacheInfos) {
        // The number of caches is known at build time so we can use fixed initialCapacity and loadFactor for the caches map.
        Map<String, CaffeineCache> caches = new HashMap<>(cacheInfos.size() + 1, 1.0F);

        ManagedExecutor managedExecutor = Arc.container().instance(ManagedExecutor.class).orElse(null);

        for (CaffeineCacheInfo cacheInfo : cacheInfos) {
            if (LOGGER.isDebugEnabled()) {
                LOGGER.debugf(
                        "Building Caffeine cache [%s] with [initialCapacity=%s], [maximumSize=%s], [expireAfterWrite=%s] and [expireAfterAccess=%s]",
                        cacheInfo.name, cacheInfo.initialCapacity, cacheInfo.maximumSize, cacheInfo.expireAfterWrite,
                        cacheInfo.expireAfterAccess);
            }
            CaffeineCache cache = new CaffeineCache(cacheInfo, managedExecutor);
            caches.put(cacheInfo.name, cache);
        }

        beanContainer.instance(CacheRepository.class).setCaches(caches);
    }
}

When CP extension is loaded, managedExecutor receives the current executor and Caffeine configures it as its executor. If CP is not loaded, managedExecutor receives null and Caffeine uses the ForkJoinPool.commomPool()
But using the Quarkus executor causes the deadlock because when a cache miss happens, Caffeine will use its executor to compute the value assincronously, but if all threads are intercepted by the cache, there is no thread left to compute the value and the deadlock happens.

I don't know if there is a reason to use the ManagedExecutor in Caffeine, but the solution to the problem is to pass executor as null when constructing the CaffeineCache object

gwenneg · 2020-11-10T22:47:06Z

We use ManagedExecutor in this code to allow context propagation into cache computation threads.

sivelli · 2020-11-10T23:00:54Z

Hi @gwenneg, thank you for the explanation.

I'm facing a thread starvation because the number of thread in the ManageExecutor is limited and the threads that Quarkus use to run the requests are shared with Caffeine. Do you think it is possible to create a ManageExecutor only for Caffeine?

gwenneg · 2020-11-10T23:14:23Z

Propagating the context into cache computation and isolating the cache executor sound like conflicting goals to me, but I'm not a context propagation expert. @FroMage WDYT?

gwenneg · 2020-11-10T23:22:43Z

@ben-manes Is there a way to avoid that thread starvation issue by changing the way Caffeine is used in the cache extension?

ben-manes · 2020-11-10T23:41:45Z

I remember us discussing using AsyncCache in the CacheResultInterceptor, but I don't recall the reason why. Skimming the code, I think this might have been for the lockTimeout feature to allow the computation to continue while also not blocking other callers indefinitely who are also waiting for it.

Assuming you didn't have that feature, then using a synchronous cache would execute the work on the calling thread. That sounds like what you want here, right?

If you wanted both, but to perform the work on the caller, then you could complete the future yourself such as

CompletableFuture<Object> future = new CompletableFuture<Object>();
var prior = cache.asMap().putIfAbsent(key, future);
if (prior == null) {
  try {
    value = // ...
    future.complete(value);
    return value;
  } catch (Throwable t) {
    future.completeExceptionally(t);
    throw t;
  }
} else {
  return future.get(binding.lockTimeout(), TimeUnit.MILLISECONDS);
}

This would avoid using the executor for the computations, while still giving you the flexibility of future-based computations.

gwenneg · 2020-11-11T12:42:16Z

Thanks for your answer @ben-manes. I'll work on something based on that suggestion.

sivelli · 2020-11-11T13:00:13Z

Thank you guys. If there is anything I can help, please let me know

gwenneg · 2020-11-11T21:13:33Z

I just submitted a PR which should fix the issue when the lock timeout feature from the cache extension is disabled.

I wonder how we should deal with deadlocks when lock timeout is on though...

ben-manes · 2020-11-12T00:22:46Z

Why is there a distinction?

In the old & new code, if the caller is the one who is adding a new mapping (isCurrentThreadComputation), then the caller does not wait on the lock timeout and blocks until the computation completes. It is only the subsequent callers who find the mapping who honor the timeout to exit early, which can still be done by waiting on the future. If those behaviors are to remain, then you never need to delegate the computation to an Executor and can use the same-thread approach.

It is only if you want the caller who created the new mapping to also honor the lock timeout does the work need to be delegated to a threadpool. In that case the starvation may still occur.

Unless I'm mistaken, I think you can remove the executor and use only getWithSyncComputation but pass in the lock timeout for the case of waiting on a cache hit.

gwenneg · 2020-11-12T01:22:08Z

You're not mistaken, I didn't realize all cases could be treated the same way. I'll update the PR tomorrow. Thanks @ben-manes!

FroMage · 2020-11-12T15:24:33Z

BTW, the reason why the ManagedExecutor starves you is that it's using the Quarkus thread pool underneath, which is something we want to do by design, so that we avoid that every extension has its own thread pool and we have hundreds of threads for no good reason.

Also, in the future it's fairly possible that the Quarkus thread pool will get a much faster access to ThreadLocals than other threads, so even more value in keeping it instead of spawning thread pools left and right.

gwenneg · 2020-11-12T20:09:32Z

Thanks for that information @FroMage.

@ben-manes I updated the PR, I hope starvation is gone for good now! :) The extension code is much simpler now.

ben-manes · 2020-11-12T20:22:05Z

wow, that is so much cleaner and simpler. very nice.

sivelli added the kind/bug Something isn't working label Nov 6, 2020

gwenneg added the area/cache label Nov 10, 2020

gwenneg self-assigned this Nov 11, 2020

gwenneg mentioned this issue Nov 11, 2020

Compute cache value on calling thread to fix thread starvation #13244

Merged

gwenneg mentioned this issue Nov 13, 2020

DeadLock when using CacheResult with Spring Data #7548

Closed

gsmet closed this as completed in #13244 Nov 13, 2020

ghost added this to the 1.11 - master milestone Nov 13, 2020

gsmet modified the milestones: 1.11 - master, 1.10.0.Final Nov 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock using Caffeine cache and context propagation #13158

Deadlock using Caffeine cache and context propagation #13158

sivelli commented Nov 6, 2020

sivelli commented Nov 6, 2020

gsmet commented Nov 9, 2020

FroMage commented Nov 9, 2020

FroMage commented Nov 9, 2020

sivelli commented Nov 9, 2020

FroMage commented Nov 10, 2020

sivelli commented Nov 10, 2020

gwenneg commented Nov 10, 2020

sivelli commented Nov 10, 2020

gwenneg commented Nov 10, 2020

gwenneg commented Nov 10, 2020

ben-manes commented Nov 10, 2020 •

edited

Loading

gwenneg commented Nov 11, 2020

sivelli commented Nov 11, 2020

gwenneg commented Nov 11, 2020

ben-manes commented Nov 12, 2020

gwenneg commented Nov 12, 2020

FroMage commented Nov 12, 2020

gwenneg commented Nov 12, 2020

ben-manes commented Nov 12, 2020

Deadlock using Caffeine cache and context propagation #13158

Deadlock using Caffeine cache and context propagation #13158

Comments

sivelli commented Nov 6, 2020

sivelli commented Nov 6, 2020

gsmet commented Nov 9, 2020

FroMage commented Nov 9, 2020

FroMage commented Nov 9, 2020

sivelli commented Nov 9, 2020

FroMage commented Nov 10, 2020

sivelli commented Nov 10, 2020

gwenneg commented Nov 10, 2020

sivelli commented Nov 10, 2020

gwenneg commented Nov 10, 2020

gwenneg commented Nov 10, 2020

ben-manes commented Nov 10, 2020 • edited Loading

gwenneg commented Nov 11, 2020

sivelli commented Nov 11, 2020

gwenneg commented Nov 11, 2020

ben-manes commented Nov 12, 2020

gwenneg commented Nov 12, 2020

FroMage commented Nov 12, 2020

gwenneg commented Nov 12, 2020

ben-manes commented Nov 12, 2020

ben-manes commented Nov 10, 2020 •

edited

Loading