Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock using Caffeine cache and context propagation #13158

Closed
sivelli opened this issue Nov 6, 2020 · 20 comments · Fixed by #13244
Closed

Deadlock using Caffeine cache and context propagation #13158

sivelli opened this issue Nov 6, 2020 · 20 comments · Fixed by #13244
Assignees
Labels
area/cache kind/bug Something isn't working
Milestone

Comments

@sivelli
Copy link
Contributor

sivelli commented Nov 6, 2020

Describe the bug
When using quarkus-cache together with quarkus-smallrye-context-propagation extension, a method annotated with @CacheResult deadlocks with 200 or more concurrent calls

Expected behavior
No deadlock

Actual behavior
When using a tool like jMeter to submit 200 or more simultaneos calls, all threads immediately deadlocks on a method annotated with @CacheResult
Removing the extension quarkus-smallrye-context-propagation, the issue doesn't occur anymore

To Reproduce

I created an example project to reproduce this issue: https://github.com/sivelli/quarkus-cache-deadlock
But the code is really simple:

@ApplicationScoped
class ExampleCache {
    @CacheResult(cacheName = "test")
    public Integer getLength(String param) {
        return param.length();
    }
}

@Path("/resteasy/hello")
public class ExampleResource {
    @Inject
    ExampleCache cache;

    AtomicInteger counter = new AtomicInteger(); 

    @GET
    @Produces(MediaType.TEXT_PLAIN)
    public String hello(@QueryParam("name") String name) {
        System.out.println("request received " + counter.incrementAndGet());
        return "hello " + name + " Length:" + cache.getLength(Objects.toString(name, ""));
    }
}

In pom.xml, if the following dependency is present, the problem happens when making 200 calls simultaneosly

    <dependency>
        <groupId>io.quarkus</groupId>
        <artifactId>quarkus-smallrye-context-propagation</artifactId>
    </dependency>    

Removing this dependency, no deadlock happens

Steps to reproduce the behavior:

  1. Start the application
  2. Submit 200 or more simultaneous requests (run testCache.jmx)

Environment:

  • Output of uname -a or ver:
    Microsoft Windows [versão 10.0.18363.657]
  • Output of java -version:
    openjdk version "11.0.8" 2020-07-14
    OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.8+10)
    OpenJDK 64-Bit Server VM AdoptOpenJDK (build 11.0.8+10, mixed mode)
  • Quarkus version: 1.9.2.Final
  • Output of mvn -version
    Apache Maven 3.6.3 (cecedd343002696d0abb50b32b541b8a6ba2883f)
    Maven home: D:\Projetos\Java\apache-maven-3.6.3\bin..
    Java version: 11.0.8, vendor: Red Hat, Inc., runtime: D:\Projetos\Java\jdk-11.0.8.10-2
    Default locale: pt_BR, platform encoding: Cp1252
    OS name: "windows 10", version: "10.0", arch: "amd64", family: "windows"
@sivelli sivelli added the kind/bug Something isn't working label Nov 6, 2020
@sivelli
Copy link
Contributor Author

sivelli commented Nov 6, 2020

This is the stack trace of one of the deadlocked threads:

"executor-thread-1" #16 daemon prio=5 os_prio=0 cpu=0.00ms elapsed=106.25s allocated=140K defined_classes=5 tid=0x0000029697292000 nid=0x6610 waiting on condition [0x000000d557cfd000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
- parking to wait for <0x000000070e334238> (a java.util.concurrent.CompletableFuture$Signaller)
at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:194)
at java.util.concurrent.CompletableFuture$Signaller.block([email protected]/CompletableFuture.java:1796)
at java.util.concurrent.ForkJoinPool.managedBlock([email protected]/ForkJoinPool.java:3128)
at java.util.concurrent.CompletableFuture.waitingGet([email protected]/CompletableFuture.java:1823)
at java.util.concurrent.CompletableFuture.get([email protected]/CompletableFuture.java:1998)
at io.quarkus.cache.runtime.CacheResultInterceptor.intercept(CacheResultInterceptor.java:46)
at io.quarkus.cache.runtime.CacheResultInterceptor_Bean.intercept(CacheResultInterceptor_Bean.zig:334)
at io.quarkus.arc.impl.InterceptorInvocation.invoke(InterceptorInvocation.java:41)
at io.quarkus.arc.impl.AroundInvokeInvocationContext.perform(AroundInvokeInvocationContext.java:41)
at io.quarkus.arc.impl.InvocationContexts.performAroundInvoke(InvocationContexts.java:32)
at org.acme.resteasy.ExampleCache_Subclass.getLength(ExampleCache_Subclass.zig:162)
at org.acme.resteasy.ExampleCache_ClientProxy.getLength(ExampleCache_ClientProxy.zig:185)
at org.acme.resteasy.ExampleResource.hello(ExampleResource.java:36)
at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke([email protected]/DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke([email protected]/Method.java:566)
at org.jboss.resteasy.core.MethodInjectorImpl.invoke(MethodInjectorImpl.java:170)
at org.jboss.resteasy.core.MethodInjectorImpl.invoke(MethodInjectorImpl.java:130)
at org.jboss.resteasy.core.ResourceMethodInvoker.internalInvokeOnTarget(ResourceMethodInvoker.java:643)
at org.jboss.resteasy.core.ResourceMethodInvoker.invokeOnTargetAfterFilter(ResourceMethodInvoker.java:507)
at org.jboss.resteasy.core.ResourceMethodInvoker.lambda$invokeOnTarget$2(ResourceMethodInvoker.java:457)
at org.jboss.resteasy.core.ResourceMethodInvoker$$Lambda$259/0x00000008002dbc40.get(Unknown Source)
at org.jboss.resteasy.core.interception.jaxrs.PreMatchContainerRequestContext.filter(PreMatchContainerRequestContext.java:364)
- locked <0x0000000702595648> (a org.jboss.resteasy.core.interception.jaxrs.PostMatchContainerRequestContext)
at org.jboss.resteasy.core.ResourceMethodInvoker.invokeOnTarget(ResourceMethodInvoker.java:459)
at org.jboss.resteasy.core.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:419)
at org.jboss.resteasy.core.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:393)
at org.jboss.resteasy.core.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:68)
at org.jboss.resteasy.core.SynchronousDispatcher.invoke(SynchronousDispatcher.java:492)
at org.jboss.resteasy.core.SynchronousDispatcher.lambda$invoke$4(SynchronousDispatcher.java:261)
at org.jboss.resteasy.core.SynchronousDispatcher$$Lambda$247/0x00000008002d8040.run(Unknown Source)
at org.jboss.resteasy.core.SynchronousDispatcher.lambda$preprocess$0(SynchronousDispatcher.java:161)
at org.jboss.resteasy.core.SynchronousDispatcher$$Lambda$254/0x00000008002d9c40.get(Unknown Source)
at org.jboss.resteasy.core.interception.jaxrs.PreMatchContainerRequestContext.filter(PreMatchContainerRequestContext.java:364)
- locked <0x00000007025956a0> (a org.jboss.resteasy.core.interception.jaxrs.PreMatchContainerRequestContext)
at org.jboss.resteasy.core.SynchronousDispatcher.preprocess(SynchronousDispatcher.java:164)
at org.jboss.resteasy.core.SynchronousDispatcher.invoke(SynchronousDispatcher.java:247)
at io.quarkus.resteasy.runtime.standalone.RequestDispatcher.service(RequestDispatcher.java:73)
at io.quarkus.resteasy.runtime.standalone.VertxRequestHandler.dispatch(VertxRequestHandler.java:131)
at io.quarkus.resteasy.runtime.standalone.VertxRequestHandler.access$000(VertxRequestHandler.java:37)
at io.quarkus.resteasy.runtime.standalone.VertxRequestHandler$1.run(VertxRequestHandler.java:94)
at org.jboss.threads.ContextClassLoaderSavingRunnable.run(ContextClassLoaderSavingRunnable.java:35)
at org.jboss.threads.EnhancedQueueExecutor.safeRun(EnhancedQueueExecutor.java:2046)
at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.doRunTask(EnhancedQueueExecutor.java:1578)
at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1452)
at org.jboss.threads.DelegatingRunnable.run(DelegatingRunnable.java:29)
at org.jboss.threads.ThreadLocalResettingRunnable.run(ThreadLocalResettingRunnable.java:29)
at java.lang.Thread.run([email protected]/Thread.java:834)
at org.jboss.threads.JBossThread.run(JBossThread.java:479)

@gsmet
Copy link
Member

gsmet commented Nov 9, 2020

@FroMage @Ladicek does it ring a bell?

@FroMage
Copy link
Member

FroMage commented Nov 9, 2020

Doesn't ring a bell, but you also observed a CI issue in CP, so perhaps it's the same thing?

@FroMage
Copy link
Member

FroMage commented Nov 9, 2020

I have tried it, and successfully ran jmeter with 1000 threads, I can't reproduce.

@sivelli
Copy link
Contributor Author

sivelli commented Nov 9, 2020

Hi @FroMage. I know that reproducing deadlocks isn't easy, but in this case in my tests it happens almost every time. I tested it on two different machine, using Linux and Windows and Java 11 and Java 14.
I changed a little the code to force a cache miss on each request. It may help to reproduce the error.
Anyway, I'll investigate more this issue. I had to disable all cache on my application to avoid this situation.

@FroMage
Copy link
Member

FroMage commented Nov 10, 2020

Tried it again and still no deadlock. How many cores do you have?

@sivelli
Copy link
Contributor Author

sivelli commented Nov 10, 2020

Hi @FroMage, I have 8 cores. You can try decreasing the number of threads of the Quarkus executor from 200 to 10 using this configuration: quarkus.thread-pool.max-threads=10

However I think I finally figured out the problem. I spent some hours investigating but I concluded that it's a simple thread starvation issue.

The quarkus-cache extension configures the caffeine executor with the Quarkus ManagedExecutor, as we see in the code below:

@Recorder
public class CaffeineCacheBuildRecorder {

    private static final Logger LOGGER = Logger.getLogger(CaffeineCacheBuildRecorder.class);

    public void buildCaches(BeanContainer beanContainer,
            Set<CaffeineCacheInfo> cacheInfos) {
        // The number of caches is known at build time so we can use fixed initialCapacity and loadFactor for the caches map.
        Map<String, CaffeineCache> caches = new HashMap<>(cacheInfos.size() + 1, 1.0F);

        ManagedExecutor managedExecutor = Arc.container().instance(ManagedExecutor.class).orElse(null);

        for (CaffeineCacheInfo cacheInfo : cacheInfos) {
            if (LOGGER.isDebugEnabled()) {
                LOGGER.debugf(
                        "Building Caffeine cache [%s] with [initialCapacity=%s], [maximumSize=%s], [expireAfterWrite=%s] and [expireAfterAccess=%s]",
                        cacheInfo.name, cacheInfo.initialCapacity, cacheInfo.maximumSize, cacheInfo.expireAfterWrite,
                        cacheInfo.expireAfterAccess);
            }
            CaffeineCache cache = new CaffeineCache(cacheInfo, managedExecutor);
            caches.put(cacheInfo.name, cache);
        }

        beanContainer.instance(CacheRepository.class).setCaches(caches);
    }
}

When CP extension is loaded, managedExecutor receives the current executor and Caffeine configures it as its executor. If CP is not loaded, managedExecutor receives null and Caffeine uses the ForkJoinPool.commomPool()
But using the Quarkus executor causes the deadlock because when a cache miss happens, Caffeine will use its executor to compute the value assincronously, but if all threads are intercepted by the cache, there is no thread left to compute the value and the deadlock happens.

I don't know if there is a reason to use the ManagedExecutor in Caffeine, but the solution to the problem is to pass executor as null when constructing the CaffeineCache object

@gwenneg
Copy link
Member

gwenneg commented Nov 10, 2020

We use ManagedExecutor in this code to allow context propagation into cache computation threads.

@sivelli
Copy link
Contributor Author

sivelli commented Nov 10, 2020

Hi @gwenneg, thank you for the explanation.

I'm facing a thread starvation because the number of thread in the ManageExecutor is limited and the threads that Quarkus use to run the requests are shared with Caffeine. Do you think it is possible to create a ManageExecutor only for Caffeine?

@gwenneg
Copy link
Member

gwenneg commented Nov 10, 2020

Propagating the context into cache computation and isolating the cache executor sound like conflicting goals to me, but I'm not a context propagation expert. @FroMage WDYT?

@gwenneg
Copy link
Member

gwenneg commented Nov 10, 2020

@ben-manes Is there a way to avoid that thread starvation issue by changing the way Caffeine is used in the cache extension?

@ben-manes
Copy link

ben-manes commented Nov 10, 2020

I remember us discussing using AsyncCache in the CacheResultInterceptor, but I don't recall the reason why. Skimming the code, I think this might have been for the lockTimeout feature to allow the computation to continue while also not blocking other callers indefinitely who are also waiting for it.

Assuming you didn't have that feature, then using a synchronous cache would execute the work on the calling thread. That sounds like what you want here, right?

If you wanted both, but to perform the work on the caller, then you could complete the future yourself such as

CompletableFuture<Object> future = new CompletableFuture<Object>();
var prior = cache.asMap().putIfAbsent(key, future);
if (prior == null) {
  try {
    value = // ...
    future.complete(value);
    return value;
  } catch (Throwable t) {
    future.completeExceptionally(t);
    throw t;
  }
} else {
  return future.get(binding.lockTimeout(), TimeUnit.MILLISECONDS);
}

This would avoid using the executor for the computations, while still giving you the flexibility of future-based computations.

@gwenneg
Copy link
Member

gwenneg commented Nov 11, 2020

Thanks for your answer @ben-manes. I'll work on something based on that suggestion.

@gwenneg gwenneg self-assigned this Nov 11, 2020
@sivelli
Copy link
Contributor Author

sivelli commented Nov 11, 2020

Thank you guys. If there is anything I can help, please let me know

@gwenneg
Copy link
Member

gwenneg commented Nov 11, 2020

I just submitted a PR which should fix the issue when the lock timeout feature from the cache extension is disabled.

I wonder how we should deal with deadlocks when lock timeout is on though...

@ben-manes
Copy link

Why is there a distinction?

In the old & new code, if the caller is the one who is adding a new mapping (isCurrentThreadComputation), then the caller does not wait on the lock timeout and blocks until the computation completes. It is only the subsequent callers who find the mapping who honor the timeout to exit early, which can still be done by waiting on the future. If those behaviors are to remain, then you never need to delegate the computation to an Executor and can use the same-thread approach.

It is only if you want the caller who created the new mapping to also honor the lock timeout does the work need to be delegated to a threadpool. In that case the starvation may still occur.

Unless I'm mistaken, I think you can remove the executor and use only getWithSyncComputation but pass in the lock timeout for the case of waiting on a cache hit.

@gwenneg
Copy link
Member

gwenneg commented Nov 12, 2020

You're not mistaken, I didn't realize all cases could be treated the same way. I'll update the PR tomorrow. Thanks @ben-manes!

@FroMage
Copy link
Member

FroMage commented Nov 12, 2020

BTW, the reason why the ManagedExecutor starves you is that it's using the Quarkus thread pool underneath, which is something we want to do by design, so that we avoid that every extension has its own thread pool and we have hundreds of threads for no good reason.

Also, in the future it's fairly possible that the Quarkus thread pool will get a much faster access to ThreadLocals than other threads, so even more value in keeping it instead of spawning thread pools left and right.

@gwenneg
Copy link
Member

gwenneg commented Nov 12, 2020

Thanks for that information @FroMage.

@ben-manes I updated the PR, I hope starvation is gone for good now! :) The extension code is much simpler now.

@ben-manes
Copy link

wow, that is so much cleaner and simpler. very nice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cache kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants