EventLoop thread blocked by EmitterProcessor.onNext(…) causes timeouts #1086

trishaarao79 · 2019-07-24T13:50:02Z

Bug Report

Current Behavior

We recently started using Lettuce with Amazon Elasticache. We're using latest version - 5.1.7-RELEASE with the default configurations. We have a 3 shard Elasticache cluster with a Master/Slave configuration. One of our services deployed this Lettuce version in production and within about 2-3 hours started observing CommandTimeoutExceptions on Ping against the elasticache nodes, with high CPU usage.Other than Ping there's hardly any other Redis operation on the srervice nodes. There was also a CPU load spike.

We took thread dumps from healthy and unhealthy instances, and attached is the thread dump for the unhealthy node. Based on our findings, we decided to try disabling the periodic topology refresh and that seems to fix these command timeouts. We also observed the error with loading partitions.
We see all the event loop threads stuck in the TIMED_WAITING state on the EmitterProcessor flow.

Thread dump from the unhealthy instance

"lettuce-eventExecutorLoop-1-3" #96 daemon prio=5 os_prio=0 tid=0x00007f7a2c073000 nid=0x185 waiting on condition [0x00007f7a04fd9000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000000ec810bf0> (a java.util.concurrent.CompletableFuture$Signaller)
	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
	at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1695)
	at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
	at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1775)
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
	at io.lettuce.core.cluster.topology.RefreshFutures.awaitAll(RefreshFutures.java:54)
	at io.lettuce.core.cluster.topology.AsyncConnections.optionalGet(AsyncConnections.java:118)
	at io.lettuce.core.cluster.topology.ClusterTopologyRefresh.loadViews(ClusterTopologyRefresh.java:90)
	at io.lettuce.core.cluster.RedisClusterClient.doLoadPartitions(RedisClusterClient.java:871)
	at io.lettuce.core.cluster.RedisClusterClient.loadPartitions(RedisClusterClient.java:851)
	at io.lettuce.core.cluster.RedisClusterClient.reloadPartitions(RedisClusterClient.java:790)
	at io.lettuce.core.cluster.ClusterTopologyRefreshScheduler$ClusterTopologyRefreshTask.run(ClusterTopologyRefreshScheduler.java:204)
	at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
	at io.netty.util.concurrent.PromiseTask.run(PromiseTask.java:73)
	at io.netty.util.concurrent.DefaultEventExecutor.run(DefaultEventExecutor.java:66)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)

"lettuce-nioEventLoop-4-1" #95 daemon prio=5 os_prio=0 tid=0x00007f7a39f22000 nid=0x184 runnable [0x00007f7a052da000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:338)
	at reactor.core.publisher.EmitterProcessor.onNext(EmitterProcessor.java:262)
	at io.lettuce.core.event.DefaultEventBus.publish(DefaultEventBus.java:46)
	at io.lettuce.core.ConnectionEventTrigger.channelInactive(ConnectionEventTrigger.java:52)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
	at io.lettuce.core.protocol.ConnectionWatchdog.channelInactive(ConnectionWatchdog.java:181)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
	at io.lettuce.core.protocol.CommandHandler.channelInactive(CommandHandler.java:329)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
	at io.lettuce.core.ChannelGroupListener.channelInactive(ChannelGroupListener.java:46)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
	at io.lettuce.core.SslConnectionBuilder$SslChannelInitializer$2.channelInactive(SslConnectionBuilder.java:182)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
	at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:390)
	at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:355)
	at io.netty.handler.ssl.SslHandler.channelInactive(SslHandler.java:1078)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandle

Environment

Lettuce version(s): 5.1.7-RELEASE
Redis version: 5.x

Possible Solution

Disabling the Periodic Topology Refresh has helped us eliminate the errors.

Lettuce now uses DirectProcessor as non-blocking event bus. DirectProcessor no longer blocks calling .next(…). Previously, EmitterProcessor used a blocking queue which blocked the caller if downstream consumers did not consume events in time.

mp911de · 2019-07-29T08:23:08Z

That's fixed now.

HarishMalavade · 2021-03-02T19:00:13Z

Hi, Is this issue fixed?

We have similar AWS cluster set up in our environment and seeing lot of lettuce-timer and lettuce-eventExecutorLoop threads in WAITING status causing extended slowness on our redis operations.

we are using RedisClusterClient to connect to our cluster in AWS.
Also using lettuce-core 5.1.8.RELEASE version

trishaarao79 added the type: bug A general bug label Jul 24, 2019

trishaarao79 closed this as completed Jul 24, 2019

trishaarao79 reopened this Jul 24, 2019

mp911de added this to the 5.2.0 milestone Jul 29, 2019

mp911de changed the title ~~CommandTimeoutExceptions with Lettuce Core 5.1.7-Release~~ EventLoop thread blocked by EmitterProcessor.onNext(…) causes timeouts Jul 29, 2019

mp911de closed this as completed Jul 29, 2019

mp911de mentioned this issue Jul 30, 2019

io.lettuce.core.RedisCommandTimeoutException: Command timed out #1092

Closed

mp911de modified the milestones: 5.2.0, 5.1.8 Jul 30, 2019

mp911de mentioned this issue Sep 11, 2019

Proposed improvement to Reactive MGET implementation #1116

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EventLoop thread blocked by EmitterProcessor.onNext(…) causes timeouts #1086

EventLoop thread blocked by EmitterProcessor.onNext(…) causes timeouts #1086

trishaarao79 commented Jul 24, 2019 •

edited

Loading

mp911de commented Jul 29, 2019

HarishMalavade commented Mar 2, 2021

EventLoop thread blocked by EmitterProcessor.onNext(…) causes timeouts #1086

EventLoop thread blocked by EmitterProcessor.onNext(…) causes timeouts #1086

Comments

trishaarao79 commented Jul 24, 2019 • edited Loading

Bug Report

Current Behavior

Thread dump from the unhealthy instance

Environment

Possible Solution

mp911de commented Jul 29, 2019

HarishMalavade commented Mar 2, 2021

trishaarao79 commented Jul 24, 2019 •

edited

Loading