Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EventLoop thread blocked by EmitterProcessor.onNext(…) causes timeouts #1086

Closed
trishaarao79 opened this issue Jul 24, 2019 · 2 comments
Closed
Labels
type: bug A general bug
Milestone

Comments

@trishaarao79
Copy link

trishaarao79 commented Jul 24, 2019

Bug Report

Current Behavior

We recently started using Lettuce with Amazon Elasticache. We're using latest version - 5.1.7-RELEASE with the default configurations. We have a 3 shard Elasticache cluster with a Master/Slave configuration. One of our services deployed this Lettuce version in production and within about 2-3 hours started observing CommandTimeoutExceptions on Ping against the elasticache nodes, with high CPU usage.Other than Ping there's hardly any other Redis operation on the srervice nodes. There was also a CPU load spike.

We took thread dumps from healthy and unhealthy instances, and attached is the thread dump for the unhealthy node. Based on our findings, we decided to try disabling the periodic topology refresh and that seems to fix these command timeouts. We also observed the error with loading partitions.
We see all the event loop threads stuck in the TIMED_WAITING state on the EmitterProcessor flow.

Thread dump from the unhealthy instance

"lettuce-eventExecutorLoop-1-3" #96 daemon prio=5 os_prio=0 tid=0x00007f7a2c073000 nid=0x185 waiting on condition [0x00007f7a04fd9000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000000ec810bf0> (a java.util.concurrent.CompletableFuture$Signaller)
	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
	at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1695)
	at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
	at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1775)
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
	at io.lettuce.core.cluster.topology.RefreshFutures.awaitAll(RefreshFutures.java:54)
	at io.lettuce.core.cluster.topology.AsyncConnections.optionalGet(AsyncConnections.java:118)
	at io.lettuce.core.cluster.topology.ClusterTopologyRefresh.loadViews(ClusterTopologyRefresh.java:90)
	at io.lettuce.core.cluster.RedisClusterClient.doLoadPartitions(RedisClusterClient.java:871)
	at io.lettuce.core.cluster.RedisClusterClient.loadPartitions(RedisClusterClient.java:851)
	at io.lettuce.core.cluster.RedisClusterClient.reloadPartitions(RedisClusterClient.java:790)
	at io.lettuce.core.cluster.ClusterTopologyRefreshScheduler$ClusterTopologyRefreshTask.run(ClusterTopologyRefreshScheduler.java:204)
	at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
	at io.netty.util.concurrent.PromiseTask.run(PromiseTask.java:73)
	at io.netty.util.concurrent.DefaultEventExecutor.run(DefaultEventExecutor.java:66)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)

"lettuce-nioEventLoop-4-1" #95 daemon prio=5 os_prio=0 tid=0x00007f7a39f22000 nid=0x184 runnable [0x00007f7a052da000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:338)
	at reactor.core.publisher.EmitterProcessor.onNext(EmitterProcessor.java:262)
	at io.lettuce.core.event.DefaultEventBus.publish(DefaultEventBus.java:46)
	at io.lettuce.core.ConnectionEventTrigger.channelInactive(ConnectionEventTrigger.java:52)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
	at io.lettuce.core.protocol.ConnectionWatchdog.channelInactive(ConnectionWatchdog.java:181)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
	at io.lettuce.core.protocol.CommandHandler.channelInactive(CommandHandler.java:329)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
	at io.lettuce.core.ChannelGroupListener.channelInactive(ChannelGroupListener.java:46)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
	at io.lettuce.core.SslConnectionBuilder$SslChannelInitializer$2.channelInactive(SslConnectionBuilder.java:182)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
	at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:390)
	at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:355)
	at io.netty.handler.ssl.SslHandler.channelInactive(SslHandler.java:1078)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandle

Environment

  • Lettuce version(s): 5.1.7-RELEASE
  • Redis version: 5.x

Possible Solution

Disabling the Periodic Topology Refresh has helped us eliminate the errors.

@trishaarao79 trishaarao79 added the type: bug A general bug label Jul 24, 2019
@trishaarao79 trishaarao79 reopened this Jul 24, 2019
@mp911de mp911de added this to the 5.2.0 milestone Jul 29, 2019
@mp911de mp911de changed the title CommandTimeoutExceptions with Lettuce Core 5.1.7-Release EventLoop thread blocked by EmitterProcessor.onNext(…) causes timeouts Jul 29, 2019
mp911de added a commit that referenced this issue Jul 29, 2019
Lettuce now uses DirectProcessor as non-blocking event bus. DirectProcessor no longer blocks calling .next(…). Previously, EmitterProcessor used a blocking queue which blocked the caller if downstream consumers did not consume events in time.
mp911de added a commit that referenced this issue Jul 29, 2019
Lettuce now uses DirectProcessor as non-blocking event bus. DirectProcessor no longer blocks calling .next(…). Previously, EmitterProcessor used a blocking queue which blocked the caller if downstream consumers did not consume events in time.
@mp911de
Copy link
Collaborator

mp911de commented Jul 29, 2019

That's fixed now.

@HarishMalavade
Copy link

Hi, Is this issue fixed?

We have similar AWS cluster set up in our environment and seeing lot of lettuce-timer and lettuce-eventExecutorLoop threads in WAITING status causing extended slowness on our redis operations.

we are using RedisClusterClient to connect to our cluster in AWS.
Also using lettuce-core 5.1.8.RELEASE version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug A general bug
Projects
None yet
Development

No branches or pull requests

3 participants