Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PauseDetector acquisition hang in DefaultCommandLatencyCollector #1300

Closed
ackerL opened this issue Jun 1, 2020 · 5 comments
Closed

PauseDetector acquisition hang in DefaultCommandLatencyCollector #1300

ackerL opened this issue Jun 1, 2020 · 5 comments
Labels
type: bug A general bug
Milestone

Comments

@ackerL
Copy link

ackerL commented Jun 1, 2020

Bug Report

Current Behavior

Lettuce thread block after upgrade to the latest lettuce version.

Stack trace After my service onboard the latest release version 5.3.0.Release, the service stuck in a block state. The problem will not happen after i rollback the lettuce version to `4.2.2.Final`.

It may related to the upgrade of lettuce. I am not sure the root cause.
The thread info in java progress.

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
12459 nobody    20   0 6529m 2.2g  31m R 35.3 14.7   1438:04 lettuce-epollEv
12438 nobody    20   0 6529m 2.2g  31m R 33.9 14.7   1438:36 lettuce-epollEv
12468 nobody    20   0 6529m 2.2g  31m R 33.6 14.7   1437:31 lettuce-epollEv
12456 nobody    20   0 6529m 2.2g  31m R 32.6 14.7   1437:04 lettuce-epollEv
12462 nobody    20   0 6529m 2.2g  31m R 32.6 14.7   1438:36 lettuce-epollEv
12465 nobody    20   0 6529m 2.2g  31m R 32.6 14.7   1437:54 lettuce-epollEv
12467 nobody    20   0 6529m 2.2g  31m R 32.6 14.7   1437:44 lettuce-epollEv
12457 nobody    20   0 6529m 2.2g  31m R 32.3 14.7   1437:32 lettuce-epollEv

I pick the thread PID 12459 and look for the detail thread state in jstack file.

Thread 12468: (state = IN_JAVA)
 - io.lettuce.core.metrics.DefaultCommandLatencyCollector$DefaultPauseDetectorWrapper.getPauseDetector() @bci=4, line=362 (Compiled frame; information may be imprecise)
 - io.lettuce.core.metrics.DefaultCommandLatencyCollector.recordCommandLatency(java.net.SocketAddress, java.net.SocketAddress, io.lettuce.core.protocol.ProtocolKeyword, long, long) @bci=57, line=99 (Compiled frame)
 - io.lettuce.core.protocol.CommandHandler.recordLatency(io.lettuce.core.protocol.WithLatency, io.lettuce.core.protocol.ProtocolKeyword) @bci=82, line=787 (Compiled frame)
 - io.lettuce.core.protocol.CommandHandler.decode(io.netty.channel.ChannelHandlerContext, io.netty.buffer.ByteBuf, io.lettuce.core.protocol.RedisCommand) @bci=65, line=670 (Compiled frame)
 - io.lettuce.core.protocol.CommandHandler.decode(io.netty.channel.ChannelHandlerContext, io.netty.buffer.ByteBuf) @bci=118, line=596 (Compiled frame)
 - io.lettuce.core.protocol.CommandHandler.channelRead(io.netty.channel.ChannelHandlerContext, java.lang.Object) @bci=252, line=565 (Compiled frame)
 - io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(java.lang.Object) @bci=16, line=379 (Compiled frame)
 - io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(io.netty.channel.AbstractChannelHandlerContext, java.lang.Object) @bci=31, line=365 (Compiled frame)
 - io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(java.lang.Object) @bci=7, line=357 (Compiled frame)
 - io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(io.netty.channel.ChannelHandlerContext, java.lang.Object) @bci=2, line=1410 (Compiled frame)
 - io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(java.lang.Object) @bci=16, line=379 (Compiled frame)
 - io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(io.netty.channel.AbstractChannelHandlerContext, java.lang.Object) @bci=31, line=365 (Compiled frame)
 - io.netty.channel.DefaultChannelPipeline.fireChannelRead(java.lang.Object) @bci=5, line=919 (Compiled frame)
 - io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady() @bci=219, line=792 (Compiled frame)
 - io.netty.channel.epoll.EpollEventLoop.processReady(io.netty.channel.epoll.EpollEventArray, int) @bci=135, line=475 (Compiled frame)
 - io.netty.channel.epoll.EpollEventLoop.run() @bci=313, line=378 (Compiled frame)
 - io.netty.util.concurrent.SingleThreadEventExecutor$4.run() @bci=44, line=989 (Compiled frame)
 - io.netty.util.internal.ThreadExecutorMap$2.run() @bci=11, line=74 (Compiled frame)
 - io.netty.util.concurrent.FastThreadLocalRunnable.run() @bci=4, line=30 (Compiled frame)
 - java.lang.Thread.run() @bci=11, line=749 (Compiled frame)

Input Code

Input Code
// your code here;

Expected behavior/code

Environment

  • Lettuce version(s): [5.3.0.RELEASE]
  • Redis version: [Redis 3.x]

Possible Solution

Additional context

@ackerL ackerL added the type: bug A general bug label Jun 1, 2020
@ackerL ackerL changed the title lettuce-epollEventGroup hang for along time lettuce-epollEventGroup hang for a long time Jun 1, 2020
@mp911de
Copy link
Collaborator

mp911de commented Jun 5, 2020

Can you provide a reproducer? We had recently a change in the mentioned area to actually fix an issue. Potentially, we introduced another bug.

@mp911de mp911de added this to the 5.3.1 milestone Jun 8, 2020
@mp911de
Copy link
Collaborator

mp911de commented Jun 8, 2020

I decided to refactor the code for an improved PauseDetector acquisition that guards against concurrent creation/shut down of RedisClient (specifically DefaultCommandLatencyCollector).

mp911de added a commit that referenced this issue Jun 8, 2020
Previously, PauseDetectorWrapper was initialized once and a later loop attempted to retrieve the PauseDetector assuming that the previous code would had initialized the PauseDetector.

Due to concurrency it is possible that the PauseDetector instance is being released while another thread wants to obtain the PauseDetector. This gap could lead to infinite loops as the instance is never allocated. The refactored code include now the initialization so concurrent initialization/release are now properly guarded.
mp911de added a commit that referenced this issue Jun 8, 2020
Previously, PauseDetectorWrapper was initialized once and a later loop attempted to retrieve the PauseDetector assuming that the previous code would had initialized the PauseDetector.

Due to concurrency it is possible that the PauseDetector instance is being released while another thread wants to obtain the PauseDetector. This gap could lead to infinite loops as the instance is never allocated. The refactored code include now the initialization so concurrent initialization/release are now properly guarded.
@mp911de mp911de changed the title lettuce-epollEventGroup hang for a long time PauseDetector acquisition in DefaultCommandLatencyCollector hang for a long time Jun 8, 2020
@mp911de mp911de changed the title PauseDetector acquisition in DefaultCommandLatencyCollector hang for a long time PauseDetector acquisition hang in DefaultCommandLatencyCollector Jun 8, 2020
@mp911de
Copy link
Collaborator

mp911de commented Jun 8, 2020

That's fixed and backported now.

@mp911de mp911de closed this as completed Jun 8, 2020
@ackerL
Copy link
Author

ackerL commented Jun 19, 2020

I decided to refactor the code for an improved PauseDetector acquisition that guards against concurrent creation/shut down of RedisClient (specifically DefaultCommandLatencyCollector).

Thanks for your fix. After i saw the change in the commit, i think it makes sense.
A question is the way you find the root cause. Did you reproduce the issue or analyze the code?Because i find the infinite loop in the issue is not easy to reproduce.

@mp911de
Copy link
Collaborator

mp911de commented Jun 19, 2020

I found it through code analysis. I wasn't able to reproduce the issue, but there was always a potential race issue when RedisClient objects were started/stopped concurrently so it made sense to expand the loop to cover for these cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug A general bug
Projects
None yet
Development

No branches or pull requests

2 participants