Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DefaultEndpoint.QUEUE_SIZE becomes out of sync, preventing command queueing #764

Closed
nivekastoreth opened this issue Apr 20, 2018 · 11 comments
Labels
type: bug A general bug
Milestone

Comments

@nivekastoreth
Copy link

nivekastoreth commented Apr 20, 2018

Observed Version(s): 5.0.3.RELEASE

Introduced in Version(s): 4.4.0.Final

  • Still exists in 5.0.3.RELEASE

Still visible in master? Unknown but likely

  • I've not tested the 5.x branch at all

Expected:
When request queue size is hit, submitted commands are terminated early. When request queue drains, new commands are once again submitted

Actual:
I'm still in the process of determining exactly what is happening here, but what I'm observing is that when a redis instance is performing a task that blocks the foreground thread for a substantial amount of time (seconds up to minutes, details on how to do this below), the DefaultEndpoint can become wedged in a state where QUEUE_SIZE is stuck at a non-zero value. If this value is greater than clientOptions.getRequestQueueSize() - command, validateWrite will never again validate any writes submitted to it.


To Reproduce
Using the setup shown below, connect to redis and verify that commands are processed correctly. Then submit a redis save command, and while that save is running (that's why we use a large list, but there are other ways to replicate this), submit more than requestQueueSize requests:

sudo docker exec helix-redis redis-cli save &
for i in {1..100}; do cat payload.json| curl -v -H 'Content-Type: application/json' -d @- http://localhost:8000/redis-endpoint 2>&1; done

Performing the above, and then waiting for the save command to complete, results in the log file:

After the save operation has completed, submitting a single followup request results in the log file:


Speculation:
I believe the dequeue command is never called due, in some part, to the following error, but currently haven't tracked down the exact flow that results in this case:

2018-04-20 14:43:53 UTC [lettuce-nioEventLoop-10-3] WARN  i.n.c.AbstractChannelHandlerContext - Failed to mark a promise as failure because it has failed already: DefaultChannelPromise@4a01a122(failure: io.lettuce.core.RedisException: Internal stack size exceeded: 10. Commands are not accepted until the stack size drops.), unnotified cause: io.lettuce.core.RedisException: Internal stack size exceeded: 10. Commands are not accepted until the stack size drops.
	at io.lettuce.core.protocol.CommandHandler.validateWrite(CommandHandler.java:441)
	at io.lettuce.core.protocol.CommandHandler.addToStack(CommandHandler.java:410)
	at io.lettuce.core.protocol.CommandHandler.writeSingleCommand(CommandHandler.java:367)
	at io.lettuce.core.protocol.CommandHandler.write(CommandHandler.java:334)
	at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
	at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
	at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38)
	at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081)
	at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128)
	at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute$$$capture(AbstractEventExecutor.java:163)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:886)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:745)

Setup:
Redis:

-- sudo docker exec redis redis-cli --eval redis-gen.lua , "large_set" 50000000 0

local set_name = #ARGV >= 1 and ARGV[1] or 'large_set'
local entry_count = #ARGV >= 2 and ARGV[2] or 50000000
local entry_start = #ARGV >= 3 and ARGV[3] or 0

local memory = {}

local base = '00000000-0000-0000-0000-'
local min = 100000000000

local start = entry_count * entry_start
local stop = start + entry_count
for user = start, stop, 1 do
  redis.call("sadd", set_name, base .. (min + user))
end

memory[set_name] = redis.call("scard", set_name)
return cjson.encode(memory)

Client:

  val resources = DefaultClientResources.builder()
    .ioThreadPoolSize(11)
    .computationThreadPoolSize(11)
    .build()
  val client = RedisClient.create(resources)
  val options = ClientOptions.builder()
    .autoReconnect(true)
    .requestQueueSize(10) // low number here to make it easier to reproduce
    .pingBeforeActivateConnection(false)
    .cancelCommandsOnReconnectFailure(true)
    .disconnectedBehavior(DisconnectedBehavior.REJECT_COMMANDS)
    .build()
  client.setOptions(options)
  val connection = client.connect(uri)

  // later on, submitting requests via:
  val async: RedisAsyncCommands[String, String] = connection().async()
  val future: RedisFuture[[String] = async.evalsha[String](digest.value, ScriptOutputType.VALUE, keys, values: _*)

P.S. This println also seems removable:

@mp911de
Copy link
Collaborator

mp911de commented Apr 20, 2018

Thanks a lot for your extensive bug report. I fixed the println issue via #765. I need to have a look on what happens with the queue size.

@nivekastoreth
Copy link
Author

Just a quick note that the save mechanism is just one way of replicating the behaviour, but we first noticed this in production when attempting to delete or unlink a very large Set while under load. I don't believe that there is any real difference between the two use cases from the point of view of the lettuce client, but figured it might be worth mentioning.

@mp911de
Copy link
Collaborator

mp911de commented Apr 23, 2018

This issue arises when n commands are written to the transport (where n is the queue limit) and additional n commands are written while the previous commands are not yet completed.

This fits the description where a long-running action causes commands to pile up.

@mp911de
Copy link
Collaborator

mp911de commented Apr 23, 2018

After further analysis, this bug is a consequence of retry and the queue limit protection. Writing a command to a queue that is full completes the command exceptionally and throws an exception. The ChannelPromise notifies RetryListener which attempts to retry the command.

The command gets into the event loop. CommandHandler.write(…) sees the command is done so it is no longer written. This no-op causes the resulting future to never complete as there is nothing to write/flush.

There are two things to do:

  1. Do not resubmit commands that are done.
  2. Complete no-op writes ourselves.

@nivekastoreth
Copy link
Author

This lines up with what I had been seeing. I had thought about not using RetryListener (i.e. Reliability.AT_MOST_ONCE), but since the state of reliability is controlled by clientOptions.isAutoReconnect() that has repercussions I wasn't comfortable with.

@mp911de mp911de added this to the Lettuce 4.4.5 milestone Apr 23, 2018
mp911de added a commit that referenced this issue Apr 23, 2018
CommandHandler now completes a write promise if the write completes without actually writing a message. Previously, no-op writes did not complete and rendered an invalid state (e.g. queue size was not decremented).
mp911de added a commit that referenced this issue Apr 23, 2018
CommandHandler now completes a write promise if the write completes without actually writing a message. Previously, no-op writes did not complete and rendered an invalid state (e.g. queue size was not decremented).
mp911de added a commit that referenced this issue Apr 23, 2018
CommandHandler now completes a write promise if the write completes without actually writing a message. Previously, no-op writes did not complete and rendered an invalid state (e.g. queue size was not decremented).
mp911de added a commit that referenced this issue Apr 23, 2018
CommandHandler now completes a write promise if the write completes without actually writing a message. Previously, no-op writes did not complete and rendered an invalid state (e.g. queue size was not decremented).
@mp911de
Copy link
Collaborator

mp911de commented Apr 23, 2018

That's fixed now. Snapshots are available. Care to give 5.0.4.BUILD-SNAPSHOT a spin (lettuce-core-5.0.4.BUILD-20180423.123048-8)?

@nivekastoreth
Copy link
Author

This looks great on my end, at least in my cobbled together reproduction environment. I'll will see if i can get this tested on production hardware later today, but I expect it to pass there as well.

Thanks!

@mp911de
Copy link
Collaborator

mp911de commented Apr 23, 2018

Cool, thanks a lot. Closing this one as resolved. Feel free to reopen the issue if the problem persists.

@mp911de mp911de closed this as completed Apr 23, 2018
@nivekastoreth
Copy link
Author

Do you have an ETA on when you'll be releasing 5.0.4 with this fix?

@mp911de
Copy link
Collaborator

mp911de commented Apr 25, 2018

I added release dates to the milestones.

@nivekastoreth
Copy link
Author

Awesome, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug A general bug
Projects
None yet
Development

No branches or pull requests

2 participants