DefaultEndpoint.QUEUE_SIZE becomes out of sync, preventing command queueing #764

nivekastoreth · 2018-04-20T17:03:53Z

Observed Version(s): 5.0.3.RELEASE

Introduced in Version(s): 4.4.0.Final

Still exists in 5.0.3.RELEASE

Still visible in master? Unknown but likely

I've not tested the 5.x branch at all

Expected:
When request queue size is hit, submitted commands are terminated early. When request queue drains, new commands are once again submitted

Actual:
I'm still in the process of determining exactly what is happening here, but what I'm observing is that when a redis instance is performing a task that blocks the foreground thread for a substantial amount of time (seconds up to minutes, details on how to do this below), the DefaultEndpoint can become wedged in a state where QUEUE_SIZE is stuck at a non-zero value. If this value is greater than clientOptions.getRequestQueueSize() - command, validateWrite will never again validate any writes submitted to it.

To Reproduce
Using the setup shown below, connect to redis and verify that commands are processed correctly. Then submit a redis save command, and while that save is running (that's why we use a large list, but there are other ways to replicate this), submit more than requestQueueSize requests:

sudo docker exec helix-redis redis-cli save &
for i in {1..100}; do cat payload.json| curl -v -H 'Content-Type: application/json' -d @- http://localhost:8000/redis-endpoint 2>&1; done

Performing the above, and then waiting for the save command to complete, results in the log file:

https://gist.github.com/nivekastoreth/d934ac60488604baa11be770ed7d1c71#file-lettuce-incorrect-queue-breaking-log

After the save operation has completed, submitting a single followup request results in the log file:

https://gist.github.com/nivekastoreth/d934ac60488604baa11be770ed7d1c71#file-lettuce-incorrect-queue-broken-log
As shown in the log file, the value of QUEUE_SIZE is now stuck at 5 and above. Given a second cycle, this connection would become entirely unresponsive. By using a debugger, one can manually set this value to zero and verify that everything once again works correctly.

Speculation:
I believe the dequeue command is never called due, in some part, to the following error, but currently haven't tracked down the exact flow that results in this case:

2018-04-20 14:43:53 UTC [lettuce-nioEventLoop-10-3] WARN  i.n.c.AbstractChannelHandlerContext - Failed to mark a promise as failure because it has failed already: DefaultChannelPromise@4a01a122(failure: io.lettuce.core.RedisException: Internal stack size exceeded: 10. Commands are not accepted until the stack size drops.), unnotified cause: io.lettuce.core.RedisException: Internal stack size exceeded: 10. Commands are not accepted until the stack size drops.
	at io.lettuce.core.protocol.CommandHandler.validateWrite(CommandHandler.java:441)
	at io.lettuce.core.protocol.CommandHandler.addToStack(CommandHandler.java:410)
	at io.lettuce.core.protocol.CommandHandler.writeSingleCommand(CommandHandler.java:367)
	at io.lettuce.core.protocol.CommandHandler.write(CommandHandler.java:334)
	at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
	at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
	at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38)
	at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081)
	at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128)
	at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute$$$capture(AbstractEventExecutor.java:163)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:886)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:745)

Setup:
Redis:

-- sudo docker exec redis redis-cli --eval redis-gen.lua , "large_set" 50000000 0

local set_name = #ARGV >= 1 and ARGV[1] or 'large_set'
local entry_count = #ARGV >= 2 and ARGV[2] or 50000000
local entry_start = #ARGV >= 3 and ARGV[3] or 0

local memory = {}

local base = '00000000-0000-0000-0000-'
local min = 100000000000

local start = entry_count * entry_start
local stop = start + entry_count
for user = start, stop, 1 do
  redis.call("sadd", set_name, base .. (min + user))
end

memory[set_name] = redis.call("scard", set_name)
return cjson.encode(memory)

Client:

  val resources = DefaultClientResources.builder()
    .ioThreadPoolSize(11)
    .computationThreadPoolSize(11)
    .build()
  val client = RedisClient.create(resources)
  val options = ClientOptions.builder()
    .autoReconnect(true)
    .requestQueueSize(10) // low number here to make it easier to reproduce
    .pingBeforeActivateConnection(false)
    .cancelCommandsOnReconnectFailure(true)
    .disconnectedBehavior(DisconnectedBehavior.REJECT_COMMANDS)
    .build()
  client.setOptions(options)
  val connection = client.connect(uri)

  // later on, submitting requests via:
  val async: RedisAsyncCommands[String, String] = connection().async()
  val future: RedisFuture[[String] = async.evalsha[String](digest.value, ScriptOutputType.VALUE, keys, values: _*)

P.S. This println also seems removable:

The text was updated successfully, but these errors were encountered:

mp911de · 2018-04-20T18:52:25Z

Thanks a lot for your extensive bug report. I fixed the println issue via #765. I need to have a look on what happens with the queue size.

nivekastoreth · 2018-04-20T21:03:23Z

Just a quick note that the save mechanism is just one way of replicating the behaviour, but we first noticed this in production when attempting to delete or unlink a very large Set while under load. I don't believe that there is any real difference between the two use cases from the point of view of the lettuce client, but figured it might be worth mentioning.

mp911de · 2018-04-23T10:20:49Z

This issue arises when n commands are written to the transport (where n is the queue limit) and additional n commands are written while the previous commands are not yet completed.

This fits the description where a long-running action causes commands to pile up.

mp911de · 2018-04-23T11:49:46Z

After further analysis, this bug is a consequence of retry and the queue limit protection. Writing a command to a queue that is full completes the command exceptionally and throws an exception. The ChannelPromise notifies RetryListener which attempts to retry the command.

The command gets into the event loop. CommandHandler.write(…) sees the command is done so it is no longer written. This no-op causes the resulting future to never complete as there is nothing to write/flush.

There are two things to do:

Do not resubmit commands that are done.
Complete no-op writes ourselves.

nivekastoreth · 2018-04-23T12:22:50Z

This lines up with what I had been seeing. I had thought about not using RetryListener (i.e. Reliability.AT_MOST_ONCE), but since the state of reliability is controlled by clientOptions.isAutoReconnect() that has repercussions I wasn't comfortable with.

CommandHandler now completes a write promise if the write completes without actually writing a message. Previously, no-op writes did not complete and rendered an invalid state (e.g. queue size was not decremented).

mp911de · 2018-04-23T12:33:21Z

That's fixed now. Snapshots are available. Care to give 5.0.4.BUILD-SNAPSHOT a spin (lettuce-core-5.0.4.BUILD-20180423.123048-8)?

nivekastoreth · 2018-04-23T12:57:22Z

This looks great on my end, at least in my cobbled together reproduction environment. I'll will see if i can get this tested on production hardware later today, but I expect it to pass there as well.

Thanks!

mp911de · 2018-04-23T13:22:04Z

Cool, thanks a lot. Closing this one as resolved. Feel free to reopen the issue if the problem persists.

nivekastoreth · 2018-04-25T19:29:39Z

Do you have an ETA on when you'll be releasing 5.0.4 with this fix?

mp911de · 2018-04-25T21:35:38Z

I added release dates to the milestones.

nivekastoreth · 2018-04-25T21:42:42Z

Awesome, thanks.

mp911de mentioned this issue Apr 20, 2018

DefaultEndpoint contains System.out.println(…) #765

Closed

mp911de added the type: bug A general bug label Apr 20, 2018

mp911de mentioned this issue Apr 23, 2018

Do not retry completed commands through RetryListener #767

Closed

mp911de added this to the Lettuce 4.4.5 milestone Apr 23, 2018

mp911de closed this as completed Apr 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DefaultEndpoint.QUEUE_SIZE becomes out of sync, preventing command queueing #764

DefaultEndpoint.QUEUE_SIZE becomes out of sync, preventing command queueing #764

nivekastoreth commented Apr 20, 2018 •

edited

Loading

mp911de commented Apr 20, 2018

nivekastoreth commented Apr 20, 2018

mp911de commented Apr 23, 2018

mp911de commented Apr 23, 2018

nivekastoreth commented Apr 23, 2018

mp911de commented Apr 23, 2018 •

edited

Loading

nivekastoreth commented Apr 23, 2018

mp911de commented Apr 23, 2018

nivekastoreth commented Apr 25, 2018

mp911de commented Apr 25, 2018

nivekastoreth commented Apr 25, 2018

DefaultEndpoint.QUEUE_SIZE becomes out of sync, preventing command queueing #764

DefaultEndpoint.QUEUE_SIZE becomes out of sync, preventing command queueing #764

Comments

nivekastoreth commented Apr 20, 2018 • edited Loading

mp911de commented Apr 20, 2018

nivekastoreth commented Apr 20, 2018

mp911de commented Apr 23, 2018

mp911de commented Apr 23, 2018

nivekastoreth commented Apr 23, 2018

mp911de commented Apr 23, 2018 • edited Loading

nivekastoreth commented Apr 23, 2018

mp911de commented Apr 23, 2018

nivekastoreth commented Apr 25, 2018

mp911de commented Apr 25, 2018

nivekastoreth commented Apr 25, 2018

nivekastoreth commented Apr 20, 2018 •

edited

Loading

mp911de commented Apr 23, 2018 •

edited

Loading