Unnecessary copying of byteBuf in CommandHandler.decode() #725

gszpak · 2018-03-07T01:44:57Z

I am using snaphots of 5.0.x.

I noticed that reading a reply to a large pipeline (hundreds of thousands of commands) takes a lot of time.

It is caused by calling discardReadBytes on every properly decoded command in CommandHandler.decode. More precisely, I am talking about this line: https://github.com/lettuce-io/lettuce-core/blob/e3641b465d7ad2e7fd5be09006f1087e5df9e919/src/main/java/io/lettuce/core/protocol/CommandHandler.java#L565

I wrote a simple benchmark to verify my hypothesis: https://gist.github.com/gszpak/c7c782df696922f3660352bfa22cedfd

Here are the results:

Number of queries: 100000 running time: 415 ms
Number of queries: 200000 running time: 692 ms
Number of queries: 300000 running time: 1046 ms
Number of queries: 400000 running time: 4811 ms
Number of queries: 500000 running time: 28003 ms

As you can see, querying 500k keys took 28s. Here is the output of JvmTop for this run:

  95.56% (    34.88s) io.netty.util.internal.PlatformDependent0.copyMemory()
   1.64% (     0.60s) io.lettuce.core.protocol.CommandHandler.decode()
   0.35% (     0.13s) io.lettuce.core.metrics.CommandLatencyId.hashCode()
   0.31% (     0.11s) ...lettuce.core.protocol.CommandArgs$ByteBufferArgument.()
   0.30% (     0.11s) io.netty.util.CharsetUtil.decoder()
   0.30% (     0.11s) io.netty.buffer.PoolThreadCache.allocate()
   0.29% (     0.11s) io.lettuce.core.protocol.RedisStateMachine.decode()
   0.27% (     0.10s) io.lettuce.core.protocol.AsyncCommand.completeResult()
   0.27% (     0.10s) io.lettuce.core.protocol.AsyncCommand.hashCode()
   0.23% (     0.08s) io.lettuce.core.protocol.DefaultEndpoint.drainCommands()
   0.19% (     0.07s) io.netty.channel.unix.FileDescriptor.readAddress()
   0.19% (     0.07s) io.netty.channel.epoll.Native.epollWait0()
   0.10% (     0.04s) io.netty.buffer.AbstractByteBuf.checkIndex()

Here is what happens:
When you send a very large pipeline, Redis sends a BULK reply. Such a reply is fully loaded to commandHandler.byteBuf before being decoded. Then lettuce starts decoding commands one by one and after every command byteBuf.discardReadBytes() is called. This causes shifting the buffer's readerIndex to 0, which actually involves copying the part of the buffer that has not been read yet. Hence the io.netty.util.internal.PlatformDependent0.copyMemory() call taking 95% time of the run.

The text was updated successfully, but these errors were encountered:

mp911de · 2018-03-07T07:36:36Z

Thanks a lot for digging into that. We have a JMH test suite. Care to include a JMH benchmark along your PR #726?

gszpak · 2018-03-07T08:26:22Z

Sure, will add it soon.

gszpak · 2018-03-07T22:16:28Z

I digged a bit deeper into this case and I figured out I had not fully understood the reason of the issue in the first place. More precisely, it's not true, that there is one big reply, which is fully loaded to a buffer before being decoded. Here is what really happens:

CommandHandler writes the commands, calling addToStack - but the command is not actually added yet. Its added only after promise associated with this write is completed! (lines 418 - 422 in CommandHandler.java)
So, the stack is still empty.
Because the write is big enough - like 500k in my case - first replies from Redis start to arrive BEFORE write promise is completed. So first channelRead events are processed while the stack is still empty. Therefore, this canDecode check fails: https://github.com/lettuce-io/lettuce-core/blob/e3641b465d7ad2e7fd5be09006f1087e5df9e919/src/main/java/io/lettuce/core/protocol/CommandHandler.java#L532 , the replies are not being decoded and the buffer grows. In my case it loaded ~ 5MB before it started decoding the commands.
At some point, write promise is completed, all sent commands are actually added to the stack, and the next channelRead event causes decoding of the whole buffer. Which involved a lot of discardReadBytes calls.

There is just one thing that bothers me - is it possible, that all channelRead events are processed before the write promise is completed? If so, that would cause a freeze of read until the next channelRead event is received.

mp911de · 2018-03-08T11:28:19Z

Not exactly sure about your analysis.

True for non-void-Promises. The command is added to the protocol/decoding stack after it's fully and successfully written. This is to avoid commands in the stack that weren't written entirely to the transport and to guarantee the order of writes (that's the more important bit).
Redis doesn't start processing/replying on partial receives for a single command. Do you have an example?
Also 2: Responses are received in chunks: a channel read reads always a certain amount of bytes per read (that's configurable via message size estimator) but you will likely require multiple reads for a whole response. Received bytes are written into an intermediate buffer and decoding (state machine) attempts to progress in decoding and emitting responses on the fly.
True for the first part. channelRead(…) attempts to decode/read from the remaining buffer until it can complete the command. At that time the buffer contains only a subsequence of response bytes that are not yet decoded. Might contain more but the reader index is set to the position at which the next decoding will start.

Netty stores outbound writes in ChannelOutboundBuffer along the promise. As soon as all bytes are successfully written to the channel, the Promise completes along the stack write. Netty handles all I/O for a single channel on a single thread. In a simultaneously bi-directional system, we would have to edit the protocol stack at an earlier time to provide a context for responses.

A read and write cannot happen concurrently (that was different for netty 3). The bound channel thread can handle only a single task at a time. It's either reading or writing. Since Redis does not reply (should not?) to commands that are currently sent, we're on the safe side. Yet it bothers me because a freeze must not happen. Interesting point, however.

gszpak · 2018-03-08T20:13:20Z

Just to clarify: I'm not sure I got the whole flow right (I'm not a Netty expert:)), but I'm sure growing of the buffer is caused because channelRead is called a lot of times with the stack being empty (and the bytes comprising the subsequent responses couldn't be decoded). If you would like to reproduce it, please try the following:

Add a check before this line: https://github.com/lettuce-io/lettuce-core/blob/e3641b465d7ad2e7fd5be09006f1087e5df9e919/src/main/java/io/lettuce/core/protocol/CommandHandler.java#L532
that the stack is empty
Send a very large batch of commands (batch == manually flushed) - in my case it was 500k commands.

Ad. 2.: Sorry, forgot to mention I was talking about batch writes. And my point was, before the whole batch is successfully written, Redis can already reply to the commands from the beginning of the batch.

Here is my another hypothesis (after glancing over DefaultPromise, which netty's DefaultChannelPromise inherits from):

let's say write is executed on eventLoop e.
When executing setSuccess on write promise, for every promise's listener method, e.execute(method) is called (according to DefaultPromise.java)
From the docs of execute: "Executes the given command at some time in the future". So my guess is, a method passed in addListener is not executed immediately, but instead added to task queue of e.
After the writeBatch is finished, e starts the loop again
First it reads replies (which arrived in the meantime) from the socket and processes them in channelRead
Then it finally executes the promise's listener which adds commands to stack.

WDYT, is this scenario possible?

mp911de · 2018-03-08T20:41:43Z

Ad. 2.: Sorry, forgot to mention I was talking about batch writes. And my point was, before the whole batch is successfully written, Redis can already reply to the commands from the beginning of the batch.

You're absolutely right, I didn't consider this aspect and Redis can indeed start answering while the last command isn't written. The last command in the write will trigger another response which then initiates decoding of the whole batch.

I haven't looked into the eventloop implementation but your description sounds reasonable.

Right now I see following possibilities how to address that loophole:

Call channelRead/decode after the promise is executed - solves the never-read case but commands still pile up while the data might be already available to decode
Add commands upon write to the stack: Allows early decoding but what if the batch write fails upon the last command? We no longer know which command failed because at that time the encoded commands are written to a buffer
Write each command via write(…) so every command gets its own promise: Increases the number of overall promises but we solve the issues that arise with 1) and 2). Also, the receive buffer never grows to a size that might negatively impact compaction.

…dler's buffer Fixes redis#725.

mp911de · 2018-03-14T16:02:41Z

I ran the benchmark (command creation overhead et. al. removed):

Calling discardReadBytes() after each command

Benchmark                                                  Mode  Cnt       Score       Error  Units
CommandHandlerBenchmark.measureNettyWriteAndRead           avgt    5     135,495 ±     7,990  ns/op
CommandHandlerBenchmark.measureNettyWriteAndReadBatch1     avgt    5     133,663 ±    10,557  ns/op
CommandHandlerBenchmark.measureNettyWriteAndReadBatch10    avgt    5    1355,452 ±    80,159  ns/op
CommandHandlerBenchmark.measureNettyWriteAndReadBatch100   avgt    5   13268,616 ±  1690,193  ns/op
CommandHandlerBenchmark.measureNettyWriteAndReadBatch1000  avgt    5  206053,679 ± 12626,892  ns/op

Calling discardSomeReadBytes() after 16 read cycles outside the decode loop

Benchmark                                                  Mode  Cnt       Score       Error  Units
CommandHandlerBenchmark.measureNettyWriteAndRead           avgt    5     131,636 ±     8,448  ns/op
CommandHandlerBenchmark.measureNettyWriteAndReadBatch1     avgt    5     138,199 ±    12,895  ns/op
CommandHandlerBenchmark.measureNettyWriteAndReadBatch10    avgt    5    1518,567 ±  1727,857  ns/op
CommandHandlerBenchmark.measureNettyWriteAndReadBatch100   avgt    5   11708,630 ±  1036,565  ns/op
CommandHandlerBenchmark.measureNettyWriteAndReadBatch1000  avgt    5  121271,371 ± 13380,699  ns/op

Calling discardReadBytes() after 16 read cycles outside the decode loop

Benchmark                                                  Mode  Cnt       Score      Error  Units
CommandHandlerBenchmark.measureNettyWriteAndRead           avgt    5     137,077 ±    6,424  ns/op
CommandHandlerBenchmark.measureNettyWriteAndReadBatch1     avgt    5     136,734 ±    6,509  ns/op
CommandHandlerBenchmark.measureNettyWriteAndReadBatch10    avgt    5    1292,053 ±  130,484  ns/op
CommandHandlerBenchmark.measureNettyWriteAndReadBatch100   avgt    5   11882,836 ±  647,510  ns/op
CommandHandlerBenchmark.measureNettyWriteAndReadBatch1000  avgt    5  118570,635 ± 8408,583  ns/op

Calling discardReadBytes() after 1 read cycle outside the decode loop

Benchmark                                                  Mode  Cnt       Score      Error  Units
CommandHandlerBenchmark.measureNettyWriteAndRead           avgt    5     134,220 ±   10,779  ns/op
CommandHandlerBenchmark.measureNettyWriteAndReadBatch1     avgt    5     136,320 ±   10,366  ns/op
CommandHandlerBenchmark.measureNettyWriteAndReadBatch10    avgt    5    1270,895 ±   63,624  ns/op
CommandHandlerBenchmark.measureNettyWriteAndReadBatch100   avgt    5   11601,653 ±  300,898  ns/op
CommandHandlerBenchmark.measureNettyWriteAndReadBatch1000  avgt    5  119552,878 ± 9189,821  ns/op

mp911de · 2018-03-14T16:08:01Z

Based on the benchmark results above I'd propose:

Switch batch to discardReadBytes() to retain buffer release behavior. discardSomeReadBytes() allows the buffer to grow to several megabytes before it is cleared
Move discardReadBytes() out of the decoding loop. This has the most impact. Discarding bytes after 16 decode runs or after each run has very little impact.

WDYT? I've polished up the code already, so no need to change the pull request.

gszpak · 2018-03-14T19:38:07Z

I'm a little surprised that discardSome can allow the buffer to grow so big - it should shrink the buffer whenever readerIndex exceeds half of it's capacity.
Anyway, moving discard... out of the loop sounds like a good idea.

So, should I leave #726 as is? I don't see any changes on the 5.0.x branch?

mp911de · 2018-03-14T23:14:32Z

I haven't pushed yet the changes, wanted to get your feedback first.

…dler's buffer #725 Add benchmarks for channelRead added to CommandHandler JMH test suite. CommandHandlerBenchmark tests the whole flow now - both writes and reads. Also, creation of commands was moved to benchmark methods - after one usage they become not writable which causes the benchmark to give incorrect results. Original pull request: #726

Move back to discardReadBytes() but discard bytes outside the decoding loop to not enforce cleanup upon each decoded command. Tweak JMH benchmarks to not include command creation overhead caused by IntStream and element collection. Tweak commands in test to never return done state so commands can be reused for all benchmark runs. Original pull request: #726

…dler's buffer #725 Add benchmarks for channelRead added to CommandHandler JMH test suite. CommandHandlerBenchmark tests the whole flow now - both writes and reads. Also, creation of commands was moved to benchmark methods - after one usage they become not writable which causes the benchmark to give incorrect results. Original pull request: #726

Move back to discardReadBytes() but discard bytes outside the decoding loop to not enforce cleanup upon each decoded command. Tweak JMH benchmarks to not include command creation overhead caused by IntStream and element collection. Tweak commands in test to never return done state so commands can be reused for all benchmark runs. Original pull request: #726

…dler's buffer #725 Add benchmarks for channelRead added to CommandHandler JMH test suite. CommandHandlerBenchmark tests the whole flow now - both writes and reads. Also, creation of commands was moved to benchmark methods - after one usage they become not writable which causes the benchmark to give incorrect results. Original pull request: #726

Move back to discardReadBytes() but discard bytes outside the decoding loop to not enforce cleanup upon each decoded command. Tweak JMH benchmarks to not include command creation overhead caused by IntStream and element collection. Tweak commands in test to never return done state so commands can be reused for all benchmark runs. Original pull request: #726

mp911de · 2018-03-15T08:43:00Z

That's fixed now. Thanks a lot!

gszpak mentioned this issue Mar 7, 2018

Avoid unnecessary copying of ByteBuf in CommandHandler.decode() #726

Closed

mp911de added the type: feature A new feature label Mar 7, 2018

gszpak added a commit to gszpak/lettuce-core that referenced this issue Mar 8, 2018

Using discardSomeReadBytes instead of discardReadBytes for CommandHan…

68f4033

…dler's buffer Fixes redis#725.

mp911de mentioned this issue Mar 9, 2018

Batched commands may time out although data was received #729

Closed

mp911de added this to the Lettuce 4.5.0 milestone Mar 15, 2018

mp911de closed this as completed Mar 15, 2018

mp911de mentioned this issue Nov 6, 2018

Optimize aggregation buffer cleanup in CommandHandler #906

Closed

dontgitit mentioned this issue Jul 20, 2020

Open up parts of the API #1311

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unnecessary copying of byteBuf in CommandHandler.decode() #725

Unnecessary copying of byteBuf in CommandHandler.decode() #725

gszpak commented Mar 7, 2018

mp911de commented Mar 7, 2018

gszpak commented Mar 7, 2018

gszpak commented Mar 7, 2018 •

edited

Loading

mp911de commented Mar 8, 2018

gszpak commented Mar 8, 2018

mp911de commented Mar 8, 2018

mp911de commented Mar 14, 2018 •

edited

Loading

mp911de commented Mar 14, 2018

gszpak commented Mar 14, 2018

mp911de commented Mar 14, 2018

mp911de commented Mar 15, 2018

Unnecessary copying of byteBuf in CommandHandler.decode() #725

Unnecessary copying of byteBuf in CommandHandler.decode() #725

Comments

gszpak commented Mar 7, 2018

mp911de commented Mar 7, 2018

gszpak commented Mar 7, 2018

gszpak commented Mar 7, 2018 • edited Loading

mp911de commented Mar 8, 2018

gszpak commented Mar 8, 2018

mp911de commented Mar 8, 2018

mp911de commented Mar 14, 2018 • edited Loading

mp911de commented Mar 14, 2018

gszpak commented Mar 14, 2018

mp911de commented Mar 14, 2018

mp911de commented Mar 15, 2018

gszpak commented Mar 7, 2018 •

edited

Loading

mp911de commented Mar 14, 2018 •

edited

Loading