Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unnecessary copying of byteBuf in CommandHandler.decode() #725

Closed
gszpak opened this issue Mar 7, 2018 · 11 comments
Closed

Unnecessary copying of byteBuf in CommandHandler.decode() #725

gszpak opened this issue Mar 7, 2018 · 11 comments
Labels
type: feature A new feature

Comments

@gszpak
Copy link
Contributor

gszpak commented Mar 7, 2018

I am using snaphots of 5.0.x.

I noticed that reading a reply to a large pipeline (hundreds of thousands of commands) takes a lot of time.

It is caused by calling discardReadBytes on every properly decoded command in CommandHandler.decode. More precisely, I am talking about this line: https://github.com/lettuce-io/lettuce-core/blob/e3641b465d7ad2e7fd5be09006f1087e5df9e919/src/main/java/io/lettuce/core/protocol/CommandHandler.java#L565

I wrote a simple benchmark to verify my hypothesis: https://gist.github.com/gszpak/c7c782df696922f3660352bfa22cedfd

Here are the results:

Number of queries: 100000 running time: 415 ms
Number of queries: 200000 running time: 692 ms
Number of queries: 300000 running time: 1046 ms
Number of queries: 400000 running time: 4811 ms
Number of queries: 500000 running time: 28003 ms

As you can see, querying 500k keys took 28s. Here is the output of JvmTop for this run:

  95.56% (    34.88s) io.netty.util.internal.PlatformDependent0.copyMemory()
   1.64% (     0.60s) io.lettuce.core.protocol.CommandHandler.decode()
   0.35% (     0.13s) io.lettuce.core.metrics.CommandLatencyId.hashCode()
   0.31% (     0.11s) ...lettuce.core.protocol.CommandArgs$ByteBufferArgument.()
   0.30% (     0.11s) io.netty.util.CharsetUtil.decoder()
   0.30% (     0.11s) io.netty.buffer.PoolThreadCache.allocate()
   0.29% (     0.11s) io.lettuce.core.protocol.RedisStateMachine.decode()
   0.27% (     0.10s) io.lettuce.core.protocol.AsyncCommand.completeResult()
   0.27% (     0.10s) io.lettuce.core.protocol.AsyncCommand.hashCode()
   0.23% (     0.08s) io.lettuce.core.protocol.DefaultEndpoint.drainCommands()
   0.19% (     0.07s) io.netty.channel.unix.FileDescriptor.readAddress()
   0.19% (     0.07s) io.netty.channel.epoll.Native.epollWait0()
   0.10% (     0.04s) io.netty.buffer.AbstractByteBuf.checkIndex()

Here is what happens:
When you send a very large pipeline, Redis sends a BULK reply. Such a reply is fully loaded to commandHandler.byteBuf before being decoded. Then lettuce starts decoding commands one by one and after every command byteBuf.discardReadBytes() is called. This causes shifting the buffer's readerIndex to 0, which actually involves copying the part of the buffer that has not been read yet. Hence the io.netty.util.internal.PlatformDependent0.copyMemory() call taking 95% time of the run.

@mp911de
Copy link
Collaborator

mp911de commented Mar 7, 2018

Thanks a lot for digging into that. We have a JMH test suite. Care to include a JMH benchmark along your PR #726?

@gszpak
Copy link
Contributor Author

gszpak commented Mar 7, 2018

Sure, will add it soon.

@gszpak
Copy link
Contributor Author

gszpak commented Mar 7, 2018

I digged a bit deeper into this case and I figured out I had not fully understood the reason of the issue in the first place. More precisely, it's not true, that there is one big reply, which is fully loaded to a buffer before being decoded. Here is what really happens:

  1. CommandHandler writes the commands, calling addToStack - but the command is not actually added yet. Its added only after promise associated with this write is completed! (lines 418 - 422 in CommandHandler.java)
    So, the stack is still empty.
  2. Because the write is big enough - like 500k in my case - first replies from Redis start to arrive BEFORE write promise is completed. So first channelRead events are processed while the stack is still empty. Therefore, this canDecode check fails: https://github.com/lettuce-io/lettuce-core/blob/e3641b465d7ad2e7fd5be09006f1087e5df9e919/src/main/java/io/lettuce/core/protocol/CommandHandler.java#L532 , the replies are not being decoded and the buffer grows. In my case it loaded ~ 5MB before it started decoding the commands.
  3. At some point, write promise is completed, all sent commands are actually added to the stack, and the next channelRead event causes decoding of the whole buffer. Which involved a lot of discardReadBytes calls.

There is just one thing that bothers me - is it possible, that all channelRead events are processed before the write promise is completed? If so, that would cause a freeze of read until the next channelRead event is received.

@mp911de
Copy link
Collaborator

mp911de commented Mar 8, 2018

Not exactly sure about your analysis.

  1. True for non-void-Promises. The command is added to the protocol/decoding stack after it's fully and successfully written. This is to avoid commands in the stack that weren't written entirely to the transport and to guarantee the order of writes (that's the more important bit).
  2. Redis doesn't start processing/replying on partial receives for a single command. Do you have an example?
    Also 2: Responses are received in chunks: a channel read reads always a certain amount of bytes per read (that's configurable via message size estimator) but you will likely require multiple reads for a whole response. Received bytes are written into an intermediate buffer and decoding (state machine) attempts to progress in decoding and emitting responses on the fly.
  3. True for the first part. channelRead(…) attempts to decode/read from the remaining buffer until it can complete the command. At that time the buffer contains only a subsequence of response bytes that are not yet decoded. Might contain more but the reader index is set to the position at which the next decoding will start.

Netty stores outbound writes in ChannelOutboundBuffer along the promise. As soon as all bytes are successfully written to the channel, the Promise completes along the stack write. Netty handles all I/O for a single channel on a single thread. In a simultaneously bi-directional system, we would have to edit the protocol stack at an earlier time to provide a context for responses.

A read and write cannot happen concurrently (that was different for netty 3). The bound channel thread can handle only a single task at a time. It's either reading or writing. Since Redis does not reply (should not?) to commands that are currently sent, we're on the safe side. Yet it bothers me because a freeze must not happen. Interesting point, however.

@gszpak
Copy link
Contributor Author

gszpak commented Mar 8, 2018

Just to clarify: I'm not sure I got the whole flow right (I'm not a Netty expert:)), but I'm sure growing of the buffer is caused because channelRead is called a lot of times with the stack being empty (and the bytes comprising the subsequent responses couldn't be decoded). If you would like to reproduce it, please try the following:

  1. Add a check before this line: https://github.com/lettuce-io/lettuce-core/blob/e3641b465d7ad2e7fd5be09006f1087e5df9e919/src/main/java/io/lettuce/core/protocol/CommandHandler.java#L532
    that the stack is empty
  2. Send a very large batch of commands (batch == manually flushed) - in my case it was 500k commands.
     

Ad. 2.: Sorry, forgot to mention I was talking about batch writes. And my point was, before the whole batch is successfully written, Redis can already reply to the commands from the beginning of the batch.
 

Here is my another hypothesis (after glancing over DefaultPromise, which netty's DefaultChannelPromise inherits from):

  • let's say write is executed on eventLoop e.
  • When executing setSuccess on write promise, for every promise's listener method, e.execute(method) is called (according to DefaultPromise.java)
  • From the docs of execute: "Executes the given command at some time in the future". So my guess is, a method passed in addListener is not executed immediately, but instead added to task queue of e.
  • After the writeBatch is finished, e starts the loop again
  • First it reads replies (which arrived in the meantime) from the socket and processes them in channelRead
  • Then it finally executes the promise's listener which adds commands to stack.

WDYT, is this scenario possible?

@mp911de
Copy link
Collaborator

mp911de commented Mar 8, 2018

Ad. 2.: Sorry, forgot to mention I was talking about batch writes. And my point was, before the whole batch is successfully written, Redis can already reply to the commands from the beginning of the batch.

You're absolutely right, I didn't consider this aspect and Redis can indeed start answering while the last command isn't written. The last command in the write will trigger another response which then initiates decoding of the whole batch.

I haven't looked into the eventloop implementation but your description sounds reasonable.

Right now I see following possibilities how to address that loophole:

  1. Call channelRead/decode after the promise is executed - solves the never-read case but commands still pile up while the data might be already available to decode
  2. Add commands upon write to the stack: Allows early decoding but what if the batch write fails upon the last command? We no longer know which command failed because at that time the encoded commands are written to a buffer
  3. Write each command via write(…) so every command gets its own promise: Increases the number of overall promises but we solve the issues that arise with 1) and 2). Also, the receive buffer never grows to a size that might negatively impact compaction.

@mp911de
Copy link
Collaborator

mp911de commented Mar 14, 2018

I ran the benchmark (command creation overhead et. al. removed):

Calling discardReadBytes() after each command

Benchmark                                                  Mode  Cnt       Score       Error  Units
CommandHandlerBenchmark.measureNettyWriteAndRead           avgt    5     135,495 ±     7,990  ns/op
CommandHandlerBenchmark.measureNettyWriteAndReadBatch1     avgt    5     133,663 ±    10,557  ns/op
CommandHandlerBenchmark.measureNettyWriteAndReadBatch10    avgt    5    1355,452 ±    80,159  ns/op
CommandHandlerBenchmark.measureNettyWriteAndReadBatch100   avgt    5   13268,616 ±  1690,193  ns/op
CommandHandlerBenchmark.measureNettyWriteAndReadBatch1000  avgt    5  206053,679 ± 12626,892  ns/op

Calling discardSomeReadBytes() after 16 read cycles outside the decode loop

Benchmark                                                  Mode  Cnt       Score       Error  Units
CommandHandlerBenchmark.measureNettyWriteAndRead           avgt    5     131,636 ±     8,448  ns/op
CommandHandlerBenchmark.measureNettyWriteAndReadBatch1     avgt    5     138,199 ±    12,895  ns/op
CommandHandlerBenchmark.measureNettyWriteAndReadBatch10    avgt    5    1518,567 ±  1727,857  ns/op
CommandHandlerBenchmark.measureNettyWriteAndReadBatch100   avgt    5   11708,630 ±  1036,565  ns/op
CommandHandlerBenchmark.measureNettyWriteAndReadBatch1000  avgt    5  121271,371 ± 13380,699  ns/op

Calling discardReadBytes() after 16 read cycles outside the decode loop

Benchmark                                                  Mode  Cnt       Score      Error  Units
CommandHandlerBenchmark.measureNettyWriteAndRead           avgt    5     137,077 ±    6,424  ns/op
CommandHandlerBenchmark.measureNettyWriteAndReadBatch1     avgt    5     136,734 ±    6,509  ns/op
CommandHandlerBenchmark.measureNettyWriteAndReadBatch10    avgt    5    1292,053 ±  130,484  ns/op
CommandHandlerBenchmark.measureNettyWriteAndReadBatch100   avgt    5   11882,836 ±  647,510  ns/op
CommandHandlerBenchmark.measureNettyWriteAndReadBatch1000  avgt    5  118570,635 ± 8408,583  ns/op

Calling discardReadBytes() after 1 read cycle outside the decode loop

Benchmark                                                  Mode  Cnt       Score      Error  Units
CommandHandlerBenchmark.measureNettyWriteAndRead           avgt    5     134,220 ±   10,779  ns/op
CommandHandlerBenchmark.measureNettyWriteAndReadBatch1     avgt    5     136,320 ±   10,366  ns/op
CommandHandlerBenchmark.measureNettyWriteAndReadBatch10    avgt    5    1270,895 ±   63,624  ns/op
CommandHandlerBenchmark.measureNettyWriteAndReadBatch100   avgt    5   11601,653 ±  300,898  ns/op
CommandHandlerBenchmark.measureNettyWriteAndReadBatch1000  avgt    5  119552,878 ± 9189,821  ns/op

@mp911de
Copy link
Collaborator

mp911de commented Mar 14, 2018

Based on the benchmark results above I'd propose:

  1. Switch batch to discardReadBytes() to retain buffer release behavior. discardSomeReadBytes() allows the buffer to grow to several megabytes before it is cleared
  2. Move discardReadBytes() out of the decoding loop. This has the most impact. Discarding bytes after 16 decode runs or after each run has very little impact.

WDYT? I've polished up the code already, so no need to change the pull request.

@gszpak
Copy link
Contributor Author

gszpak commented Mar 14, 2018

I'm a little surprised that discardSome can allow the buffer to grow so big - it should shrink the buffer whenever readerIndex exceeds half of it's capacity.
Anyway, moving discard... out of the loop sounds like a good idea.

So, should I leave #726 as is? I don't see any changes on the 5.0.x branch?

@mp911de
Copy link
Collaborator

mp911de commented Mar 14, 2018

I haven't pushed yet the changes, wanted to get your feedback first.

@mp911de mp911de added this to the Lettuce 4.5.0 milestone Mar 15, 2018
mp911de pushed a commit that referenced this issue Mar 15, 2018
…dler's buffer #725

Add benchmarks for channelRead added to CommandHandler JMH test suite.

CommandHandlerBenchmark tests the whole flow now - both writes and reads.
Also, creation of commands was moved to benchmark methods -
after one usage they become not writable which causes the benchmark
to give incorrect results.

Original pull request: #726
mp911de added a commit that referenced this issue Mar 15, 2018
Move back to discardReadBytes() but discard bytes outside the decoding loop to not enforce cleanup upon each decoded command. Tweak JMH benchmarks to not include command creation overhead caused by IntStream and element collection. Tweak commands in test to never return done state so commands can be reused for all benchmark runs.

Original pull request: #726
mp911de pushed a commit that referenced this issue Mar 15, 2018
…dler's buffer #725

Add benchmarks for channelRead added to CommandHandler JMH test suite.

CommandHandlerBenchmark tests the whole flow now - both writes and reads.
Also, creation of commands was moved to benchmark methods -
after one usage they become not writable which causes the benchmark
to give incorrect results.

Original pull request: #726
mp911de added a commit that referenced this issue Mar 15, 2018
Move back to discardReadBytes() but discard bytes outside the decoding loop to not enforce cleanup upon each decoded command. Tweak JMH benchmarks to not include command creation overhead caused by IntStream and element collection. Tweak commands in test to never return done state so commands can be reused for all benchmark runs.

Original pull request: #726
mp911de pushed a commit that referenced this issue Mar 15, 2018
…dler's buffer #725

Add benchmarks for channelRead added to CommandHandler JMH test suite.

CommandHandlerBenchmark tests the whole flow now - both writes and reads.
Also, creation of commands was moved to benchmark methods -
after one usage they become not writable which causes the benchmark
to give incorrect results.

Original pull request: #726
mp911de added a commit that referenced this issue Mar 15, 2018
Move back to discardReadBytes() but discard bytes outside the decoding loop to not enforce cleanup upon each decoded command. Tweak JMH benchmarks to not include command creation overhead caused by IntStream and element collection. Tweak commands in test to never return done state so commands can be reused for all benchmark runs.

Original pull request: #726
@mp911de
Copy link
Collaborator

mp911de commented Mar 15, 2018

That's fixed now. Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: feature A new feature
Projects
None yet
Development

No branches or pull requests

2 participants