[Java][FlightRpc] server zero-copy doesn't work if padding buffers are needed to serialise response #40039

tolmalev · 2024-02-11T20:58:59Z

Describe the bug, including details regarding any error messages, version, and platform.

ArrowBufRetainingCompositeByteBuf isn't supposed to copy data into new Netty buffers. To make it work it extends CompositeByteBuf and passes existing Arrow buffers as components.

But CompositeByteBuf constructors accepts two parameters: max count of components and list of components (buffers) and if count of buffers is above maxNumComponents it will do consolidation and merge some buffers into a new buffer.

ArrowBufRetainingCompositeByteBuf passes maxNumComponents=backingBuffers.size() + 1 and not buffers.size() + 1. When padding is used, buffers will have additional byte buffers for padding and as a result buffers.size() > backingBuffers.size() + 1.
As a result zero-copy doesn't work and a new copy of data is created by CompositeByteBuf.consolidateIfNeeded().

Fun fact: I found this when I was trying to debug why simple client-server benchmark works exactly 2x times faster when result has 199999 rows than when it has 200000 rows. Number of columns didn't matter, only the number of rows.

Fun fact 2: This is zero-copy version that works slower, not the version that does additional memory copy. If I remove listener.setUseZeroCopy(true); from producer implementation, both versions start showing same results.

Component(s)

FlightRPC, Java

The text was updated successfully, but these errors were encountered:

tolmalev · 2024-02-11T22:36:27Z

In fact, the same problem is present on non-zero-copy path as well.
CompositeByteBuf created for non zero-copy path will have maxNumComponents < byteBufs.size() and will make an unnecessary memory copy.
Most vectors are represented by 2 byte buffers and if padding is needed there will be ~1 additional byte buffer with padding per vector. So after adding ~2/3 of byteBufs as components, CompositeByteBuf will allocate a new buffer for all previously added components and make a memory copy. The rest will be added as zero-copy

final int maxNumComponents = Math.max(2, bufs.size() + 1);
      final ImmutableList<ByteBuf> byteBufs = ImmutableList.<ByteBuf>builder()
          .add(initialBuf)
          .addAll(allBufs)
          .build();
      if (tryZeroCopyWrite) {
        bb = new ArrowBufRetainingCompositeByteBuf(maxNumComponents, byteBufs, bufs);
      } else {
        // Don't retain the buffers in the non-zero-copy path since we're copying them
        bb = new CompositeByteBuf(UnpooledByteBufAllocator.DEFAULT, /* direct */ true, maxNumComponents, byteBufs);
      }

tolmalev · 2024-02-11T22:43:43Z

This maxNumComponents limit causes another issue down the road inside netty codebase in DefaultHttp2ConnectionEncoder.

Encoder uses CoalescingBufferQueue that merges small buffers to optimize writes, but there is a special optimisation if buffer is a CompositeByteBuf.
See CoalescingBufferQueue::compose

if (cumulation instanceof CompositeByteBuf) {
    CompositeByteBuf composite = (CompositeByteBuf) accumulation;
    composite.addComponent(true, next);
    return composite;
}

This code assumes that CompositeByteBuf is created by this class and it has the correct maxNumComponents so that it never needs to merge components together. But Arrow is passing CompositeByteBuf into it and it breaks this assumption.

This call to composite.addComponent may lead to an additional memory copy if we reached the limit of maxNumComponents and it always happens if CompositeByteBuf didn't merge buffers when it was created because netty will try to merge small buffers that Arrow is sending before the message body.

So to finalize: current logic of setting maxNumComponents always leads to one additional memory copy.
It will either happen immediately when we create CompositeByteBuf or later before netty starts sending message to output stream.

I think that correct solution might be to set maxNumComponents to some high number instead of trying to predict what exactly it will be.

I believe It might be one of the problems in Java performance described here: #13980

tolmalev · 2024-02-12T07:55:25Z

Traces that show where the problem happens.

(if padding is needed): Memory copy happens in thread flight-server-default-executor-*
(if padding is not needed): Memory copy happens in thread grpc-nio-worker-ELG-* thread when netty writes data to socket

…ry memory copies (#40042) ### Rationale for this change Described in details in the issue: #40039 Summary: class ArrowMessage uses CompositeByteBuf to avoid memory copies but `maxNumComponents` for it is calculated incorrectly and as a result memory copies are still performed which significantly affects the performance of the server. ### What changes are included in this PR? Changing maxNumComponents to `Integer.MAX_VALUE` because we never want to silently merge large buffers into one. User can set useZeroCopy=false (default) and then the library will copy data into a new buffer before sending it to Netty for write. ### Are these changes tested? **TestPerf: 30% throughput boost** ``` BEFORE Transferred 100000000 records totaling 3200000000 bytes at 877.812629 MiB/s. 28764164.218015 record/s. 7024.784185 batch/s. AFTER Transferred 100000000 records totaling 3200000000 bytes at 1145.333893 MiB/s. 37530301.022096 record/s. 9165.650116 batch/s. ``` Also tested with a simple client-server application and I saw even more significant performance boost if padding isn't needed. Two tests with zero-copy set to true: **50 batches, 30 columns (Int32), 199999 rows in each batch** - before change: throughput ~25Gbit/s (memory copy happens in `grpc-nio-worker-ELG-*`) - after change: throughput ~32Gbit/s (20% boost) **50 batches, 30 columns (Int32), 200k rows in each batch** - before change: throughput ~15Gbit/s (much slower than with 199999 because memory copy happens in `flight-server-default-executor-*` thread and blocks server from writing next batch. - after change: throughput ~32Gbit/s (**115% boost**) * Closes: #40039 Authored-by: Lev Tolmachev <[email protected]> Signed-off-by: David Li <[email protected]>

…ecessary memory copies (apache#40042) ### Rationale for this change Described in details in the issue: apache#40039 Summary: class ArrowMessage uses CompositeByteBuf to avoid memory copies but `maxNumComponents` for it is calculated incorrectly and as a result memory copies are still performed which significantly affects the performance of the server. ### What changes are included in this PR? Changing maxNumComponents to `Integer.MAX_VALUE` because we never want to silently merge large buffers into one. User can set useZeroCopy=false (default) and then the library will copy data into a new buffer before sending it to Netty for write. ### Are these changes tested? **TestPerf: 30% throughput boost** ``` BEFORE Transferred 100000000 records totaling 3200000000 bytes at 877.812629 MiB/s. 28764164.218015 record/s. 7024.784185 batch/s. AFTER Transferred 100000000 records totaling 3200000000 bytes at 1145.333893 MiB/s. 37530301.022096 record/s. 9165.650116 batch/s. ``` Also tested with a simple client-server application and I saw even more significant performance boost if padding isn't needed. Two tests with zero-copy set to true: **50 batches, 30 columns (Int32), 199999 rows in each batch** - before change: throughput ~25Gbit/s (memory copy happens in `grpc-nio-worker-ELG-*`) - after change: throughput ~32Gbit/s (20% boost) **50 batches, 30 columns (Int32), 200k rows in each batch** - before change: throughput ~15Gbit/s (much slower than with 199999 because memory copy happens in `flight-server-default-executor-*` thread and blocks server from writing next batch. - after change: throughput ~32Gbit/s (**115% boost**) * Closes: apache#40039 Authored-by: Lev Tolmachev <[email protected]> Signed-off-by: David Li <[email protected]>

…ry memory copies (#40042) ### Rationale for this change Described in details in the issue: apache/arrow#40039 Summary: class ArrowMessage uses CompositeByteBuf to avoid memory copies but `maxNumComponents` for it is calculated incorrectly and as a result memory copies are still performed which significantly affects the performance of the server. ### What changes are included in this PR? Changing maxNumComponents to `Integer.MAX_VALUE` because we never want to silently merge large buffers into one. User can set useZeroCopy=false (default) and then the library will copy data into a new buffer before sending it to Netty for write. ### Are these changes tested? **TestPerf: 30% throughput boost** ``` BEFORE Transferred 100000000 records totaling 3200000000 bytes at 877.812629 MiB/s. 28764164.218015 record/s. 7024.784185 batch/s. AFTER Transferred 100000000 records totaling 3200000000 bytes at 1145.333893 MiB/s. 37530301.022096 record/s. 9165.650116 batch/s. ``` Also tested with a simple client-server application and I saw even more significant performance boost if padding isn't needed. Two tests with zero-copy set to true: **50 batches, 30 columns (Int32), 199999 rows in each batch** - before change: throughput ~25Gbit/s (memory copy happens in `grpc-nio-worker-ELG-*`) - after change: throughput ~32Gbit/s (20% boost) **50 batches, 30 columns (Int32), 200k rows in each batch** - before change: throughput ~15Gbit/s (much slower than with 199999 because memory copy happens in `flight-server-default-executor-*` thread and blocks server from writing next batch. - after change: throughput ~32Gbit/s (**115% boost**) * Closes: #40039 Authored-by: Lev Tolmachev <[email protected]> Signed-off-by: David Li <[email protected]>

tolmalev added the Type: bug label Feb 11, 2024

github-actions bot added Component: Java Component: FlightRPC labels Feb 11, 2024

github-actions bot mentioned this issue Feb 12, 2024

GH-40039: [Java][FlightRPC] Improve performance by removing unnecessary memory copies #40042

Merged

github-actions bot assigned tolmalev Feb 12, 2024

lidavidm closed this as completed in #40042 Feb 12, 2024

lidavidm added this to the 16.0.0 milestone Feb 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Java][FlightRpc] server zero-copy doesn't work if padding buffers are needed to serialise response #40039

[Java][FlightRpc] server zero-copy doesn't work if padding buffers are needed to serialise response #40039

tolmalev commented Feb 11, 2024

tolmalev commented Feb 11, 2024

tolmalev commented Feb 11, 2024 •

edited

Loading

tolmalev commented Feb 12, 2024 •

edited

Loading

[Java][FlightRpc] server zero-copy doesn't work if padding buffers are needed to serialise response #40039

[Java][FlightRpc] server zero-copy doesn't work if padding buffers are needed to serialise response #40039

Comments

tolmalev commented Feb 11, 2024

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

tolmalev commented Feb 11, 2024

tolmalev commented Feb 11, 2024 • edited Loading

tolmalev commented Feb 12, 2024 • edited Loading

tolmalev commented Feb 11, 2024 •

edited

Loading

tolmalev commented Feb 12, 2024 •

edited

Loading