Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Java][FlightRpc] server zero-copy doesn't work if padding buffers are needed to serialise response #40039

Closed
tolmalev opened this issue Feb 11, 2024 · 3 comments · Fixed by #40042

Comments

@tolmalev
Copy link
Contributor

Describe the bug, including details regarding any error messages, version, and platform.

ArrowBufRetainingCompositeByteBuf isn't supposed to copy data into new Netty buffers. To make it work it extends CompositeByteBuf and passes existing Arrow buffers as components.

But CompositeByteBuf constructors accepts two parameters: max count of components and list of components (buffers) and if count of buffers is above maxNumComponents it will do consolidation and merge some buffers into a new buffer.

ArrowBufRetainingCompositeByteBuf passes maxNumComponents=backingBuffers.size() + 1 and not buffers.size() + 1. When padding is used, buffers will have additional byte buffers for padding and as a result buffers.size() > backingBuffers.size() + 1.
As a result zero-copy doesn't work and a new copy of data is created by CompositeByteBuf.consolidateIfNeeded().

Fun fact: I found this when I was trying to debug why simple client-server benchmark works exactly 2x times faster when result has 199999 rows than when it has 200000 rows. Number of columns didn't matter, only the number of rows.

Fun fact 2: This is zero-copy version that works slower, not the version that does additional memory copy. If I remove listener.setUseZeroCopy(true); from producer implementation, both versions start showing same results.

Component(s)

FlightRPC, Java

@tolmalev
Copy link
Contributor Author

In fact, the same problem is present on non-zero-copy path as well.
CompositeByteBuf created for non zero-copy path will have maxNumComponents < byteBufs.size() and will make an unnecessary memory copy.
Most vectors are represented by 2 byte buffers and if padding is needed there will be ~1 additional byte buffer with padding per vector. So after adding ~2/3 of byteBufs as components, CompositeByteBuf will allocate a new buffer for all previously added components and make a memory copy. The rest will be added as zero-copy

final int maxNumComponents = Math.max(2, bufs.size() + 1);
      final ImmutableList<ByteBuf> byteBufs = ImmutableList.<ByteBuf>builder()
          .add(initialBuf)
          .addAll(allBufs)
          .build();
      if (tryZeroCopyWrite) {
        bb = new ArrowBufRetainingCompositeByteBuf(maxNumComponents, byteBufs, bufs);
      } else {
        // Don't retain the buffers in the non-zero-copy path since we're copying them
        bb = new CompositeByteBuf(UnpooledByteBufAllocator.DEFAULT, /* direct */ true, maxNumComponents, byteBufs);
      }

@tolmalev
Copy link
Contributor Author

tolmalev commented Feb 11, 2024

This maxNumComponents limit causes another issue down the road inside netty codebase in DefaultHttp2ConnectionEncoder.

Encoder uses CoalescingBufferQueue that merges small buffers to optimize writes, but there is a special optimisation if buffer is a CompositeByteBuf.
See CoalescingBufferQueue::compose

if (cumulation instanceof CompositeByteBuf) {
    CompositeByteBuf composite = (CompositeByteBuf) accumulation;
    composite.addComponent(true, next);
    return composite;
}

This code assumes that CompositeByteBuf is created by this class and it has the correct maxNumComponents so that it never needs to merge components together. But Arrow is passing CompositeByteBuf into it and it breaks this assumption.

This call to composite.addComponent may lead to an additional memory copy if we reached the limit of maxNumComponents and it always happens if CompositeByteBuf didn't merge buffers when it was created because netty will try to merge small buffers that Arrow is sending before the message body.

So to finalize: current logic of setting maxNumComponents always leads to one additional memory copy.
It will either happen immediately when we create CompositeByteBuf or later before netty starts sending message to output stream.

I think that correct solution might be to set maxNumComponents to some high number instead of trying to predict what exactly it will be.

I believe It might be one of the problems in Java performance described here: #13980

@tolmalev
Copy link
Contributor Author

tolmalev commented Feb 12, 2024

Traces that show where the problem happens.

  1. (if padding is needed): Memory copy happens in thread flight-server-default-executor-*
  2. (if padding is not needed): Memory copy happens in thread grpc-nio-worker-ELG-* thread when netty writes data to socket
Screenshot 2024-02-12 at 07 51 42 Screenshot 2024-02-12 at 07 53 04

lidavidm pushed a commit that referenced this issue Feb 12, 2024
…ry memory copies (#40042)

### Rationale for this change
Described in details in the issue: #40039

Summary: class ArrowMessage uses CompositeByteBuf to avoid memory copies but `maxNumComponents` for it is calculated incorrectly and as a result memory copies are still performed which significantly affects the performance of the server.

### What changes are included in this PR?
Changing maxNumComponents to `Integer.MAX_VALUE` because we never want to silently merge large buffers into one.

User can set useZeroCopy=false (default) and then the library will copy data into a new buffer before sending it to Netty for write. 

### Are these changes tested?

**TestPerf: 30% throughput boost**
```
BEFORE
Transferred 100000000 records totaling 3200000000 bytes at 877.812629 MiB/s. 28764164.218015 record/s. 7024.784185 batch/s.

AFTER
Transferred 100000000 records totaling 3200000000 bytes at 1145.333893 MiB/s. 37530301.022096 record/s. 9165.650116 batch/s.
```

Also tested with a simple client-server application and I saw even more significant performance boost if padding isn't needed.

Two tests with zero-copy set to true:
**50 batches, 30 columns (Int32), 199999 rows in each batch**
- before change: throughput ~25Gbit/s (memory copy happens in `grpc-nio-worker-ELG-*`)
- after change: throughput ~32Gbit/s (20% boost)

**50 batches, 30 columns (Int32), 200k rows in each batch**
- before change: throughput ~15Gbit/s (much slower than with 199999 because memory copy happens in `flight-server-default-executor-*` thread and blocks server from writing next batch.
- after change: throughput ~32Gbit/s (**115% boost**)
* Closes: #40039

Authored-by: Lev Tolmachev <[email protected]>
Signed-off-by: David Li <[email protected]>
@lidavidm lidavidm added this to the 16.0.0 milestone Feb 12, 2024
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…ecessary memory copies (apache#40042)

### Rationale for this change
Described in details in the issue: apache#40039

Summary: class ArrowMessage uses CompositeByteBuf to avoid memory copies but `maxNumComponents` for it is calculated incorrectly and as a result memory copies are still performed which significantly affects the performance of the server.

### What changes are included in this PR?
Changing maxNumComponents to `Integer.MAX_VALUE` because we never want to silently merge large buffers into one.

User can set useZeroCopy=false (default) and then the library will copy data into a new buffer before sending it to Netty for write. 

### Are these changes tested?

**TestPerf: 30% throughput boost**
```
BEFORE
Transferred 100000000 records totaling 3200000000 bytes at 877.812629 MiB/s. 28764164.218015 record/s. 7024.784185 batch/s.

AFTER
Transferred 100000000 records totaling 3200000000 bytes at 1145.333893 MiB/s. 37530301.022096 record/s. 9165.650116 batch/s.
```

Also tested with a simple client-server application and I saw even more significant performance boost if padding isn't needed.

Two tests with zero-copy set to true:
**50 batches, 30 columns (Int32), 199999 rows in each batch**
- before change: throughput ~25Gbit/s (memory copy happens in `grpc-nio-worker-ELG-*`)
- after change: throughput ~32Gbit/s (20% boost)

**50 batches, 30 columns (Int32), 200k rows in each batch**
- before change: throughput ~15Gbit/s (much slower than with 199999 because memory copy happens in `flight-server-default-executor-*` thread and blocks server from writing next batch.
- after change: throughput ~32Gbit/s (**115% boost**)
* Closes: apache#40039

Authored-by: Lev Tolmachev <[email protected]>
Signed-off-by: David Li <[email protected]>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this issue Feb 28, 2024
…ecessary memory copies (apache#40042)

### Rationale for this change
Described in details in the issue: apache#40039

Summary: class ArrowMessage uses CompositeByteBuf to avoid memory copies but `maxNumComponents` for it is calculated incorrectly and as a result memory copies are still performed which significantly affects the performance of the server.

### What changes are included in this PR?
Changing maxNumComponents to `Integer.MAX_VALUE` because we never want to silently merge large buffers into one.

User can set useZeroCopy=false (default) and then the library will copy data into a new buffer before sending it to Netty for write. 

### Are these changes tested?

**TestPerf: 30% throughput boost**
```
BEFORE
Transferred 100000000 records totaling 3200000000 bytes at 877.812629 MiB/s. 28764164.218015 record/s. 7024.784185 batch/s.

AFTER
Transferred 100000000 records totaling 3200000000 bytes at 1145.333893 MiB/s. 37530301.022096 record/s. 9165.650116 batch/s.
```

Also tested with a simple client-server application and I saw even more significant performance boost if padding isn't needed.

Two tests with zero-copy set to true:
**50 batches, 30 columns (Int32), 199999 rows in each batch**
- before change: throughput ~25Gbit/s (memory copy happens in `grpc-nio-worker-ELG-*`)
- after change: throughput ~32Gbit/s (20% boost)

**50 batches, 30 columns (Int32), 200k rows in each batch**
- before change: throughput ~15Gbit/s (much slower than with 199999 because memory copy happens in `flight-server-default-executor-*` thread and blocks server from writing next batch.
- after change: throughput ~32Gbit/s (**115% boost**)
* Closes: apache#40039

Authored-by: Lev Tolmachev <[email protected]>
Signed-off-by: David Li <[email protected]>
thisisnic pushed a commit to thisisnic/arrow that referenced this issue Mar 8, 2024
…ecessary memory copies (apache#40042)

### Rationale for this change
Described in details in the issue: apache#40039

Summary: class ArrowMessage uses CompositeByteBuf to avoid memory copies but `maxNumComponents` for it is calculated incorrectly and as a result memory copies are still performed which significantly affects the performance of the server.

### What changes are included in this PR?
Changing maxNumComponents to `Integer.MAX_VALUE` because we never want to silently merge large buffers into one.

User can set useZeroCopy=false (default) and then the library will copy data into a new buffer before sending it to Netty for write. 

### Are these changes tested?

**TestPerf: 30% throughput boost**
```
BEFORE
Transferred 100000000 records totaling 3200000000 bytes at 877.812629 MiB/s. 28764164.218015 record/s. 7024.784185 batch/s.

AFTER
Transferred 100000000 records totaling 3200000000 bytes at 1145.333893 MiB/s. 37530301.022096 record/s. 9165.650116 batch/s.
```

Also tested with a simple client-server application and I saw even more significant performance boost if padding isn't needed.

Two tests with zero-copy set to true:
**50 batches, 30 columns (Int32), 199999 rows in each batch**
- before change: throughput ~25Gbit/s (memory copy happens in `grpc-nio-worker-ELG-*`)
- after change: throughput ~32Gbit/s (20% boost)

**50 batches, 30 columns (Int32), 200k rows in each batch**
- before change: throughput ~15Gbit/s (much slower than with 199999 because memory copy happens in `flight-server-default-executor-*` thread and blocks server from writing next batch.
- after change: throughput ~32Gbit/s (**115% boost**)
* Closes: apache#40039

Authored-by: Lev Tolmachev <[email protected]>
Signed-off-by: David Li <[email protected]>
kou pushed a commit to apache/arrow-java that referenced this issue Nov 25, 2024
…ry memory copies (#40042)

### Rationale for this change
Described in details in the issue: apache/arrow#40039

Summary: class ArrowMessage uses CompositeByteBuf to avoid memory copies but `maxNumComponents` for it is calculated incorrectly and as a result memory copies are still performed which significantly affects the performance of the server.

### What changes are included in this PR?
Changing maxNumComponents to `Integer.MAX_VALUE` because we never want to silently merge large buffers into one.

User can set useZeroCopy=false (default) and then the library will copy data into a new buffer before sending it to Netty for write. 

### Are these changes tested?

**TestPerf: 30% throughput boost**
```
BEFORE
Transferred 100000000 records totaling 3200000000 bytes at 877.812629 MiB/s. 28764164.218015 record/s. 7024.784185 batch/s.

AFTER
Transferred 100000000 records totaling 3200000000 bytes at 1145.333893 MiB/s. 37530301.022096 record/s. 9165.650116 batch/s.
```

Also tested with a simple client-server application and I saw even more significant performance boost if padding isn't needed.

Two tests with zero-copy set to true:
**50 batches, 30 columns (Int32), 199999 rows in each batch**
- before change: throughput ~25Gbit/s (memory copy happens in `grpc-nio-worker-ELG-*`)
- after change: throughput ~32Gbit/s (20% boost)

**50 batches, 30 columns (Int32), 200k rows in each batch**
- before change: throughput ~15Gbit/s (much slower than with 199999 because memory copy happens in `flight-server-default-executor-*` thread and blocks server from writing next batch.
- after change: throughput ~32Gbit/s (**115% boost**)
* Closes: #40039

Authored-by: Lev Tolmachev <[email protected]>
Signed-off-by: David Li <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment