Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Java] Arrow Flight C++/Java performance comparison #13980

Open
Lagrang opened this issue Aug 26, 2022 · 20 comments
Open

[C++][Java] Arrow Flight C++/Java performance comparison #13980

Lagrang opened this issue Aug 26, 2022 · 20 comments

Comments

@Lagrang
Copy link

Lagrang commented Aug 26, 2022

Hi
I'm trying to understand difference in performance of Arrow Flight for C++ and Java. I run benchmarks for both languages and got the ~20x throughput difference.
Benchmark parameters:

  • threads/streams: 16/16
  • batch size: 8192
  • records per stream: 10_000_000
  • network: localhost

Hardware:
CPU: 2x Intel(R) Xeon(R) Platinum 8352Y CPU @ 2.20GHz
RAM: 512GB

C++(arrow/flight/flight_benchmark.cc)

Testing method: DoGet
Using spawned TCP server
Server running with pid 382642
Server host: localhost
Server port: 31337
Server host: localhost
Server port: 31337
Number of perf runs: 10
Number of concurrent gets/puts: 16
Batch size: 262080
Batches read: 195360
Bytes read: 51200000000
Nanos: 2453011703
Speed: 19905.4 MB/s
Throughput: 79640.9 batches/s
Latency mean: 174 us
Latency quantile=0.5: 95 us
Latency quantile=0.95: 546 us
Latency quantile=0.99: 1920 us
Latency max: 147827 us

Java(flight-core/src/test/java/org/apache/arrow/flight/perf/TestPerf.java)

Transferred 160000000 records totaling 5120000000 bytes at 641.102193 MiB/s. 21007636.651566 record/s. 2565.032435 batch/s.
Transferred 160000000 records totaling 5120000000 bytes at 582.299625 MiB/s. 19080794.111082 record/s. 2329.764961 batch/s.
Transferred 160000000 records totaling 5120000000 bytes at 848.258903 MiB/s. 27795747.747999 record/s. 3393.860800 batch/s.
Transferred 160000000 records totaling 5120000000 bytes at 896.877756 MiB/s. 29388890.295311 record/s. 3588.383505 batch/s.
Transferred 160000000 records totaling 5120000000 bytes at 1292.493243 MiB/s. 42352418.587486 record/s. 5171.230310 batch/s.
Transferred 160000000 records totaling 5120000000 bytes at 1003.844027 MiB/s. 32893961.061297 record/s. 4016.352646 batch/s.
Transferred 160000000 records totaling 5120000000 bytes at 1065.951924 MiB/s. 34929112.654969 record/s. 4264.844655 batch/s.
Transferred 160000000 records totaling 5120000000 bytes at 1463.970739 MiB/s. 47971393.190316 record/s. 5857.307109 batch/s.
Transferred 160000000 records totaling 5120000000 bytes at 1229.642808 MiB/s. 40292935.524028 record/s. 4919.767427 batch/s.
Transferred 160000000 records totaling 5120000000 bytes at 1297.904113 MiB/s. 42529721.971291 record/s. 5192.879053 batch/s.
Summary: 
Average throughput: 1032.234533 MiB/s, standard deviation: 278.068050 MiB/s

Is it expected throughput difference? Can Java Flight server somehow be tuned in order to be more close to C++ version?

@lidavidm
Copy link
Member

The benchmarks are nowhere near comparable.

The C++ version spawns 16 threads, as you can see in the output. The Java version spawns one thread per Endpoint, and defaults to two endpoints. The Java version uses smaller batches (4095 * 4 * 8 ~= 128 KiB) whereas the C++ one uses ~256 KiB.

The Java version also appears to suffer from JVM warmup. The Java thread also isn't clear, but it appears to be summing the two threads together.

It does seem the per-thread performance is not quite as good, but there's so many differences between the two benchmarks that I wouldn't take this as anything remotely definitive.

@lidavidm
Copy link
Member

lidavidm commented Aug 26, 2022

Ah, I missed this initially, but did you modify the benchmarks already?

@Lagrang
Copy link
Author

Lagrang commented Aug 26, 2022

@lidavidm yes, I adjusted Java version to have same parameters as C++. They should behave more or less similar:)

@lidavidm
Copy link
Member

Can you share all modifications?

@Lagrang
Copy link
Author

Lagrang commented Aug 26, 2022

Sure, here you can find the changes in TestPerf class.

@lidavidm
Copy link
Member

I would try removing the backpressure wait and seeing how that helps:

(this will require a lot of memory, you may have to manually add sleep() calls to space out the batches a bit)

the issue with gRPC Java is that it has a fixed, small buffer it uses for backpressure, so effectively every write will trigger backpressure and artificially throttle the producer, regardless of actual network conditions. This has been known for years but the upstream is not interested in fixing it: grpc/grpc-java#5433

Unfortunately without it, Java never applies any backpressure and instead you tend to just OOM. I believe C++ does the 'right' thing and automatically applies backpressure (by blocking the send call) above some threshold which I do not recall, but which I think is actually based on network conditions.

@lidavidm
Copy link
Member

Or you can build gRPC yourself with that threshold modified as a test (e.g. setting it to some megabytes).

@Lagrang
Copy link
Author

Lagrang commented Aug 26, 2022

the issue with gRPC Java is that it has a fixed, small buffer it uses for backpressure, so effectively every write will trigger backpressure and artificially throttle the producer, regardless of actual network conditions. This has been known for years but the upstream is not interested in fixing it: grpc/grpc-java#5433

yes, we also found the same issue and tried it without backpressure, but still no changes in performance(not in the benchmark, nor in real code).

@lidavidm
Copy link
Member

In that case I will have to build and profile it, unless you have tried this already too?

@Lagrang
Copy link
Author

Lagrang commented Aug 26, 2022

We already tried to profile Flight in production service. After applying some fixes to ArrowMessage in order to prevent network buffer copying, we only saw that we have "pure" network traces(Netty calls to sendmsg). Will provide a flamegraph later.

@lidavidm
Copy link
Member

Ah, the final thing I would suggest (unless you've already done this?), at least for this benchmark, is to create a new flight client for each stream (since otherwise all the clients share a single TCP connection). This is what C++ does.

But that doesn't seem like it would apply for the production service.

What are the fixes? I thought we had optimizations for that (but it has been a while since we last looked at it and something may have been overlooked/changed)

@Lagrang
Copy link
Author

Lagrang commented Aug 26, 2022

Ah, the final thing I would suggest (unless you've already done this?), at least for this benchmark, is to create a new flight client for each stream (since otherwise all the clients share a single TCP connection). This is what C++ does.

Oh, yes, I forgot to try it, will do

What are the fixes? I thought we had optimizations for that (but it has been a while since we last looked at it and something may have been overlooked/changed)

You can take a look here.

@Lagrang
Copy link
Author

Lagrang commented Aug 26, 2022

Ah, the final thing I would suggest (unless you've already done this?), at least for this benchmark, is to create a new flight client for each stream (since otherwise all the clients share a single TCP connection). This is what C++ does.

I got this result after this change:
Average throughput: 9299.281056 MiB/s, standard deviation: 3461.357041 MiB/s
This is a lot better.
We tried same change in production before(somehow I missed that in the benchmark), and also saw big improvement in throughput. But, sadly, still it's not even close to C++...

@lidavidm
Copy link
Member

Thanks for checking.

I filed https://issues.apache.org/jira/browse/ARROW-17537 and will look at these things.

I'm not sure what is left, at this point it seems like an issue of gRPC implementations, but possibly there are other things (e.g. Netty can optionally use JNI code and epoll instead of relying on JVM facilities, but this may not necessarily actually be an improvement).

@Lagrang
Copy link
Author

Lagrang commented Aug 26, 2022

e.g. Netty can optionally use JNI code and epoll instead of relying on JVM facilities, but this may not necessarily actually be an improvement

Yes, we also tried Epoll. Didn't observe any changes.

@lidavidm
Copy link
Member

Ok, thanks.

Sorry, this is still on my backlog to take a deeper look at, but I probably won't have time before Arrow 10.0.0 (mid-late October) :/

@Lagrang
Copy link
Author

Lagrang commented Sep 27, 2022

Ok, I understand👍

@tolmalev
Copy link
Contributor

Hey @Lagrang!
If you're still interested in this issue it would be great to see the same comparison results with a recent fix from this PR: #40042
In some of the tests I saw ~110% boost in performance so there is a chance that Java is not much closer to c++

@kou kou changed the title Arrow Flight C++/Java performance comparison [C++][Java] Arrow Flight C++/Java performance comparison Feb 13, 2024
@Lagrang
Copy link
Author

Lagrang commented Feb 13, 2024

Hello @tolmalev !
Your PR looks pretty much the same as my changes to Arrow in scope of this issue. I described similar change here. This fix eliminated unnecessary copying, but it didn't give me a perf.boost in my case.
Anyway, it's great to see that same fix now in upstream 👍

@jduo
Copy link
Member

jduo commented Apr 9, 2024

grpc/grpc-java#5433

Worth mentioning that grpc/grpc-java#5433 is fixed now, and the changes have been integrated into Flight in #41051

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants