-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quarkus/Vertx/Netty HTTP performance #1
Comments
Similarly, I see that REST endpoints in Quarkus are not using the More information about which endpoints to mark as non-blocking here: https://quarkus.io/blog/resteasy-reactive-smart-dispatch/ Things have changed "recently" |
Thanks for pointing out the problems. I will try to work on them this weekend. The performance tests use JHM and therefore test the web server on a loopback interface - while this isn't the most representative use case for a web server, we thought it was good enough for the "high level" perspective of its performance.
Regarding the second question, I'd like to read more about the issues. Can you share any resources with me so that I can delve into it more? One thing that might differ is I'm aware that other applications are more likely to consume resources on localhost than on the dedicated server, but if the tests are repeated multiple times or better - verified on different machine I tended to take them into account. |
Thanks, answers inline
you can decide how many threads on the eg This will benefit not making client + server(s) to compete and share the same resources. I'm not a great fan of using JMH for macrobenchmarks anyway, and I would suggest using Hyperfoil instead (https://hyperfoil.io/) that avoid the infamous Coordinated Omission measurement issue and can provide easily comparisons/metrics/graphs: and it has a wrapper that emulate 'wrk/wrk2` that would make it easier to use. NOTE: using an external load generator tool don't mean that you dont' have to
Agree but if you check with linux cmds you'll find that loopback interface is called "noIRQ" because it's not interested by soft/hard IRQ i.e interrupts related any network interface; it means then that if an application is optimized for throughput and won't try to correctly size packets in a way to be more "mechanical sympathy" with a real network interface (that would means creating more interrupts on receive, for example, that would consume CPU as well) you won't see it, and will give the false impression that a stack is "better" then another, while they have been developed to benefit different uses cases that are not captured by the "loopback" interface.
you mean https://redhatperf.github.io/post/type-check-scalability-issue/ ? |
Thanks @franz1981 the comment was super enlightening! By the way, I have read about |
Actually it is not. And it's not a delayed 1st April joke 🤣 |
Well backporting was a wrong term. I should have used term "porting". I refer to:
|
I've had some time to work on it over the weekend and there is a battle plan:
For all runs I'll create a JSON file with exact statistics. Then we analyse the variations in them. |
1. base-line test outputThe result is recorded in Gist: https://gist.github.com/novoj/90b090cf3adc9116f2a99d7e78fa4e4a |
Yep, please consider another factor, in the baseline as per the other version: both quarkus and netty are performing blocking operations... |
2. upgraded versions test outputThe result is recorded in Gist: https://gist.github.com/novoj/4adcef0c9deef3964d191844b36f700b Bumped versions:
No bigger changes in the original were performed (this is going to be part of step 3). |
Just to help understanding whatever results are here so far:
|
I have similar feelings about it. We're newbies to Netty integration, but the implementations seems pretty straightforward to me: https://github.com/FgForrest/HttpServerEvaluationTest/blob/version-from-2023-03/server/src/main/java/one/edee/oss/http_server_evaulation_test/server/netty/GraphQLHandler.java As I said - next step would be optimizing the code. I'll try to run the Netty tests separatelly and look at the profiler output. At least we may have a first observation - it's not easy for a newbie to write a well performing GQL service on Netty :) I see two bigger memory allocations in the code: byte[] body = new byte[request.content().readableBytes()]; and final ByteBuf bodyBuffer = ctx.alloc().buffer(body.length); Is there a better approach to it? In the colleague implementation there is: ServerBootstrap b = new ServerBootstrap();
b.option(ChannelOption.SO_BACKLOG, 1024);
b.option(ChannelOption.ALLOCATOR, PooledByteBufAllocator.DEFAULT); So I guess the I will continue later this week. Thank you for your suggestions! |
The real problem is that graph ql shouldn't ever block the event loop of Netty or it would cause massive scalability degradations: the graphql manager is performing supposedly blocking operations? |
I think so - based on this blog post: https://www.graphql-java.com/blog/threads - let's see what the switch to non-blocking brings. On the other hand - there is no I/O, just rendering the constant string to the output. I would've thought that switching to async here would only bring another overhead. |
Maybe or maybe not: without profiling is difficult to say unless they have a very clear explanation of what's doing under the hood. |
I agree. I have been surprised many times already :) |
One qq:
and
these suggest that the benchmark is creating a whole new connection for each single request; that's not what HTTP 1.1 suggest by default (so-called persistent connections) - and doesn't seem to happen for the other server; why? Both removing the CONNECTION CLOSE http header and closing the connection while the response is sent, (more then) double the performance of the Netty server; and I see already other low hanging fruits there. |
Sorry for the delay - I do this in my spare time and the week has been quite busy. I've included your recommendations. The results have changed quite a lot. Following numbers relate to current version of the code where the GraphQL evaluation is moved to async thread (I hope I did it right) and the connection closing was eliminated.
This results represents a version where only the logic was moved to async thread, but the connection was still closed after each request:
It looks like the connection closing was the culprit. I have to discuss this with a colleague (why he closed the connection in this implementation). I've also run a profiler on the JMH run with this results: Full JFR file can be downloaded from here: https://drive.google.com/file/d/1X2VOhCz8mhetPQHiebwwhAd3olj8ZPbE/view?usp=sharing The GraphQL implementation takes 54%, 15% write and flush, 21% NioEventLoop |
This commit updates the Vert.x implementation - I couldn't find a way to add the
So it turns out that Results with the
Results with the
|
Send a comment at c888d2e#r107958651
this applies to quarkus, not Vertx; sorry if I didn't make it clear. For quarkus, I suggest 2 thing to try:
And see which perform better. Re GraphQL is an allocating machine, see: This is the allocation flamegraph: most of the allocations for Netty come from there; it's not really a problem what Netty will do or not, because it isn't the major bottleneck there (after solving the connection closed bug). |
By reading twice the code called by GraphQL I strongly discourage using the async version: it is not doing anything asynchronously, probably due to some method not implemented :"( |
I'm experimenting with it just now - the code: import io.smallrye.common.annotation.NonBlocking;
@POST @NonBlocking
public GraphQLResponse graphQL(GraphQLRequest request) {
return graphQLManager.execute(request);
} doesn't bring any performance gains (neither @POST @NonBlocking
public Uni<GraphQLResponse<Object>> graphQL(GraphQLRequest request) {
return Uni.createFrom().completionStage(graphQLManager.executeAsync(request));
} Unfortunatelly it didn't bring any performance gains. Regarding the Netty note - so the correct way is to create my own threadpool and delegate the work there? I tried it and the numbers were slightly worse. So I removed it again and aligned the count of the threads with the number of processors which lead to performance increase. So, I re-run entire suite with current code version and link the results here. |
Re quarkus I believe it to be better NOT using async call on graphql (see #1 (comment)) and just mark @POST @NonBlocking
public GraphQLResponse graphQL(GraphQLRequest request) {
return graphQLManager.execute(request);
} TBH I didn't profiled quarkus yet, but the same idea applies: use the very latest versions of everything, if possible.
At #1 (comment) I've basically said to not relies on the async API of graphQL (that's not just doing what you think is doing) and stick with the synchronous one: in short just do what the original netty code was doing (but close the connections); and you will get a decent starting point from a performance point of view. |
The results are here: https://gist.github.com/novoj/cef56bd940a015b4cfb1ad389d2b6705 Comparison with the original version The single implementation that is not faster is the Quarkus one. There must be something wrong there, but I haven't been able to find out what today. |
Another thing I have noted (and applies to everyone, Netty/vertx,/quarkus) is that graphql internally create many UUIDs they are sadly sharing a java monitor; meaning that wrapping N instances of Graphql with a fastthreadlocal (or just save it as a handler field, by peeking it from a fastthreadlocal) would save such heavy contention to happen and would further improve performance. |
Splitting the |
I could provide a UNIX script that make uses of Hyperfoil too, if you do that :) That's a bit mor complex because have to trigger profiling, but you need to pass the application to be run as a separate script argument. |
That would be great! Ok, I'm going to prepare it during this week. |
@franz1981 the servers currently open different ports so that they can run in parallel. Should I leave it, or do you need them to reuse the same port because you'll run them sequentially one after another? What is the expected procedure for stopping them (now they wait for a system input or Ctrl+C)? |
It seems the
|
Now, what happen with your settings re taskset? Is it quarkus using nearly all cores? |
The utilization distribution seems ok - the JMH doesn't utilize the resources:
And for Netty server:
|
How much cores you have dedicated to quarkus? If you have 100% it means is using a single core, that is not good (unless you left a single core to it). That's why is better to have a client capable to produce high load with just few threads (we use Hyperfoil for this reason, but wrk and others are fine) |
@franz1981 all servers are executed now with The interesting fact is that nor Netty fully utilizes the CPUs: But it's for sure more than I think I have something wrong with the Quarkus, but the application.properties seems pretty minimalistics, so it should use the defaults. |
The problem is still present re taskset
You are making the servers and the client to share the same physical cores Eg if the client uses the logical core 6, and the server uses 0, they are running on the same physical core 0!! If you disable hyper threading it become more clear |
Ok, I've disabled the hyperthreading - now I have only 6 CPUs listed:
When I execute the Netty server (which performs the best) without any
When I start server with taskset:
Using napkin math - the Netty server had assigned 4 out of 6 CPUs - so it should had done I observe the similar behaviour for Quarkus JVM (without taskset):
And when executed with taskset - I get following:
I've discovered this post: https://stackoverflow.com/questions/26219500/provide-more-than-one-processor-with-taskset that signalizes, that I might be using |
But the results for Quarkus looks like that I'm capped in all scenarios on 1 CPU - the utilization never goes over 100% for the Quarkus server. |
The problem IMO is that the client is not good enough to push enough load while capped to run on a single core. How many threads/core the client is using? How many connections it creates? It used a non-blocking or blocking mechanism? I hope at some point today to have time to run this myself. |
I had some time and:
|
Sorry, I went out with my daughter (it's actually nice here in Czech rep. :) ) ... I had following post half-written up: *The benchmark is here: https://github.com/FgForrest/HttpServerEvaluationTest/blob/isolated-modules/performance_tests/src/main/java/one/edee/oss/http_server_evaulation_test/jmh/ServersBenchmark.java The It uses the HttpClient defaults - but that's something we wanted for the client. We're developing server side API and we will not have the client side under the control. So we want to record and compare the web server performance with default settings of the client (unless there are some really dumb presets, which we don't expect in case of Another counter-argument is - when the client side is exactly the same for measuring both Netty performance and Quarkus one, it should have similar impact and both servers should produce similar numbers.* Next step I planned was to attach profiler and generate the execution flamegraph - but you were faster. There is also another end point ( So I've run the same performance tests against this endpoint, which should be our baseline - no GraphQL involved there. The resulsts were interesting: Netty:
Flamegraph: So the results are more or less the same. Quarkus:
But the Quarkus results went out of the roof! I've doublechecked that the endpoint is really called by placing the breakpoint and it is. Flamegraph: The limit here is probably the client:
Quarkus still allocates only 100% CPU. |
What's Interesting would be to check how many connections are established vs the server. I see that Quarkus is responding on a http 2 endpoint while it is not happening with the other type of servers (Netty for example is not answering back on http 2). I am pretty sure there is a single connection and that's why it is using a single event loop (quarkus is backed by Netty that allocate connections in round robin on the I/O threads). |
We use default Java HttpClient but it's shared among all the JMH thread (because the stat is marked as Internal pool: According to this post: https://stackoverflow.com/questions/53617574/how-to-keep-connection-alive-in-java-11-http-client the client uses internal http connection pool and since the tested URL it might get reused. The implementation is here: jdk.internal.net.http.HttpConnection#getConnection and I see it uses the connection pool only for HTTP/1.1 protocol - not for HTTP/2. If I try calling the Quarkus endpoint via CURL I see:
But the Java client prefers HTTP/2 so may be the protocol might differ for that case. I have no more time now - I will have to get back to it later. |
@novoj I think that the problem is that the behaviour with HTTP 2 is very different and just Quarkus support it (while not, the other servers benchmarked), making it to behave in an unexpected way! @Setup(Level.Trial)
public void setUp() throws URISyntaxException {
client = HttpClient.newBuilder()
.version(HttpClient.Version.HTTP_1_1)
.build();
/* ... */ This will fix the performance of quarkus because will make the client to behave the same across different frameworks - I've already verified that numbers are now good :) |
Let me know when you have the chance if the proposed fix work for you @novoj |
What fix do you mean? The enforcement of |
Yep! thanks! |
I've had some break before leaving, so here are the results: Enforcing
And pidstat shows now more than single core:
The flamegraph look like this: So the numbers are aligned with other servers. I've also tried another change - switching When I did this and removed the Latest version is in the GitHub repo. |
The change to |
And I can tell you that Quarkus will improve soon on this one thanks to this benchmark |
I can close this one as completed at the point and eager to see the next results in the blog post :P |
Yes, I'm counting with it. |
The branch I re-run all the performance tests and got following results:
When I used taskset with 3 CPUs for server and 2 CPUs for JMH (leaving 1 CPU for the system) I got these results:
The article https://evitadb.io/blog/03-choosing-http-server was updated. |
Thanks @novoj ❤️ |
I feel pretty ashamed we did so many mistakes in perf. tests. Thanks for the guidance. |
Ehi, you are doing it in your free time (today is Sunday!!!), moved by interest and passion, so don't be ashamed, really. |
Netty HTTP performance has changed A LOT recently hence I suggest to:
For Netty, again: please use what I've done here -> https://github.com/TechEmpower/FrameworkBenchmarks/tree/master/frameworks/Java/netty/src/main/java/hello
It should include fixes for a JDK issue (that would affect quarkus as well, but not only -> https://redhatperf.github.io/post/type-check-scalability-issue/ for more info) and the right way to handle what the Netty pipeline produce AND how to populate the pipeline as well.
Do the same with Vertx: please use the latest.
Same should be said for Quarkus: I'm a core developer for all three frameworks, thanks :)
Personal suggestion: please don't run things on localhost, really, unless someone force you doing it with a gun. It's not representative in any form, given that localhost won't use IRQ.
The text was updated successfully, but these errors were encountered: