Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quarkus/Vertx/Netty HTTP performance #1

Closed
franz1981 opened this issue Mar 30, 2023 · 82 comments
Closed

Quarkus/Vertx/Netty HTTP performance #1

franz1981 opened this issue Mar 30, 2023 · 82 comments

Comments

@franz1981
Copy link

Netty HTTP performance has changed A LOT recently hence I suggest to:

For Netty, again: please use what I've done here -> https://github.com/TechEmpower/FrameworkBenchmarks/tree/master/frameworks/Java/netty/src/main/java/hello

It should include fixes for a JDK issue (that would affect quarkus as well, but not only -> https://redhatperf.github.io/post/type-check-scalability-issue/ for more info) and the right way to handle what the Netty pipeline produce AND how to populate the pipeline as well.

Do the same with Vertx: please use the latest.
Same should be said for Quarkus: I'm a core developer for all three frameworks, thanks :)

Personal suggestion: please don't run things on localhost, really, unless someone force you doing it with a gun. It's not representative in any form, given that localhost won't use IRQ.

@franz1981
Copy link
Author

Similarly, I see that REST endpoints in Quarkus are not using the @NonBLocking annotation for the hello world one: it makes a lot of difference, perf-wise.

More information about which endpoints to mark as non-blocking here: https://quarkus.io/blog/resteasy-reactive-smart-dispatch/

Things have changed "recently"

@franz1981 franz1981 changed the title Netty HTTP performance Quarkus/Vertx/Netty HTTP performance Mar 30, 2023
@novoj
Copy link
Contributor

novoj commented Mar 31, 2023

Thanks for pointing out the problems. I will try to work on them this weekend.
However, I'd like to ask a few "dumb" questions now:

The performance tests use JHM and therefore test the web server on a loopback interface - while this isn't the most representative use case for a web server, we thought it was good enough for the "high level" perspective of its performance.

  1. Since the JMH uses the number of threads equal to the number of CPUs, we thought that the @NonBlocking approach would not help anywhere - there should be no delay on the network and there is no disk I/O involved (we only test the "echo" service). That's why we used the synchronous approach for all implementations. If I remember correctly, the tests were using 100% of the CPU - but I'll check again this weekend.
  2. Testing on localhost - we have done similar performance tests for evitaDB itself (results can be found here: https://evitadb.io/research/evaluation/evaluation), both on rented Digital Ocean VPS and on physical hardware at the university. We also ran tests locally during development, and although the absolute numbers were different, the relative differences were similar. So we concluded that testing on localhost might not be precise, but in general it gives similar results, so we started to accept it (with a reservation) because it's the easiest and fastest way to get results for iterative development. We used dedicated test runs on an external server, running for a longer period of time on VPS only for confirmation.

Regarding the second question, I'd like to read more about the issues. Can you share any resources with me so that I can delve into it more? One thing that might differ is -server Java argument (https://docs.oracle.com/cd/E22289_01/html/821-1274/configuring-the-default-jvm-and-java-arguments.html) - but this can be set on for the tests even on localhost. I've never heard of the IRQ problem you write about.

I'm aware that other applications are more likely to consume resources on localhost than on the dedicated server, but if the tests are repeated multiple times or better - verified on different machine I tended to take them into account.

@franz1981
Copy link
Author

franz1981 commented Mar 31, 2023

Thanks, answers inline

Since the JMH uses the number of threads equal to the number of CPUs, we thought that the @nonblocking approach would not help anywhere

you can decide how many threads on the @Benchmark method by using -t on the JMH command line.
This will leave enough cores available that you can use to configure the server application.
Re the server, I suggest to confine it in specific cores by using cgroup or taskset; this will help the ergonomics of the JVM to report the right number of cores to the framework configuration, without requiring you to set it

eg taskset -c 0,2,4,6 java -jar <quarkus app>.jar would ensure that the whole JVM threads are confined in that cores, inlcuding GC, compilers threads and framework threads (including sizing of the event loop by the right count, automatically - thanks to Runtime::getRuntime::availableProcessor that would report 4 in the mentioned example), incluing blocking threads, if endpoints are not marked as NonBlocking nor use any "async" type eg. Multi/Uni/CompletableFuture/.. etc etc

This will benefit not making client + server(s) to compete and share the same resources.
Re NonBlocking it is just a good way to avoid thread context switches that using a blocking thread executor will cause and requires to NOT perform blocking operations on the I/O threads eg hello world is perfect in this sense

I'm not a great fan of using JMH for macrobenchmarks anyway, and I would suggest using Hyperfoil instead

(https://hyperfoil.io/) that avoid the infamous Coordinated Omission measurement issue and can provide easily comparisons/metrics/graphs: and it has a wrapper that emulate 'wrk/wrk2` that would make it easier to use.
In case the workload you have to emulate doesn't allow to use a tool like that, it's fine to use JMH (sigh!).

NOTE: using an external load generator tool don't mean that you dont' have to taskset server/clients but the opposite: you now have the obligation to do it for the client as well (now simpler then doing -t with JMH that won't "pin" threads to cores as taskset will do).

So we concluded that testing on localhost might not be precise, but in general it gives similar results, so we started to accept it (with a reservation) because it's the easiest and fastest way to get results for iterative development. We used dedicated test runs on an external server, running for a longer period of time on VPS only for confirmation.

Agree but if you check with linux cmds you'll find that loopback interface is called "noIRQ" because it's not interested by soft/hard IRQ i.e interrupts related any network interface; it means then that if an application is optimized for throughput and won't try to correctly size packets in a way to be more "mechanical sympathy" with a real network interface (that would means creating more interrupts on receive, for example, that would consume CPU as well) you won't see it, and will give the false impression that a stack is "better" then another, while they have been developed to benefit different uses cases that are not captured by the "loopback" interface.
Said that I agree that in cpu bound scenario(s) can be still ok to perform such tests, but I always recommend some round in a proper HW.

Regarding the second question, I'd like to read more about the issues. Can you share any resources with me so that I can delve into it more

you mean https://redhatperf.github.io/post/type-check-scalability-issue/ ?

@novoj
Copy link
Contributor

novoj commented Mar 31, 2023

Thanks @franz1981 the comment was super enlightening!

By the way, I have read about JDK-8180450 - but diagnosing such a problem is way beyond my level. Kudos to the team, and I'm glad it has been backported to the JDK so we can benefit from the findings.

@franz1981
Copy link
Author

and I'm glad it has been backported to the JDK

Actually it is not. And it's not a delayed 1st April joke 🤣

@novoj
Copy link
Contributor

novoj commented Apr 2, 2023

Well backporting was a wrong term. I should have used term "porting". I refer to:

Thanks to the reports of our developers and the rest of Java Community (including Netflix and Intel engineers), the scalability JDK issue now has raised a priority and a fix will be targeted for the Java 21 release.

@novoj
Copy link
Contributor

novoj commented Apr 2, 2023

I've had some time to work on it over the weekend and there is a battle plan:

  1. measure a base-line: I will make measurements on my development machine (Intel(R) Core(TM) i7-10750H @ 2.60GHz - 6 cores, 12 threads) on one year old versions that match the original measurements and results published in the article; I will assign 6 cores to the servers, assign 4 cores to the JMH client, leave 2 cores to the system (this is currently running)
  2. measure improvements in current versions: I will upgrade all servers to current versions (95% is done in the version-from-2023-03 branch and there is only one subtle bug in the Quarcus implementation) and run the test suite with exactly the same parameters as the baseline.
  3. tweak the code according to @franz1981's recommendations here and run the test suite with exactly the same parameters as the baseline.
  4. move the JMH client to another machine: I will leave the server on my developer laptop with taskset affinity still at 6 out of 12 CPU threads, the JMH client will be moved to my other Windows machine on the same local 1GB network, now the client should not affect or limit the server

For all runs I'll create a JSON file with exact statistics. Then we analyse the variations in them.

@novoj
Copy link
Contributor

novoj commented Apr 2, 2023

1. base-line test output

The result is recorded in Gist: https://gist.github.com/novoj/90b090cf3adc9116f2a99d7e78fa4e4a
Visualization available here

@franz1981
Copy link
Author

Yep, please consider another factor, in the baseline as per the other version:
https://github.com/FgForrest/HttpServerEvaluationTest/blob/main/server/src/main/java/one/edee/oss/http_server_evaulation_test/server/netty/GraphQLHandler.java#L50

both quarkus and netty are performing blocking operations...
I suppose, instead, they should perform the async version one (adding @NonBlocking to Quarkus, instead)

@novoj
Copy link
Contributor

novoj commented Apr 4, 2023

2. upgraded versions test output

The result is recorded in Gist: https://gist.github.com/novoj/4adcef0c9deef3964d191844b36f700b
Visualization available here

Bumped versions:

Web server Original version New version
Quarkus 2.7.5.Final 2.16.5.Final
Spring Boot MVC 2.6.5 3.0.5
Spring Boot WebFlux 2.6.5 3.0.5
Micronaut 3.4.1 3.8.8
Microhttp 0.7 0.7
Netty 4.1.75.Final 4.1.90.Final
Nanohttpd 2.3.1 2.3.1
Javalin 4.4.0 5.4.2
Vert.x 4.2.6 4.4.1
Undertow 2.2.16.Final 2.3.5.Final

No bigger changes in the original were performed (this is going to be part of step 3).

@franz1981
Copy link
Author

Just to help understanding whatever results are here so far:

@novoj
Copy link
Contributor

novoj commented Apr 4, 2023

I have similar feelings about it. We're newbies to Netty integration, but the implementations seems pretty straightforward to me: https://github.com/FgForrest/HttpServerEvaluationTest/blob/version-from-2023-03/server/src/main/java/one/edee/oss/http_server_evaulation_test/server/netty/GraphQLHandler.java

As I said - next step would be optimizing the code. I'll try to run the Netty tests separatelly and look at the profiler output. At least we may have a first observation - it's not easy for a newbie to write a well performing GQL service on Netty :)

I see two bigger memory allocations in the code:

byte[] body = new byte[request.content().readableBytes()];

and

final ByteBuf bodyBuffer = ctx.alloc().buffer(body.length);

Is there a better approach to it? In the colleague implementation there is:

ServerBootstrap b = new ServerBootstrap();
b.option(ChannelOption.SO_BACKLOG, 1024);
b.option(ChannelOption.ALLOCATOR, PooledByteBufAllocator.DEFAULT);

So I guess the alloc just retrieve already allocated buffer from the pool. On the other hand I don't see explicit buffer.release() call (but as I scanned the code it should be called later on the way by Netty itself, so it shouldn't be a problem). I thought I could replace the work that allocates byte[] arrays with ByteBuffers, but Jackson probably won't allow it.

I will continue later this week. Thank you for your suggestions!

@franz1981
Copy link
Author

The real problem is that graph ql shouldn't ever block the event loop of Netty or it would cause massive scalability degradations: the graphql manager is performing supposedly blocking operations?
I suggest to profile the server using async profiler lock profiling to find it out.

@novoj
Copy link
Contributor

novoj commented Apr 4, 2023

I think so - based on this blog post: https://www.graphql-java.com/blog/threads - let's see what the switch to non-blocking brings. On the other hand - there is no I/O, just rendering the constant string to the output. I would've thought that switching to async here would only bring another overhead.

@franz1981
Copy link
Author

franz1981 commented Apr 4, 2023

Maybe or maybe not: without profiling is difficult to say unless they have a very clear explanation of what's doing under the hood.
Sadly Netty as Vertx and quarkus are very sensitive to block the event loop thread.

@novoj
Copy link
Contributor

novoj commented Apr 4, 2023

I agree. I have been surprised many times already :)

@franz1981
Copy link
Author

franz1981 commented Apr 7, 2023

@novoj

One qq:

        // Tell the client we're going to close the connection.
        response.headers().set(CONNECTION, CLOSE);

and

            ChannelFuture f = ctx.write(response);
            f.addListener(ChannelFutureListener.CLOSE);

these suggest that the benchmark is creating a whole new connection for each single request; that's not what HTTP 1.1 suggest by default (so-called persistent connections) - and doesn't seem to happen for the other server; why?

Both removing the CONNECTION CLOSE http header and closing the connection while the response is sent, (more then) double the performance of the Netty server; and I see already other low hanging fruits there.

@novoj
Copy link
Contributor

novoj commented Apr 7, 2023

Sorry for the delay - I do this in my spare time and the week has been quite busy.

I've included your recommendations. The results have changed quite a lot. Following numbers relate to current version of the code where the GraphQL evaluation is moved to async thread (I hope I did it right) and the connection closing was eliminated.

Benchmark                                                    Mode  Cnt      Score      Error  Units
ServersThroughputBenchmark.graphQLApiEchoQuery_NettyServer  thrpt    5  28337.495 ± 1275.362  ops/s
ServersLatencyBenchmark.graphQLApiEchoQuery_NettyServer      avgt    5    138.122 ±    5.174  us/op

This results represents a version where only the logic was moved to async thread, but the connection was still closed after each request:

Benchmark                                                    Mode  Cnt      Score     Error  Units
ServersThroughputBenchmark.graphQLApiEchoQuery_NettyServer  thrpt    5  15442.104 ± 673.331  ops/s
ServersLatencyBenchmark.graphQLApiEchoQuery_NettyServer      avgt    5    257.802 ±  11.021  us/op

It looks like the connection closing was the culprit. I have to discuss this with a colleague (why he closed the connection in this implementation).

I've also run a profiler on the JMH run with this results:

image

Full JFR file can be downloaded from here: https://drive.google.com/file/d/1X2VOhCz8mhetPQHiebwwhAd3olj8ZPbE/view?usp=sharing

The GraphQL implementation takes 54%, 15% write and flush, 21% NioEventLoop processSelectedKeys.

@novoj
Copy link
Contributor

novoj commented Apr 7, 2023

This commit updates the Vert.x implementation - I couldn't find a way to add the @Nonblocking annotation to the code, but I did find out that the handler method executes the code on the event loop thread, while the blockingHandler does not. Instead, the work is done on a separate thread. According to Javadoc:

Specify a blocking request handler for the route. This method works just like {@link #handler(Handler)} excepted that it will run the blocking handler on a worker thread so that it won't block the event loop. Note that it's safe to call context.next() from the blocking handler as it will be executed on the event loop context (and not on the worker thread.

So it turns out that blockingHandler is actually non-blocking for the event loop. That's pretty counterintuitive. Also, the javadoc for the io.vertx.ext.web.Route#handler does not mention that the code is executed in an eventloop thread and does not recommend the "blocking" alternative. Although the implementer should read the javadocs, I can understand that it's easy to make mistakes in this case.

Results with the handler implementation:

Benchmark                                                    Mode  Cnt      Score      Error  Units
ServersThroughputBenchmark.graphQLApiEchoQuery_VertXServer  thrpt    5  13350.199 ± 1051.297  ops/s
ServersLatencyBenchmark.graphQLApiEchoQuery_VertXServer      avgt    5    294.011 ±    8.472  us/op

Results with the blockingHandler implementation:

Benchmark                                                    Mode  Cnt      Score     Error  Units
ServersThroughputBenchmark.graphQLApiEchoQuery_VertXServer  thrpt    5  22699.185 ± 794.541  ops/s
ServersLatencyBenchmark.graphQLApiEchoQuery_VertXServer      avgt    5    171.575 ±   4.230  us/op

@franz1981
Copy link
Author

franz1981 commented Apr 7, 2023

@novoj

Send a comment at c888d2e#r107958651

I couldn't find a way to add the @nonblocking annotation to the code

this applies to quarkus, not Vertx; sorry if I didn't make it clear.

For quarkus, I suggest 2 thing to try:

  • use a @NonBlocking annotation on the end point and use the synchronous graph QL API
  • use the async GraphQL API and return a CompletableFuture that's supported and acknowledged as "non blocking" type

And see which perform better.

Re GraphQL is an allocating machine, see:
image

This is the allocation flamegraph: most of the allocations for Netty come from there; it's not really a problem what Netty will do or not, because it isn't the major bottleneck there (after solving the connection closed bug).

@franz1981
Copy link
Author

By reading twice the code called by GraphQL I strongly discourage using the async version: it is not doing anything asynchronously, probably due to some method not implemented :"(

@novoj
Copy link
Contributor

novoj commented Apr 7, 2023

I'm experimenting with it just now - the code:

import io.smallrye.common.annotation.NonBlocking;

@POST @NonBlocking
    public GraphQLResponse graphQL(GraphQLRequest request) {
        return graphQLManager.execute(request);
    }

doesn't bring any performance gains (neither @Blocking does) - the results are practically the same. I got following variant working when I added quarkus-resteasy-mutiny extension to the project:

@POST @NonBlocking
public Uni<GraphQLResponse<Object>> graphQL(GraphQLRequest request) {
    return Uni.createFrom().completionStage(graphQLManager.executeAsync(request));
}

Unfortunatelly it didn't bring any performance gains.

Regarding the Netty note - so the correct way is to create my own threadpool and delegate the work there? I tried it and the numbers were slightly worse. So I removed it again and aligned the count of the threads with the number of processors which lead to performance increase.

So, I re-run entire suite with current code version and link the results here.

@franz1981
Copy link
Author

@novoj

Re quarkus I believe it to be better NOT using async call on graphql (see #1 (comment)) and just mark
the endpoint as you did per

    @POST @NonBlocking
    public GraphQLResponse graphQL(GraphQLRequest request) {
        return graphQLManager.execute(request);
    }

TBH I didn't profiled quarkus yet, but the same idea applies: use the very latest versions of everything, if possible.
The fact that it doesn't bring any improvement means something is not right: consider that usually the chart is:
vertx slightly better then netty (no jokes, there are good reasons for that!) and quarkus 10-15% worse then vertx, depending what's doing.
If you get anything different from this; something is wrong or broken.

Regarding the Netty note - so the correct way is to create my own threadpool and delegate the work there?

At #1 (comment) I've basically said to not relies on the async API of graphQL (that's not just doing what you think is doing) and stick with the synchronous one: in short just do what the original netty code was doing (but close the connections); and you will get a decent starting point from a performance point of view.
In a real world, while performing capacity planning for a thing like netty/quarkus/vertx, but using both the event loops AND a blocking thread pool you should size them accordly to you core and expected usage; if you are super I/O intensive and super blocking calls intensive isn't a good idea to make them compete and share all cores, but size them to divide it equally (half used for I/O, half for blocking calls) - it defeat a bit the OS load balancing provided by schedulers but doesn't make sense really to make them fight for the cpu time if you know they will both busy.

@novoj
Copy link
Contributor

novoj commented Apr 7, 2023

The results are here: https://gist.github.com/novoj/cef56bd940a015b4cfb1ad389d2b6705
Visualization here

Comparison with the original version

The single implementation that is not faster is the Quarkus one. There must be something wrong there, but I haven't been able to find out what today.

@franz1981
Copy link
Author

franz1981 commented Apr 9, 2023

Another thing I have noted (and applies to everyone, Netty/vertx,/quarkus) is that graphql internally create many UUIDs they are sadly sharing a java monitor; meaning that wrapping N instances of Graphql with a fastthreadlocal (or just save it as a handler field, by peeking it from a fastthreadlocal) would save such heavy contention to happen and would further improve performance.
@novoj I don't know if the results you have shared are using the latest versions of everything: I had few issues due to how the benchmarks run (many using others common dependencies) - it would be better for each server to have their own modele and dependences

@novoj
Copy link
Contributor

novoj commented Apr 9, 2023

  1. GraphQL - we know that the GraphQL Java has its own performance issues, but I don't know about other GraphQL implementations in the Java ecosystem for GraphQL, so we need to stick with it. All implementations share the same GraphQL logic - so they should have the same performance issues. The performance tests try to measure only the performance penalty added by the web server itself in the GraphQL integration scenario.

  2. shared dependencies - you're right, I struggled with this myself during the version upgrade. That's why I had to move the Spring Boot server implementation to separate poms and fiddle with their POM.xml. I can split the `server' module into separate ones that would also be started separately. It would just require starting multiple JVMs. It may affect the tests in the sense that during the JMH test suite all web servers will be running at the same time, sharing the resources of the host, so potentially fighting for the cores/memory access (even if only one web server is under load in parallel).

Splitting the server module is a matter of an hour or two - which I could have this week. If you think it would help, I'll do it.

@franz1981
Copy link
Author

I could provide a UNIX script that make uses of Hyperfoil too, if you do that :)
You can try yourself using this one as a blueprint
https://github.com/franz1981/quarkus-profiling-workshop/blob/master/scripts/benchmark.sh

That's a bit mor complex because have to trigger profiling, but you need to pass the application to be run as a separate script argument.
Then you can produce charts as well with ease, that's a built-in hyperfoil feature

@novoj
Copy link
Contributor

novoj commented Apr 9, 2023

That would be great! Ok, I'm going to prepare it during this week.

@novoj
Copy link
Contributor

novoj commented Apr 10, 2023

@franz1981 the servers currently open different ports so that they can run in parallel. Should I leave it, or do you need them to reuse the same port because you'll run them sequentially one after another? What is the expected procedure for stopping them (now they wait for a system input or Ctrl+C)?

@novoj
Copy link
Contributor

novoj commented May 28, 2023

It seems the pidstat -G java command renders CPU utilization in a wrong way, I need to add the sampling rate to see useful results:

jno@jnonb:~ $ pidstat -G java
Linux 5.15.0-48-generic (jnonb) 	28.5.2023 	_x86_64_	(12 CPU)

11:51:08      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
11:51:08     1000    172805    6,93    0,78    0,00    0,00    7,71     2  java
11:51:08     1000    173636    0,04    0,00    0,00    0,00    0,04     8  java
11:51:08     1000    173700    0,51    0,27    0,00    0,00    0,79     7  java
jno@jnonb:~ $ pidstat 1 -G java
Linux 5.15.0-48-generic (jnonb) 	28.5.2023 	_x86_64_	(12 CPU)

11:51:12      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
11:51:13     1000    172805   91,09    8,91    0,00    0,00  100,00     2  java
11:51:13     1000    173700   41,58   21,78    0,00    0,00   63,37     7  java

@franz1981
Copy link
Author

Now, what happen with your settings re taskset? Is it quarkus using nearly all cores?

@novoj
Copy link
Contributor

novoj commented May 28, 2023

The utilization distribution seems ok - the JMH doesn't utilize the resources:

11:56:16      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
11:56:17     1000    175590   95,00    5,00    0,00    0,00  100,00     3  http_server_eva (Quarkus native)
11:56:17     1000    175729   22,00   12,00    0,00    0,00   34,00     8  java (JMH)
11:50:29      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
11:50:30     1000    172805   89,00   11,00    0,00    0,00  100,00     2  java (Quarkus JVM)
11:50:30     1000    173700   41,00   25,00    0,00    0,00   66,00     7  java (JMH)

And for Netty server:

12:01:24      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
12:01:25     1000    177391  178,00   50,00    0,00    0,00  228,00     3  java (Netty)
12:01:25     1000    177522  136,00  115,00    0,00    0,00  251,00     6  java (JMH)

@franz1981
Copy link
Author

How much cores you have dedicated to quarkus? If you have 100% it means is using a single core, that is not good (unless you left a single core to it). That's why is better to have a client capable to produce high load with just few threads (we use Hyperfoil for this reason, but wrk and others are fine)

@novoj
Copy link
Contributor

novoj commented May 28, 2023

@franz1981 all servers are executed now with taskset -c 0-5

The interesting fact is that nor Netty fully utilizes the CPUs:

image

But it's for sure more than 100% that we observe in case of Quarkus.
I don't think that the way I measure it is wrong - if it would be, I'd see the same problems with other servers (such as Netty) and I don't see them capped at 100%.

I think I have something wrong with the Quarkus, but the application.properties seems pretty minimalistics, so it should use the defaults.

@franz1981
Copy link
Author

franz1981 commented May 28, 2023

The problem is still present re taskset

all servers are executed now with taskset -c 0-5

You are making the servers and the client to share the same physical cores

Eg if the client uses the logical core 6, and the server uses 0, they are running on the same physical core 0!!

If you disable hyper threading it become more clear

@novoj
Copy link
Contributor

novoj commented May 28, 2023

Ok, I've disabled the hyperthreading - now I have only 6 CPUs listed:

jno@jnonb:/www/oss/HttpServerEvaluationTest (isolated-modules) $ lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   39 bits physical, 48 bits virtual
CPU(s):                          6
On-line CPU(s) list:             0-5
Thread(s) per core:              1
Core(s) per socket:              6
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           165
Model name:                      Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz

When I execute the Netty server (which performs the best) without any taskset command I get following utilization and following results:

12:48:16      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
12:48:17     1000     12655  175,00   51,00    0,00    0,00  226,00     5  java
12:48:17     1000     15355  122,00  118,00    0,00    0,00  240,00     4  java
Benchmark                                                    Mode  Cnt      Score      Error  Units
ServersThroughputBenchmark.graphQLApiEchoQuery_NettyServer  thrpt    5  29644.674 ± 4356.935  ops/s
ServersLatencyBenchmark.graphQLApiEchoQuery_NettyServer      avgt    5    196.437 ±    0.595  us/op

When I start server with taskset: taskset -c 0-3 java -jar ./netty_server/target/servers.jar
And run JMH with taskset: taskset -c 4 java -jar performance_tests/target/benchmarks.jar .*Netty.* -wi 1 -i 5 -f 1
I get following utilization and results:

13:14:00      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
13:14:01     1000     16696   44,00   17,00    0,00    0,00   61,00     1  java
13:14:01     1000     16955   29,00   30,00    0,00    0,00   59,00     4  java
Benchmark                                                    Mode  Cnt     Score     Error  Units
ServersThroughputBenchmark.graphQLApiEchoQuery_NettyServer  thrpt    5  9479.548 ± 134.995  ops/s
ServersLatencyBenchmark.graphQLApiEchoQuery_NettyServer      avgt    5   104.831 ±   0.878  us/op

Using napkin math - the Netty server had assigned 4 out of 6 CPUs - so it should had done 29644 * (4/6) = 19762,66 reqs/sec (or better since the CPUs are reserved for the process), but it did only a half of expected rate.

I observe the similar behaviour for Quarkus JVM (without taskset):

13:31:20      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
13:31:21     1000     30735   94,00    6,00    0,00    0,00  100,00     3  java
13:31:21     1000     31054   36,00   24,00    0,00    0,00   60,00     5  java
Benchmark                                                      Mode  Cnt      Score     Error  Units
ServersThroughputBenchmark.graphQLApiEchoQuery_QuarkusServer  thrpt    5  17887.315 ± 705.428  ops/s
ServersLatencyBenchmark.graphQLApiEchoQuery_QuarkusServer      avgt    5    334.959 ±   2.454  us/op

And when executed with taskset - I get following:

13:49:08      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
13:49:09     1000     32834   61,00   14,00    0,00    0,00   75,00     3  java
13:49:09     1000     33115   18,00   20,00    0,00    0,00   38,00     4  java
Benchmark                                                      Mode  Cnt     Score     Error  Units
ServersThroughputBenchmark.graphQLApiEchoQuery_QuarkusServer  thrpt    5  8893.156 ± 170.491  ops/s
ServersLatencyBenchmark.graphQLApiEchoQuery_QuarkusServer      avgt    5   115.536 ±   0.681  us/op

I've discovered this post: https://stackoverflow.com/questions/26219500/provide-more-than-one-processor-with-taskset that signalizes, that I might be using taskset with wrong attributes. But even using this variant or this one taskset --all-tasks --cpu-list 0-3 didn't bring different results. Using the taskset even in environment with disabled hyper-threading makes the results worse regarding throughput (but better regarding latency!) than without it.

@novoj
Copy link
Contributor

novoj commented May 28, 2023

But the results for Quarkus looks like that I'm capped in all scenarios on 1 CPU - the utilization never goes over 100% for the Quarkus server.

@franz1981
Copy link
Author

franz1981 commented May 28, 2023

The problem IMO is that the client is not good enough to push enough load while capped to run on a single core.

How many threads/core the client is using? How many connections it creates? It used a non-blocking or blocking mechanism?

I hope at some point today to have time to run this myself.
By profiling Netty last time I can predict that is likely the performance will be the same on every server tested, because graphql is the dominant factor really.

@franz1981
Copy link
Author

franz1981 commented May 28, 2023

I had some time and:

  • the number of event loops fully used are just 1 (!!) despite the number of created ones is correct; it requires some investigation on my side, really
  • the most of the cpu cycles are spent in something very expensive and that should happen just once! ie io/smallrye/graphql/cdi/CdiLookupService.getInstance (see it in violet)

image

@novoj
Copy link
Contributor

novoj commented May 28, 2023

Sorry, I went out with my daughter (it's actually nice here in Czech rep. :) ) ... I had following post half-written up:

*The benchmark is here: https://github.com/FgForrest/HttpServerEvaluationTest/blob/isolated-modules/performance_tests/src/main/java/one/edee/oss/http_server_evaulation_test/jmh/ServersBenchmark.java

The HttpClient is created here: https://github.com/FgForrest/HttpServerEvaluationTest/blob/isolated-modules/performance_tests/src/main/java/one/edee/oss/http_server_evaulation_test/jmh/ServerState.java

It uses the HttpClient defaults - but that's something we wanted for the client. We're developing server side API and we will not have the client side under the control. So we want to record and compare the web server performance with default settings of the client (unless there are some really dumb presets, which we don't expect in case of java.net.http.HttpClient).

Another counter-argument is - when the client side is exactly the same for measuring both Netty performance and Quarkus one, it should have similar impact and both servers should produce similar numbers.*

Next step I planned was to attach profiler and generate the execution flamegraph - but you were faster. There is also another end point (/hello) - which simply prints static text.

So I've run the same performance tests against this endpoint, which should be our baseline - no GraphQL involved there. The resulsts were interesting:

Netty:

Benchmark                                                    Mode  Cnt      Score     Error  Units
ServersThroughputBenchmark.graphQLApiEchoQuery_NettyServer  thrpt    5  25114.343 ± 144.033  ops/s
ServersLatencyBenchmark.graphQLApiEchoQuery_NettyServer      avgt    5    253.076 ±  44.961  us/op

Flamegraph:

image

So the results are more or less the same.

Quarkus:

Benchmark                                                      Mode  Cnt      Score      Error  Units
ServersThroughputBenchmark.graphQLApiEchoQuery_QuarkusServer  thrpt    5  66951.590 ± 2576.214  ops/s
ServersLatencyBenchmark.graphQLApiEchoQuery_QuarkusServer      avgt    5     90.658 ±    3.607  us/op

But the Quarkus results went out of the roof! I've doublechecked that the endpoint is really called by placing the breakpoint and it is.

Flamegraph:

image

The limit here is probably the client:

17:07:22      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
17:07:23     1000     63037   77,00   23,00    0,00    0,00  100,00     0  java
17:07:23     1000     64397  121,00   60,00    0,00    0,00  181,00     1  java

Quarkus still allocates only 100% CPU.

@franz1981
Copy link
Author

franz1981 commented May 28, 2023

What's Interesting would be to check how many connections are established vs the server. I see that Quarkus is responding on a http 2 endpoint while it is not happening with the other type of servers (Netty for example is not answering back on http 2).

I am pretty sure there is a single connection and that's why it is using a single event loop (quarkus is backed by Netty that allocate connections in round robin on the I/O threads).

@novoj
Copy link
Contributor

novoj commented May 29, 2023

We use default Java HttpClient but it's shared among all the JMH thread (because the stat is marked as @Trial) - according to the documentation here it should be ok. The client should prefer HTTP/2 if the server supports it, the client uses internal thread pool and makes the requests internally asynchronously (but in tests we wrap them into synchronous call):

image

Internal pool:

image

According to this post: https://stackoverflow.com/questions/53617574/how-to-keep-connection-alive-in-java-11-http-client the client uses internal http connection pool and since the tested URL it might get reused. The implementation is here: jdk.internal.net.http.HttpConnection#getConnection and I see it uses the connection pool only for HTTP/1.1 protocol - not for HTTP/2.

If I try calling the Quarkus endpoint via CURL I see:

jno@jnonb:~ $ curl -v 'http://localhost:8089/graphql' -H 'Accept-Encoding: gzip, deflate, br' -H 'Content-Type: application/json' -H 'Accept: application/json' -H 'Connection: keep-alive' -H 'Origin: altair://-' --data-binary '{"query":"query Echo {\n    echo(message: \"hello\") {\n        message\n    }\n}","variables":{}}' --compressed
*   Trying 127.0.0.1:8089...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 8089 (#0)
> POST /graphql HTTP/1.1
> Host: localhost:8089
> User-Agent: curl/7.68.0
> Accept-Encoding: gzip, deflate, br
> Content-Type: application/json
> Accept: application/json
> Connection: keep-alive
> Origin: altair://-
> Content-Length: 98
> 
* upload completely sent off: 98 out of 98 bytes
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< content-type: application/json
< content-length: 37
< 
* Connection #0 to host localhost left intact
{"data":{"echo":{"message":"hello"}}}

But the Java client prefers HTTP/2 so may be the protocol might differ for that case. I have no more time now - I will have to get back to it later.

@franz1981
Copy link
Author

franz1981 commented May 29, 2023

@novoj I think that the problem is that the behaviour with HTTP 2 is very different and just Quarkus support it (while not, the other servers benchmarked), making it to behave in an unexpected way!
You should force HttpClient to use HTTP 1.1 in order to have a consistent experience across frameworks:

    @Setup(Level.Trial)
    public void setUp() throws URISyntaxException {
        client = HttpClient.newBuilder()
                .version(HttpClient.Version.HTTP_1_1)
                .build();

        /* ... */

This will fix the performance of quarkus because will make the client to behave the same across different frameworks - I've already verified that numbers are now good :)

@franz1981
Copy link
Author

Let me know when you have the chance if the proposed fix work for you @novoj

@novoj
Copy link
Contributor

novoj commented May 31, 2023

What fix do you mean? The enforcement of HttpClient.Version.HTTP_1_1?! If so, I will for sure - I'm leaving for holiday tommorow, so I'll get to it next week.

@franz1981
Copy link
Author

The enforcement of HttpClient.Version.HTTP_1_1

Yep! thanks!

novoj added a commit that referenced this issue Jun 1, 2023
@novoj
Copy link
Contributor

novoj commented Jun 1, 2023

I've had some break before leaving, so here are the results:

Enforcing HttpClient.Version.HTTP_1_1 did make a difference:

Benchmark                                                      Mode  Cnt      Score      Error  Units
ServersThroughputBenchmark.graphQLApiEchoQuery_QuarkusServer  thrpt    5  27698.199 ± 2692.612  ops/s
ServersLatencyBenchmark.graphQLApiEchoQuery_QuarkusServer      avgt    5    217.066 ±   22.781  us/op

And pidstat shows now more than single core:

07:58:59      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
07:59:00     1000     71139  187,00   50,00    0,00    0,00  237,00     2  java
07:59:00     1000     71576  136,00   91,00    0,00    0,00  227,00     4  java

07:59:00      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
07:59:01     1000     71139  196,00   39,00    0,00    0,00  235,00     2  java
07:59:01     1000     71576  124,00  101,00    0,00    0,00  225,00     4  java

07:59:01      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
07:59:02     1000     71139  189,00   47,00    0,00    0,00  236,00     2  java
07:59:02     1000     71576  135,00   91,00    0,00    0,00  226,00     4  java

07:59:02      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
07:59:03     1000     71139  193,00   43,00    0,00    0,00  236,00     2  java
07:59:03     1000     71576  126,00  100,00    0,00    0,00  226,00     4  java

07:59:03      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
07:59:04     1000     71139  171,00   42,00    0,00    0,00  213,00     2  java
07:59:04     1000     71576  120,00   85,00    0,00    0,00  205,00     4  java

The flamegraph look like this:

image

So the numbers are aligned with other servers.

I've also tried another change - switching @State(Scope.Benchmark) to @State(Scope.Thread), which is more accurate for this situation since it forces different HttpClient per JMH thread and thus having each of them separate connection pool.

When I did this and removed the HttpClient.Version.HTTP_1_1 enforcement, the Quarkus performance went to 35k reqs/sec on the same machine. So the HTTP/2 brings another performance gain. But in order to have numbers comparable with other servers I downgraded the protocol to HttpClient.Version.HTTP_1_1 again.

Latest version is in the GitHub repo.

@novoj
Copy link
Contributor

novoj commented Jun 1, 2023

The change to @State(Scope.Thread) also positively impacted Netty statistics - that grew to 33k reqs/sec now.

@franz1981
Copy link
Author

And I can tell you that Quarkus will improve soon on this one thanks to this benchmark

@franz1981
Copy link
Author

I can close this one as completed at the point and eager to see the next results in the blog post :P

@novoj
Copy link
Contributor

novoj commented Jun 5, 2023

Yes, I'm counting with it.

@novoj
Copy link
Contributor

novoj commented Jun 17, 2023

The branch isolated-modules was merged into a main branch.

I re-run all the performance tests and got following results:

Server, library or framework JMH Score (ops/s) Min JMH Score (ops/s) Max JMH Score (ops/s)
Netty 32310 31372 33248
microhttp 31344 30597 32092
Vert.x 30463 28915 32010
Javalin 26649 24502 28796
Undertow 25214 22444 27985
Micronaut 22169 19626 24712
Quarkus 21269 19650 22887
Spring Boot WebFlux 20016 18677 21355
Spring Boot MVC 15550 15059 16041
Quarkus (in native mode) 15433 14516 16351
NanoHTTPD 9400 9068 9733

When I used taskset with 3 CPUs for server and 2 CPUs for JMH (leaving 1 CPU for the system) I got these results:

Server, library or framework JMH Score (ops/s) Min JMH Score (ops/s) Max JMH Score (ops/s)
Netty 25598 24251 26944
microhttp 25230 22878 27582
Vert.x 23316 22447 24186
Undertow 23056 22583 23529
Javalin 23025 22082 23968
Micronaut 18354 18097 18611
Quarkus 17429 17172 17686
Spring Boot WebFlux 16549 15965 17134
Quarkus (in native mode) 11245 10812 11678
Spring Boot MVC 11113 10842 11384
NanoHTTPD 5297 5102 5493

The article https://evitadb.io/blog/03-choosing-http-server was updated.

@franz1981
Copy link
Author

Thanks @novoj ❤️
I will give it a other shot next week and create an issue to update it again when Quarkus GraphQL improvements will land (soon)

@novoj
Copy link
Contributor

novoj commented Jun 18, 2023

I feel pretty ashamed we did so many mistakes in perf. tests. Thanks for the guidance.
I'll be happy to retest the results with the new Quarkus version.

@franz1981
Copy link
Author

Ehi, you are doing it in your free time (today is Sunday!!!), moved by interest and passion, so don't be ashamed, really.
Only who try hard, fail and than succeed, and with benchmarking succeeding is just a temporary state, anyway :P

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants