Performance fixes based on hotspot analysis in load tests #11148

bcluap · 2020-08-02T19:58:58Z

These 2 changes result from hotspot analysis of a load test using Jaeger extension and lots of Application scoped beans. The changes result in both the .equals and String.format falling off the hotspot list and improve throughput by more than 1%.

Sync from master

gsmet

Looks interesting. Could you rebase? There is a first commit that looks weird.

mkouba · 2020-08-03T10:14:57Z

independent-projects/arc/runtime/src/main/java/io/quarkus/arc/impl/AbstractSharedContext.java

@@ -116,6 +116,10 @@ public boolean equals(Object obj) {
                return false;
            }
            Key other = (Key) obj;
+            // Shortcut removes hotspot on contextual.equals
+            if (contextual == other.contextual) {


Hm, at the moment we don't implement equals()/hashCode() for generated bean classes so !contextual.equals(other.contextual) should be translated to !(contextual == other.contextual). In other words, this modification could save one java.lang.Object.equals(Object) invocation (which is very likely negligible although it removes the equals method from the hot path). Long.toHexString() saves probably a lot more because String.format() creates a formatter object, parses the string, etc.

@mkouba what's you're saying is we should drop the equals() call altogether?

Nope. I'd like to keep it. The fact that we don't implement it now does not mean we'll never need to implement it.

OK. But if it ended up being a hot spot for Paul, there's something weird going on.

It's hard to say without the app and test sources. CC @bcluap

So, I just saw that the profiler used is the one from VisualVM, it inject some bytecode and can generates deoptimization.
That's why it can misleading you to such kind of hotspot.
When working on lowlevel optimizations (and avoiding a method call is very low level), you should use low level stuff like async-profiler, we have a page that expalain how to use it: https://github.com/quarkusio/quarkus/blob/master/TROUBLESHOOTING.md

So again, the hotspot for the equals method call is certainly a profiler artefact not a real hotspot.
And 1% difference between two load testing run is very slight so can go into the recording error range.
Such small performance enhancements should be validated via microbenchmarking using JMH ...

Such small performance enhancements should be validated via microbenchmarking using JMH ...

Yes, we talked about this in the comments below, e.g. #11148 (comment).

If you have many possible concrete types implementing Contextual, the dispatch to find the right equals implementation becomes a very expensive megamorphic call.
+1 for the shortcut

As mentioned above we removed the need for equals() from the hot path completely...

As mentioned above we removed the need for equals() from the hot path completely...

Sure that's even better. I just meant to suggest a possible explanation for the equals to be - in some contexts - really not that efficient. Removing the field is even better!

bcluap · 2020-08-03T10:19:56Z

Yea that .equals one was a bit weird but it popped up as the hottest method under io.quarkus in my tests. It was being called 2.5m times per second... The change resulted in it coming off the hot list.

…

On Mon, 03 Aug 2020, 12:15 Martin Kouba, ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In independent-projects/arc/runtime/src/main/java/io/quarkus/arc/impl/AbstractSharedContext.java <#11148 (comment)>: > @@ -116,6 +116,10 @@ public boolean equals(Object obj) { return false; } Key other = (Key) obj; + // Shortcut removes hotspot on contextual.equals + if (contextual == other.contextual) { Hm, at the moment we don't implement equals()/hashCode() for generated bean classes so !contextual.equals(other.contextual) should be translated to !(contextual == other.contextual). In other words, this modification could save one java.lang.Object.equals(Object) invocation (which is very likely negligible although it removes the equals method from the hot path). Long.toHexString() saves probably a lot more because String.format() creates a formatter object, parses the string, etc. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#11148 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGMB23PTJGHQPG4ZCJZUJZDR62E3BANCNFSM4PSWSDHA> .

mkouba · 2020-08-03T10:51:38Z

It was being called 2.5m times per second...
The change resulted in it coming off the hot list.

Sure, because the method is not invoked anymore. It's replaced with the if_acmpne instruction which is definitely faster. What happens if you remove this optimization? Can you still see some improvement?

gsmet · 2020-08-03T11:59:13Z

Hmmmm. Is it normal that this method is called so many times?

mkouba · 2020-08-03T12:04:16Z

Hmmmm. Is it normal that this method is called so many times?

It depends. It's a load test anyway. But every time you invoke a method upon a client proxy we have to do a lookup, and in this case create a new key and invoke Map.get().

bcluap · 2020-08-03T12:27:24Z

This was the analysis I found. This was run now without the change: [image: image.png] And with the change: [image: image.png] I can't understand why it makes a big enough impact to take AbstractSharedContext.$Key.equals off the list as it is still called and is just one hop away from the beans default .equals but it does make a difference - both in my test throughout and the hotspot analysis.

mkouba · 2020-08-03T12:36:46Z

This was the analysis I found. This was run now without the change:

I can't see the attached images...

bcluap · 2020-08-03T12:41:39Z

bcluap · 2020-08-03T12:57:09Z

The reason the .equals is called so much is due to traces like this:
at io.quarkus.arc.impl.AbstractSharedContext$Key.equals(AbstractSharedContext.java:118)
at java.base/java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:940)
at io.quarkus.arc.impl.ComputingCache.getValueIfPresent(ComputingCache.java:45)
at io.quarkus.arc.impl.AbstractSharedContext.get(AbstractSharedContext.java:34)
at guru.jini.arch.impl.aaa.RoleBasedGateKeeper_ClientProxy.arc$delegate(RoleBasedGateKeeper_ClientProxy.zig:70)
at guru.jini.arch.impl.aaa.RoleBasedGateKeeper_ClientProxy.checkPermissions(RoleBasedGateKeeper_ClientProxy.zig:316)

Every method call to a managed bean (app scoped or request scoped) results in a lookup in the concurrenthashmap to find the correct delegate.

We have a large number of helpers in our architecture which use other helpers and these are all ApplicationScoped so they can be easily injected when needed. As a call jumps around a bit and uses persistence and other things these quickly add up and can end up using say 200 managed beans in a complex call.

I've subsequently changed them all to Singletons to avoid the proxying but it's probably still worth trying to optimise the resolution of beans as much as possible.

mkouba · 2020-08-03T13:30:46Z

I think that we need more information. These two snapshots don't seem to be comparable. The total time of the first one is 79 732 ms and the total time of the second one is 114 359 ms. Also after the change, the AbstractSharedContext.Key.equals(Object) with 0.5% disappears but HashMapLocalCache_ClientProxy.get() comes up with 0.2%. One snapshot contains VertxHttpHeaders.getAll(CharSequence) with 0.1% but the other one only contains VertxHttpHeaders.get0(CharSequence) with 0.3%...

However, I really want to find the problem. So I'm going to update the ClientProxyInvocationBenchmark microbenchmark to see how your modification improves the throughput.

Maybe we could find even more optimizations. In general, we could skip the lookup for @ApplicationScoped beans completely but then we would have to "clear the cached instance" if a bean is destroyed (rare use case but legal and possible) to fulfill the spec requirements.

gsmet · 2020-08-03T14:37:43Z

Maybe we can limit the PR to the change in the Jaeger extension and get this part merged and backported?

Then you can pursue your discussions on the rest?

mkouba · 2020-08-03T15:03:10Z

Maybe we can limit the PR to the change in the Jaeger extension and get this part merged and backported?

+1

Then you can pursue your discussions on the rest?

I wonder if it could be a problem of hash collisions and a large number of application scoped beans (so that equals() is called many times to find the correct bean instance). I'll prepare a branch to test this theory...

bcluap · 2020-08-03T15:46:04Z

I did a simple test for that and found very few occasions where the method returned false. So I don't think its that. I did a micro benchmark of sorts in the running load test and using System.nanotime timings around the body of the function. With the change commented out, the average of 1000 random results was 55ns. With it present it was 26ns. So if we doing say 200 bean calls to 1 business method then that's about 0.0058ms additional. In my test, each business method has a latency of 0.27ms on average ... so 0.0058 is 2% of 0.27ms... which aligns with my findings. long start = System.nanoTime(); try { if (this == obj) { return true; } if (obj == null) { return false; } if (!(obj instanceof Key)) { return false; } Key other = (Key) obj; // if (contextual == other.contextual) { // return true; // } if (!contextual.equals(other.contextual)) { return false; } return true; } finally { long latency = System.nanoTime() - start; if (System.currentTimeMillis() % 100 == 0) { System.out.println(latency); } } Paul Carter-Brown

…

On Mon, Aug 3, 2020 at 5:03 PM Martin Kouba ***@***.***> wrote: Maybe we can limit the PR to the change in the Jaeger extension and get this part merged and backported? +1 Then you can pursue your discussions on the rest? I wonder if it could be a problem of hash collisions and a large number of application scoped beans (so that equals() is called many times to find the correct bean instance). I'll prepare a branch to test this theory... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11148 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGMB23PHYCSB4QKZYUUSDNLR63GT5ANCNFSM4PSWSDHA> .

mkouba · 2020-08-04T12:26:21Z

Hm, nanosecond improvements are often hard to prove unless you use a proper microbenchmark tool such as jmh properly and interpret the results correctly (which is not trivial ;-). That said, I don't question your results I just wanted to note that microbenchmarks are tricky.

In any case, I've prepared a branch where we get rid of the Key class completely and rely directly on bean identifiers:
https://github.com/mkouba/quarkus/tree/shared-context-opt

ClientProxyInvocationBenchmark shows 2-7% improvement compared to v1.6.1. (note that we talk about ~ 100 millions invocations per second and app context with ~ 200 beans).

@bcluap It would be great if you could give it a try with your load test... Could you also modify this PR to only include the Jaeger change? Thanks!

gsmet · 2020-08-04T12:56:32Z

Yes, please let's get this PR only about the Jaeger change (and remove the other commit too). That way I can backport that one right away to 1.7.

bcluap · 2020-08-04T13:20:32Z

Unfortunately the changes slowed things down from around 20500 tps using the master version to 19500 tps using your version. Now there is a hotspot on the get itself as per attached picture.

bcluap · 2020-08-04T13:49:22Z

I created a new PR for Jaeger: #11197

mkouba · 2020-08-04T14:31:46Z

@bcluap Thanks for the PR.

Unfortunately the changes slowed things down...

That's funny. I'm getting curious about what exactly does your load test do ;-). You're talking about tps? Does it mean "transactions per second". If so what kind of transactions are involved?

Also what kind of tool/profiler are you using?

Now there is a hotspot on the get itself...

That kind of makes sense because it's called many many times...

bcluap · 2020-08-04T14:35:13Z

Hi Martin, I'm actually busy creating a simple test case which can be called with ab so you can see what I see. Paul

…

On Tue, Aug 4, 2020 at 4:32 PM Martin Kouba ***@***.***> wrote: @bcluap <https://github.com/bcluap> Thanks for the PR. Unfortunately the changes slowed things down... That's funny. I'm getting curious about what exactly does your load test do ;-). You're talking about tps? Does it mean "transactions per second". If so what kind of transactions are involved? Also what kind of tool/profiler are you using? Now there is a hotspot on the get itself... That kind of makes sense because it's called many many times... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11148 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGMB23LTEA3MGQBFD24FHA3R7ALWFANCNFSM4PSWSDHA> .

bcluap · 2020-08-04T17:12:32Z

I'm battling to explain what I'm seeing. I'll answer your questions but here is a test project which forces a number of bean calls per rest request: https://github.com/bcluap/quarkus-examples Run the project in there and hit it with: ab -k -c100 -t10 -n100000000 http://localhost:8080/test1/go\?beanCalls\=200 Repeat a number of times and get an average. Try with different implementations of the AbstractSharedContext and see. For this project, Martins change has about a 10% increase in throughput (requests per second) as compared to the default in master. On my large test project, however, Martins is very slightly slower. Weird. Paul

…

On Tue, Aug 4, 2020 at 6:35 PM Loïc Mathieu ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In independent-projects/arc/runtime/src/main/java/io/quarkus/arc/impl/AbstractSharedContext.java <#11148 (comment)>: > @@ -116,6 +116,10 @@ public boolean equals(Object obj) { return false; } Key other = (Key) obj; + // Shortcut removes hotspot on contextual.equals + if (contextual == other.contextual) { In JVM mode, JIT inlining will make the code path the same in no more than 1000 invocations (short method threshold) so this can not be an hotspot. In native mode, I don't know if there is any inlining. Anyway, method call is very chip so it may be a profiling artefact. Some profilers inject bytecode so inlining didn't happens and we see these kind of wrong hotspot. This kind of hack can go wrong at some moment so I suggest dropping it. Side question, which profiler was used ? Asynch-profiler didn't inject bytecode so it may not show this line as an hotspot. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11148 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGMB23MMEG4J5L2HDSANKITR7A2FRANCNFSM4PSWSDHA> .

mkouba · 2020-08-05T07:01:06Z

On my large test project, however, Martins is very slightly slower. Weird.

@bcluap That's hard to guess. We would have to analyze your application to see which parts are involved. I suppose it's not possible to share your test project, right?

loicmathieu · 2020-08-05T08:39:11Z

@bcluap please use async-profiler if possible, or Java Mission Control.
You can follow our guide here: https://github.com/quarkusio/quarkus/blob/master/TROUBLESHOOTING.md
Other profilers have sampling bias that can mess with low level optimizations.

bcluap · 2020-08-05T10:18:35Z

Thanks for the feedback. I did some JMH profiling last night and can confirm that Martin's changes do improve the Bean lookups quite a lot. I also did some more load tests and can see that we are working within the margin of error of the tests so its very difficult to ascertain that they do actually slow my test down as the results are not perfectly consistent. You are spot on the VisualVM is not ideal for micro optimisations but as a quick first pass it does pick up a lot of issues and at first, I thought that maybe the beans had a more complex .equals implementation and hence the change I suggested would have made perfect sense. But seeing that the change actually just prevents one small function call then it will likely be inlined as you say. Paul

…

On Wed, Aug 5, 2020 at 10:39 AM Loïc Mathieu ***@***.***> wrote: @bcluap <https://github.com/bcluap> please use async-profiler if possible, or Java Mission Control. You can follow our guide here: https://github.com/quarkusio/quarkus/blob/master/TROUBLESHOOTING.md Other profilers have sampling bias that can mess with low level optimizations. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11148 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGMB23O36YZE7FNZTLFTRO3R7ELD5ANCNFSM4PSWSDHA> .

gsmet · 2020-08-07T09:14:44Z

I'm closing this one. We can iterate in subsequent PRs if we find room for other optimizations.

Merge pull request #1 from quarkusio/master

37f534b

Sync from master

boring-cyborg bot added area/arc Issue related to ARC (dependency injection) area/jaeger labels Aug 2, 2020

gsmet requested changes Aug 2, 2020

View reviewed changes

gsmet requested a review from mkouba August 2, 2020 21:37

Performance improvements

fcbab4e

mkouba reviewed Aug 3, 2020

View reviewed changes

gsmet added the triage/on-ice Frozen until external concerns are resolved label Aug 4, 2020

mkouba mentioned this pull request Aug 5, 2020

ArC - cleanup and optimizations #11217

Merged

gsmet closed this Aug 7, 2020

gsmet added triage/out-of-date This issue/PR is no longer valid or relevant and removed triage/on-ice Frozen until external concerns are resolved labels Aug 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance fixes based on hotspot analysis in load tests #11148

Performance fixes based on hotspot analysis in load tests #11148

bcluap commented Aug 2, 2020

gsmet left a comment

mkouba Aug 3, 2020

gsmet Aug 3, 2020

mkouba Aug 3, 2020

gsmet Aug 3, 2020

mkouba Aug 3, 2020

loicmathieu Aug 5, 2020

mkouba Aug 5, 2020

Sanne Aug 6, 2020

mkouba Aug 7, 2020

Sanne Aug 7, 2020

bcluap commented Aug 3, 2020 via email

mkouba commented Aug 3, 2020

gsmet commented Aug 3, 2020

mkouba commented Aug 3, 2020 •

edited

Loading

bcluap commented Aug 3, 2020 via email •

edited

Loading

mkouba commented Aug 3, 2020

bcluap commented Aug 3, 2020

bcluap commented Aug 3, 2020

mkouba commented Aug 3, 2020

gsmet commented Aug 3, 2020

mkouba commented Aug 3, 2020

bcluap commented Aug 3, 2020 via email

mkouba commented Aug 4, 2020

gsmet commented Aug 4, 2020

bcluap commented Aug 4, 2020

bcluap commented Aug 4, 2020

mkouba commented Aug 4, 2020

bcluap commented Aug 4, 2020 via email

bcluap commented Aug 4, 2020 via email

mkouba commented Aug 5, 2020

loicmathieu commented Aug 5, 2020

bcluap commented Aug 5, 2020 via email

gsmet commented Aug 7, 2020

Performance fixes based on hotspot analysis in load tests #11148

Performance fixes based on hotspot analysis in load tests #11148

Conversation

bcluap commented Aug 2, 2020

gsmet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bcluap commented Aug 3, 2020 via email

mkouba commented Aug 3, 2020

gsmet commented Aug 3, 2020

mkouba commented Aug 3, 2020 • edited Loading

bcluap commented Aug 3, 2020 via email • edited Loading

mkouba commented Aug 3, 2020

bcluap commented Aug 3, 2020

bcluap commented Aug 3, 2020

mkouba commented Aug 3, 2020

gsmet commented Aug 3, 2020

mkouba commented Aug 3, 2020

bcluap commented Aug 3, 2020 via email

mkouba commented Aug 4, 2020

gsmet commented Aug 4, 2020

bcluap commented Aug 4, 2020

bcluap commented Aug 4, 2020

mkouba commented Aug 4, 2020

bcluap commented Aug 4, 2020 via email

bcluap commented Aug 4, 2020 via email

mkouba commented Aug 5, 2020

loicmathieu commented Aug 5, 2020

bcluap commented Aug 5, 2020 via email

gsmet commented Aug 7, 2020

mkouba commented Aug 3, 2020 •

edited

Loading

bcluap commented Aug 3, 2020 via email •

edited

Loading