Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance fixes based on hotspot analysis in load tests #11148

Closed
wants to merge 2 commits into from
Closed

Performance fixes based on hotspot analysis in load tests #11148

wants to merge 2 commits into from

Conversation

bcluap
Copy link

@bcluap bcluap commented Aug 2, 2020

These 2 changes result from hotspot analysis of a load test using Jaeger extension and lots of Application scoped beans. The changes result in both the .equals and String.format falling off the hotspot list and improve throughput by more than 1%.

@boring-cyborg boring-cyborg bot added area/arc Issue related to ARC (dependency injection) area/jaeger labels Aug 2, 2020
Copy link
Member

@gsmet gsmet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks interesting. Could you rebase? There is a first commit that looks weird.

@gsmet gsmet requested a review from mkouba August 2, 2020 21:37
@@ -116,6 +116,10 @@ public boolean equals(Object obj) {
return false;
}
Key other = (Key) obj;
// Shortcut removes hotspot on contextual.equals
if (contextual == other.contextual) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, at the moment we don't implement equals()/hashCode() for generated bean classes so !contextual.equals(other.contextual) should be translated to !(contextual == other.contextual). In other words, this modification could save one java.lang.Object.equals(Object) invocation (which is very likely negligible although it removes the equals method from the hot path). Long.toHexString() saves probably a lot more because String.format() creates a formatter object, parses the string, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkouba what's you're saying is we should drop the equals() call altogether?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. I'd like to keep it. The fact that we don't implement it now does not mean we'll never need to implement it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. But if it ended up being a hot spot for Paul, there's something weird going on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to say without the app and test sources. CC @bcluap

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I just saw that the profiler used is the one from VisualVM, it inject some bytecode and can generates deoptimization.
That's why it can misleading you to such kind of hotspot.
When working on lowlevel optimizations (and avoiding a method call is very low level), you should use low level stuff like async-profiler, we have a page that expalain how to use it: https://github.com/quarkusio/quarkus/blob/master/TROUBLESHOOTING.md

So again, the hotspot for the equals method call is certainly a profiler artefact not a real hotspot.
And 1% difference between two load testing run is very slight so can go into the recording error range.
Such small performance enhancements should be validated via microbenchmarking using JMH ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such small performance enhancements should be validated via microbenchmarking using JMH ...

Yes, we talked about this in the comments below, e.g. #11148 (comment).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have many possible concrete types implementing Contextual, the dispatch to find the right equals implementation becomes a very expensive megamorphic call.
+1 for the shortcut

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above we removed the need for equals() from the hot path completely...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above we removed the need for equals() from the hot path completely...

Sure that's even better. I just meant to suggest a possible explanation for the equals to be - in some contexts - really not that efficient. Removing the field is even better!

@bcluap
Copy link
Author

bcluap commented Aug 3, 2020 via email

@mkouba
Copy link
Contributor

mkouba commented Aug 3, 2020

It was being called 2.5m times per second...
The change resulted in it coming off the hot list.

Sure, because the method is not invoked anymore. It's replaced with the if_acmpne instruction which is definitely faster. What happens if you remove this optimization? Can you still see some improvement?

@gsmet
Copy link
Member

gsmet commented Aug 3, 2020

Hmmmm. Is it normal that this method is called so many times?

@mkouba
Copy link
Contributor

mkouba commented Aug 3, 2020

Hmmmm. Is it normal that this method is called so many times?

It depends. It's a load test anyway. But every time you invoke a method upon a client proxy we have to do a lookup, and in this case create a new key and invoke Map.get().

@bcluap
Copy link
Author

bcluap commented Aug 3, 2020 via email

@mkouba
Copy link
Contributor

mkouba commented Aug 3, 2020

This was the analysis I found. This was run now without the change:

I can't see the attached images...

@bcluap
Copy link
Author

bcluap commented Aug 3, 2020

image (3)
image (2)

@bcluap
Copy link
Author

bcluap commented Aug 3, 2020

The reason the .equals is called so much is due to traces like this:
at io.quarkus.arc.impl.AbstractSharedContext$Key.equals(AbstractSharedContext.java:118)
at java.base/java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:940)
at io.quarkus.arc.impl.ComputingCache.getValueIfPresent(ComputingCache.java:45)
at io.quarkus.arc.impl.AbstractSharedContext.get(AbstractSharedContext.java:34)
at guru.jini.arch.impl.aaa.RoleBasedGateKeeper_ClientProxy.arc$delegate(RoleBasedGateKeeper_ClientProxy.zig:70)
at guru.jini.arch.impl.aaa.RoleBasedGateKeeper_ClientProxy.checkPermissions(RoleBasedGateKeeper_ClientProxy.zig:316)

Every method call to a managed bean (app scoped or request scoped) results in a lookup in the concurrenthashmap to find the correct delegate.

We have a large number of helpers in our architecture which use other helpers and these are all ApplicationScoped so they can be easily injected when needed. As a call jumps around a bit and uses persistence and other things these quickly add up and can end up using say 200 managed beans in a complex call.

I've subsequently changed them all to Singletons to avoid the proxying but it's probably still worth trying to optimise the resolution of beans as much as possible.

@mkouba
Copy link
Contributor

mkouba commented Aug 3, 2020

I think that we need more information. These two snapshots don't seem to be comparable. The total time of the first one is 79 732 ms and the total time of the second one is 114 359 ms. Also after the change, the AbstractSharedContext.Key.equals(Object) with 0.5% disappears but HashMapLocalCache_ClientProxy.get() comes up with 0.2%. One snapshot contains VertxHttpHeaders.getAll(CharSequence) with 0.1% but the other one only contains VertxHttpHeaders.get0(CharSequence) with 0.3%...

However, I really want to find the problem. So I'm going to update the ClientProxyInvocationBenchmark microbenchmark to see how your modification improves the throughput.

Maybe we could find even more optimizations. In general, we could skip the lookup for @ApplicationScoped beans completely but then we would have to "clear the cached instance" if a bean is destroyed (rare use case but legal and possible) to fulfill the spec requirements.

@gsmet
Copy link
Member

gsmet commented Aug 3, 2020

Maybe we can limit the PR to the change in the Jaeger extension and get this part merged and backported?

Then you can pursue your discussions on the rest?

@mkouba
Copy link
Contributor

mkouba commented Aug 3, 2020

Maybe we can limit the PR to the change in the Jaeger extension and get this part merged and backported?

+1

Then you can pursue your discussions on the rest?

I wonder if it could be a problem of hash collisions and a large number of application scoped beans (so that equals() is called many times to find the correct bean instance). I'll prepare a branch to test this theory...

@bcluap
Copy link
Author

bcluap commented Aug 3, 2020 via email

@mkouba
Copy link
Contributor

mkouba commented Aug 4, 2020

Hm, nanosecond improvements are often hard to prove unless you use a proper microbenchmark tool such as jmh properly and interpret the results correctly (which is not trivial ;-). That said, I don't question your results I just wanted to note that microbenchmarks are tricky.

In any case, I've prepared a branch where we get rid of the Key class completely and rely directly on bean identifiers:
https://github.com/mkouba/quarkus/tree/shared-context-opt

ClientProxyInvocationBenchmark shows 2-7% improvement compared to v1.6.1. (note that we talk about ~ 100 millions invocations per second and app context with ~ 200 beans).

@bcluap It would be great if you could give it a try with your load test... Could you also modify this PR to only include the Jaeger change? Thanks!

@gsmet
Copy link
Member

gsmet commented Aug 4, 2020

Yes, please let's get this PR only about the Jaeger change (and remove the other commit too). That way I can backport that one right away to 1.7.

@bcluap
Copy link
Author

bcluap commented Aug 4, 2020

image

Unfortunately the changes slowed things down from around 20500 tps using the master version to 19500 tps using your version. Now there is a hotspot on the get itself as per attached picture.

@bcluap
Copy link
Author

bcluap commented Aug 4, 2020

I created a new PR for Jaeger: #11197

@mkouba
Copy link
Contributor

mkouba commented Aug 4, 2020

@bcluap Thanks for the PR.

Unfortunately the changes slowed things down...

That's funny. I'm getting curious about what exactly does your load test do ;-). You're talking about tps? Does it mean "transactions per second". If so what kind of transactions are involved?

Also what kind of tool/profiler are you using?

Now there is a hotspot on the get itself...

That kind of makes sense because it's called many many times...

@gsmet gsmet added the triage/on-ice Frozen until external concerns are resolved label Aug 4, 2020
@bcluap
Copy link
Author

bcluap commented Aug 4, 2020 via email

@bcluap
Copy link
Author

bcluap commented Aug 4, 2020 via email

@mkouba
Copy link
Contributor

mkouba commented Aug 5, 2020

On my large test project, however, Martins is very slightly slower. Weird.

@bcluap That's hard to guess. We would have to analyze your application to see which parts are involved. I suppose it's not possible to share your test project, right?

@loicmathieu
Copy link
Contributor

@bcluap please use async-profiler if possible, or Java Mission Control.
You can follow our guide here: https://github.com/quarkusio/quarkus/blob/master/TROUBLESHOOTING.md
Other profilers have sampling bias that can mess with low level optimizations.

@bcluap
Copy link
Author

bcluap commented Aug 5, 2020 via email

@gsmet
Copy link
Member

gsmet commented Aug 7, 2020

I'm closing this one. We can iterate in subsequent PRs if we find room for other optimizations.

@gsmet gsmet closed this Aug 7, 2020
@gsmet gsmet added triage/out-of-date This issue/PR is no longer valid or relevant and removed triage/on-ice Frozen until external concerns are resolved labels Aug 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/arc Issue related to ARC (dependency injection) area/jaeger triage/out-of-date This issue/PR is no longer valid or relevant
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants