-
Notifications
You must be signed in to change notification settings - Fork 729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JAXP-intensive workload on OpenJ9 seems to be about 5% slower than HotSpot #6642
Comments
What is the setting for transparent huge pages on the machine in question? Testing with transparent huge pages set to always may help performance. |
I'll try |
@andrewcraik Re-ran test 9 with
FYI, memory in the container:
|
@kgibm that is still a 5% improvement if I am reading this correctly - eg comparing 12 to 1 which is the effect of the change? |
note that OpenJ9 plans to better support madvise once we have investigated #6156 which is targeted to OpenJ9 0.16 at the present time. |
@andrewcraik The last column is the relative difference. So OpenJ9 is 5% slower than HotSpot. I haven't been comparing tests to each other - I'm just comparing the J9 experimental change to HotSpot baseline. (Since the HotSpot baseline test is always the same, we could compare all the tests together, but I avoid that because I'd rather compare tests run at approximately the same time to each other to reduce any time-based variability). I treat each test independently: three runs for OpenJ9, three runs for HotSpot, take the average of each and then calculate the relative difference: |
@kgibm right what I am saying is that taking the baseline configuration the delta was 10% (eg configuration 1). Changing the transparent huge page setting to always narrowed that gap to 5% - eg OpenJ9 is still slower, but less so than in configuration 1. That performance improvement is expected to be delivered in OpenJ9 for 0.16. There is a 5% delta to still be investigated. I am interested to see that -Xverify:none had an impact - there was and is work being done to improve the performance of the verification logic some of the gains from that will likely also be available in 0.16 - FYI @DanHeidinga since he runs the team of folks who are helping with that work. |
FYI @mpirvu due to the AOT numbers above which seem 'interesting' |
@andrewcraik Oh I see what you mean. This |
BTW, test ideas 2-9 were just random ideas I grabbed from the air :) Just trying to be thorough before opening the issue. |
@andrewcraik I ran test 1 with I also re-did the whole spreadsheet to make it clear which test each run is based on. I also added a t-test column (two-tailed, heteroscedastic) and most of them are
|
@kgibm thanks for the clarification - I am surprised about the overheads seemingly caused by -Xtrace:none -Xverify:none. Would it be possible to get a perf record of configs 1 and 9 during iteration 3 on your setup with -Xjit:perfTool specified? This will show how time is distributed between the various JVM components etc. |
FYI @vijaysun-omr due to general interest in perf deltas. |
@andrewcraik I was worried somebody would ask for perf :) Unfortunately, Docker (at least on Mac) doesn't support perf. I tried running the container with Note also that I'm traveling to another customer tomorrow for a week so I'll be mostly unavailable, but I've attached the standalone program in the original issue description if others want to try to reproduce locally. |
Also, you already mentioned this, but just for others that might miss this, it would certainly be interesting to understand why |
AOT is supposed to help only when many methods need to be compiled in a short period of time. This benchmark runs for 50+ seconds and I don't think that the JAXP benchmark has that many methods to compile. |
I think understanding the -Xtrace:none and -Xverify:none issues will help inform where the rest of the delta is coming from. |
@kgibm To continue the discussion of the OpenJ9 problem - I have access to a linux perf setup so I'll see if I can get the test case running on there. I'll see if I can reproduce the same kind of gaps as you see in your container and see if a profile shows anything useful. If you manage to get any profiling data or find out anything else do let me know since there are some strange things in play seemingly |
So I have the test running on my setup and I can see a delta between HotSpot and OpenJ9 even on bare metal (I removed the container support option etc so I'm just looking at the native machine performance to simplify things a bit). I can also see that -Xverify:none and -Xtrace:none are able to save about 2/3 of the gap I see. The profile does show hot and scorching compilations. The hotest methods seems to be StringBuilder's ensureCapacityImpl so there is a lot of string building going on (makes sense based on the code). Going to dig a bit more at what is going on etc. Still some mysteries to solve. |
Great to hear that it's reproducible, thanks Andrew! |
Has it been 7 days already? We had a couple of days off here in Canada so I guess it has... Anyway I have still been looking at this to try and figure out what is going on. I've been doing some experiments with:
So some configurations and some notes - all runs are averages of 3 on a skylake linux perf box. I'm pinning to 4 cores.
The fastest configuration in my testing so far is a latest JVM with shareclasses enabled and There remains about 5% to the HotSpot performance on my setup. I need to now build some test JITs to test the affect of enabling nextGenHCR during startup as well as the new JProfiling implementation. I did do a quick check of the end to end profile for the benchmark. There is a non-trivial amount of time in the bytecode interpreter loop (a bit under 1/3 of the profile). There is a reasonable amount of time consumed by JIT compilation and the bulk of the remainder is in JIT code. Top method looked to be StringBuilder.append (hence the constant arraylength optimizations above). |
@andrewcraik Can you also do a run with |
So I have done a few more experiments. Enabling nextGenHCR during startup doesn't seem to make much of a difference to the performance. Using the new JProfiling implementation for high optimization compiles that @r30shah seems to cut the time to around 44.5 seconds which helps, but there is still more on the table. More investigation to come. @DanHeidinga Looks like there is a lot of class loading going on for part of the run - many many MANY instances of GregorSamsa being loaded - looks to be related to https://xml.apache.org/xalan-j/xsltc/xsltc_native_api.html. Something about stylesheet processing. Perhaps it is our class loading performance that is letting us down? I'll have to dig at that next week. HotSpot appears to load a similar number of dynamically generated classes. |
FYI. The default class name XSLTC uses for the class it generates for an XSLT stylesheet is GregorSamsa. That was the case both with the XSLTC native API that Andrew mentions and also under the covers when using XSLTC as part of a JAXP implementation. If the customer's code is recompiling the XSLT stylesheet each time it's used, it will generate a new set of classes each time - those classes will be generated, loaded and the methods JIT compiled, as required, and old classes dropped on the floor, even if the XSLT stylesheet is not changing. At least, that's how things stood when I was part of the team working on Apache-Xalan (including XSLTC) and IBM's JAXP implementation. Looking at the benchmark code, I see it is repeatedly calling We used to recommend to customers that they call Of course, if the real scenario that the benchmark is modelling involves a situation where XSLT stylesheets are really used just once -- perhaps because the stylesheets themselves are dynamically generated -- then calling |
So looking at the benchmark as attached previously the @hzongaro thank you for the insights into the XML parser - neither @DanHeidinga nor I are experts on the inner details as you are. |
So I can confirm that moving the lines:
out of the |
@andrewcraik Do you have profiles of the application? I'd like to see both which java calls are hot and which VM calls are hot to see if there's an obvious bottleneck in the classload path. |
With the modified benchmark the total runtime has dropped below 12s - we are now dealing with quite a short running benchmark if run in the same configuration which puts the benchmark into a domain where startup performance is important. If I double the work to keep the execution time over 10s to minimize the start-up/ramp-up artifacts then I see the following:
Now if I drop the data set size back to the original 100000:
The difference in work makes class load / startup / rampup dominate. Adding -Xverify:none has no measurable impact at the original data set size. We will keep having a look at the class loading aspect, but @kgibm I'm wondering if the benchmark is representative of the actual workload? Is the original workload recreating templates / transformers all the time? Could a Java code optimization get them into a mode where the OpenJ9 performance is better (or is that the mode they are in - in which case the benchmark isn't simulating the bottleneck they are seeing)? @DanHeidinga I do have profiles - I'll pass them to you offline so you can study them to see what you find since they are a bit unweildy to post here. |
Part of the problem may just be that we have a lot of code running in the interpreter if each of the stylesheets is created and a lot of the code is only run once. |
Interesting finds.
It's quite possible that this benchmark is not representative. I wrote this code from scratch based on HealthCenter profiling of the actual workload and then mimicking some of the XML document structure and XPaths used, but it's quite possible that the application/stack product that runs the actual workload does not call newTemplates/newTransformer; I will ask to run some trace on the actual workload (I'm presuming |
@kgibm an |
If the source is available, I'd start by greping the code base for Otherwise, I'd go with @andrewcraik's
Note, the method names are not complete. They need to be prepended with the correct java package using |
Thanks all, I'll ask for that tracing (first |
For reference, I will propose:
|
Thanks for everyone's time on this but we finally re-created the full customer environment internally, ran the above trace, and there is only a single call to |
Java -version output
OpenJ9 -version:
HotSpot -version:
Summary of problem
A customer observes about a 5% performance difference between J9 and HotSpot (particularly for IBM Java, but the same is seen with OpenJ9). The workload is JAXP-heavy (especially XPath and XSLT) and we may have recreated the problem with the attached standalone Java microbenchmark which shows AdoptOpenJDK+J9 is about 3.5% slower than AdoptOpenJDK+HotSpot even after trying various J9 tuning (the best seemingly being
-Xtrace:none -Xverify:none
). Verbose garbage collection shows the proportion of time in J9 is ~1.5% and in HotSpot is ~1.6%, so GC doesn't seem to be the issue.Notes:
-Xms1024 -Xmx1024m
.Test results:
Longer-running tests (e.g. for JIT) based on test 9 were also run and didn't help and only became worse:
The sample program is
jaxpperformance.jar
in the attached zip (its source is in theJAXPPerformanceCode
directory - ultimately justJAXPPerformance.java
).The simplest execution of the sample program is:
time java -Djavax.xml.transform.TransformerFactory=com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl -Djavax.xml.xpath.XPathFactory=com.sun.org.apache.xpath.internal.jaxp.XPathFactoryImpl -Djavax.xml.xpath.XPathFactory:http://java.sun.com/jaxp/xpath/dom=com.sun.org.apache.xpath.internal.jaxp.XPathFactoryImpl -Djavax.xml.parsers.SAXParserFactory=com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl -Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl -Djavax.xml.validation.SchemaFactory:http://www.w3.org/2001/XMLSchema=com.sun.org.apache.xerces.internal.jaxp.validation.XMLSchemaFactory -Djavax.xml.datatype.DatatypeFactory=com.sun.org.apache.xerces.internal.jaxp.datatype.DatatypeFactoryImpl -Xms1024m -Xmx1024m -jar jaxpperformance.jar 100000
(Along with
-XX:+UseContainerSupport
on J9 if running in Docker.)Other notes:
tests.txt
(in particular, including the timestamp when the test started if you want to correlate to nmon). Results are only captured for columns 3 and 4. The second and third iterations of each experiment were run after the first round of tests.hcd
is in the zip.-Djavax.xml*
options aren't strictly needed, but avoids any potential pointless factory lookups which we had seen during some benchmarks.The Docker container sees 4 CPU core threads with these specs:
Diagnostic files
j9_hotspot_jaxp_diff.zip
The text was updated successfully, but these errors were encountered: