Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JAXP-intensive workload on OpenJ9 seems to be about 5% slower than HotSpot #6642

Closed
kgibm opened this issue Jul 31, 2019 · 35 comments
Closed

Comments

@kgibm
Copy link
Contributor

kgibm commented Jul 31, 2019

Java -version output

OpenJ9 -version:

openjdk version "1.8.0_222"
OpenJDK Runtime Environment (build 1.8.0_222-b10)
Eclipse OpenJ9 VM (build openj9-0.15.1, JRE 1.8.0 Linux amd64-64-Bit Compressed References 20190717_368 (JIT enabled, AOT enabled)
OpenJ9   - 0f66c6431
OMR      - ec782f26
JCL      - f147086df1e based on jdk8u222-b10)

HotSpot -version:

openjdk version "1.8.0_222"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_222-b10)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.222-b10, mixed mode)

Summary of problem

A customer observes about a 5% performance difference between J9 and HotSpot (particularly for IBM Java, but the same is seen with OpenJ9). The workload is JAXP-heavy (especially XPath and XSLT) and we may have recreated the problem with the attached standalone Java microbenchmark which shows AdoptOpenJDK+J9 is about 3.5% slower than AdoptOpenJDK+HotSpot even after trying various J9 tuning (the best seemingly being -Xtrace:none -Xverify:none). Verbose garbage collection shows the proportion of time in J9 is ~1.5% and in HotSpot is ~1.6%, so GC doesn't seem to be the issue.

Notes:

  • Baseline is -Xms1024 -Xmx1024m.
  • Each row is an independent experiment with the change from baseline noted in the second column. That change is only made on J9, and the HotSpot results are just a re-run of baseline.
  • Each experiment is run three times for each JVM and the averages are the third-to-last and second-to-last columns.
  • The last column is the OpenJ9 relative increase compared to HotSpot, calculated from the averages.

Test results:

# Test OpenJ9 1 (s) HotSpot 1 (s) OpenJ9 2 (s) HotSpot 2 (s) OpenJ9 3 (s) HotSpot 3 (s) OpenJ9 Average (s) HotSpot Average (s) OpenJ9 Relative Increase
1 Baseline 55.98 49.34 54.89 48.95 54.12 50.41 55.00 49.57 10.95%
2 -Xaggressive 57.38 51.44 54.65 49.09 54.48 50.01 55.50 50.18 10.61%
3 (2nd run) -Xshareclasses:name=shr 58.62 49.67 57.28 49.24 57.29 48.94 57.73 49.28 17.14%
4 (4th run) -Xshareclasses:name=shr -Xscmx400M '-Xjit:dontDowngradeToCold,useHigherMethodCounts,forceAOT' '-Xaot:dontDowngradeToCold,useHigherMethodCounts,forceAOT' 61.76 50.72 59.67 49.09 59.55 49.70 60.33 49.84 21.05%
5 -Xquickstart 59.62 48.94 60.06 48.93 58.87 49.18 59.52 49.02 21.42%
6 -XcompilationThreads1 55.33 48.95 56.97 48.88 55.01 48.71 55.77 48.85 14.17%
7 -Xtrace:none 57.26 49.69 54.62 49.41 54.22 49.49 55.37 49.53 11.78%
8 -Xverify:none 51.59 49.38 51.93 48.96 53.21 48.74 52.24 49.03 6.56%
9 -Xtrace:none -Xverify:none 51.64 49.57 51.13 49.27 49.89 48.76 50.89 49.20 3.43%

Longer-running tests (e.g. for JIT) based on test 9 were also run and didn't help and only became worse:

# Loops OpenJ9 (s) HotSpot (s) OpenJ9 Relative Increase
10 100000 525.3 453.5 15.83%
11 1000000 5,676.03 4,455.77 27.39%

The sample program is jaxpperformance.jar in the attached zip (its source is in the JAXPPerformanceCode directory - ultimately just JAXPPerformance.java).

The simplest execution of the sample program is:

time java -Djavax.xml.transform.TransformerFactory=com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl -Djavax.xml.xpath.XPathFactory=com.sun.org.apache.xpath.internal.jaxp.XPathFactoryImpl -Djavax.xml.xpath.XPathFactory:http://java.sun.com/jaxp/xpath/dom=com.sun.org.apache.xpath.internal.jaxp.XPathFactoryImpl -Djavax.xml.parsers.SAXParserFactory=com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl -Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl -Djavax.xml.validation.SchemaFactory:http://www.w3.org/2001/XMLSchema=com.sun.org.apache.xerces.internal.jaxp.validation.XMLSchemaFactory -Djavax.xml.datatype.DatatypeFactory=com.sun.org.apache.xerces.internal.jaxp.datatype.DatatypeFactoryImpl -Xms1024m -Xmx1024m -jar jaxpperformance.jar 100000

(Along with -XX:+UseContainerSupport on J9 if running in Docker.)

Other notes:

  • Nmon was run for the duration of all tests and the output is in the zip.
  • Verbose garbage collection was only run for the first iteration of both baselines and the files are in the zip.
  • All tests' output are in tests.txt (in particular, including the timestamp when the test started if you want to correlate to nmon). Results are only captured for columns 3 and 4. The second and third iterations of each experiment were run after the first round of tests.
  • After the above tests were all run, the J9 baseline was run again with HealthCenter (using this technique) and the hcd is in the zip.
  • The -Djavax.xml* options aren't strictly needed, but avoids any potential pointless factory lookups which we had seen during some benchmarks.
  • The tests were run on Fedora Linux 30 running in Docker on top of a MacBook Pro 2017 with low background activity verified in Activity Monitor.
  • The linperf_RESULTS.tar.gz is from a post-run of J9 baseline plus linperf.sh. A system dump was captured during this test if needed. Additional javacores from that test were also excluded from the upload to fit into Github upload requirements.

The Docker container sees 4 CPU core threads with these specs:

model name	: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
cache size	: 8192 KB

Diagnostic files

j9_hotspot_jaxp_diff.zip

@andrewcraik
Copy link
Contributor

What is the setting for transparent huge pages on the machine in question? Testing with transparent huge pages set to always may help performance.

@kgibm
Copy link
Contributor Author

kgibm commented Jul 31, 2019

What is the setting for transparent huge pages on the machine in question? Testing with transparent huge pages set to always may help performance.

$ cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never

I'll try always and report back...

@kgibm
Copy link
Contributor Author

kgibm commented Jul 31, 2019

@andrewcraik Re-ran test 9 with transparent_hugepage=always and the relative difference is still about 5%:

# Test OpenJ9 1 (s) HotSpot 1 (s) OpenJ9 2 (s) HotSpot 2 (s) OpenJ9 3 (s) HotSpot 3 (s) OpenJ9 Average (s) HotSpot Average (s) OpenJ9 Relative Increase
12 Test 9 w/ transparent_hugepage=always 51.14 48.46 50.04 47.67 49.88 47.7 50.35 47.94 5.03%
[was@1c12f673e1f5 ~]$ echo always | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
always
[was@1c12f673e1f5 ~]$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

FYI, memory in the container:

[was@1c12f673e1f5 ~]$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
MemTotal:       12281300 kB
MemFree:        10901932 kB
MemAvailable:   11377204 kB
Buffers:           53748 kB
Cached:           646036 kB
SwapCached:            0 kB
Active:           950964 kB
Inactive:         298696 kB
Active(anon):     538880 kB
Inactive(anon):    15176 kB
Active(file):     412084 kB
Inactive(file):   283520 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       1048572 kB
SwapFree:        1048572 kB
Dirty:                 4 kB
Writeback:             0 kB
AnonPages:        547596 kB
Mapped:           238120 kB
Shmem:              4184 kB
Slab:              63568 kB
SReclaimable:      34548 kB
SUnreclaim:        29020 kB
KernelStack:       10352 kB
PageTables:         8020 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     7189220 kB
Committed_AS:    3367240 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
AnonHugePages:     38912 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:       54700 kB
DirectMap2M:     5187584 kB
DirectMap1G:     9437184 kB

@andrewcraik
Copy link
Contributor

@kgibm that is still a 5% improvement if I am reading this correctly - eg comparing 12 to 1 which is the effect of the change?

@andrewcraik
Copy link
Contributor

note that OpenJ9 plans to better support madvise once we have investigated #6156 which is targeted to OpenJ9 0.16 at the present time.

@kgibm
Copy link
Contributor Author

kgibm commented Jul 31, 2019

@andrewcraik The last column is the relative difference. So OpenJ9 is 5% slower than HotSpot.

I haven't been comparing tests to each other - I'm just comparing the J9 experimental change to HotSpot baseline. (Since the HotSpot baseline test is always the same, we could compare all the tests together, but I avoid that because I'd rather compare tests run at approximately the same time to each other to reduce any time-based variability). I treat each test independently: three runs for OpenJ9, three runs for HotSpot, take the average of each and then calculate the relative difference: (B-A)/A where A=HotSpot, B=J9.

@andrewcraik
Copy link
Contributor

@kgibm right what I am saying is that taking the baseline configuration the delta was 10% (eg configuration 1). Changing the transparent huge page setting to always narrowed that gap to 5% - eg OpenJ9 is still slower, but less so than in configuration 1. That performance improvement is expected to be delivered in OpenJ9 for 0.16. There is a 5% delta to still be investigated. I am interested to see that -Xverify:none had an impact - there was and is work being done to improve the performance of the verification logic some of the gains from that will likely also be available in 0.16 - FYI @DanHeidinga since he runs the team of folks who are helping with that work.

@andrewcraik
Copy link
Contributor

FYI @mpirvu due to the AOT numbers above which seem 'interesting'

@kgibm
Copy link
Contributor Author

kgibm commented Jul 31, 2019

@andrewcraik Oh I see what you mean. This transparent_hugepage=always test 12 was based on test 9, so I ran it with -Xtrace:none -Xverify:none as well. Test 9 was 7% faster than test 1, but test 12 was only 5% faster than test 1, so it doesn't seem that transparent_hugepage=always had any improvement (although of course sample sizes here are small and variability may be high, so I'd put it into the error range). I'll run a test with transparent_hugepage=always based on test 1...

@kgibm
Copy link
Contributor Author

kgibm commented Jul 31, 2019

BTW, test ideas 2-9 were just random ideas I grabbed from the air :) Just trying to be thorough before opening the issue.

@kgibm
Copy link
Contributor Author

kgibm commented Jul 31, 2019

@andrewcraik I ran test 1 with transparent_hugepage=always and it didn't seem to give any benefit (J9 about 10% slower). It seems the major benefit is from -Xtrace:none -Xverify:none.

I also re-did the whole spreadsheet to make it clear which test each run is based on. I also added a t-test column (two-tailed, heteroscedastic) and most of them are p < 0.05 (in other words, the difference between averages is statistically "significant"). Note that I re-arranged the columns compared to the original table to make it easier to run the t-test.

# Test OpenJ9 1 (s) OpenJ9 2 (s) OpenJ9 3 (s) HotSpot 1 (s) HotSpot 2 (s) HotSpot 3 (s) OpenJ9 Average (s) HotSpot Average (s) OpenJ9 Relative Increase t-test
1 Baseline (-Xmx1G -Xmx1G; 100000 loops) 55.98 54.89 54.12 49.34 48.95 50.41 55.00 49.57 10.95% 0.00
2 Test 1 w/ -Xaggressive 57.38 54.65 54.48 51.44 49.09 50.01 55.50 50.18 10.61% 0.01
3 (2nd run) Test 1 w/ -Xshareclasses:name=shr 58.62 57.28 57.29 49.67 49.24 48.94 57.73 49.28 17.14% 0.00
4 (4th run) Test 1 w/ -Xshareclasses:name=shr -Xscmx400M '-Xjit:dontDowngradeToCold,useHigherMethodCounts,forceAOT' '-Xaot:dontDowngradeToCold,useHigherMethodCounts,forceAOT' 61.76 59.67 59.55 50.72 49.09 49.7 60.33 49.84 21.05% 0.00
5 Test 1 w/ -Xquickstart 59.62 60.06 58.87 48.94 48.93 49.18 59.52 49.02 21.42% 0.00
6 Test 1 w/ -XcompilationThreads1 55.33 56.97 55.01 48.95 48.88 48.71 55.77 48.85 14.17% 0.01
7 Test 1 w/ -Xtrace:none 57.26 54.62 54.22 49.69 49.41 49.49 55.37 49.53 11.78% 0.02
8 Test 1 w/ -Xverify:none 51.59 51.93 53.21 49.38 48.96 48.74 52.24 49.03 6.56% 0.01
9 Test 1 w/ -Xtrace:none -Xverify:none 51.64 51.13 49.89 49.57 49.27 48.76 50.89 49.20 3.43% 0.07
10 Test 9 w/ 1000000 loops 525.3 453.5 525.30 453.50 15.83%
11 Test 9 w/ 10000000 loops 5,676.03 4,455.77 5676.03 4455.77 27.39%
12 Test 9 w/ transparent_hugepage=always 51.14 50.04 49.88 48.46 47.67 47.7 50.35 47.94 5.03% 0.01
13 Test 1 w/ transparent_hugepage=always 51.83 52.86 53.22 47.69 47.77 47.78 52.64 47.75 10.24% 0.01

@andrewcraik
Copy link
Contributor

@kgibm thanks for the clarification - I am surprised about the overheads seemingly caused by -Xtrace:none -Xverify:none. Would it be possible to get a perf record of configs 1 and 9 during iteration 3 on your setup with -Xjit:perfTool specified? This will show how time is distributed between the various JVM components etc.

@andrewcraik
Copy link
Contributor

FYI @vijaysun-omr due to general interest in perf deltas.

@kgibm
Copy link
Contributor Author

kgibm commented Jul 31, 2019

@andrewcraik I was worried somebody would ask for perf :) Unfortunately, Docker (at least on Mac) doesn't support perf. I tried running the container with --privileged and it didn't help. I do have a Thinkpad with RHEL on it so I'll see if I can dust that off today.

Note also that I'm traveling to another customer tomorrow for a week so I'll be mostly unavailable, but I've attached the standalone program in the original issue description if others want to try to reproduce locally.

@kgibm
Copy link
Contributor Author

kgibm commented Jul 31, 2019

Also, you already mentioned this, but just for others that might miss this, it would certainly be interesting to understand why -Xtrace:none -Xverify:none are helping so much (btw, that was a suggestion from Gary DeVal to try), but the bigger issue for the customer is the remaining ~5% difference.

@mpirvu
Copy link
Contributor

mpirvu commented Jul 31, 2019

AOT is supposed to help only when many methods need to be compiled in a short period of time. This benchmark runs for 50+ seconds and I don't think that the JAXP benchmark has that many methods to compile.
I am also surprised by the improvement brought in by -Xtrace:none -Xverify:none. These two are useful for short running apps which is not the case here.

@andrewcraik
Copy link
Contributor

andrewcraik commented Jul 31, 2019

I think understanding the -Xtrace:none and -Xverify:none issues will help inform where the rest of the delta is coming from.

@andrewcraik
Copy link
Contributor

@kgibm To continue the discussion of the OpenJ9 problem - I have access to a linux perf setup so I'll see if I can get the test case running on there. I'll see if I can reproduce the same kind of gaps as you see in your container and see if a profile shows anything useful. If you manage to get any profiling data or find out anything else do let me know since there are some strange things in play seemingly

@andrewcraik
Copy link
Contributor

So I have the test running on my setup and I can see a delta between HotSpot and OpenJ9 even on bare metal (I removed the container support option etc so I'm just looking at the native machine performance to simplify things a bit). I can also see that -Xverify:none and -Xtrace:none are able to save about 2/3 of the gap I see. The profile does show hot and scorching compilations. The hotest methods seems to be StringBuilder's ensureCapacityImpl so there is a lot of string building going on (makes sense based on the code).

Going to dig a bit more at what is going on etc. Still some mysteries to solve.

@kgibm
Copy link
Contributor Author

kgibm commented Aug 1, 2019

Great to hear that it's reproducible, thanks Andrew!

@andrewcraik
Copy link
Contributor

Has it been 7 days already? We had a couple of days off here in Canada so I guess it has...

Anyway I have still been looking at this to try and figure out what is going on. I've been doing some experiments with:

openjdk version "1.8.0_222"
OpenJDK Runtime Environment (build 1.8.0_222-201907290413-b10)
Eclipse OpenJ9 VM (build master-e4050392f, JRE 1.8.0 Linux amd64-64-Bit Compressed References 20190729_373 (JIT enabled, AOT enabled)
OpenJ9   - e4050392f
OMR      - 8acafad8
JCL      - 46e5e62038e based on jdk8u222-b10)

So some configurations and some notes - all runs are averages of 3 on a skylake linux perf box. I'm pinning to 4 cores.

config time notes
baseline 48.09 this with default shareclasses for bootstrap classes
Xshareclasses:none 50.07 share classes helps
Xverify:none 45.41 verify:none boosts performance
Xtrace:none 48.208 no affect from trace:none
Xverify:none & Xtrace:none 46.08 benefit of verify:none is not increased by trace:none
TR_OptimizeForConstantLengthArrayCopy=1 & Xverify:none 45.83 no affect from constant length arraycopy opts (one hunch for the delta)
Xverify:none & Xquickstart 52.25 the throughput tradeoff with quickstart is not worth it

The fastest configuration in my testing so far is a latest JVM with shareclasses enabled and -Xverify:none.

There remains about 5% to the HotSpot performance on my setup. I need to now build some test JITs to test the affect of enabling nextGenHCR during startup as well as the new JProfiling implementation.

I did do a quick check of the end to end profile for the benchmark. There is a non-trivial amount of time in the bytecode interpreter loop (a bit under 1/3 of the profile). There is a reasonable amount of time consumed by JIT compilation and the bulk of the remainder is in JIT code. Top method looked to be StringBuilder.append (hence the constant arraylength optimizations above).
`-

@DanHeidinga
Copy link
Member

@andrewcraik Can you also do a run with -verbose:class to see if there is significant classloading that continues to occur throughout? Does JAXP generate classes at runtime? Is there something we're doing that Hotspot isn't that causes them to be continually regenerated?

@andrewcraik
Copy link
Contributor

So I have done a few more experiments. Enabling nextGenHCR during startup doesn't seem to make much of a difference to the performance. Using the new JProfiling implementation for high optimization compiles that @r30shah seems to cut the time to around 44.5 seconds which helps, but there is still more on the table. More investigation to come.

@DanHeidinga Looks like there is a lot of class loading going on for part of the run - many many MANY instances of GregorSamsa being loaded - looks to be related to https://xml.apache.org/xalan-j/xsltc/xsltc_native_api.html. Something about stylesheet processing. Perhaps it is our class loading performance that is letting us down? I'll have to dig at that next week. HotSpot appears to load a similar number of dynamically generated classes.

@hzongaro
Copy link
Member

many many MANY instances of GregorSamsa being loaded

FYI. The default class name XSLTC uses for the class it generates for an XSLT stylesheet is GregorSamsa. That was the case both with the XSLTC native API that Andrew mentions and also under the covers when using XSLTC as part of a JAXP implementation.

If the customer's code is recompiling the XSLT stylesheet each time it's used, it will generate a new set of classes each time - those classes will be generated, loaded and the methods JIT compiled, as required, and old classes dropped on the floor, even if the XSLT stylesheet is not changing. At least, that's how things stood when I was part of the team working on Apache-Xalan (including XSLTC) and IBM's JAXP implementation. Looking at the benchmark code, I see it is repeatedly calling TransformerFactory.newTemplates() to create a Templates object, and then calling Templates.newTransformer(), so that appears to be what it is doing - recompiling the XSLT stylesheet each time it's used.

We used to recommend to customers that they call TransformerFactory.newTemplates() once, and then create a Transformer object from that each time they needed to apply the stylesheet to an XML input document. That avoids repeatedly recompiling the same XSLT stylesheet into new classes each time. That would yield much better performance with both IBM's JDK and Hotspot, back in those days - and likely with OpenJ9 today, and would probably result in a more meaningful comparison. Otherwise, it's a bit like recompiling and reloading a method every time you need to run it.

Of course, if the real scenario that the benchmark is modelling involves a situation where XSLT stylesheets are really used just once -- perhaps because the stylesheets themselves are dynamically generated -- then calling TransformerFactory.newTemplates() might better match what the customer's actual code needs to do.

@andrewcraik
Copy link
Contributor

So looking at the benchmark as attached previously the static void run method is doing TransformerFactory.newInstance() is done once. Inside the static void benchmark layer there is a new template being on each iteration of the core loop. I am going to experiment by moving this logic out of the loop and allowing the same transformer to be reused. This would simulate the code transform being suggested by @hzongaro.

@hzongaro thank you for the insights into the XML parser - neither @DanHeidinga nor I are experts on the inner details as you are.

@andrewcraik
Copy link
Contributor

So I can confirm that moving the lines:

final Templates xsltTemplate = transFact.newTemplates(xsltSource);
final Transformer transformer = xsltTemplate.newTransformer();

out of the benchmark function's loop stops the huge amount of class loading. The removal makes the difference between -Xverify:none and default hard to measure.

@DanHeidinga
Copy link
Member

@andrewcraik Do you have profiles of the application? I'd like to see both which java calls are hot and which VM calls are hot to see if there's an obvious bottleneck in the classload path.

@andrewcraik
Copy link
Contributor

With the modified benchmark the total runtime has dropped below 12s - we are now dealing with quite a short running benchmark if run in the same configuration which puts the benchmark into a domain where startup performance is important.

If I double the work to keep the execution time over 10s to minimize the start-up/ramp-up artifacts then I see the following:

config runtime
openjdk 13.03
openj9 10.40
openj9 + shareclasses 10.12
openj9 + jProfiling 10.79
openj9 + jProfiling + shareclasses 9.67

Now if I drop the data set size back to the original 100000:

config runtime
openjdk 7.37
openj9 + jProfiling + shareclasses 9.39

The difference in work makes class load / startup / rampup dominate. Adding -Xverify:none has no measurable impact at the original data set size. -Xquickstart hurts performance since we do need steady-state performance as well as startup in this scenario - it is end to end with high opt compiles required (as evidence by the gains from JProfiling).

We will keep having a look at the class loading aspect, but @kgibm I'm wondering if the benchmark is representative of the actual workload? Is the original workload recreating templates / transformers all the time? Could a Java code optimization get them into a mode where the OpenJ9 performance is better (or is that the mode they are in - in which case the benchmark isn't simulating the bottleneck they are seeing)?

@DanHeidinga I do have profiles - I'll pass them to you offline so you can study them to see what you find since they are a bit unweildy to post here.

@andrewcraik
Copy link
Contributor

Part of the problem may just be that we have a lot of code running in the interpreter if each of the stylesheets is created and a lot of the code is only run once.

@kgibm
Copy link
Contributor Author

kgibm commented Aug 12, 2019

Interesting finds.

@kgibm I'm wondering if the benchmark is representative of the actual workload? Is the original workload recreating templates / transformers all the time?

It's quite possible that this benchmark is not representative. I wrote this code from scratch based on HealthCenter profiling of the actual workload and then mimicking some of the XML document structure and XPaths used, but it's quite possible that the application/stack product that runs the actual workload does not call newTemplates/newTransformer; I will ask to run some trace on the actual workload (I'm presuming -Xtrace is the easiest way to check if they're calling these methods after warmup).

@andrewcraik
Copy link
Contributor

@kgibm an -Xverbose:class will show how much class loading is going on though there may be a better -Xtrace option for that - @DanHeidinga thoughts? A perf record of steady state -Xjit:perfTool would show where time is really going in the system - an -Xtrace for method entry would be a poor mans approximation of that I think. @DanHeidinga again any additional thoughts on trace to see if the benchmark is representative?

@DanHeidinga
Copy link
Member

If the source is available, I'd start by greping the code base for TransformerFactory.newTemplates() calls as that's the context that will determine if the XLST stylesheet is allowed to change or not. If that shows the style sheet doesn't change (or mostly doesn't change), it provides the opportunity to cache the Templates.newTransformer() objects.

Otherwise, I'd go with @andrewcraik's -Xverbose:class as a good starting point and then maybe narrow it with something like this which will dump a stack trace on the call:

"-Xtrace:none,maximal=mt,methods=TransformerFactory.newTemplates,trigger=method{TransformerFactory.newTemplates,jstacktrace}"

Note, the method names are not complete. They need to be prepended with the correct java package using / instead of .

@kgibm
Copy link
Contributor Author

kgibm commented Aug 12, 2019

Thanks all, I'll ask for that tracing (first -verbose:class + -Xtrace, and if that doesn't show heavy classloading after warmup, then perf) and report back (a huge stack product is involved on top of WAS, so a grep won't be easy).

@kgibm
Copy link
Contributor Author

kgibm commented Aug 12, 2019

For reference, I will propose:

-Xverify:none
-Xverbose:class
-Xtrace:none,maximal=mt,methods=com/sun/org/apache/xalan/internal/xsltc/trax/TransformerFactoryImpl.newTemplates,trigger=method{com/sun/org/apache/xalan/internal/xsltc/trax/TransformerFactoryImpl.newTemplates,jstacktrace}
-Djavax.xml.transform.TransformerFactory=com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl
-Djavax.xml.xpath.XPathFactory=com.sun.org.apache.xpath.internal.jaxp.XPathFactoryImpl
-Djavax.xml.xpath.XPathFactory:http://java.sun.com/jaxp/xpath/dom=com.sun.org.apache.xpath.internal.jaxp.XPathFactoryImpl
-Djavax.xml.parsers.SAXParserFactory=com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl
-Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl
-Djavax.xml.validation.SchemaFactory:http://www.w3.org/2001/XMLSchema=com.sun.org.apache.xerces.internal.jaxp.validation.XMLSchemaFactory
-Djavax.xml.datatype.DatatypeFactory=com.sun.org.apache.xerces.internal.jaxp.datatype.DatatypeFactoryImpl

@kgibm
Copy link
Contributor Author

kgibm commented Feb 18, 2020

Thanks for everyone's time on this but we finally re-created the full customer environment internally, ran the above trace, and there is only a single call to TransformerFactory.newTemplates so this standalone reproduction was a false positive.

@kgibm kgibm closed this as completed Feb 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants