-
Notifications
You must be signed in to change notification settings - Fork 728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenJ9 is 2 seconds slower than Azul openjdk9 when running this mandelbrot bench... #39
Comments
I am able to see the differene too - 23 seconds with openjdk vs 28 seconds with openj9. (my underlying system is MacOS) Looks like perf tool is not fully functional in docker, so performance comparison at high level is going to be laborious. @tobespc - is healthcenter functional for measuring openjdk performance? In my understanding, the tr.jit does not compete in micro-benchmark space, instead the defaults are tuned to fit more real-world like workloads. @mstoodle While the key method computeRaw seem to have compiled to its fullest (scorching) level, the time it took to decide that the method is super-hot may be a deciding factor. I guess it should be tunable to match performance for this workload, though I don't know what it is. @0xdaryl + (hot) mandelbrot$Mandelbrot.computeRow(I)V @ 00007FE9BB312FF4-00007FE9BB314178 MethodInProgress - Q_SZ=17 Q_SZI=17 QW=105 j9m=00000000026A1580 bcsz=250 DLT@60 compThread=0 CpuLoad=100%(100%avg) JvmCpu=98%
+ (warm) mandelbrot$Mandelbrot.computeRow(I)V @ 00007FE9BB3141C0-00007FE9BB314548 OrdinaryMethod - Q_SZ=16 Q_SZI=16 QW=104 j9m=00000000026A1580 bcsz=250 GCR compThread=0 CpuLoad=100%(100%avg) JvmCpu=98%
+ (profiled very-hot) mandelbrot$Mandelbrot.computeRow(I)V @ 00007FE9BB316858-00007FE9BB318118 OrdinaryMethod 96.70% T Q_SZ=0 Q_SZI=0 QW=100 j9m=00000000026A1580 bcsz=250 compThread=0 CpuLoad=100%(100%avg) JvmCpu=100%
+ (scorching) mandelbrot$Mandelbrot.computeRow(I)V @ 00007FE9BB318160-00007FE9BB318C28 OrdinaryMethod - Q_SZ=0 Q_SZI=0 QW=100 j9m=00000000026A1580 bcsz=250 compThread=0 CpuLoad=100%(100%avg) JvmCpu=100% Only one highlight I can see is that it was first DLT (Dynamic Loop Transfer) compiled before graduating to normal optimization flow. Let me test with this one off. |
ah! what is the option to turn off dlt? |
Disabling DLT [ Zulu9 Results:
OpenJ9 Results: [~2 seconds slower]
OpenJ9 (and "-Xjit:disableDynamicLoopTransfer") Results: [~2 seconds slower]
OpenJ9 (and "-Xjit:optLevel=scorching") Results: [Comparable to Zulu9]
References: |
@babsingh - great result, thanks! |
I reran the bench with -Xgcpolicy:metronome to make sure #42 wasn't adding to the problem and the results were a little slower, so the two issues have no overlap. |
So -Xjit:optLevel=scorching is not the same as the method compiling to scorching by itself. The very-hot profiled step is a special compile which adds runtime instrumentation to the method body which is exploited by the scorching compile. When you set optLevel=scorching you bypass this and so the optLevel=scorching is very similar to optLevel=hot. I agree that the runtime of the benchmark at around 25s means that the profile will be dominated by JIT startup - doing the profiling etc will take time. We don't normally try to tune compilation aggressiveness on the command-line. These test cases are ideal for helping make the compilation heuristics better. Gathering a runtime profile and a steady state profile with a tool like perf can help show what methods are running and where opportunities might exist. Another test which may be worth doing is to try fixing the heap size for both JDKs and comparing the performance. The default initial heap is 8MB with a max of 512MB. A verbose GC log could help show if GC time is an issue for J9 and fixing the heap to something like 256MB may reduce heap overhead (fixing the heap would be to use -Xmx256M -Xms256M). Dynamically growing the heap is an expensive operation and J9 tries to keep the heap as small as possible, this comes at a penalty of more garbage collection happening prior to the heap initially growing which can slow down startup for J9 in some scenarios. |
I am not that familiar with Zulu9 and it's defaults for things like the heap. @andrewcraik is correct that the heap settings may be making the 1.5-2 second difference here. Some verbose:gc logs would definitely identifying if some small tweaks the heap settings could improve the overall performance. |
Some thoughts; I haven't had a chance to look very deeply at it. This 2 second gap is actually on the order of 8% slower. Note that this benchmark is mostly focused in one method (computeRow) and does a lot of stuff with doubles. It does some file output in the hot method and, because of the buffer size (8192) as well as the problem size provided (16000), that code is active in the measurement and so the JIT inlines it (and quite aggressively at scorching, sigh). The main inner loop code looks reasonably consolidated, though I didn't look to make sure we weren't missing optimization opportunities. Double math isn't something I would say our JIT is highly tuned for although we do a lot of the usual kinds of things, so that might explain (some of?) the difference. Instruction selection could be another possible place to look for this 8% gap. It might be worth comparing the times when the buffer only gets written at the end (i.e. lowering that 16000 argument to something like 8000) just to see what happens on both sides. The main method doesn't look like it's allocating lots of objects, other than maybe via that PrintStream object, so I'm dubious GC will affect perf much, but I've been wrong before :) (or maybe :( ). |
Did some more testing based on feedbacks from @andrewcraik and @charliegracie openjdk seems to use 32M initial heap, so some difference there. Few GCs occurred in j9 in the run, while there was none in opnejdk case. I used the same heap settings for j9 as well, and got Looking at the verbose:gc log, I see the gc efforts were insignificant: bash$ grep "durationms" j9.log | awk -F' ' '{print $4 $5}'
contextid="4"durationms="1.603"
durationms="3.647"/>
contextid="16"durationms="1.533"
durationms="1.681"/>
contextid="28"durationms="2.073"
durationms="2.222"/>
contextid="40"durationms="1.778"
durationms="1.923"/>
contextid="52"durationms="1.962"
durationms="2.120"/> amounting to only ~20 ms . So back to JIT. Now tried what @mstoodle suggested: With 8K as the input, still j9 is lagging: bash$ ./run.sh openjdk
usingjre: java
real 0m5.393s
user 0m5.340s
sys 0m0.030s
bash$ ./run.sh j9jdk
usingjre: /root/openj9-openjdk-jdk9/build/linux-x86_64-normal-server-release/jdk/bin/java
real 0m6.578s
user 0m6.470s
sys 0m0.080s
bash$ If I increase the input count, the difference also increases, proportionately. |
@charliegracie zulu9 should be openjdk9 for the most part. I don't know for sure if there's any difference between the two other than azul built and packaged this version of openjdk9. |
@gireeshpunathil I wonder if there is a difference in the scorching compile of the method that we got via specifying optLevel=scorching vs getting to scorching via the upgrade mechanism in the JIT without fixing optLevel=scorching. The other possibility is that the time it takes us to get up to scorching is why we have a difference; is it possible to run the test case for longer (inside a harness or with a larger input) and see if the difference is still the same ? |
Perf statistics that may help to better understand OpenJ9 behaviour. With OpenJ9 default settings:
OpenJ9 and -Xjit:optLevel=scorching:
Zulu9 default settings:
|
The optLevel=scorching will cut out all recompilation activity which saves instructions and cache references. For small micro-benchmarks aggressive compilation can work well, but for large applications it will significantly affect how fast the application starts etc. I think focusing on the question asked by @vijaysun-omr can we run it longer to see what the steady-state is like without the startup (possibly put the Mandelbrot calculation in a loop and calculate n Mandelbrots to the same level one after the other, timing each one as an iteration of the benchmark) this will remove startup and rampup and compare the steady-state code quality which will show how much of the delta you are seeing are due to start-up effects from the GC and JIT and how much are related to the steady state code quality. |
One thing I noticed running in a docker container on my Mac (so probably not the best measurement environment) is that the very-hot-with-profiling compile is the last one, which isn't exactly ideal. If you run with |
I ran for 64000, which allowed me to collect a full log for the benchmark (including very hot with profiling and scorching compiles). The log generated by |
@andrewcraik in regards to --> "put the Mandelbrot calculation in a loop and calculate n Mandelbrots to the same level one after the other". gathered some perf statistics. perf issue doesn't resolve in steady state. OpenJ9 default setting:
Zulu9 default setting:
OpenJ9 + -Xjit:optLevel=scorching:
OpenJ9 + -Xjit:quickProfile:
|
@andrewcraik Also tried this --> increased OpenJ9 default settings:
Zulu9 default settings:
OpenJ9 + -Xjit:optLevel=scorching:
OpenJ9 + -Xjit:quickProfile:
|
hmm that is an interesting finding - seems like there is opportunity for further compiles at higher opt levels or for further optimization some how. Getting a verbose compilation log with and without quickProfile would show the methods getting to higher opt levels (the option would be to add verbose={comp*,sampling},vlog=/path/to/log/file to the JIT command line). There must be a subset of methods that are not being escalated to higher opt levels in the default configuration that would benefit from it is my initial thought. @mpirvu any thoughts? |
@andrewcraik Ran with |
The instructions per cycle is markedly higher for us than for Zulu9. I wonder if there is a better choice of X86 instruction to do what the core loop that is almost all double arithmetic is doing; i.e. maybe one or more operations in the loop can be fused. @0xdaryl and @andrewcraik thoughts ? You can see the core loop from the log attached in an earlier comment. |
I don't know if it will be helpful to you guys at this point, but I added this bench (and a parallel version) to my project here. It uses JMH, so you can launch this specific bench with |
Looking at the java code the vast majority of the time should be spent in this method: For the quickprofile case the scorching body is available after 486ms (very close to the previous case). |
so looking at the core loop (which in the attached log was loop 5 with frequency 10000) there is a pattern of repeated double width adds, subs, and multiplies. I can see one obvious silliness that is happening. Consider the following from valuation of tree root n94n: The two floating point constants are 2 and 4 respectively so we might be able to do something better there. There also looks like an opportunity to exploit the VFMADD/VFMADDSUB instructions to reduce the instruction count further - there may also be scope for a packed VFMADD/VFMADDSUB. These newer fused instructions are not currently supported by the OpenJ9 x86 code generator. At a higher level this loop is counted and has an early termination condition with the floating point test. The loop is unrolled as uncounted but could be unrolled as counted since we know it will run either 50 times or until the floating point condition is satisfied. The unroller could do a better job moving the induction variable updates out of the way, but they are probably small compared to the floating point cost. |
I would add that the VFMADD VFMADDSUB are three operand instructions so getting 3 operand support into the code gen would be the first step on how to exploit these operations. |
I may have to correct my movsd comments - it looks like there are future uses so they may be valid. |
I'm thinking "At a higher level this loop is counted and has an early termination condition with the floating point test. The loop is unrolled as uncounted but could be unrolled as counted since we know it will run either 50 times or until the floating point condition is satisfied. The unroller could do a better job moving the induction variable updates out of the way, but they are probably small compared to the floating point cost." is probably worth pursuing next since the delta is small enough and this loop is so tight that those couple of instructions could be a non trivial contributor. Plus the run with optLevel=scorching run being faster than Zulu9 suggests that we don't need to necessarily solve the floating point issues to get to that point (though we should open an issue to also improve the floating point handling here). |
let me see if I can:
|
I do not want us to look at zulu9 generated code. We should pursue the above lead regarding unrolling instead. |
@vijaysun-omr, thanks. My rational was that if we can compare at higher levels, we can compare at any levels in pursuit of getting into the bottom of the issue. But sure, I will follow your judgement. |
@markehammons This should now be resolved. The change was a fix to the BlockSplitter optimization - it was setting an incorrect frequency on a basic block that it generated during optimization which disrupted the layout of the loop and disrupted subsequent loop optimizations. With the OMR fix included in OpenJ9 you should see a performance boost inline with the scores being reported with -Xjit:quickProfiling or optLevel=scorching. |
@andrewcraik thanks, I'll rebench and confirm this weekend! |
i lied, i did it last night. according to my benchmarks, openj9 on default settings now beats zulu9 on this bench with its best case performance being 638ms faster, its average case being 617ms faster and a worst case performance being 705ms faster! good job on optimizing this! edit: i should note that i didn't use any forks in this bench, and i should probably do 3 to get stronger results. a rerun with all the shootout benches i've ported to jmh has zulu9 performing near as well as openj9's best case and worst case, but still losing out by about 100ms |
Thanks for circling back for us, @markehammons ! |
@markehammons is there anything left to do in this issue, in your opinion, or do you think we can close it off? |
Closing, please re-open if there is still an issue. |
Hi @markehammons ! We haven't heard from you in a while :) . I know this issue has been closed for a long time, but I just wanted to reach out to see if you're still using OpenJ9 and to ask if you have any further feedback on your experiences with either the JVM or the project itself? We have a weekly web meeting (using Zoom) on Wednesdays at 11am EST. We're very interested to hear your feedback, good or not, at that call if you would be willing to join us? The Zoom link is usually posted in our #planning Slack channel the morning of the meeting (very easy to join, please see the "What's new" page at https://www.eclipse.org/openj9/oj9_whatsnew.html) |
I'll join. I have been very busy with other projects, and haven't had much chance to test openj9 recently but I'll try to help out still. |
I compared Azul OpenJDK9 with OpenJ9 to test the performance of OpenJ9 against relatively standard OpenJDK, and found that it is around 2 seconds slower to run the bench.
MandelbrotBench.tar.gz
My system is Fedora 26, and my OpenJ9 -version is
The text was updated successfully, but these errors were encountered: