Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenJ9 is 2 seconds slower than Azul openjdk9 when running this mandelbrot bench... #39

Closed
markehammons opened this issue Sep 17, 2017 · 37 comments

Comments

@markehammons
Copy link

I compared Azul OpenJDK9 with OpenJ9 to test the performance of OpenJ9 against relatively standard OpenJDK, and found that it is around 2 seconds slower to run the bench.

MandelbrotBench.tar.gz

My system is Fedora 26, and my OpenJ9 -version is

openjdk 9-internal
OpenJDK Runtime Environment (build 9-internal+0-adhoc.jenkins.openjdk)
Eclipse OpenJ9 VM (build 2.9, JRE 9 Linux amd64-64 Compressed References 20170915_6 (JIT enabled, AOT enabled)
J9VM - cea1ed7
JIT  - cea1ed7
OMR  - 617de12
OpenJDK  - 83f5cd0 based on )
@gireeshpunathil
Copy link
Contributor

I am able to see the differene too - 23 seconds with openjdk vs 28 seconds with openj9. (my underlying system is MacOS)

Looks like perf tool is not fully functional in docker, so performance comparison at high level is going to be laborious.

@tobespc - is healthcenter functional for measuring openjdk performance?

In my understanding, the tr.jit does not compete in micro-benchmark space, instead the defaults are tuned to fit more real-world like workloads. @mstoodle

While the key method computeRaw seem to have compiled to its fullest (scorching) level, the time it took to decide that the method is super-hot may be a deciding factor. I guess it should be tunable to match performance for this workload, though I don't know what it is. @0xdaryl

+ (hot) mandelbrot$Mandelbrot.computeRow(I)V @ 00007FE9BB312FF4-00007FE9BB314178 MethodInProgress - Q_SZ=17 Q_SZI=17 QW=105 j9m=00000000026A1580 bcsz=250 DLT@60 compThread=0 CpuLoad=100%(100%avg) JvmCpu=98%
+ (warm) mandelbrot$Mandelbrot.computeRow(I)V @ 00007FE9BB3141C0-00007FE9BB314548 OrdinaryMethod - Q_SZ=16 Q_SZI=16 QW=104 j9m=00000000026A1580 bcsz=250 GCR compThread=0 CpuLoad=100%(100%avg) JvmCpu=98%
+ (profiled very-hot) mandelbrot$Mandelbrot.computeRow(I)V @ 00007FE9BB316858-00007FE9BB318118 OrdinaryMethod 96.70% T Q_SZ=0 Q_SZI=0 QW=100 j9m=00000000026A1580 bcsz=250 compThread=0 CpuLoad=100%(100%avg) JvmCpu=100%
+ (scorching) mandelbrot$Mandelbrot.computeRow(I)V @ 00007FE9BB318160-00007FE9BB318C28 OrdinaryMethod - Q_SZ=0 Q_SZI=0 QW=100 j9m=00000000026A1580 bcsz=250 compThread=0 CpuLoad=100%(100%avg) JvmCpu=100%

Only one highlight I can see is that it was first DLT (Dynamic Loop Transfer) compiled before graduating to normal optimization flow. Let me test with this one off.

@gireeshpunathil
Copy link
Contributor

ah! what is the option to turn off dlt?

@babsingh
Copy link
Contributor

babsingh commented Sep 18, 2017

Disabling DLT [-Xjit:disableDynamicLoopTransfer] doesn't help. The performance issue can be resolved by using the following JVM cmdline option: -Xjit:optLevel=scorching.

Zulu9 Results:

source run.sh "../zulu9_jdk9.0.0/bin/java"
using jre: ../zulu9_jdk9.0.0/bin/java

real	0m24.478s
user	0m24.522s
sys	0m0.023s

OpenJ9 Results: [~2 seconds slower]

source run.sh "../jdk-9+181/bin/java"
using jre: ../jdk-9+181/bin/java

real	0m26.266s
user	0m26.294s
sys	0m0.178s

OpenJ9 (and "-Xjit:disableDynamicLoopTransfer") Results: [~2 seconds slower]

source run.sh "../jdk-9+181/bin/java -Xjit:disableDynamicLoopTransfer"
using jre: ../jdk-9+181/bin/java -Xjit:disableDynamicLoopTransfer

real	0m26.258s
user	0m26.266s
sys	0m0.184s

OpenJ9 (and "-Xjit:optLevel=scorching") Results: [Comparable to Zulu9]

source run.sh "../jdk-9+181/bin/java -Xjit:optLevel=scorching"
using jre: ../jdk-9+181/bin/java -Xjit:optLevel=scorching

real	0m23.676s
user	0m24.001s
sys	0m0.310s

References:

@gireeshpunathil
Copy link
Contributor

@babsingh - great result, thanks!
So what is the bottom line here: the benchmark is too tiny for tr.jit to take advantage of the runtime profiling - i.e., the leeway for learning the method characteristics is more than the total run time of the method? or something else? are there other tunables which are not method specific but can influence the compiling decisions - for example profiling frequency?

@markehammons
Copy link
Author

I reran the bench with -Xgcpolicy:metronome to make sure #42 wasn't adding to the problem and the results were a little slower, so the two issues have no overlap.

@andrewcraik
Copy link
Contributor

So -Xjit:optLevel=scorching is not the same as the method compiling to scorching by itself. The very-hot profiled step is a special compile which adds runtime instrumentation to the method body which is exploited by the scorching compile. When you set optLevel=scorching you bypass this and so the optLevel=scorching is very similar to optLevel=hot.

I agree that the runtime of the benchmark at around 25s means that the profile will be dominated by JIT startup - doing the profiling etc will take time. We don't normally try to tune compilation aggressiveness on the command-line. These test cases are ideal for helping make the compilation heuristics better. Gathering a runtime profile and a steady state profile with a tool like perf can help show what methods are running and where opportunities might exist.

Another test which may be worth doing is to try fixing the heap size for both JDKs and comparing the performance. The default initial heap is 8MB with a max of 512MB. A verbose GC log could help show if GC time is an issue for J9 and fixing the heap to something like 256MB may reduce heap overhead (fixing the heap would be to use -Xmx256M -Xms256M). Dynamically growing the heap is an expensive operation and J9 tries to keep the heap as small as possible, this comes at a penalty of more garbage collection happening prior to the heap initially growing which can slow down startup for J9 in some scenarios.

@charliegracie
Copy link
Contributor

I am not that familiar with Zulu9 and it's defaults for things like the heap. @andrewcraik is correct that the heap settings may be making the 1.5-2 second difference here. Some verbose:gc logs would definitely identifying if some small tweaks the heap settings could improve the overall performance.

@mstoodle
Copy link
Contributor

mstoodle commented Sep 18, 2017

Some thoughts; I haven't had a chance to look very deeply at it. This 2 second gap is actually on the order of 8% slower.

Note that this benchmark is mostly focused in one method (computeRow) and does a lot of stuff with doubles. It does some file output in the hot method and, because of the buffer size (8192) as well as the problem size provided (16000), that code is active in the measurement and so the JIT inlines it (and quite aggressively at scorching, sigh). The main inner loop code looks reasonably consolidated, though I didn't look to make sure we weren't missing optimization opportunities. Double math isn't something I would say our JIT is highly tuned for although we do a lot of the usual kinds of things, so that might explain (some of?) the difference. Instruction selection could be another possible place to look for this 8% gap.

It might be worth comparing the times when the buffer only gets written at the end (i.e. lowering that 16000 argument to something like 8000) just to see what happens on both sides.

The main method doesn't look like it's allocating lots of objects, other than maybe via that PrintStream object, so I'm dubious GC will affect perf much, but I've been wrong before :) (or maybe :( ).

@gireeshpunathil
Copy link
Contributor

Did some more testing based on feedbacks from @andrewcraik and @charliegracie

openjdk seems to use 32M initial heap, so some difference there. Few GCs occurred in j9 in the run, while there was none in opnejdk case. I used the same heap settings for j9 as well, and got some improvements, but not on the scale of seconds.

Looking at the verbose:gc log, I see the gc efforts were insignificant:

bash$ grep "durationms" j9.log | awk -F' ' '{print $4 $5}'
contextid="4"durationms="1.603"
durationms="3.647"/>
contextid="16"durationms="1.533"
durationms="1.681"/>
contextid="28"durationms="2.073"
durationms="2.222"/>
contextid="40"durationms="1.778"
durationms="1.923"/>
contextid="52"durationms="1.962"
durationms="2.120"/>

amounting to only ~20 ms .

So back to JIT. Now tried what @mstoodle suggested: With 8K as the input, still j9 is lagging:

bash$ ./run.sh openjdk
usingjre: java

real	0m5.393s
user	0m5.340s
sys	0m0.030s
bash$ ./run.sh j9jdk
usingjre: /root/openj9-openjdk-jdk9/build/linux-x86_64-normal-server-release/jdk/bin/java

real	0m6.578s
user	0m6.470s
sys	0m0.080s
bash$

If I increase the input count, the difference also increases, proportionately.

@markehammons
Copy link
Author

@charliegracie zulu9 should be openjdk9 for the most part. I don't know for sure if there's any difference between the two other than azul built and packaged this version of openjdk9.

@vijaysun-omr
Copy link
Contributor

@gireeshpunathil I wonder if there is a difference in the scorching compile of the method that we got via specifying optLevel=scorching vs getting to scorching via the upgrade mechanism in the JIT without fixing optLevel=scorching.

The other possibility is that the time it takes us to get up to scorching is why we have a difference; is it possible to run the test case for longer (inside a harness or with a larger input) and see if the difference is still the same ?

@babsingh
Copy link
Contributor

Perf statistics that may help to better understand OpenJ9 behaviour. With optLevel=scorching, number of instructions reduce by 20% and cache usage increases (by 2x). It does seem like JIT is compiling/optimizing quicker with optLevel=scorching.

OpenJ9 default settings:

Performance counter stats for '../jdk-9+181/bin/java mandelbrot 16000':

    93,228,726,387      cycles                                                        (60.01%)
   147,595,639,195      instructions              #    1.58  insn per cycle           (80.00%)
        19,814,955      cache-references                                              (80.00%)
         1,862,484      cache-misses              #    9.399 % of all cache refs      (80.01%)
     2,543,880,869      bus-cycles                                                    (79.99%)

      25.896081203 seconds time elapsed

OpenJ9 and -Xjit:optLevel=scorching:

 Performance counter stats for '../jdk-9+181/bin/java -Xjit:optLevel=scorching mandelbrot 16000':

    86,358,957,861      cycles                                                        (60.01%)
   121,699,582,700      instructions              #    1.41  insn per cycle           (80.00%)
        33,944,092      cache-references                                              (79.99%)
         4,356,769      cache-misses              #   12.835 % of all cache refs      (80.01%)
     2,364,621,279      bus-cycles                                                    (79.99%)

      23.639411169 seconds time elapsed

Zulu9 default settings:

 Performance counter stats for '../zulu9_jdk9.0.0/bin/java mandelbrot 16000':

    86,471,100,248      cycles                                                        (59.98%)
    76,977,227,571      instructions              #    0.89  insn per cycle           (80.00%)
        16,489,455      cache-references                                              (80.00%)
         2,721,538      cache-misses              #   16.505 % of all cache refs      (80.02%)
     2,384,954,480      bus-cycles                                                    (80.00%)

      24.536771062 seconds time elapsed

@andrewcraik
Copy link
Contributor

The optLevel=scorching will cut out all recompilation activity which saves instructions and cache references. For small micro-benchmarks aggressive compilation can work well, but for large applications it will significantly affect how fast the application starts etc. I think focusing on the question asked by @vijaysun-omr can we run it longer to see what the steady-state is like without the startup (possibly put the Mandelbrot calculation in a loop and calculate n Mandelbrots to the same level one after the other, timing each one as an iteration of the benchmark) this will remove startup and rampup and compare the steady-state code quality which will show how much of the delta you are seeing are due to start-up effects from the GC and JIT and how much are related to the steady state code quality.

@mstoodle
Copy link
Contributor

mstoodle commented Sep 18, 2017

One thing I noticed running in a docker container on my Mac (so probably not the best measurement environment) is that the very-hot-with-profiling compile is the last one, which isn't exactly ideal. If you run with java -Xjit:quickProfile I saw a 3 second improvement (not that I'm recommending people use that option as a "right" approach...just trying to diagnose the issue).

@mstoodle
Copy link
Contributor

mstoodle commented Sep 18, 2017

I ran for 64000, which allowed me to collect a full log for the benchmark (including very hot with profiling and scorching compiles). The log generated by '-Xjit:{*computeRow*}(log=log.dmp,traceFull)' is here, if anyone wants to peruse (gzip file):
https://app.box.com/v/jitLog-ComputeRow

@babsingh
Copy link
Contributor

@andrewcraik in regards to --> "put the Mandelbrot calculation in a loop and calculate n Mandelbrots to the same level one after the other". gathered some perf statistics. perf issue doesn't resolve in steady state.

OpenJ9 default setting:

Run 1: elapsed time = 26.043635038 seconds
Run 2: elapsed time = 25.965893799 seconds
Run 3: elapsed time = 25.781609066 seconds
Run 4: elapsed time = 25.781954754 seconds
Run 5: elapsed time = 25.592928397 seconds
Run 6: elapsed time = 26.038676642 seconds
Run 7: elapsed time = 26.393416152 seconds
Run 8: elapsed time = 25.398758643 seconds
Run 9: elapsed time = 25.333281808 seconds
Run 10: elapsed time = 25.285969633 seconds

   923,707,151,350      cycles                                                      
 1,458,786,766,267      instructions              #    1.58  insn per cycle         
        71,725,532      cache-references                                            
        20,272,118      cache-misses              #   28.263 % of all cache refs    
    25,401,176,439      bus-cycles                                                  

     257.885232042 seconds time elapsed

Zulu9 default setting:

Run 1: elapsed time = 23.418343667 seconds
Run 2: elapsed time = 23.415317452 seconds
Run 3: elapsed time = 23.479185654 seconds
Run 4: elapsed time = 23.468103129 seconds
Run 5: elapsed time = 23.434137855 seconds
Run 6: elapsed time = 23.379058791 seconds
Run 7: elapsed time = 23.422241448 seconds
Run 8: elapsed time = 23.384524801 seconds
Run 9: elapsed time = 23.372130895 seconds
Run 10: elapsed time = 23.389335308 seconds

   853,837,566,489      cycles                                                      
   762,939,672,607      instructions              #    0.89  insn per cycle         
        29,690,956      cache-references                                            
         5,070,921      cache-misses              #   17.079 % of all cache refs    
    23,209,424,461      bus-cycles                                                  

     234.273548669 seconds time elapsed

OpenJ9 + -Xjit:optLevel=scorching:

Run 1: elapsed time = 22.926882356 seconds
Run 2: elapsed time = 22.745778744 seconds
Run 3: elapsed time = 22.747955596 seconds
Run 4: elapsed time = 22.731840778 seconds
Run 5: elapsed time = 22.747453338 seconds
Run 6: elapsed time = 22.983480399 seconds
Run 7: elapsed time = 22.814466641 seconds
Run 8: elapsed time = 22.795242145 seconds
Run 9: elapsed time = 22.755443293 seconds
Run 10: elapsed time = 22.701920895 seconds

   831,475,814,808      cycles                                                      
 1,178,343,568,845      instructions              #    1.42  insn per cycle         
        64,910,063      cache-references                                            
         7,169,076      cache-misses              #   11.045 % of all cache refs    
    22,621,426,474      bus-cycles                                                  

     228.232725858 seconds time elapsed

OpenJ9 + -Xjit:quickProfile:

Run 1: elapsed time = 22.958591357 seconds
Run 2: elapsed time = 22.971043626 seconds
Run 3: elapsed time = 22.931530618 seconds
Run 4: elapsed time = 22.855747613 seconds
Run 5: elapsed time = 22.856325383 seconds
Run 6: elapsed time = 22.941845896 seconds
Run 7: elapsed time = 22.851320409 seconds
Run 8: elapsed time = 22.845156092 seconds
Run 9: elapsed time = 22.907897378 seconds
Run 10: elapsed time = 22.86562445 seconds

   836,039,408,770      cycles                                                      
 1,226,470,282,710      instructions              #    1.47  insn per cycle         
        70,846,081      cache-references                                            
         6,657,698      cache-misses              #    9.397 % of all cache refs    
    22,731,595,492      bus-cycles                                                  

     229.206880038 seconds time elapsed

@babsingh
Copy link
Contributor

@andrewcraik Also tried this --> increased input_size from 16000 to 80000 (5x). 25x increase in elapsed time: from ~25s to ~600s. ~5-10% regression is still observed between default OpenJ9 and Zulu9. Effects of start/rampup should have disappeared after increasing benchmark's run-time. The perf issue still exists in steady state. The performance issue is resolved with -Xjit:quickProfile and -Xjit:optLevel=scorching. With these options, OpenJ9 performs ~3-4% better.

OpenJ9 default settings:

Performance counter stats for '../jdk-9+181/bin/java mandelbrot 80000':

 2,296,297,923,507      cycles                                                      
 3,643,778,763,097      instructions              #    1.59  insn per cycle         
       135,753,939      cache-references                                            
        20,346,417      cache-misses              #   14.988 % of all cache refs    
    62,645,370,306      bus-cycles                                                  

     636.138118316 seconds time elapsed

Zulu9 default settings:

 Performance counter stats for '../zulu9_jdk9.0.0/bin/java mandelbrot 80000':

 2,131,810,460,200      cycles                                                      
 1,906,522,418,148      instructions              #    0.89  insn per cycle         
        65,875,157      cache-references                                            
        29,427,958      cache-misses              #   44.672 % of all cache refs    
    58,660,310,634      bus-cycles                                                  

     596.293777171 seconds time elapsed

OpenJ9 + -Xjit:optLevel=scorching:

 Performance counter stats for '../jdk-9+181/bin/java -Xjit:optLevel=scorching mandelbrot 80000':

 2,076,905,640,417      cycles                                                      
 2,945,923,168,612      instructions              #    1.42  insn per cycle         
       199,875,814      cache-references                                            
        56,213,121      cache-misses              #   28.124 % of all cache refs    
    57,246,606,156      bus-cycles                                                  

     580.659259508 seconds time elapsed

OpenJ9 + -Xjit:quickProfile:

 Performance counter stats for '../jdk-9+181/bin/java -Xjit:quickProfile mandelbrot 80000':

 2,069,668,721,467      cycles                                                      
 2,950,008,303,349      instructions              #    1.43  insn per cycle         
       133,625,782      cache-references                                            
        34,583,084      cache-misses              #   25.881 % of all cache refs    
    56,683,085,417      bus-cycles                                                  

     577.466478655 seconds time elapsed

@andrewcraik
Copy link
Contributor

hmm that is an interesting finding - seems like there is opportunity for further compiles at higher opt levels or for further optimization some how. Getting a verbose compilation log with and without quickProfile would show the methods getting to higher opt levels (the option would be to add verbose={comp*,sampling},vlog=/path/to/log/file to the JIT command line). There must be a subset of methods that are not being escalated to higher opt levels in the default configuration that would benefit from it is my initial thought. @mpirvu any thoughts?

@babsingh
Copy link
Contributor

@andrewcraik Ran with input_size == 48000. JIT verbose logs:
i) without quickProfile jitlog_no_quickProfile.20170918.161140.24341.txt
ii) with quickProfile jitlog_quickProfile.20170918.161529.24513.txt

@vijaysun-omr
Copy link
Contributor

The instructions per cycle is markedly higher for us than for Zulu9. I wonder if there is a better choice of X86 instruction to do what the core loop that is almost all double arithmetic is doing; i.e. maybe one or more operations in the loop can be fused. @0xdaryl and @andrewcraik thoughts ? You can see the core loop from the log attached in an earlier comment.

@markehammons
Copy link
Author

I don't know if it will be helpful to you guys at this point, but I added this bench (and a parallel version) to my project here.

It uses JMH, so you can launch this specific bench with sbt "jmh:run -i 10 -t numberOfThreads -f1 -jvm /home/mhammons/jdk-9+181/bin/java benchmarksgame.mandelbrot2". I'm still running an initial run, but I'm getting the same results as @babsingh with 5 warmup iterations of 2 minutes each with 8 instances of the bench running simultaneously.

@mpirvu
Copy link
Contributor

mpirvu commented Sep 18, 2017

Looking at the java code the vast majority of the time should be spent in this method: private void computeRow(int y). It does reach scorching, so the performance is dictated by how fast it can go to scorching and by the quality of the generated code.
Looking at the vlogs posted by babsingh for the non-quickprofile we see
(scorching) Compiling mandelbrot$Mandelbrot.computeRow(I)V OrdinaryMethod j9m=00000000010DD628 **t=454**...
+ (scorching) mandelbrot$Mandelbrot.computeRow(I)V @ 00007FA9FF7497C0-00007FA9FF74A428 OrdinaryMethod - Q_SZ=0 Q_SZI=0 QW=100 j9m=00000000010DD628 bcsz=250 **time=23283us** ...
so after (454+23)=477ms we have a scorching body. For something that lasts close to 4 minutes this is not that bad.

For the quickprofile case the scorching body is available after 486ms (very close to the previous case).
Based on this alone I am going to guess that the quality of the generated code is the more likely cause of the perf drop.

@andrewcraik
Copy link
Contributor

so looking at the core loop (which in the attached log was loop 5 with frequency 10000) there is a pattern of repeated double width adds, subs, and multiplies. I can see one obvious silliness that is happening. Consider the following from valuation of tree root n94n:
[0x7f9b1e0eb670] subsd FPR_1713, FPR_1712
[0x7f9b1e0eb710] addsd FPR_1713, FPR_1716
[0x7f9b1e0eb840] movsd FPR_1719, FPR_1713 <--- don't need this could keep using 1713 in mulsd
[0x7f9b1e0eb8e0] mulsd FPR_1719, FPR_1713
[0x7f9b1e0ebb60] movsd FPR_1720, qword ptr [FPRCONSTANT]
[0x7f9b1e0ebc00] mulsd FPR_1715, FPR_1720
[0x7f9b1e0ebca0] mulsd FPR_1714, FPR_1715
[0x7f9b1e0ebd40] addsd FPR_1714, FPR_1717
[0x7f9b1e0ebe70] movsd FPR_1721, FPR_1714 <--- don't need this could keep using 1714 in mulsd
[0x7f9b1e0ebf10] mulsd FPR_1721, FPR_1714
[0x7f9b1e0ec040] movsd FPR_1722, FPR_1719 <--- don't need this could keep using 1719 in addsd
[0x7f9b1e0ec0e0] addsd FPR_1722, FPR_1721
[0x7f9b1e0ec360] movsd FPR_1723, qword ptr [FPRCONSTANT]
[0x7f9b1e0ec400] ucomisd FPR_1722, FPR_1723

The two floating point constants are 2 and 4 respectively so we might be able to do something better there. There also looks like an opportunity to exploit the VFMADD/VFMADDSUB instructions to reduce the instruction count further - there may also be scope for a packed VFMADD/VFMADDSUB. These newer fused instructions are not currently supported by the OpenJ9 x86 code generator.

At a higher level this loop is counted and has an early termination condition with the floating point test. The loop is unrolled as uncounted but could be unrolled as counted since we know it will run either 50 times or until the floating point condition is satisfied. The unroller could do a better job moving the induction variable updates out of the way, but they are probably small compared to the floating point cost.

@andrewcraik
Copy link
Contributor

I would add that the VFMADD VFMADDSUB are three operand instructions so getting 3 operand support into the code gen would be the first step on how to exploit these operations.

@andrewcraik
Copy link
Contributor

I may have to correct my movsd comments - it looks like there are future uses so they may be valid.

@vijaysun-omr
Copy link
Contributor

I'm thinking

"At a higher level this loop is counted and has an early termination condition with the floating point test. The loop is unrolled as uncounted but could be unrolled as counted since we know it will run either 50 times or until the floating point condition is satisfied. The unroller could do a better job moving the induction variable updates out of the way, but they are probably small compared to the floating point cost."

is probably worth pursuing next since the delta is small enough and this loop is so tight that those couple of instructions could be a non trivial contributor. Plus the run with optLevel=scorching run being faster than Zulu9 suggests that we don't need to necessarily solve the floating point issues to get to that point (though we should open an issue to also improve the floating point handling here).

@gireeshpunathil
Copy link
Contributor

let me see if I can:

  • get the generated code with force scorching and natural graduation
  • get the generated code for zulu

@vijaysun-omr
Copy link
Contributor

I do not want us to look at zulu9 generated code. We should pursue the above lead regarding unrolling instead.

@gireeshpunathil
Copy link
Contributor

@vijaysun-omr, thanks. My rational was that if we can compare at higher levels, we can compare at any levels in pursuit of getting into the bottom of the issue. But sure, I will follow your judgement.

@mstoodle mstoodle added the perf label Sep 20, 2017
@andrewcraik
Copy link
Contributor

@markehammons This should now be resolved.

The change was a fix to the BlockSplitter optimization - it was setting an incorrect frequency on a basic block that it generated during optimization which disrupted the layout of the loop and disrupted subsequent loop optimizations.

With the OMR fix included in OpenJ9 you should see a performance boost inline with the scores being reported with -Xjit:quickProfiling or optLevel=scorching.

@markehammons
Copy link
Author

@andrewcraik thanks, I'll rebench and confirm this weekend!

@markehammons
Copy link
Author

markehammons commented Sep 29, 2017

i lied, i did it last night. according to my benchmarks, openj9 on default settings now beats zulu9 on this bench with its best case performance being 638ms faster, its average case being 617ms faster and a worst case performance being 705ms faster!

benchmarksgame-jmh.zip

good job on optimizing this!

edit: i should note that i didn't use any forks in this bench, and i should probably do 3 to get stronger results. a rerun with all the shootout benches i've ported to jmh has zulu9 performing near as well as openj9's best case and worst case, but still losing out by about 100ms

@mstoodle
Copy link
Contributor

Thanks for circling back for us, @markehammons !

@mstoodle
Copy link
Contributor

@markehammons is there anything left to do in this issue, in your opinion, or do you think we can close it off?

@pshipton
Copy link
Member

Closing, please re-open if there is still an issue.

@mstoodle
Copy link
Contributor

mstoodle commented Feb 27, 2018

Hi @markehammons ! We haven't heard from you in a while :) . I know this issue has been closed for a long time, but I just wanted to reach out to see if you're still using OpenJ9 and to ask if you have any further feedback on your experiences with either the JVM or the project itself?

We have a weekly web meeting (using Zoom) on Wednesdays at 11am EST. We're very interested to hear your feedback, good or not, at that call if you would be willing to join us? The Zoom link is usually posted in our #planning Slack channel the morning of the meeting (very easy to join, please see the "What's new" page at https://www.eclipse.org/openj9/oj9_whatsnew.html)

@markehammons
Copy link
Author

I'll join. I have been very busy with other projects, and haven't had much chance to test openj9 recently but I'll try to help out still.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants