-
Notifications
You must be signed in to change notification settings - Fork 729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-deterministic NPEs when running Clojure #12191
Comments
Given the intermittency perhaps it is worth running without the JIT. Perhaps we have a bug in the JIT compiler which can introduce such intermittency. You can run without the JIT buy adding |
Cannot reproduce it with |
It is likely JIT then, but not conclusive. What is the reproducibility rate on your hardware? Does it fail 1/10 for example or 1/100, etc.? That will dictate how the investigation should proceed. I'll also try and spin up a VM and see if I can reproduce using your instructions above. That may greatly speed up the investigation. |
0/1000 on -Xint, ~5/21 on JIT |
Thanks! So 1/4 roughly. That's fairly good to investigate. I'm going to try and reproduce on a VM now and see if I can get the same reproduction rate. |
It's frequent enough that new Clojure users using J9 run into it nearly immediately |
There are newer builds you could try to see if the issue is resolved. The Java 15 support cycle is done, replaced with Java 16 shortly. Java 16 on 0.25.0 m2 https://github.com/AdoptOpenJDK/openjdk16-binaries/releases/tag/jdk-16%2B36_openj9-0.25.0-m2
Java 16 on 0.26.0 m1 https://github.com/AdoptOpenJDK/openjdk16-binaries/releases/tag/jdk-16%2B36_openj9-0.26.0-m1
|
I used Peter's second link above and can reproduce an NPE which happens much earlier in the run if I force JIT compilation on one specific method at no optimization:
It is deterministically reproducible. If I exclude this method we progress but get an NPE in a similar method. Here is the command line dump for reference:
I've confirmed when we run with |
Here are the first couple of stack slots which are all interpreted from the above core dump:
I guess we would have to figure out what bytecode index 542 in this top method is doing and how it possibly relates to the JIT compiled method. @ghadishayban any ideas on this? Subscribing @0xdaryl as this could very well be an x86 codegen issue given it reproduces at |
Bytecode 542 in
|
tried it on
same issues. |
Do you might have an idea how this relates to |
If I set jit count=0 and compile only reduce1::invokeStatic, it's 100% reproducible
|
I used @ghadishayban's command above to confirm:
This generates a javacore, JIT trace log, and system core dump when the exception occurs. Looking at the stackslots of the failed thread we see the following:
Then mapping the PC to the JIT log we can see:
The SIGSEGV occurred at offset 1F6 when we dereferenced
It looks like the receiver object was NULL here. This node
It's a bit odd because the bytecode seems to zero out symref #353 after the load, see node
Looking at the CFG of the method it looks like this: There is a backedge in Attaching the log for reference: |
|
Still fails even with disabling typical suspect things:
I'm not really sure how to proceed here. Given the above analysis there exists a path through which we can get the observed NPE, but the fact that it only happens when we JIT compile this one method is troublesome. There must be something wrong happening. @0xdaryl / @liqunl any ideas on how to proceed for this one? We basically have a test which deterministically fails with NPE at |
Will look at the trace log and reproduce on my vm |
Here is the trace of calling this synthesized bad method directly
results in an NPE:
Bytecode for the offending method:
|
@ghadishayban Thanks. Could you upload the class file? |
Github didn't accept the class file, delivered them in slack. The script to create the BadJump.class needs dependencies:
Here is the script:
compile and run it and it creates BadJump.class ...to which you can link this Runner class:
...which if you compile and run with
...results in an NPE and a trace. |
@liqunl pls create a PR for the v0.26.0-release branch as well. We may want to update the IBM 21_02 release branch as well if this affects Java 8. |
I tested the proposed patch with OpenJ9 16 + Clojure: seems to be resolved. Thanks everyone for diagnosing this tricky bug! |
This bug also manifests on 8 & 11, versions below @liqunl @pshipton
|
Thanks @ghadishayban for the unit test. That helped a lot! The fix delivered #12221 backports this to the 0.26 release which will be generated for Java 8, 11, and 16. We use the same JIT compiler for all of those JVM levels so the fix will be there in 0.26 releases. This release is scheduled for April 23, 2021: |
Hello kind folks,
I have an interesting issue with OpenJ9 running Clojure. Running Clojure on OpenJ9 throws non-deterministic NPEs, and this appears to be related to the "locals clearing" characteristic of Clojure's bytecode. Locals clearing is when a reference in a bytecode local variable is set to null using
aconst_null; astore X
before the last usage of that local variable (within a compilation unit, the function). I will explain how to repro, then I will explain what is happening in the specific bytecode.Speculation: The compiler is applying an incorrect optimization of this locals clearing pattern across control flow / jumps.
Env
Repro by building Clojure
git clone https://github.com/clojure/clojure.git cd clojure mvn -Plocal -Dmaven.test.skip=true package
then run the test until it fails:
Alternative repro
Install the Clojure CLI, then run:
When it fails non-deterministically, it outputs:
which points here.
This failure indicates that
f
is null, and it's trying to call(.invoke NULL val (first s))
.f
should never be null within this function.Here is the annotated bytecode for this function:
The text was updated successfully, but these errors were encountered: