-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
avx512 features no longer detected in target images in v1.11 #56177
Comments
This code shouldn't have been allowed to load :. @vchuravy |
For debugging this, before loading JLD2 do ENV["JULIA_DEBUG"] = "loading"
using JLD2 You should see a line like
Then run (basically you need to replace the extension Base.parse_image_targets(Base.parse_cache_header("/depot/compiled/v1.11/JLD2/bla_blah.ji")[7]) What do you get here? |
julia> Base.parse_image_targets(Base.parse_cache_header("/home/sschult/.julia/compiled/v1.11/JLD2/O1EyT_NIQbS.ji")[7])
1-element Vector{Base.ImageTarget}:
cascadelake; flags=0; features_en=(sse3, pclmul, ssse3, fma, cx16, sse4.1, sse4.2, movbe, popcnt, aes, xsave, avx, f16c, fsgsbase, bmi, avx2, bmi2, adx, clflushopt, clwb, sahf, lzcnt, prfchw, xsavec, xsaves) |
Ok, to be clear, you should do that test in the situation where you get the "wrong" JLD2 with the incompatible code. This image doesn't seem to have the avx512 feature (assuming JLD2 is indeed the offending package here). |
That's exactly what I did. In fact, the same .so file is reported on both machines, and the output is exactly the same. |
In 1.10, I get julia> Base.parse_image_targets(Base.parse_cache_header("/home/sschult/.julia/compiled/v1.10/JLD2/O1EyT_NQjXZ.ji")[7])
1-element Vector{Base.ImageTarget}:
cascadelake; flags=0; features_en=(sse3, pclmul, ssse3, fma, cx16, sse4.1, sse4.2, movbe, popcnt, aes, xsave, avx, f16c, fsgsbase, bmi, avx2, bmi2, avx512f, avx512dq, adx, clflushopt, clwb, avx512cd, avx512bw, avx512vl, avx512vnni, sahf, lzcnt, prfchw, xsavec, xsaves) which does include avx512, so I guess something causes this to be stored incorrectly in 1.11. |
After some further testing, I found that the issue appears first in 1.11.0-rc4, with rc3 unaffected. Also, in rc3, :~> julia +1.11.0-rc3 -e "println(Base.current_image_targets())"
Base.ImageTarget[cascadelake; flags=0; features_en=(sse3, pclmul, ssse3, fma, cx16, sse4.1, sse4.2, movbe, popcnt, aes, xsave, avx, f16c, fsgsbase, bmi, avx2, bmi2, avx512f, avx512dq, adx, clflushopt, clwb, avx512cd, avx512bw, avx512vl, avx512vnni, sahf, lzcnt, prfchw, xsavec, xsaves)]
:~> julia +1.11.0-rc4 -e "println(Base.current_image_targets())"
Base.ImageTarget[cascadelake; flags=0; features_en=(sse3, pclmul, ssse3, fma, cx16, sse4.1, sse4.2, movbe, popcnt, aes, xsave, avx, f16c, fsgsbase, bmi, avx2, bmi2, adx, clflushopt, clwb, sahf, lzcnt, prfchw, xsavec, xsaves)]
|
That's very interesting. Would you be able to run git bisect? This is the diff: v1.11.0-rc3...v1.11.0-rc4, there are only 35 commits between the two versions, but honestly at a quick glance I can't spot a change which would affect that. |
I don't have the time to git bisect right now, might be able to do it later, but I can confirm avx512 feature is gone also on skylake-avx512: $ julia -E 'Base.current_image_targets()'
Base.ImageTarget[skylake-avx512; flags=0; features_en=(sse3, pclmul, ssse3, fma, cx16, sse4.1, sse4.2, movbe, popcnt, aes, xsave, avx, f16c, fsgsbase, bmi, avx2, bmi2, adx, clflushopt, clwb, pku, sahf, lzcnt, prfchw, xsavec, xsaves)] |
From checking the automatic builds 50c1ea8 appears to be the first commit affected. |
It'd be very surprising if that was the commit affecting this, I can't see that affecting features detection: Lines 1735 to 1738 in 1f935af
For the record, the issue seems to be solved on $ julia +1.10 -E 'Base.current_image_targets()'
Base.ImageTarget[znver3; flags=0; features_en=(sse3, pclmul, ssse3, fma, cx16, sse4.1, sse4.2, movbe, popcnt, aes, xsave, avx, f16c, fsgsbase, bmi, avx2, bmi2, avx512f, avx512dq, adx, avx512ifma, clflushopt, clwb, avx512cd, sha, avx512bw, avx512vl, avx512vbmi, pku, avx512vbmi2, shstk, gfni, vaes, vpclmulqdq, avx512vnni, avx512bitalg, avx512vpopcntdq, rdpid, sahf, lzcnt, sse4a, prfchw, mwaitx, xsavec, xsaves, clzero, wbnoinvd, avx512bf16)]
$ julia +1.11 -E 'Base.current_image_targets()'
Base.ImageTarget[znver4; flags=0; features_en=(sse3, pclmul, ssse3, fma, cx16, sse4.1, sse4.2, movbe, popcnt, aes, xsave, avx, f16c, fsgsbase, bmi, avx2, bmi2, adx, clflushopt, clwb, sha, pku, shstk, gfni, vaes, vpclmulqdq, rdpid, sahf, lzcnt, sse4a, prfchw, mwaitx, xsavec, xsaves, clzero, wbnoinvd)]
$ julia +nightly -E 'Base.current_image_targets()'
Base.ImageTarget[znver4; flags=0; features_en=(sse3, pclmul, ssse3, fma, cx16, sse4.1, sse4.2, movbe, popcnt, aes, xsave, avx, f16c, fsgsbase, bmi, avx2, bmi2, avx512f, avx512dq, adx, avx512ifma, clflushopt, clwb, avx512cd, sha, avx512bw, avx512vl, avx512vbmi, pku, avx512vbmi2, shstk, gfni, vaes, vpclmulqdq, avx512vnni, avx512bitalg, avx512vpopcntdq, rdpid, sahf, lzcnt, sse4a, prfchw, mwaitx, xsavec, xsaves, clzero, wbnoinvd, avx512bf16)]
$ julia +nightly -e 'using InteractiveUtils; versioninfo()'
Julia Version 1.12.0-DEV.1421
Commit d36417b8230 (2024-10-17 17:37 UTC)
Build Info:
Official https://julialang.org release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 384 × AMD EPYC 9654 96-Core Processor
WORD_SIZE: 64
LLVM: libLLVM-18.1.7 (ORCJIT, znver4)
Threads: 1 default, 0 interactive, 1 GC (on 384 virtual cores) |
That makes it sound like it might be capturing some values from the build machines and not correctly getting those from JULIA_CPU_TARGET (aka #54093)? |
Although note that loading (specifically staticdata.c) is supposed to reject loading a pkgimage that requires more features than are present on the current machine, even if loading.jl makes a mistake, to prevent issues like this. So there are multiple level of errors and failures here |
That'd be znver2, not cascadelake, nor skylake-avx512 nor znver4, and if you compare the features on on 1.11.0(-rc4) in #56177 (comment), #56177 (comment) and #56177 (comment) (I used two different clusters) the sets are all different (my skylake-avx512 has
The current setting of
|
Ok, with git bisect reset
git bisect start
git bisect good v1.11.0-rc3
git bisect bad v1.11.0-rc4
git bisect run ./bisect.sh and the following #!/bin/bash
export JULIA_CPU_TARGET="generic;sandybridge,-xsaveopt,clone_all;haswell,-rdrnd,base(1);x86-64-v4,-rdrnd,base(1)"
make cleanall
make -j
./julia -E 'Base.current_image_targets()' | grep avx512f I confirmed that #55729 is indeed the culprit on the v1.11 release branch:
Note that setting Note that I did this on an avx512 machine, which rules out bad caching properties of the CPU on the build machine: the issue happens regardless of what's the build machine. But this also doesn't reproduce on 255162c, the merge commit of #55729 on |
#55729 perhaps unwisely added precompiling |
That's indeed the issue! It's the call to |
This change by itself doesn't do anything significant on `master`, but when backported to the v1.11 branch it'll address #56177. However it'd be great if someone could tell _why_ this fixes that issue, because it looks very unrelated. --------- Co-authored-by: Ian Butterworth <[email protected]>
This change by itself doesn't do anything significant on `master`, but when backported to the v1.11 branch it'll address #56177. However it'd be great if someone could tell _why_ this fixes that issue, because it looks very unrelated. --------- Co-authored-by: Ian Butterworth <[email protected]> (cherry picked from commit f36f342)
I guess this can be closed now that #56239 has been merged in the backports branch. |
This change by itself doesn't do anything significant on `master`, but when backported to the v1.11 branch it'll address #56177. However it'd be great if someone could tell _why_ this fixes that issue, because it looks very unrelated. --------- Co-authored-by: Ian Butterworth <[email protected]>
I use julia on a heterogeneous compute cluster, consisting of many different nodes with different CPUs, but using a single shared file system. This has created problems before: In 1.10, precompilation is triggered each time a project is used on a different node. However, this was easily fixed by using separate projects. In 1.11 this no longer seems to be the case.
For example, let's say I create two projects,
env1
andenv2
. I loadenv1
on my workstation (Intel Xeon W-2223), add a package (say JLD2) and precompile. Then I loadenv2
on a compute node (AMD EPYC 7302) and add the same package. Despite the different CPU, no precompilation is triggered. Then, when I try to run some code, julia crashes on an invalid instruction:The instruction in question appears to be
vpbroadcastq
from the AVX-512 instruction set, which, indeed, the Intel Xeon W-2223 supports and the AMD EPYC 7302 does not.env2
on another node with e.g. an Intel Xeon E5-2698 v3, which also does not support AVX-512, the behaviour is different: precompilation is triggered, and no error is thrown.env2
on the compute node, everything works fine, until I useenv1
on my workstation, after which the same error occurs on the compute node.The text was updated successfully, but these errors were encountered: