Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make arraymancer work with devel #505

Closed
ringabout opened this issue Apr 18, 2021 · 19 comments · Fixed by #542
Closed

make arraymancer work with devel #505

ringabout opened this issue Apr 18, 2021 · 19 comments · Fixed by #542

Comments

@ringabout
Copy link
Contributor

ringabout commented Apr 18, 2021

https://pipelines.actions.githubusercontent.com/ZRinn1OrR0LWxU3iWy1StaQcZRN2kXW9lHWwDJ3esfasrmfdRn/_apis/pipelines/1/runs/18644/signedlogcontent/8?urlExpires=2021-04-18T15%3A29%3A02.1515769Z&urlSigningMethod=HMACV1&urlSignature=ES5o3LHv4%2Bky4FEHjiSAzReQNwqIZDrkd2X%2B6tFYxcU%3D

2021-04-18T13:40:09.5027645Z                  from /usr/lib/gcc/x86_64-linux-gnu/7/include/x86intrin.h:48,
2021-04-18T13:40:09.5028566Z                  from /home/runner/.cache/nim/tests_cpu_d/@mtensor@stest_operators_blas.nim.c:11:
2021-04-18T13:40:09.5031638Z /home/runner/.cache/nim/tests_cpu_d/@mtensor@stest_operators_blas.nim.c: In function ‘gebb_ukernel_int64_x86_AVX512_tensorZtest95operators95blas_37425’:
2021-04-18T13:40:09.5034234Z /usr/lib/gcc/x86_64-linux-gnu/7/include/avx512fintrin.h:282:1: error: inlining failed in call to always_inline ‘_mm512_setzero_si512’: target specific option mismatch
2021-04-18T13:40:09.5035235Z  _mm512_setzero_si512 (void)
2021-04-18T13:40:09.5035871Z  ^~~~~~~~~~~~~~~~~~~~
2021-04-18T13:40:09.5036752Z /home/runner/.cache/nim/tests_cpu_d/@mtensor@stest_operators_blas.nim.c:23985:9: note: called from here
2021-04-18T13:40:09.5037563Z   AB13_1 = _mm512_setzero_si512();
2021-04-18T13:40:09.5038016Z   ~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
2021-04-18T13:40:09.5039014Z In file included from /usr/lib/gcc/x86_64-linux-gnu/7/include/immintrin.h:45:0,
2021-04-18T13:40:09.5040103Z                  from /usr/lib/gcc/x86_64-linux-gnu/7/include/x86intrin.h:48,
2021-04-18T13:40:09.5041044Z                  from /home/runner/.cache/nim/tests_cpu_d/@mtensor@stest_operators_blas.nim.c:11:
2021-04-18T13:40:09.5042887Z /usr/lib/gcc/x86_64-linux-gnu/7/include/avx512fintrin.h:282:1: error: inlining failed in call to always_inline ‘_mm512_setzero_si512’: target specific option mismatch
2021-04-18T13:40:09.5043892Z  _mm512_setzero_si512 (void)
2021-04-18T13:40:09.5044319Z  ^~~~~~~~~~~~~~~~~~~~
2021-04-18T13:40:09.5045218Z /home/runner/.cache/nim/tests_cpu_d/@mtensor@stest_operators_blas.nim.c:23982:9: note: called from here
2021-04-18T13:40:09.5046020Z   AB13_0 = _mm512_setzero_si512();
2021-04-18T13:40:09.5046491Z   ~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
2021-04-18T13:40:09.5047498Z In file included from /usr/lib/gcc/x86_64-linux-gnu/7/include/immintrin.h:45:0,
2021-04-18T13:40:09.5048610Z                  from /usr/lib/gcc/x86_64-linux-gnu/7/include/x86intrin.h:48,
2021-04-18T13:40:09.5049536Z                  from /home/runner/.cache/nim/tests_cpu_d/@mtensor@stest_operators_blas.nim.c:11:
2021-04-18T13:40:09.5051321Z /usr/lib/gcc/x86_64-linux-gnu/7/include/avx512fintrin.h:282:1: error: inlining failed in call to always_inline ‘_mm512_setzero_si512’: target specific option mismatch
2021-04-18T13:40:09.5052304Z  _mm512_setzero_si512 (void)
2021-04-18T13:40:09.5052743Z  ^~~~~~~~~~~~~~~~~~~~
2021-04-18T13:40:09.5053612Z /home/runner/.cache/nim/tests_cpu_d/@mtensor@stest_operators_blas.nim.c:23979:9: note: called from here
2021-04-18T13:40:09.5054412Z   AB12_1 = _mm512_setzero_si512();
2021-04-18T13:40:09.5054871Z   ~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
2021-04-18T13:40:09.5055686Z compilation terminated due to -fmax-errors=3.
2021-04-18T13:40:09.5058447Z Error: execution of an external compiler program 'gcc -c  -w -fmax-errors=3   -I/home/runner/work/Nim/Nim/lib -I/home/runner/work/Nim/Nim/pkgstemp/arraymancer/tests -o /home/runner/.cache/nim/tests_cpu_d/@mtensor@stest_operators_blas.nim.c.o /home/runner/.cache/nim/tests_cpu_d/@mtensor@stest_operators_blas.nim.c' failed with exit code: 1

and https://github.com/nim-lang/Nim/runs/2374356299

@auxym
Copy link
Contributor

auxym commented Dec 20, 2021

Currently getting the same error with nim 1.4.8, 1.60, 1.6.2...

Is there a known workaround?

Running archlinux, error happens whether I use --d:blas=cblas or not. Example program that causes the error:

import arraymancer
let d = [[1, 2, 3], [4, 5, 6]].toTensor()
let x = d * d

git bisect gave me this commit as the culprit: d6b7984

Which isn't particularly recent. Running an i5-4310U cpu, which doesn't have avx512. Something going wrong with cpu detection maybe?

@auxym
Copy link
Contributor

auxym commented Dec 20, 2021

auxym@48b721e

Tracked it down a bit. This will compile AND print "BRANCH 2" on my PC. This shows that cpuinfo seems to be working correctly. HOWEVER, if we change the 1st branch to dispatch(x86_AVX512), then compiling fails.

I'm not sure what's going on. Maybe a Nim compiler bug leading to the 1st branch being codegen'd even though it shouldn't?

@Vindaar
Copy link
Collaborator

Vindaar commented Dec 20, 2021

Thanks for digging into this. I've seen the error myself, but never encountered it in practice. Never attempted it to fix it, because I was hoping it'd be a trivial fix for mratsim. At the same time I simply lack the experience dealing with stuff like AVX and the error message is rather opaque.

Your git bisect and further digging is really appreciated though! Maybe that's enough for me to figure out a solution.

@SteadBytes
Copy link

I've been able to reproduce this issue using an i7-8665U CPU which also does not support AVX-512 and have also been able to reproduce the behaviour from @auxym's patch. I've done some further digging but have come up empty so far 😓 I may well be wrong here, but the error messages seem to suggest that gcc is rejecting compilation of the AVX-512 specific C code that Nim generates (even if it wouldn't actually be called at runtime):

/home/ben/.cache/nim/t_d/@mt.nim.c:1976:18: note: called from here 1976 |         AB13_0 = _mm512_setzero_si512();

That function in particular:

/* Create vectors with repeated elements */

static  __inline __m512i __DEFAULT_FN_ATTRS512
_mm512_setzero_si512(void)
{
  return __extension__ (__m512i)(__v8di){ 0, 0, 0, 0, 0, 0, 0, 0 };
}

@Vindaar
Copy link
Collaborator

Vindaar commented Dec 22, 2021

Thanks for looking into it @SteadBytes as well!

Just dug myself a bit: If I pass -mavx512dq to the C compiler it compiles successfully.

@auxym
Copy link
Contributor

auxym commented Dec 22, 2021

Great news! Can confirm that --passC:-mavx512dq does work. Thanks for figuring that out.

Out of curiosity, why? Wouldn't that enable avx512 extensions? How does that work on a cpu without avx512?

Next step, what do we do about it? Document it? Can it be added automatically to build flags when arraymancer is imported? Does that limit arraymancer to usage on x86 and/or gcc compiler only?

@SteadBytes
Copy link

SteadBytes commented Dec 22, 2021

Fantastic - nice find @Vindaar!

@auxym I think what's happening is that passing -mavx512dq allows GCC to generate the AVX-512 code regardless of whether the native platform supports it. Things would then go badly if those code paths were hit at runtime which I think may be the case here.

diff --git a/arraymancer.nimble b/arraymancer.nimble
index 3a021e9..93448cf 100644
--- a/arraymancer.nimble
+++ b/arraymancer.nimble
@@ -188,7 +188,7 @@ task all_tests, "Run all tests - Intel MKL + Cuda + OpenCL + OpenMP":
 #   test "tests_cpu_remainder", switch, split = true

 task test, "Run all tests - Default BLAS & Lapack":
-  test "tests_cpu", "", split = false
+  test "tests_cpu", " --passC:-mavx512dq", split = false

 task test_arc, "Run all tests under ARC - Default BLAS & Lapack":
   test "tests_cpu", "--gc:arc", split = false
nimble test
...
Hint: /usr/src/app/tests/ml/test_metrics  [Exec]

[Suite] [ML] Metrics
Traceback (most recent call last)
/usr/src/app/tests/ml/test_metrics.nim(46) test_metrics
/usr/src/app/tests/ml/test_metrics.nim(25) main
/usr/src/app/src/arraymancer/ml/metrics/accuracy_score.nim(26) accuracy_score
/usr/src/app/src/arraymancer/tensor/ufunc.nim(29) astype
/usr/src/app/src/arraymancer/tensor/higher_order_applymap.nim(43) map
/usr/src/app/src/arraymancer/tensor/ufunc.nim(29) :anonymous
SIGILL: Illegal operation.
Illegal instruction (core dumped)
Error: execution of an external program failed: '/usr/src/app/tests/ml/test_metrics '

The failing test calls accuracy_score which performs floating point operations that I think are causing the Illegal instruction error due to the -mavx512dq flag.

result = (y_true ==. y_pred).astype(float).mean

Inspecting a coredump using GCC shows that the illegal instruction was vmovdqu64 %zmm2, (%rsi):

Program terminated with signal SIGILL, Illegal instruction.
#0  0x000000000044c61f in astype.t_1514 ()
(gdb) layout asm

image

VMOVDQA64 zmm2 is AVX-512 only.

@Vindaar
Copy link
Collaborator

Vindaar commented Dec 23, 2021

I don't fully see a good solution. As it seems to me right now:

  • we only detect CPU features at runtime
  • to support AVX512 we need an additional compilation flag
  • that compilation flag makes it look like AVX512 is supported even if it is not

So this leads me to believe that the best solution for now is to simply disable AVX512 by default and add a Nim -d:avx512 flag that the user has to hand if their CPU supports it. In that case we pass the -mavx512dq flag to the C compiler. Unfortunately, I don't have an AVX512 capable CPU to test if the code even works there.

@SteadBytes
Copy link

SteadBytes commented Dec 23, 2021

What seems very odd here is that I can reproduce the original compilation error on a CPU with AVX-512:

lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 140
Model name:            11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
Stepping:              1
CPU MHz:               2995.201
BogoMIPS:              5990.40
Hypervisor vendor:     Microsoft
Virtualization type:   full
L1d cache:             48K
L1i cache:             32K
L2 cache:              1280K
L3 cache:              12288K
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase bmi1 avx2smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512vbmi umip avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid avx512_vp2intersect flush_l1d arch_capabilities

Using the same reproducer saved as avx512.nim:

nim c -r avx512.nim
...
inlining failed in call to ‘always_inline’ ‘_mm512_setzero_si512’: target specific option mismatch
  334 | _mm512_setzero_si512 (void)
      | ^~~~~~~~~~~~~~~~~~~~
/home/bsteadman/.cache/nim/t_d/@mt.nim.c:1955:18: note: called from here
 1955 |         AB13_1 = _mm512_setzero_si512();
      |                  ^~~~~~~~~~~~~~~~~~~~~~
...

Compiling with the -mavx512dq flag succeeds as does executing the resulting binary (because AVX 512 is actually available this time).

@auxym
Copy link
Contributor

auxym commented Dec 23, 2021

So this leads me to believe that the best solution for now is to simply disable AVX512 by default and add a Nim -d:avx512 flag that the user has to hand if their CPU supports it. In that case we pass the -mavx512dq flag to the C compiler. Unfortunately, I don't have an AVX512 capable CPU to test if the code even works there.

This seems like a good solution to me, and something that can be implemented via {booldefine} and {passC:} pragmas, then gating the avx512 code behind compile time whens instead of runtime ifs. Or maybe even in nim.cfg file, though its syntax for conditionals, etc seems to be undocumented (?).

The main downside of it is that -mavx512dq is gcc-specific, yes? Then we lose compatibility with clang, msvc, etc? TBH sounds like a reasonable tradeoff to me, at least as a quick initial fix: use gcc if you want avx512 support. Perhaps later we can check the c compiler that is used and add different flags for avx512.

Another option would be using cpuinfo at compile time to automatically discover CPU features, but IMO it might not be the best idea. Even if the user is on a avx512 cpu, they might not want an avx512 build, eg if they want to share it with others. Sounds like a "vanilla" build as a default would be the best choice, IMO.

@SteadBytes
Copy link

Unfortunately, I don't have an AVX512 capable CPU to test if the code even works there.

Take a look at my earlier comment #505 (comment)

I agree that disabling AVX-512 by default is probably a good idea. However I feel like something is missing here - this was working, right? 🤔

The main downside of it is that -mavx512dq is gcc-specific, yes?

FWIW, clang also has -mavx512dq.

Even if the user is on a avx512 cpu, they might not want an avx512 build, eg if they want to share it with others. Sounds like a "vanilla" build as a default would be the best choice, IMO.

For GCC/clang this is handled with the -march compiler flag e.g. -march=native, -march=skylake-avx512 etc. These set a bunch of #defines for the available CPU features allowing for the corresponding conditional code to be written.

It seems like disabling AVX-512 by default and then opting in at compile time rather than runtime CPU feature detection will more reliably produce working code - at the expense of having to compile with specific options and generating less "universal" code. I'm personally fine with that trade off but I, of course, cannot speak for everyone 😅

Vindaar added a commit to Vindaar/Arraymancer that referenced this issue Dec 29, 2021
This commit fixes issue mratsim#505, which is a codegen issue arising from
generated C code that is only valid if the `-mavx512dq` compilation
flag is handed to the C compiler.

The issue is that:
- we cannot generate the code without the compilation flag (and
compile it successfully)
- handing the compilation flag makes the code compile, but forces
every CPU to attempt to run the AVX512 code, which is an illegal
operation for CPUs that do not support it.

For this reason AVX512 support is hidden behind a `-d:avx512`
compilation flag that needs to be handed. The resulting binary then
*will* use AVX512 for gemm.
Vindaar added a commit that referenced this issue Dec 29, 2021
* do not compile with AVX512 support by default

This commit fixes issue #505, which is a codegen issue arising from
generated C code that is only valid if the `-mavx512dq` compilation
flag is handed to the C compiler.

The issue is that:
- we cannot generate the code without the compilation flag (and
compile it successfully)
- handing the compilation flag makes the code compile, but forces
every CPU to attempt to run the AVX512 code, which is an illegal
operation for CPUs that do not support it.

For this reason AVX512 support is hidden behind a `-d:avx512`
compilation flag that needs to be handed. The resulting binary then
*will* use AVX512 for gemm.

* update README with a note about `-d:avx512`

* fix whitespace in README
@Vindaar
Copy link
Collaborator

Vindaar commented Dec 29, 2021

Now we disable AVX512 by default. If anyone comes up with a solution that makes it work automagically if available, but still compiles on all targets, I'm all ears.

@mratsim
Copy link
Owner

mratsim commented Jan 3, 2022

I think there is a change in how Nim passes file-specific compilation flags in 1.6.
This suffers from the same:

ringabout added a commit to nim-lang/Nim that referenced this issue Jan 3, 2022
The cause of arraymancer failure has been tracked here: mratsim/Arraymancer#505
And it was fixed by mratsim/Arraymancer#542
ringabout added a commit to nim-lang/Nim that referenced this issue Jan 3, 2022
The cause of arraymancer failure has been tracked here: mratsim/Arraymancer#505
And it was fixed by mratsim/Arraymancer#542
@auxym
Copy link
Contributor

auxym commented Jan 5, 2022

@mratsim I did reproduce with 1.4.8 (and some older versions too if I recall correctly)

@Vindaar
Copy link
Collaborator

Vindaar commented Jan 5, 2022

I think there is a change in how Nim passes file-specific compilation flags in 1.6. This suffers from the same:

I'm not sure if this is related.
Before we did not even have any file specific compilation flags that are related to AVX512, no?

@mratsim
Copy link
Owner

mratsim commented Jan 5, 2022

We do

Arraymancer/nim.cfg

Lines 77 to 83 in 20e0dc3

gemm_ukernel_sse.always = "-msse"
gemm_ukernel_sse2.always = "-msse2"
gemm_ukernel_sse4_1.always = "-msse4.1"
gemm_ukernel_avx.always = "-mavx"
gemm_ukernel_avx_fma.always = "-mavx -mfma"
gemm_ukernel_avx2.always = "-mavx2"
gemm_ukernel_avx512.always = "-mavx512f -mavx512dq"

@SteadBytes
Copy link

SteadBytes commented Jan 5, 2022

@mratsim I did reproduce with 1.4.8 (and some older versions too if I recall correctly)

That's interesting. I'm able to compile the example fine on 1.4.8 🤔 1.6.0 and above fails as @mratsim suggested.

status-im/nim-blscurve#133 (comment)

@Vindaar
Copy link
Collaborator

Vindaar commented Jan 8, 2022

For reference also here (posted initially here status-im/nim-blscurve#133 (comment)):

I just did a git bisect on one of the test cases that are broken with the AVX512 error with arraymancer before #542 was merged. The Nim PR that introduced the regression is:

nim-lang/Nim#17311

which nicely enough already has a still open "fix arraymancer regression" TODO!

@SteadBytes
Copy link

Nice find @Vindaar!

which nicely enough already has a still open "fix arraymancer regression" TODO!

Haha, the infamous TODO that doesn't get done strikes again 😆 On the plus side, hopefully that means there's a fix to be had sooner rather than later 🤞

PMunch pushed a commit to PMunch/Nim that referenced this issue Mar 28, 2022
The cause of arraymancer failure has been tracked here: mratsim/Arraymancer#505
And it was fixed by mratsim/Arraymancer#542
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants