Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CPUID submodule #37479

Merged
merged 2 commits into from
Sep 17, 2020
Merged

Add CPUID submodule #37479

merged 2 commits into from
Sep 17, 2020

Conversation

giordano
Copy link
Contributor

@giordano giordano commented Sep 8, 2020

This is a small internal module which defines the API to query the Instruction
Set Architecture (ISA) of the current CPU. This can be used for example to
select an artifact in a JLL package which is compatible with the current CPU.

This is related to #37320, and might change to accommodate some needs of that PR.

Fix #36367. CC: @staticfloat.

@yuyichao
Copy link
Contributor

yuyichao commented Sep 8, 2020

As I said in #36367 this should not be a public API since using this to generate code is wrong. A strictly private API would be fine and should just use the existing C function instead.

@yuyichao
Copy link
Contributor

yuyichao commented Sep 8, 2020

Also, this is capturing the build time environment which is wrong. Note that the selection of external binary also must be done at runtime and not package installation time or precompilation time. This is the problem shared by a lot of packages that must be fixed before they can be accepted to base/stdlib.

Copy link
Contributor

@yuyichao yuyichao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix detection time, do not export julia API.

@staticfloat
Copy link
Member

As I said in #36367 this should not be a public API since using this to generate code is wrong.

Sorry, I read through again and it's just not connecting for me. How would this be used for code generation?

@yuyichao
Copy link
Contributor

yuyichao commented Sep 8, 2020

For example, LoopVectorization.jl must not (and as I understand it currently is) using anything like this to determine what code to generate.
The target used for julia code and the actual hardware target are usually the same. However, whenever they disagree the hardware one is almost always the wrong one to use.

And again, having some dirty hack that should work for library loading most of the time is fine, as long as it's clear that no one outside of base should use it and the API may be deleted at any time without notice.

@giordano
Copy link
Contributor Author

giordano commented Sep 8, 2020

Fair point about detecting the features at runtime. Regarding the API, my understanding is that something is considered "public" if its documentation/docstring is exposed in the manual, which isn't the case in this pull request and wasn't my intention.

Regarding the use of jl_test_cpu_feature, should we manually copy the numeric values of all features in features_*.h to this module and manually have a mapping between the value and the feature it represent, keeping this always in sync with the features defined on the C side?

@yuyichao
Copy link
Contributor

yuyichao commented Sep 8, 2020

Regarding the API, my understanding is that something is considered "public" if its documentation/docstring is exposed in the manual, which isn't the case in this pull request and wasn't my intention.

For now this can just go in the module that does the loading. That'll lower the chance of someone using it.

Regarding the use of jl_test_cpu_feature, should we manually copy the numeric values of all features in features_*.h to this module and manually have a mapping between the value and the feature it represent, keeping this always in sync with the features defined on the C side?

No.

you can define them by generating a jl file from the features_*.h headers.

You can just run a julia script through C preprocessor. We do that for quite a few other headers.

@giordano
Copy link
Contributor Author

giordano commented Sep 9, 2020

We do that for quite a few other headers.

Any example to look at?

@yuyichao
Copy link
Contributor

yuyichao commented Sep 9, 2020

julia/base/Makefile

Lines 24 to 34 in 4a412e3

$(BUILDDIR)/pcre_h.jl: $(PCRE_INCL_PATH)
@$(call PRINT_PERL, $(CPP) -D PCRE2_CODE_UNIT_WIDTH=8 -dM $< | perl -nle '/^\s*#define\s+PCRE2_(\w*)\s*\(?($(PCRE_CONST))\)?u?\s*$$/ and print index($$1, "ERROR_") == 0 ? "const $$1 = Cint($$2)" : "const $$1 = UInt32($$2)"' | LC_ALL=C sort > $@)
$(BUILDDIR)/errno_h.jl:
@$(call PRINT_PERL, echo '#include <errno.h>' | $(CPP) -dM - | perl -nle 'print "const $$1 = Int32($$2)" if /^#define\s+(E\w+)\s+(\d+)\s*$$/' | LC_ALL=C sort > $@)
$(BUILDDIR)/file_constants.jl: $(SRCDIR)/../src/file_constants.h
@$(call PRINT_PERL, $(CPP_STDOUT) -DJULIA $< | perl -nle 'print "$$1 0o$$2" if /^(\s*const\s+[A-z_]+\s+=)\s+(0[0-9]*)\s*$$/; print "$$1" if /^\s*(const\s+[A-z_]+\s+=\s+([1-9]|0x)[0-9A-z]*)\s*$$/' > $@)
$(BUILDDIR)/uv_constants.jl: $(SRCDIR)/../src/uv_constants.h $(build_includedir)/uv/errno.h
@$(call PRINT_PERL, $(CPP_STDOUT) "-I$(LIBUV_INC)" -DJULIA $< | tail -n 16 > $@)

@codecov-commenter
Copy link

Codecov Report

Merging #37479 into master will increase coverage by 0.03%.
The diff coverage is 88.73%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #37479      +/-   ##
==========================================
+ Coverage   87.54%   87.57%   +0.03%     
==========================================
  Files         351      351              
  Lines       71009    71009              
==========================================
+ Hits        62166    62188      +22     
+ Misses       8843     8821      -22     
Impacted Files Coverage Δ
stdlib/Printf/src/Printf.jl 70.99% <ø> (+1.10%) ⬆️
base/compiler/ssair/passes.jl 88.74% <50.00%> (-0.26%) ⬇️
base/compiler/ssair/ir.jl 83.47% <80.00%> (+0.08%) ⬆️
base/compiler/tfuncs.jl 86.54% <84.00%> (-0.11%) ⬇️
stdlib/REPL/src/REPL.jl 81.37% <94.44%> (-0.19%) ⬇️
base/compiler/abstractinterpretation.jl 85.86% <100.00%> (ø)
base/compiler/validation.jl 88.88% <100.00%> (+0.10%) ⬆️
base/ryu/shortest.jl 96.69% <100.00%> (+1.65%) ⬆️
base/strings/substring.jl 93.44% <100.00%> (+0.05%) ⬆️
base/process.jl 89.59% <0.00%> (-1.01%) ⬇️
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f13e11c...4a9ed1e. Read the comment docs.

@giordano
Copy link
Contributor Author

giordano commented Sep 9, 2020

Ok, I exported jl_test_cpu_feature and a Perl script kindly provided by Elliot generates a Julia file with the list of the numerical features in Julia. Now there is a function to get the list of features.

This module is not going to stay in Sys but I'll move it later and I marked this PR as draft, also because some details of the API need to be ironed out, depending on exactly how we need to use the microarchitectures in the Artifacts module.

base/sysinfo.jl Outdated Show resolved Hide resolved
@giordano giordano force-pushed the mg/cpuid branch 2 times, most recently from fc6e8f7 to 14cd2d1 Compare September 12, 2020 22:32
@giordano giordano marked this pull request as ready for review September 12, 2020 22:34
@giordano
Copy link
Contributor Author

I've moved the module under the new BinaryPlatforms module, so I believe I've now addressed all original comments

Copy link
Member

@staticfloat staticfloat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the largest piece missing from this is any kind of mapping for aarch64 and armv7l. Once we have those pieces, this looks exactly like what I'm looking for.

base/binaryplatforms.jl Show resolved Hide resolved
base/binaryplatforms.jl Outdated Show resolved Hide resolved
@giordano
Copy link
Contributor Author

giordano commented Sep 15, 2020

I addressed the last two comments

@giordano
Copy link
Contributor Author

On the aarch64 build machine I get the following:

julia> using Base.BinaryPlatforms.CPUID

julia> for f in filter(n -> startswith(String(n), "JL_AArch64"), names(CPUID; all=true))
           CPUID.test_cpu_feature(getfield(CPUID, f)) && println(f)
       end
JL_AArch64_aes
JL_AArch64_crc
JL_AArch64_sha2

@yuyichao any clue why only these features are set? In particular, the machine is armv8.1-a, shouldn't JL_AArch64_v8_1a be set as well?

julia> CPUID.test_cpu_feature(CPUID.JL_AArch64_v8_1a)
false

@yuyichao
Copy link
Contributor

ISA version is impossible to detect (it's not a thing, basically). The only way to set it is through guessing. What is the CPU? Looking at the list, the only v8.1a I've seen is thunderx2....

# https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html. Keep in sync with
# `arch_march_isa_mapping`.
const ISAs_by_family = Dict(
"x86_64" => (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to make these enums so you know from the type what possible the possible values are?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like Dict's better so that we can add to them via Pkg hooks in the future.

Copy link
Member

@KristofferC KristofferC Sep 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't mean to change the Dict, just the entries, "core2", "nehalem" etc. So you know that the keys are arches, and not say, the life work of Shakespeare, and you can reflect on the instances of the enum.

Copy link
Member

@staticfloat staticfloat Sep 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change the keys like "core2" from a string to an enum type, how do we add a new value to that enum type at runtime? What I'm considering is that Julia 1.6 is around for 3+ years, and in that time we want to add new ISAs, which we will be able to do from within a package by running something like Base.BinaryPlatforms.ISAs_by_family["x86_64"]["cannonlake"] = ....

We can already reflect on keys(ISAs_by_family["x86_64"]), what does an enum type give us?

Copy link
Member

@KristofferC KristofferC Sep 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to be able to interactively add new ISA's at runtime from packages then it clearly can't be an enum. I didn't know that was desirable, especially when it seems like there are multiple data structures here that need to be kept in sync and that the features are generated from the C-code which made it look like the data here is the one source of truth. But if that's not the case, then yeah, can't use an enum.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which we will be able to do from within a package by running something like Base.BinaryPlatforms.ISAs_by_family["x86_64"]["cannonlake"] = ....

Note that nothing from this PR will be a stable API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed how arch_march_isa_mapping is defined below, which should make it easier to keep that dictionary in sync with this one.

@giordano
Copy link
Contributor Author

What is the CPU?

The features shown in /proc/cpuinfo are

fp asimd evtstrm aes pmull sha1 sha2 crc32

I'm a bit confused by the comments in

// JL_FEATURE_DEF(aes, 3, 0) // HWCAP_AES. Implied by `aes`
JL_FEATURE_DEF(aes, 4, 0) // HWCAP_PMULL, ID_AA64ISAR0_EL1.AES == 2
What does it mean that aes is implied by aes? The feature at line 15 is called aes (like the one in line 14) but the comment mentions PMULL, is this correct?

@yuyichao
Copy link
Contributor

The features shown in /proc/cpuinfo are

.... but what CPU is it...

What does it mean that aes is implied by aes? The feature at line 15 is called aes (like the one in line 14) but the comment mentions PMULL, is this correct?

Yes it is correct. It says HWCAP_AES. Implied by `aes`. The exported name is to match the LLVM one and LLVM's aes actually means HWCAP_PMULL (ID_AA64ISAR0_EL1.AES == 2) rather than HWCAP_AES (ID_AA64ISAR0_EL1.AES == 1 || ID_AA64ISAR0_EL1.AES == 2). The commented out name also says aes simply because there's nothing else I can think of to put there....

@giordano
Copy link
Contributor Author

giordano commented Sep 16, 2020

.... but what CPU is it...

The list of features was the only information I could find in /proc/cpuinfo, no reference to the model. lscpu says ThunderX 88XX

@yuyichao
Copy link
Contributor

yuyichao commented Sep 16, 2020

The list of features was the only information I could find in /proc/cpuinfo, no reference to the model. lspcu says ThunderX 88XX

What's the full content of /proc/cpuinfo.

lspcu says ThunderX 88XX

ThunderX should be v8

@giordano
Copy link
Contributor Author

What's the full content of /proc/cpuinfo.

processor       : 0
BogoMIPS        : 200.00
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x43
CPU architecture: 8
CPU variant     : 0x1
CPU part        : 0x0a1
CPU revision    : 1

repeated a bunch of times for all CPUs

@yuyichao
Copy link
Contributor

CPU implementer : 0x43
CPU architecture: 8
CPU variant     : 0x1
CPU part        : 0x0a1
CPU revision    : 1

Yes that's a thunderx88 and we should be detecting it correctly. Why do you believe it's v8.1a?

@staticfloat
Copy link
Member

I was told the machine was a Thunder X2. Heh. Well this is good progress!

@staticfloat
Copy link
Member

Confirmed that on a Graviton2, this successfully identifies it as v8.2-a!

@yuyichao
Copy link
Contributor

Since this detection is unreliable (it's based on known name) I would not recommend relying on it. You should check for the features you need instead.

@staticfloat
Copy link
Member

We get to set the features we need; e.g. we're generating binaries in BB by using compiler flags like "-march=armv8.2-a". Can you point me to the right table to use to translate from the JL_AArch64_XXX flags to ensure that any binary generated with -march=armv8.2-a can run on that CPU?

@yuyichao
Copy link
Contributor

Well, that's the part that needs to be fixed. v8.2a comes with features (don't remember which one off the top of my head) that cannot be reliably detected.

@staticfloat
Copy link
Member

Fixed by whom? Is it the LLVM detection that is unreliable?

@yuyichao
Copy link
Contributor

Fixed by whom?

Fixed by you, to change the compilation flags.

Is it the LLVM detection that is unreliable?

No, there's simply no way at the hardware level to detect which version it is. The only thing available is the feature set. And the linux kernel also does not expose all the features that LLVM may use.

@staticfloat
Copy link
Member

So is what you're saying that we should avoid the -march=armv8.X-a flags completely, and instead use armv8a+feature1+feature2+feature3, and only use features that can be reliably detected from Julia?

@yuyichao
Copy link
Contributor

Yes. At least until someone can find a way to reliably detect all features in all 8.*-a versions. It would be OK to give them aliases like v8.1-a but that shouldn't be the thing passed to the compiler.

@staticfloat
Copy link
Member

Okay, this we can work with. So you suggest that we look through what is defined in https://github.com/JuliaLang/julia/blob/master/src/features_aarch64.h, take those as the official list of flags that can be reliably detected on AArch64, and map those to what is implied by each armv8.X-a level. We can then setup BB to define each "named" armv8.x-a level to that set of features, then we can test them here in the CPUID module. And we'll completely ignore the JL_AArch64_v8_Xa flags?

@yuyichao
Copy link
Contributor

Yes.

@yuyichao
Copy link
Contributor

All of those maps to kernel ABI so that's about as stable as it can be. It's possible that more could be added later for earlier versions but doing that shouldn't break anything.

@staticfloat
Copy link
Member

staticfloat commented Sep 16, 2020

As a datapoint, I looked at what archspec does (the microarchitecture management library used by spack) and it looks like they just use what you can get from /proc/cpuinfo, so they're in the same position as us, and they also only use families of feature flags.

Example mappings for thunderx2 and graviton2.

I'm going to collect here a list of features that we should map to: I'm going to edit this comment until I'm happy with it. :)

First off, some useful resources:

Notably, I see that there are no base armv8.X-a levels that require common extensions such as the crypto extensions (e.g. (aes, sha2) on armv8.1- and (aes, sha2, sha3, sm4) on armv8.2+), or the sve extension, which we may want to support.

Questions:

  • fp and asimd are both commented out in the header, presumably because they'll always be on. Can we un-comment them so that we have less implicit knowledge and it's easier to do exhaustive tests?

  • Given your domain knowledge of cores, do you think there are useful levels that we should compile for? For x86_64 it seems clear that the avx, avx2, etc... feature boundaries are useful stepping stones for performance. In this case, I can only guess that we should support some base levels but I'm not sure what optional extensions we should build for, if any. Right now, I'm leaning toward us supporting the following:

    • armv8.0-a -> [fp, asimd] - matches rpi4 in aarch64 mode, and thunderx88
    • armv8.1-a -> [fp, asimd, lse, crc, rdma]
    • armv8.2-a_crypto -> [fp, asimd, lse, crc, rdma, aes, sha2] - matches thunderx2, carmel, graviton2, and even Apple A12z (according to pytorch's cpuinfo)
    • armv8.4-a_crypto_sve -> [fp, asimd, lse, crc, rdma, fp16fml, dotprod, aes, sha2, sve] - matches a64fx

@yuyichao
Copy link
Contributor

First off, some useful resources:

You can just use our definition.

julia/src/processor_arm.cpp

Lines 230 to 341 in 4864161

constexpr auto generic = get_feature_masks();
constexpr auto armv8a_crc = get_feature_masks(crc);
constexpr auto armv8a_crc_crypto = armv8a_crc | get_feature_masks(aes, sha2);
constexpr auto armv8_1a = armv8a_crc | get_feature_masks(v8_1a, lse, rdm); // lor
constexpr auto armv8_1a_crypto = armv8_1a | get_feature_masks(aes, sha2);
constexpr auto armv8_2a = armv8_1a | get_feature_masks(v8_2a, ccpp);
constexpr auto armv8_2a_crypto = armv8_2a | get_feature_masks(aes, sha2);
constexpr auto armv8_3a = armv8_2a | get_feature_masks(v8_3a, jsconv, complxnum, rcpc);
constexpr auto armv8_3a_crypto = armv8_3a | get_feature_masks(aes, sha2);
constexpr auto armv8_4a = armv8_3a | get_feature_masks(v8_4a, dit, rcpc_immo, fmi);
constexpr auto armv8_4a_crypto = armv8_4a | get_feature_masks(aes, sha2);
constexpr auto armv8_5a = armv8_4a | get_feature_masks(v8_5a, sb, ccdp, altnzcv, fptoint);
constexpr auto armv8_6a = armv8_5a | get_feature_masks(v8_6a, i8mm, bf16);
// For ARM cores, the features required can be found in the technical reference manual
// The relevant register values and the features they are related to are:
// ID_AA64ISAR0_EL1:
// .AES: aes, pmull
// .SHA1: sha1
// .SHA2: sha2, sha512
// .CRC32: crc
// .Atomic: les
// .RDM: rdm
// .SHA3: sha3
// .SM3: sm3 (sm4)
// .SM4: sm4
// .DP: dotprod
// .FHM: fp16fml
// .TS: fmi, altnzcz
// .RNDR: rand
// ID_AA64ISAR1_EL1
// .JSCVT: jsconv
// .FCMA: complxnum
// .LRCPC: rcpc, rcpc_immo
// .DPB: ccpp, ccdp
// .SB: sb
// .APA/.API: paca (pa)
// .GPA/.GPI: paga (pa)
// .FRINTTS: fptoint
// .I8MM: i8mm
// .BF16: bf16
// .DGH: dgh
// ID_AA64PFR0_EL1
// .FP: fullfp16
// .SVE: sve
// .DIT: dit
// .BT: bti
// ID_AA64PFR1_EL1.SSBS: ssbs
// ID_AA64MMFR2_EL1.AT: uscat
// ID_AA64ZFR0_EL1
// .SVEVer: sve2
// .AES: sve2-aes, sve2-pmull
// .BitPerm: sve2-bitperm
// .SHA3: sve2-sha3
// .SM4: sve2-sm4
// .F32MM: f32mm
// .F64MM: f64mm
constexpr auto arm_cortex_a34 = armv8a_crc;
constexpr auto arm_cortex_a35 = armv8a_crc;
constexpr auto arm_cortex_a53 = armv8a_crc;
constexpr auto arm_cortex_a55 = armv8_2a | get_feature_masks(dotprod, rcpc, fullfp16, ssbs);
constexpr auto arm_cortex_a57 = armv8a_crc;
constexpr auto arm_cortex_a65 = armv8_2a | get_feature_masks(rcpc, fullfp16, ssbs);
constexpr auto arm_cortex_a72 = armv8a_crc;
constexpr auto arm_cortex_a73 = armv8a_crc;
constexpr auto arm_cortex_a75 = armv8_2a | get_feature_masks(dotprod, rcpc, fullfp16);
constexpr auto arm_cortex_a76 = armv8_2a | get_feature_masks(dotprod, rcpc, fullfp16, ssbs);
constexpr auto arm_cortex_a77 = armv8_2a | get_feature_masks(dotprod, rcpc, fullfp16, ssbs);
constexpr auto arm_cortex_a78 = armv8_2a | get_feature_masks(dotprod, rcpc, fullfp16, ssbs); // spe
constexpr auto arm_cortex_x1 = armv8_2a | get_feature_masks(dotprod, rcpc, fullfp16, ssbs); // spe
constexpr auto arm_neoverse_e1 = armv8_2a | get_feature_masks(rcpc, fullfp16, ssbs);
constexpr auto arm_neoverse_n1 = armv8_2a | get_feature_masks(dotprod, rcpc, fullfp16, ssbs);
constexpr auto arm_zeus = armv8_4a | get_feature_masks(sve, i8mm, bf16, fullfp16, ssbs, rand);
constexpr auto cavium_thunderx = armv8a_crc_crypto;
constexpr auto cavium_thunderx88 = armv8a_crc_crypto;
constexpr auto cavium_thunderx88p1 = armv8a_crc_crypto;
constexpr auto cavium_thunderx81 = armv8a_crc_crypto;
constexpr auto cavium_thunderx83 = armv8a_crc_crypto;
constexpr auto cavium_thunderx2t99 = armv8_1a_crypto;
constexpr auto cavium_thunderx2t99p1 = cavium_thunderx2t99;
constexpr auto cavium_octeontx2 = armv8_2a_crypto;
constexpr auto fujitsu_a64fx = armv8_2a | get_feature_masks(sha2, fullfp16, sve, complxnum);
constexpr auto hisilicon_tsv110 = armv8_2a_crypto | get_feature_masks(dotprod, fullfp16);
constexpr auto hxt_phecda = armv8a_crc_crypto;
constexpr auto marvell_thunderx3t110 = armv8_3a_crypto;
constexpr auto nvidia_denver1 = generic; // TODO? (crc, crypto)
constexpr auto nvidia_denver2 = armv8a_crc_crypto;
constexpr auto nvidia_carmel = armv8_2a_crypto | get_feature_masks(fullfp16);
constexpr auto apm_xgene1 = generic;
constexpr auto apm_xgene2 = generic; // TODO?
constexpr auto apm_xgene3 = generic; // TODO?
constexpr auto qualcomm_kyro = armv8a_crc_crypto;
constexpr auto qualcomm_falkor = armv8a_crc_crypto | get_feature_masks(rdm);
constexpr auto qualcomm_saphira = armv8_4a_crypto;
constexpr auto samsung_exynos_m1 = armv8a_crc_crypto;
constexpr auto samsung_exynos_m2 = armv8a_crc_crypto;
constexpr auto samsung_exynos_m3 = armv8a_crc_crypto;
constexpr auto samsung_exynos_m4 = armv8_2a_crypto | get_feature_masks(dotprod, fullfp16);
constexpr auto samsung_exynos_m5 = samsung_exynos_m4;
constexpr auto apple_a7 = armv8a_crc_crypto;
constexpr auto apple_a10 = armv8a_crc_crypto | get_feature_masks(rdm);
constexpr auto apple_a11 = armv8_2a_crypto | get_feature_masks(fullfp16);
constexpr auto apple_a12 = armv8_3a_crypto | get_feature_masks(fullfp16);
constexpr auto apple_a13 = armv8_4a_crypto | get_feature_masks(fp16fml, fullfp16, sha3);
constexpr auto apple_s4 = apple_a12;
constexpr auto apple_s5 = apple_a12;
I believe this is the most complete list you can get from anywhere.

fp and asimd are both commented out in the header, presumably because they'll always be on. Can we un-comment them so that we have less implicit knowledge and it's easier to do exhaustive tests?

No, they are just required feature. Soft floating point is not supported. I'm not sure what can be tested about something that just isn't supposed to work.

Given your domain knowledge of cores, do you think there are useful levels that we should compile for?

It was "clear" for x86 only because there are changes on the vector registers that somewhat overshadows other changes. There are actually also a few other extensions like bit manipulation and cache (prefetch, invalidate, flush) manipulations, half precisions, that doesn't really fit into the sse/avx evolution timeline. Of course since there are only two major players there aren't too many lines of succession. Even then, AMD is known for deprecating old and unused extensions in newer processors and actually removing them so it's not actually not as simple as the illusion one might get from just looking at the vector registers.

On AArch64, the only extra SIMD level exists is SVE (and SVE2 I guess) so there just isn't the same evolution line that overshadow everything else. Also, SVE may or may not actually work with our compiler and unless someone get a chip/server from fujitsu or is working on fugaku you'll not be able to actually use the instruction. There is at least still another useful enough line of succession on aarch64 though, which is the memory model. 8.1-atomic (i.e. lse) and later ones like rcpc are interesting enough (and according to Keno good enough to run rr) so that's one feature that's worth supporting, though probably only for libraries that does their own locking etc.

Other than the atomics, there are basically about the same number of other random interesting features. They are just not burried among a million avx512 flavors........ The usuable hardwares with these extensions are scarce enough that I don't really know what are the ones that people uses. (if anyone want to teach me how to hack a mid-high end phone/tablet so that I can run gdb on it I'll be very interested.............) OTOH, despite the dozens of cores and hundreds of SoC names, the actual unique features are actually pretty limited. If you have a look at the feature lists we have, most cores are basically either v8-a or v8.2-a and potentially with extensions that agrees with the cortex-a line fairly well. Apple seems to be the most agressive player and is bumping their ISA version on every new chip since a11 but unless someone can figure out how to run linux on there it won't be very relavent for linux libraries.

So in the end I'll say just go with what known cores have is probably fine. This is also how I decided to suggest thunderx originally as an additional target since at the time it seems like it'll be the most immediate "server" platform that people can use. Feature wise this shouldn't give too many variance to support (it seems that v8 and v8.2,rcpc,fullfp16 would be good, maybe adding carmel if people care a lot since it seems to be missing rcpc). I'm not sure if there's anything that's worth optimizing for a particular core but I kind of doubt it. With more and more players (apple and marvell being the main exceptions) using straight or tweaked arm cortex lines the performance model is probably not going to diverge too much....

@staticfloat
Copy link
Member

Thanks, that's really helpful Yichao!

No, they are just required feature. Soft floating point is not supported. I'm not sure what can be tested about something that just isn't supposed to work.

It's not about testing with a platform that doesn't have fp, it's about having our feature lists be consistent the rest of the world. It's annoying to have to carry the cognitive load of remembering that even though /proc/cpuinfo specifies fp and asmid, the fact that Julia doesn't advertise support for those flags doesn't mean it's not compatible, we just have to remember that it is automatically included.

I'm not sure if there's anything that's worth optimizing for a particular core but I kind of doubt it. With more and more players (apple and marvell being the main exceptions) using straight or tweaked arm cortex lines the performance model is probably not going to diverge too much....

So it sounds like just two marches for now may be okay:

  • armv8 => armv8.0-a+fp+asimd (yes, I know fp and asimd are already on by default, I just want to be explicit)
  • armv8_2_crypto => armv8.0-a+fp+asimd+lse+crc+rdma+aes+sha2

I'm not sure how important rcpc or fullfp16 is; is it worthwhile adding an armv8_2_crypto_rcpc? If you don't think it's that big of a deal, I vote that we just go with these two for now and we can add more later.

@yuyichao
Copy link
Contributor

It's annoying to have to carry the cognitive load of remembering that even though /proc/cpuinfo specifies fp and asmid, the fact that Julia doesn't advertise support for those flags doesn't mean it's not compatible, we just have to remember that it is automatically included.

Well, you are free to ignore these anywhere in the stack if you want copy-paste from /proc/cpuinfo etc. However, note that they are not the LLVM name so if anything if they are added here they will not be called these. (LLVM names are neon and fp-armv8 I believe) and these aren't the only one either so you always have to know the translation. That's why I added document to map between llvm name, HWCAP name and feature register bits.

Also, the two are always exactly the same according to the arm architecture manual so allowing both will be even more confusing. It's the same reason many other features are commented out, (e.g. asimdhp, sm3, svei8mm, svebf16)

@yuyichao
Copy link
Contributor

I'm not sure how important rcpc or fullfp16 is; is it worthwhile adding an armv8_2_crypto_rcpc? If you don't think it's that big of a deal, I vote that we just go with these two for now and we can add more later.

According to Keno, rcpc is suspected to be required by rr. I don't think it'll be important for many external libraries (and still, I believe very few of these are important for external libraries and the rest can and should do internal dispatch.........................)

@giordano giordano force-pushed the mg/cpuid branch 2 times, most recently from 19e5300 to d73df58 Compare September 16, 2020 23:25
This is a small internal module which defines the API to query the Instruction
Set Architecture (ISA) of the current CPU.  This can be used for example to
select an artifact in a JLL package which is compatible with the current CPU.
Copy link
Member

@staticfloat staticfloat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I say ready to merge.

Thanks @yuyichao for his careful attention, and @giordano for his work putting this together!

@staticfloat
Copy link
Member

staticfloat commented Sep 17, 2020

According to Keno, rcpc is suspected to be required by rr.

He's not 100% certain, so we're going to do some experiments to figure out exactly what is required by building rr with different march'es and figuring out what we need. Once we know, we can add a mapping here (or even dynamically include it via a Pkg hook that I'm writing)

@staticfloat staticfloat merged commit 0f782b2 into JuliaLang:master Sep 17, 2020
@giordano giordano deleted the mg/cpuid branch September 17, 2020 22:12
@chriselrod
Copy link
Contributor

For example, LoopVectorization.jl must not (and as I understand it currently is) using anything like this to determine what code to generate.
The target used for julia code and the actual hardware target are usually the same. However, whenever they disagree the hardware one is almost always the wrong one to use.

And again, having some dirty hack that should work for library loading most of the time is fine, as long as it's clear that no one outside of base should use it and the API may be deleted at any time without notice.

@yuyichao I didn't see this. It would be great to have an API that provides features that are safe to really on.
LoopVectorization hasn't relied on CpuId for a couple months, instead queering LLVM with getHostCpuFeaures. I made the change from CpuId because some people experienced crashes when running on avx512 hardware while LLVM not listing it among the features (the recent bug on mac's).

A more abstracted api would be generally useful. It's be great to have a reliable FMA_NATIVE, for example.

@yuyichao
Copy link
Contributor

The answer from LLVM was wrong as well.

@chriselrod
Copy link
Contributor

chriselrod commented Sep 18, 2020

It solved the issue on Macs, but it is incorrect when starting Julia with -Civybridge. I haven't noticed any issues outside of specifying arch on startup yet.
Is there any way to do this correctly?

Aside from "don't generate machine specific code from Julia at all".
Maybe someday I can try implementing it in LLVM, but experimenting is much easier in Julia -- at least for me -- and the library is still immature, lacking many capabilities I intend it to have. Meaning there is still a lot more experimenting to be done.

Unless I'm mistaken, getHostCpuFeatures should be correct as long as llvm's rather is the host, which will be the case unless user's deliberately specified something else when starting Julia.

@echo >> $@
endef

$(BUILDDIR)/features_h.jl: ../src/features_x86.h ../src/features_aarch32.h ../src/features_aarch64.h
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broke out of dir build":

vchuravy@odin ~/b/julia-replloader [2]> make -j
make[1]: *** No rule to make target '../src/features_x86.h', needed by 'features_h.jl'.  Stop.
make[1]: *** Waiting for unfinished jobs....
make: *** [/home/vchuravy/src/julia/Makefile:66: julia-base] Error 2
make: *** Waiting for unfinished jobs....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

A CPUID library in Base or stdlib
7 participants