Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A CPUID library in Base or stdlib #36367

Closed
staticfloat opened this issue Jun 19, 2020 · 24 comments · Fixed by #37479
Closed

A CPUID library in Base or stdlib #36367

staticfloat opened this issue Jun 19, 2020 · 24 comments · Fixed by #37479
Milestone

Comments

@staticfloat
Copy link
Member

As @giordano and I are working on microarchitecture-specific tarballs in BinaryBuilder, we are currently importing a modified version of CpuId.jl, however if Base is going to be selecting artifacts based on the current CPU type, it makes sense that something with similar capabilities to this CpuId package would become a stdlib at the least, if not a part of Base.

We already have some CPUID code in src/processor*.cpp; should we export jl_test_cpu_feature() and simply maintain a Julia mapping of the flags? The way I see it, we have three options:

  • Export the C function, create a simple wrapper stdlib that has various flag definitions and invokes a ccall() to probe CPUID bits.
  • Leave the C code alone, duplicate the functionality in pure Julia (similar to CpuId.jl), but have it shipped by default with Julia.
  • Force all packages that want to use microarchitecture-specific artifacts to do the artifact selection themselves (this will only work for lazy artifacts; no Pkg.add()-time artifacts allowed)

I am in favor of 1 or 2, and between the two of them I'm in favor of (2) because it is most maintainable.

@Keno
Copy link
Member

Keno commented Jun 19, 2020

Yes, I think having Cpuid functionality in base would be good. Base seems like the right place for it, since it seems conceivable that the loading code would have to be aware of it. I think importing CpuId.jl as is for now, would be fine, though in the long run, we should make sure to remove as much duplication as possible.

@yuyichao
Copy link
Contributor

yuyichao commented Jun 19, 2020

  1. The detection should not really happen at build time. Pkg.add should not do any architecture detection. It should be done at runtime/load time. Of course right now because it is only done on user's machine you can do it at build time but that'll not work for stdlib and base and for shared home directory.
  2. You shouldn't really use CpuId.jl since it does not agree with the target julia uses. You should use the one we ship. I was previously thinking of exporting a more user friendly string based API but just exporting the ccall should work as well. We can just use ccall optimization to make it work and use aliases to provide the right generic names. (i.e. the user shouldn't need to know that vfp4 provides fma on ARM).

Other than exporting the ccall, the constants should also be exported and it should be only used through names and not the values directly. There should be something like CPU.X86.fma and CPU.AArch32.neon and you can define them by generating a jl file from the features_*.h headers.

@yuyichao
Copy link
Contributor

Actually ccall will also lie to you about the features so a new intrinsic is needed to get a consistent answer with what the LLVM target actually is. The ccall could be used most of the time for loading external libraries but not anything that uses vector calls.

@yuyichao
Copy link
Contributor

yuyichao commented Jun 20, 2020

Also what’s an example where you want to load libraries optimized for different uarch ?

@giordano
Copy link
Contributor

@yuyichao
Copy link
Contributor

Ok. Sure. As long as it’s not the intended way to fix the compiler version regression.

Though, as I mentioned in the other issue, it should be possible to just patch the library to support dispatch within the library. It should be within the interest of the library to do so since loading multiple libraries won’t benefit any normal library users...

@Keno
Copy link
Member

Keno commented Jun 20, 2020

Ok. Sure. As long as it’s not the intended way to fix the compiler version regression.

Why not? There's no reason for each library to ship its own micro-arch dispatching code.

@yuyichao
Copy link
Contributor

As long as it’s not the intended way to fix the compiler version regression.

Because the mpfr issue doesn't even get the full performance back when compiled with gcc 4.8 on core2 target. And it's not even using any new instructions and is purely a compiler issue. This would simply be the wrong fix.

There's no reason for each library to ship its own micro-arch dispatching code.

First, this is not related to why this is the wrong fix for the regression.

And, if the library benefit from newer instruction it should support dispatch directly. It doesn't mean the library has to implement the dispatch, that's what GCC target_clone is for... As I said, the reason is that

loading multiple libraries won’t benefit any normal library users...

and I don't believe julia is the most important user for most libraries.

@yuyichao
Copy link
Contributor

And IIUC icc also does so automatically so there should be little reason the library maintainer has to do this manually...

@kimikage
Copy link
Contributor

The detection should not really happen at build time. Pkg.add should not do any architecture detection. It should be done at runtime/load time.

This issue is also related to #33011. In this case, the detection must be done at runtime/load time. So, the CPUID feature in Base is useful, although it requires careful usage.

@yuyichao
Copy link
Contributor

That is correct and that is exactly why things that seems to work in a package can easily be wrong.

For #33011, it is aimed at julia code so it must agree with the LLVM target and not the runtime environment so that's why I said a separate intrinsic is needed. It's also exactly why I mention about vfp4. People that want to query about existance of fma should write sth like CPU.hasfma() rather than CPU.test_feature(ARM::vfp4).

But for the propose of loading external libraries, as long as the library to be loaded doesn't have vector type as argument, the ABI should not be affected so loading a library that doesn't match the LLVM target isn't much of a deal as long as the library loaded can run. That's why I think it's still OK to just use ccall of jl_test_cpu_feature in this case and export all the constant. Both should be reasonably stable (as long as no one uses the numerical values in the code directly) but I don't want to wrap jl_test_cpu_feature in an easy to use julia function since it is too easy to misuse.

@giordano
Copy link
Contributor

I've just found that in libllvm there is LLVMGetHostCPUFeatures:

ccall(:LLVMGetHostCPUFeatures, Cstring, ())

gives the list of CPU features, also on architectures different from x86. For example, on AWS Graviton 2:

julia> features = filter(f -> startswith(f, "+"), split(unsafe_string(ccall(:LLVMGetHostCPUFeatures, Cstring, ())), ","))
4-element Array{SubString{String},1}:
 "+neon"
 "+fp-armv8"
 "+crypto"
 "+crc"

Can you see any problem with using this function? This would be good enough for our purposes, it's more general than cpuid and it doesn't require extra code in Base.

@yuyichao
Copy link
Contributor

For your purpose of loading libraries it's answer is correct enough

as long as the library to be loaded doesn't have vector type as argument

It's an unstable API though.

@staticfloat
Copy link
Member Author

@yuyichao we've noticed that LLVMGetHostCPUFeatures is pretty sparse on armv8. For instance on the graviton 2 processor that @giordano just posted about, that is an armv8.2 processor with some extra extensions, and we know that it should have many more extensions listed in that LLVMGetHostCPUFeatures output. Is that something that might be fixed by a newer LLVM, or should we be calling a different LLVM function? As it stands, I don't think the output of LLVMGetHostCPUFeatures is going to be sufficient for us, as we need to do things like ensure that lse is enabled on this chip, and although we know it is enabled (because this is an armv8.2-a chip, and lse is on by default for anything armv8.1-a and higher) it's not showing in the LLVM features here.

@yuyichao
Copy link
Contributor

AFAICT there's no newer LLVM functions to use. As I said, exporting jl_test_cpu_feature is fine and it already support lse. It's as correct as LLVMGetHostCPUFeatures so just don't use it for anything that affect julia code. Since it shouldn't be use for julia code, julia using ccall should be fine. It'll be easily confused with future functions that should be used with julia code otherwise.

@staticfloat
Copy link
Member Author

@Keno it seems that LLVM's HostCPUFeatures on ARM are not specific enough for us; for instance, it would be difficult to tell if an aarch64 processor had enough features to run rr on with just this information. Is this something you would prefer we try to fix inside of LLVM, and rely on a newer LLVM in Julia to deal with, or should we just create our own CPUID library either in C or in Julia?

@Keno
Copy link
Member

Keno commented Jul 8, 2020

for instance, it would be difficult to tell if an aarch64 processor had enough features to run rr on with just this information

Hmm, really? Does it not have the ISA version (e.g. ARMv8.3 or something). That should be sufficient for us to tell. Anyway, didn't @yuyichao say in the previous comment that we essentially already have this?

@staticfloat
Copy link
Member Author

Does it not have the ISA version (e.g. ARMv8.3 or something).

No, you can see in #36367 (comment) that it does not, sadly. Unless we're just asking the wrong function.

Exporting jl_test_cpu_feature() is fine as well, but we will have to maintain a list of CPU features that it knows how to test for, which may be somewhat onerous. If you think that's the way to go, then let's just do that.

@Keno
Copy link
Member

Keno commented Jul 8, 2020

Don't we have that list in C already? Presumably we could just add a new function that uses the same information and just returns a list of all the applicable features?

@yuyichao
Copy link
Contributor

yuyichao commented Jul 8, 2020

to run rr on

I didin't know that is supported now. What are the required features?

Does it not have the ISA version (e.g. ARMv8.3 or something)

That's not actually detectable AFAICT. Only features are.

Exporting jl_test_cpu_feature() is fine as well, but we will have to maintain a list of CPU features that it knows how to test for, which may be somewhat onerous. If you think that's the way to go, then let's just do that.

Errr, well, that's exactly what jl_test_cpu_feature is doing...

@Keno
Copy link
Member

Keno commented Jul 8, 2020

I didin't know that is supported now.

I wrote that recently?

What are the required features?

lse atomics, but they've been filling out the whole set in subsequent ISA versions, so the vanilla 8.1 atomics are not sufficient. So far, we've only tested Graviton2 where it works well.

@yuyichao
Copy link
Contributor

yuyichao commented Jul 8, 2020

Does it require not using any ll-sc instructions? Does LLVM/gcc/libc support doing that?

@Keno
Copy link
Member

Keno commented Jul 8, 2020

Yes and yes.

@yuyichao
Copy link
Contributor

yuyichao commented Jul 8, 2020

If it works on neoverse-n1 then I assume th additional one is rcpc so any cortex-a75+ including a55 and a65 should be fine.

Is the use of LSE instructions a ABI requirement? I know they couldn't do 128bit atomic read without a lock without lse so it would make sense that without lse 128bit atomics requires libcall anyway. I imagine it should be legal for <=64bit atomics to use ll-sc though...


So there's a chance AArch64 might get rr support before amd does?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants