Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpk: some GitHub CI runners have MPK enabled but fail to run MPK code #7445

Closed
abrown opened this issue Nov 1, 2023 · 5 comments · Fixed by #7513
Closed

mpk: some GitHub CI runners have MPK enabled but fail to run MPK code #7445

abrown opened this issue Nov 1, 2023 · 5 comments · Fixed by #7513

Comments

@abrown
Copy link
Contributor

abrown commented Nov 1, 2023

GitHub CI runners are showing some strange behavior: on certain runners (unknown which ones), the CPUID bits claim that MPK is supported, but running any MPK code (e.g., RDPKRU) causes a SIGILL crash. #7446 disables MPK until this is resolved.

Some instances of this failure:

abrown added a commit to abrown/wasmtime that referenced this issue Nov 1, 2023
GitHub CI runners are showing some strange behavior: on certain runners
(unknown which ones), the CPUID bits claim that MPK is supported, but
running any MPK code (e.g., `RDPKRU`) causes a `SIGILL` crash. This
change disables MPK until bytecodealliance#7445 is resolved.
abrown added a commit to abrown/wasmtime that referenced this issue Nov 1, 2023
We discovered that certain CPUs claim that MPK is available but then
fail when running MPK instructions; MPK support was temporarily disabled
in bytecodealliance#7446; this change re-enables it.

Closes bytecodealliance#7445.
@abrown
Copy link
Contributor Author

abrown commented Nov 1, 2023

Finally got a reproduction that includes the lscpu output: https://github.com/bytecodealliance/wasmtime/actions/runs/6722788583/job/18272718138?pr=7448#step:4:20. Something must be different about how that AMD EPYC 7763 sets up its CPUID bits.

@alexcrichton
Copy link
Member

Nice! Seems like this should be easy enough to manage by updating the supported check to require Intel CPUs and filter out AMD ones then?

@abrown
Copy link
Contributor Author

abrown commented Nov 1, 2023

That's an option; I was also thinking about looking up how the Linux kernel sets the pku flag. They probably have to deal with this exact issue there. There might even more special cases there to be aware of...

github-merge-queue bot pushed a commit that referenced this issue Nov 1, 2023
GitHub CI runners are showing some strange behavior: on certain runners
(unknown which ones), the CPUID bits claim that MPK is supported, but
running any MPK code (e.g., `RDPKRU`) causes a `SIGILL` crash. This
change disables MPK until #7445 is resolved.
@nagisa
Copy link
Contributor

nagisa commented Nov 7, 2023

I can reproduce this issue (the SIGILL) locally & reliably on my 2nd generation EPYC and can test things out it helps anyway.

abrown added a commit to abrown/wasmtime that referenced this issue Nov 9, 2023
In bytecodealliance#7446 I disabled MPK support temporarily due to failures in CI runs.
Looking into this further in bytecodealliance#7445, I discovered that it is due to how
`has_cpuid_bit_set` works on different x86 machines: Intel's `CPUID`
instruction reports support for MPK in a certain leaf bit, AMD does it
some other (unknown?) way. The CI problem boiled down to occasional runs
on AMD machines that would fail with `SIGILL` because the AMD machine
reported that it had MPK support when it really did not. This change
fixes the issue by first checking if the CPU vendor string is
`GenuineIntel` before inspecting the MPK `CPUID` leaf bit.

Closes bytecodealliance#7445.
github-merge-queue bot pushed a commit that referenced this issue Nov 9, 2023
* mpk: reenable MPK support with vendor string check

In #7446 I disabled MPK support temporarily due to failures in CI runs.
Looking into this further in #7445, I discovered that it is due to how
`has_cpuid_bit_set` works on different x86 machines: Intel's `CPUID`
instruction reports support for MPK in a certain leaf bit, AMD does it
some other (unknown?) way. The CI problem boiled down to occasional runs
on AMD machines that would fail with `SIGILL` because the AMD machine
reported that it had MPK support when it really did not. This change
fixes the issue by first checking if the CPU vendor string is
`GenuineIntel` before inspecting the MPK `CPUID` leaf bit.

Closes #7445.

* review: use `u32::from_le_bytes` to self-document the string check
abrown added a commit to abrown/wasmtime that referenced this issue Nov 22, 2023
When testing, there are certain CPU-dependent features that influence
Cranelift's codegen (e.g., availability of AVX512 instructions). This
additional CI step logs the current CPU information to aid in
troubleshooting, such as the MPK-related troubleshooting over in bytecodealliance#7445.
Also, if we let this run in CI for a while, we may be able to run
queries on the logs to determine how often jobs run on servers with
certain features enabled.
abrown added a commit to abrown/wasmtime that referenced this issue Nov 22, 2023
When testing, there are certain CPU-dependent features that influence
Cranelift's codegen (e.g., availability of AVX512 instructions). This
additional CI step logs the current CPU information to aid in
troubleshooting, such as the MPK-related troubleshooting over in bytecodealliance#7445.
Also, if we let this run in CI for a while, we may be able to run
queries on the logs to determine how often jobs run on servers with
certain features enabled.

prtest:full
github-merge-queue bot pushed a commit that referenced this issue Nov 22, 2023
* ci: log CPU details when testing

When testing, there are certain CPU-dependent features that influence
Cranelift's codegen (e.g., availability of AVX512 instructions). This
additional CI step logs the current CPU information to aid in
troubleshooting, such as the MPK-related troubleshooting over in #7445.
Also, if we let this run in CI for a while, we may be able to run
queries on the logs to determine how often jobs run on servers with
certain features enabled.

prtest:full

* Add Windows variant of 'lscpu'

* Add MacOS variant of 'lscpu'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants