-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add more SIMD platform-intrinsics #117953
Conversation
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @b-naber (or someone else) soon. Please see the contribution instructions for more information. Namely, in order to ensure the minimum review times lag, PR authors and assigned reviewers should ensure that the review label (
|
r? rust-lang/project-portable-simd |
Failed to set assignee to
|
Failed to set assignee to
|
r? @workingjubilee as a SIMD person that has bors perms :D |
Some changes occurred in compiler/rustc_codegen_cranelift cc @bjorn3 |
Marking this as ready for review. I tested it end to end in my project, with changes to the portable-simd crate linked above. It's been working well. |
@RalfJung Letting you know about a new SIMD intrinsic because of rust-lang/miri#1912 (comment) |
Please add somewhere documentation for the behavior of these intrinsics -- in particular documentation for any cases where it is UB. |
I guess the question is where to add those docs... it's a bit annoying that the |
I'll add it to portable-simd for now, then? This will be an interesting one for Miri because memory is not accessed when the corresponding mask element is disabled. For example, it's legal to use a memory address that is invalid if the mask is all zeros. |
Yes that sounds very important to describe precisely. I assume you ensured that all codegen backends are implementing this in a way that matches that spec (i.e., avoids the UB when the mask is zero). |
Yes, LLVM describes it this way:
And for Cranelift, I implemented it with a scalar loop and guarded the memory access by an if statement. |
Hi @workingjubilee , is there anything I can do to help expedite the review process? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I had like 200 notifications.
Now I have <100.
I agree with Ralf that the documentation for these intrinsics ought to live inside this repo. That we more-meticulously documented them is more just an accident of no one else beating us to it. A brief sentence and then "this corresponds roughly to XYZ instruction" is fine for now, we can coalesce them later.
@workingjubilee Thanks, I've pushed out a commit addressing your comments. |
@workingjubilee OK I think that should cover it, I've added a What's the final verdict on the order of these arguments? I think we want |
Yes, that ordering is good by me. Hypothetically you could use this to argue that the proper order is
But I think |
@workingjubilee I'm happy with the state of the PR now. The validation coverage feels fine to me, the only downside are the diagnostic messages shown to the user. I'm fairly sure that other intrinsics (especially the multi-argument ones) also suffer from this, but they don't have I don't think this PR regresses on what we already have, I'm just highlighting a deficiency that already existed. |
Probably! This is almost certainly an inadequately tested part of the compiler. In any case, it is fine for monomorphization errors to be gross, or even simply unhelpful, we just want them to be correct if we are emitting them at all. |
Are we good to merge then, or is there something else we should do in this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks good now modulo a nit.
A test seems to be absent?
Please rebase to minimize the intervening history a bit.
r=me with that added.
tests/codegen/simd-intrinsic/simd-intrinsic-generic-masked-load.rs
Outdated
Show resolved
Hide resolved
@bors delegate=farnoy |
✌️ @farnoy, you can now approve this pull request! If @workingjubilee told you to " |
This maps to the LLVM intrinsics: llvm.masked.load and llvm.masked.store
e271119
to
97ae509
Compare
…llaumeGomez Rollup of 6 pull requests Successful merges: - rust-lang#117953 (Add more SIMD platform-intrinsics) - rust-lang#118057 (dedup for duplicate suggestions) - rust-lang#118638 (More `rustc_mir_dataflow` cleanups) - rust-lang#118702 (Strengthen well known check-cfg names and values test) - rust-lang#118734 (Unescaping cleanups) - rust-lang#118766 (Lower some forgotten spans) r? `@ghost` `@rustbot` modify labels: rollup
Rollup merge of rust-lang#117953 - farnoy:masked-load-store, r=workingjubilee Add more SIMD platform-intrinsics - [x] simd_masked_load - [x] LLVM codegen - llvm.masked.load - [x] cranelift codegen - implemented but untested - [ ] simd_masked_store - [x] LLVM codegen - llvm.masked.store - [ ] cranelift codegen Also added a run-pass test to test both intrinsics, and additional build-fail & check-fail to cover validation for both intrinsics
Rustup Pulls in rust-lang/rust#117953 (in preparation for implementing those intrinsics)
// CHECK-LABEL: @load_f32x2 | ||
#[no_mangle] | ||
pub unsafe fn load_f32x2(mask: Vec2<i32>, pointer: *const f32, | ||
values: Vec2<f32>) -> Vec2<f32> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the meaning of "values" here? It seems strange that a load is told the values to load...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the fallback/passthrough values used for lanes that are masked off, so have the corresponding bits in mask
all zeros
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, the name values
doesn't really convey that. Thanks!
#[no_mangle] | ||
pub unsafe fn store_f32x2(mask: Vec2<i32>, pointer: *mut f32, values: Vec2<f32>) { | ||
// CHECK: call void @llvm.masked.store.v2f32.p0(<2 x float> {{.*}}, ptr {{.*}}, i32 {{.*}}, <2 x i1> {{.*}}) | ||
simd_masked_store(mask, pointer, values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the alignment requirements on the pointer for the store to be valid? (And same question for the load.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pointer should be aligned to the individual element, not the whole vector. So when loading f32x2
, size_of is 8, but alignment of f32 is 4, therefore the pointer should be aligned to 4. This is both for load & store
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the mask is all-0, i.e., no load/store actually happens. Is the pointer still required to be aligned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even with alignment checking, masked elements never generate an access for the load, thus no hardware implementation requires it. If you have a strong reasoning for arguing otherwise, we could make it UB regardless?
Otherwise: no, it's a non-event, like unsafe { invalid_ptr.add(0) }
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw x86-64 for these doesn't ever generate an alignment check exception even if you turn alignment checking on, while the others are explicit that the logic is that predication prevents the access, therefore the alignment check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes they do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe everything that's been raised so far has been addressed. Thanks @RalfJung and apologies if my answers earlier were unhelpful. I wasn't aware that the alignment requirement is only relaxed through the from_array/to_array path, relying on the optimizer to elide those later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@farnoy can you link to the PR(s) that address this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly I've been asking for documentation. #118853 addresses that, please double-check there that the docs for these new intrinsics match what you implemented now. :)
|
||
// Alignment of T, must be a constant integer value: | ||
let alignment_ty = bx.type_i32(); | ||
let alignment = bx.const_i32(bx.align_of(values_ty).bytes() as i32); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be bx.align_of(values_elem).bytes()
not values_ty
. Technically, the LLVM intrinsic accepts any power of two as alignment, so we could relax this down to 1.
Am I understanding this correctly or does it not make a difference @workingjubilee?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having checked the Arm ARM, RISCV V spec, and Intel SDM, we should continue to require element alignment at least. We can expect any masked-load-supporting implementation to support masked loads and stores on element boundaries. However, they may reject accesses on byte boundaries (except where the element is u8, of course), or they may just implement the load/store inefficiently if unaligned, which is honestly about as bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are definitely reasons to do unaligned loads (deserialization, packet framing, etc) that benefit from vectorization, so I think we should at least make writing {}_unaligned
versions possible. Though I can't remember, what's the requirement for scatter/gather?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, element alignment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can adjust them to take a const parameter for alignment, if we want, I guess?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can only guess the intent, but I am fairly sure about what Simd::load
currently does. :)
If it is intended to allow entirely unaligned pointers, it should be implemented as ptr.cast::<Self>().read_unaligned()
. Though I guess that would run into issues with non-power-of-2 vectors as the comment indicates, but those aren't properly supported currently anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened rust-lang/portable-simd#382 for the load question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will open a PR to fix the alignment passed down to LLVM IR. The optimizer converts the masked load to an umasked, aligned load because the alignment is too high
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we go with element alignment for now, does this require a code change? The documentation makes it seem like bx.align_of(values_ty) is not guaranteed to equal bx.align_of(values_elem) (implementation-defined), so this line should probably be corrected?
Correct.
I don't have opinions either way here. It's just important that we pick a rule and then apply it consistently everywhere. (Which is also IMO why we shouldn't land new intrinsics without documentation of their exact safety requirements. Just because portable SIMD intrinsics got a pass on this in the past doesn't mean we should continue doing that.)
I certainly wasn't intending on being freewheeling here, or I would not have asked a zillion small questions myself. :^) The problem is slightly of the "remembering all the questions I should be asking" nature, especially when there has already been multiple rounds of review.
I guess the main list of questions to keep in mind is:
- For memory (i.e. pointer) operations: What is the alignment asserted by the op? (And for SIMD types, you should almost always be picking the alignment of the element or the alignment of the vector.)
- For memory operations, is there a condition in which that alignment is not asserted by the intrinsic? (Also, please answer no, especially if there is any doubt about how LLVM will interpret it.)
- For memory operations, what happens if you operate on uninit memory? It's UB, right?
- For masked operations, what happens for masks derived from alternating
[true, false, ..]
? (i.e. "what is the normal behavior?") - For masked operations, what happens on all-1s? Is there any special event, or does it strengthen any assertions?
- For masked operations, what happens on all-0s? Is there any special event, or does it weaken any assertions?
- For indexing operations, what happens when your index is "out-of-bounds"?
- Sooo... have you checked the LLVMIR for the answers to all of the above questions? How about after optimizations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For memory operations, what happens if you operate on uninit memory? It's UB, right?
Note that we do not ever treat uninit memory specially for memory operations in Rust. The UB on uninit derives entirely from the fact that uninit memory violates the validity invariant of most types. So the question that should be asked is: is this an untyped operation, working on raw abstract machine bytes, or is it a typed operation, and at which type?
…workingjubilee Fix alignment passed down to LLVM for simd_masked_load Follow up to rust-lang#117953 The alignment for a masked load operation should be that of the element/lane, not the vector as a whole It can produce miscompilations after the LLVM optimizer notices the higher alignment and promotes this to an unmasked, aligned load followed up by blend/select - https://rust.godbolt.org/z/KEeGbevbb
Rollup merge of rust-lang#118864 - farnoy:masked-load-store-fixes, r=workingjubilee Fix alignment passed down to LLVM for simd_masked_load Follow up to rust-lang#117953 The alignment for a masked load operation should be that of the element/lane, not the vector as a whole It can produce miscompilations after the LLVM optimizer notices the higher alignment and promotes this to an unmasked, aligned load followed up by blend/select - https://rust.godbolt.org/z/KEeGbevbb
Rustup Pulls in rust-lang#117953 (in preparation for implementing those intrinsics)
Also added a run-pass test to test both intrinsics, and additional build-fail & check-fail to cover validation for both intrinsics