-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement SIMD-specific functions #16
Comments
i saw the non-temporal stuff in std::arch when i wrote |
(Non-)Temporal hinting lets you indicate loads and stores that should not be cached, because you are only reading once and storing once to that location, so you have no need in the rest of the algorithm to frequently read and update it. Basically, it's a way to indicate "please do not shatter my cache with these particular read/writes, I am using that for the ENTIRE REST of my algorithm." This is a good link: https://vgatherps.github.io/2018-09-02-nontemporal/ My apologies if you already knew that much, but if you did, I would be interested to know what other information you feel would be needed? |
I do not see nontemporal loads and stores as "SIMD-specific" (but maybe we just meant "SIMD loads"!) but I will acknowledge their heightened usage in SIMD. Obviously scatter/gather is! |
other load/stores that should be implemented are masked load/stores, strided load/stores (often faster than gather/scatter), and combinations thereof. There's also vector reductions, e.g. reduce-add for things like dot-product. |
|
In #45 we found out that nontemporal stores are not really supported in the codegen backends, not LLVM and as far as I'm aware not in cranelift either, and also that there's no plans to expose this as a scalar API yet, so I took it off the list. That still leaves all the other stuff, of course. |
How difficult would it be to expose masked loads & stores? Gather/scatter are already supported, while being more complex. Both are backed by LLVM intrinsics. I tried working around it by using those gather operations, but the only optimization LLVM does on them it seems, is recognizing a broadcast/splat operation: u8x8::gather_select_unchecked(
bytestring,
Mask::splat(true),
usizex8::from_array([
0, 0, 0, 0, 0, 0, 0, 0
]),
u8x8::splat(0),
); This translates to a single broadcast instruction on AVX, but any divergence in the indices causes inefficient emulation, like when the indices are 0 through 7. Even when specifically compiling for a |
I believe it's straightforward, like you said LLVM supports it. We only need to add an intrinsic to rustc that emits it. |
alternatively we could add an optimization to rustc and/or llvm that detects a vector of successive indices/pointers and converts to a masked load/store. |
That sounds intimidating to me. I'll focus on getting the masked load/store intrinsics running and adding new methods to |
I've submitted the rustc changes here: rust-lang/rust#117953 (so far only masked loads, no stores yet) |
An incomplete list (motivated by a reddit comment)
[ ] nontemporal load/store (not really SIMD, but maybe. seenontemporal_store
)The text was updated successfully, but these errors were encountered: