-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for AArch64 CRC32 instructions #6
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thanks for this! The benchmarks certainly look promising.
I've left comments in-line to be addressed.
Forgot to say this, but if you wanted to use llvm instrinsics instead of inline assembly, you may be able to use the extern {
#[link_name = "llvm.aarch64.crc32x"]
pub unsafe fn crc32x(a: i32, b: i64) -> i32;
} |
188eab0
to
0fa7925
Compare
Now using intrinsics and detection via stdsimd, which should be added by rust-lang/stdarch#612 :) So waiting on that. |
src/specialized/aarch64.rs
Outdated
let mut ptr4; | ||
let mut ptr8; | ||
|
||
if len != 0 && ((ptr as usize) & 1) != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this perhaps use the recently stabilized align_to
method on slices to do the workhorse of the logic around alignment here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooh, this is a very nice method! (and the chunks_exact
iterator too)
A quick attempt at using it here though made performance significantly worse. I'll investigate that later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, it just wasn't inlining the intrinsics' wrappers because I removed the target_feature
attr. lol.
@alexcrichton What would the timeline look like to get the crc instrinsics change shipped to nightly? |
@myfreeweb It looks like the |
That's all for this project of course. |
Rebased, updated for new intrinsic names rust-lang/stdarch#626 let's wait for them to land in nightly |
It looks like the change to the instrinsic names has landed in nightly 🎉 Let me know if you want any help pushing this over the finish line! |
Cool. Removed the temporary |
Excellent, thanks for all your effort! |
This should eventually be done using intrinsics, butcore::arch::aarch64 doesn't have any crc32 intrinsics right nowIdeally, CPU capabilities should be checked too, butstdsimd
doesn't do that on FreeBSD on non-x86 CPUs (elf_aux_info
) yet. (And all my machines run FreeBSD :D) CRC is mandatory in ARMv8.1 anyway, and there are very few v8.0 chips without it.see comments
Some fun bench runs!
tfw a humble ARM Cortex-A72 @ 2.18GHz (Rockchip RK3399,
cpuset -l4-5
):matches a Ryzen 7 1700 @ 3.85GHz (well, in one test)
while the Cortex-A53 (@ 1.6GHz, Rockchip RK3399,
cpuset -l0-3
) is that much worse than the A72:and Cavium ThunderX (Scaleway's KVM VPS) has terrible CRC32 units in particular:
upd: my phone: Qualcomm Snapdragon 660 (Kryo V2, 2.2GHz, weird big.little management?):
upd: Amazon EC2 a1 instance (Graviton, also A72) — looks like more cache than RK3399
upd: Packet c2.large.arm (Ampere eMAG)
upd: Marvell MACCHIATObin (A72 @ 1.6GHz)
upd: Marvell MACCHIATObin (A72 @ 2.0GHz)
upd: Amazon EC2 m6g (Graviton2, Neoverse N1)
upd: Apple M1 Max (MacBook Pro) thanks weatherlight — impressive baseline, but unimpressive HW crc32 units
upd: Ryzen 9 5950X @ PBO for comparison