-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use the _addcarry and _subborrow intrinsics when available #141
Conversation
I also tried applying I also experimented with the |
It's worth noting that |
The build failure for rustc 1.32 and 1.31 should be expected, given that these intrinsics were stabilized with rustc 1.33 |
I would like to keep 1.31 compatibility for now -- nice as the start of 2018 edition, and that's already an increase from num-bigint's compatibility. If nothing else, that will also make sure we have a
When you tested that, were you running on i686, or x86_64 patched back to 32-bit digits?
It's not bad for now, but we'll need a new approach if we scale out a lot of arch intrinsics. Maybe it would duplicate less if we abstracted this closer to |
I've never done one of these before, but I'll look into it and amend the pull request
I tried both just editing the build.rs script to stop emitting
I considered this, and I decided not to because I was worried about creating surprising situations where a contributor writes code that overflows on one platform but not another, etc. Or they try to pass the borrow/carry field along to another function, which affects the type inference, but only on some platforms, etc I'm not too deeply against it if that's the direction you want to go. |
I think we already need CI to make sure all of this works both in generic code and arch-specialized, so I'm not too worried about surprises in the carry/borrow type. We can document that variability, and also that they only expect to be 0/1 regardless of type. |
I decided to do a more thorough benchmark compilation, and found that for x86_64 forced to 32-bit digits, using _addcarry_u32 actually did make a difference. But on genuine x86, it makes things the same or worse. Is 32-bit digits on x86_64 something worth worrying about? Seems like the only way it would happen is if someone like me forced it off to test something. |
Not really. I was hoping the result would go the other way, showing benefit on native 32-bit. But now I feel wary that even in the 64-bit case, the benefits you've seen might be very fickle depending on specific CPUs, etc. How much was the actual improvement you saw? |
I probably should have shared benchmarks in the first place. Here they are: x86_64-pc-windows-msvc
x86_64-pc-windows-gnu
i686-pc-windows-msvc
i686-pc-windows-gnu
On both the msvc and gnu toolchains, there's a 10-20% improvement for big problems on x86_64. But on both toolchains, there's actually a 0-10% drop in performance for i686 |
…64, applied rustfmt
a18350c
to
029a3df
Compare
029a3df
to
e03bbc1
Compare
Sorry for leaving this so long! I rebased your branch, and cleaned up the feature conditions a bit (possibly subjective). Then I ran the benchmarks myself on Fedora 33, for both |
I'm glad to hear it was an improvement, and no problem on the delay. Looking at the changes to cfg stuff, I think it's much more clear and upfront what the intention is. |
bors r+ |
When compiling for x86_64, with "u64_digit" enabled, some benchmarks are improved by using
_addcarry_u64
instead of the custom-writtenadc
function, and using_subborrow_u64)
instead of the custom-writtensbb
function.The fib and fib2 benchmarks improved the most, most benchmarks improved a little, and a few were worse within the margin of error.
The only benchmark that did legitimately worse was the
gcd_euclid
family, but there's a comment after those benchmarks saying// Integer for BigUint now uses Stein for gcd
. the stein benchmarks showed improvements with this change.Looking at the generated assembly, it was generating adcq instructions both before and after the change, but post-change the code using adc is a little shorter. It's possible that the intrinsic provided just enough of a hint to the compiler that it was able to optimize some things away. The compiler wasn't generating sbb instructions at all, so this adds them -- and once nice thing is that this change eliminates signed->unsigned conversions.
Let me know if you'd prefer a different away to organize the platform-specific code.