Use the _addcarry and _subborrow intrinsics when available #141

ejmahler · 2020-03-24T05:15:43Z

When compiling for x86_64, with "u64_digit" enabled, some benchmarks are improved by using _addcarry_u64 instead of the custom-written adc function, and using _subborrow_u64) instead of the custom-written sbb function.

The fib and fib2 benchmarks improved the most, most benchmarks improved a little, and a few were worse within the margin of error.

The only benchmark that did legitimately worse was the gcd_euclid family, but there's a comment after those benchmarks saying // Integer for BigUint now uses Stein for gcd. the stein benchmarks showed improvements with this change.

Looking at the generated assembly, it was generating adcq instructions both before and after the change, but post-change the code using adc is a little shorter. It's possible that the intrinsic provided just enough of a hint to the compiler that it was able to optimize some things away. The compiler wasn't generating sbb instructions at all, so this adds them -- and once nice thing is that this change eliminates signed->unsigned conversions.

Let me know if you'd prefer a different away to organize the platform-specific code.

ejmahler · 2020-03-24T05:18:10Z

I also tried applying _addcarry_u32 and subborrow_u32 for 32-bit digits, but it didn't improve any benchmarks, and made many worse, so I backed it out.

I also experimented with the mulx_u64 intrinsic in the mac_with_carry function, but it didn't even generate the mulx instruction and made benchmarks significantly worse.

ejmahler · 2020-03-24T05:19:54Z

It's worth noting that _addcarry_u64 was stabilized with rustc 1.33, so this PR would require either a MSRV increase, or some extra stuff in the build.rs file.

ejmahler · 2020-03-24T06:08:17Z

The build failure for rustc 1.32 and 1.31 should be expected, given that these intrinsics were stabilized with rustc 1.33

cuviper · 2020-03-24T20:24:20Z

I would like to keep 1.31 compatibility for now -- nice as the start of 2018 edition, and that's already an increase from num-bigint's compatibility. If nothing else, that will also make sure we have a u64_digit target in CI that still uses the plain code. So yeah, another autocfg test in the build script would work.

I also tried applying _addcarry_u32 and subborrow_u32 for 32-bit digits, but it didn't improve any benchmarks, and made many worse, so I backed it out.

When you tested that, were you running on i686, or x86_64 patched back to 32-bit digits?

Let me know if you'd prefer a different away to organize the platform-specific code.

It's not bad for now, but we'll need a new approach if we scale out a lot of arch intrinsics.

Maybe it would duplicate less if we abstracted this closer to adc/sbb? I think we could work with arguments in the same order, just different carry/borrow types, and let inference deal with that difference in the callers.

ejmahler · 2020-03-25T04:01:21Z

I would like to keep 1.31 compatibility for now -- nice as the start of 2018 edition, and that's already an increase from num-bigint's compatibility. If nothing else, that will also make sure we have a u64_digit target in CI that still uses the plain code. So yeah, another autocfg test in the build script would work.

I've never done one of these before, but I'll look into it and amend the pull request

I also tried applying _addcarry_u32 and subborrow_u32 for 32-bit digits, but it didn't improve any benchmarks, and made many worse, so I backed it out.

When you tested that, were you running on i686, or x86_64 patched back to 32-bit digits?

I tried both just editing the build.rs script to stop emitting u64_digit on x86_64, and compiling with the i686 msvc target, comparing before/after in each case.

Maybe it would duplicate less if we abstracted this closer to adc/sbb? I think we could work with arguments in the same order, just different carry/borrow types, and let inference deal with that difference in the callers.

I considered this, and I decided not to because I was worried about creating surprising situations where a contributor writes code that overflows on one platform but not another, etc. Or they try to pass the borrow/carry field along to another function, which affects the type inference, but only on some platforms, etc

I'm not too deeply against it if that's the direction you want to go.

cuviper · 2020-03-25T15:03:47Z

I think we already need CI to make sure all of this works both in generic code and arch-specialized, so I'm not too worried about surprises in the carry/borrow type. We can document that variability, and also that they only expect to be 0/1 regardless of type.

ejmahler · 2020-03-26T06:28:07Z

I decided to do a more thorough benchmark compilation, and found that for x86_64 forced to 32-bit digits, using _addcarry_u32 actually did make a difference. But on genuine x86, it makes things the same or worse.

Is 32-bit digits on x86_64 something worth worrying about? Seems like the only way it would happen is if someone like me forced it off to test something.

cuviper · 2020-04-02T18:33:08Z

Is 32-bit digits on x86_64 something worth worrying about?

Not really. I was hoping the result would go the other way, showing benefit on native 32-bit. But now I feel wary that even in the 64-bit case, the benefits you've seen might be very fickle depending on specific CPUs, etc. How much was the actual improvement you saw?

ejmahler · 2020-04-10T02:47:44Z

I probably should have shared benchmarks in the first place. Here they are:

x86_64-pc-windows-msvc

master:
test fib2_100             ... bench:       1,068 ns/iter (+/- 6)
test fib2_1000            ... bench:      13,776 ns/iter (+/- 226)
test fib2_10000           ... bench:     715,450 ns/iter (+/- 22,143)
test fib_100              ... bench:         723 ns/iter (+/- 15)
test fib_1000             ... bench:       8,026 ns/iter (+/- 407)
test fib_10000            ... bench:     401,122 ns/iter (+/- 10,114)
test fib_to_string        ... bench:         217 ns/iter (+/- 2)

addcarry_instrinsic:
test fib2_100             ... bench:       1,040 ns/iter (+/- 16)
test fib2_1000            ... bench:      13,087 ns/iter (+/- 519)
test fib2_10000           ... bench:     599,950 ns/iter (+/- 49,125)
test fib_100              ... bench:         732 ns/iter (+/- 13)
test fib_1000             ... bench:       8,084 ns/iter (+/- 273)
test fib_10000            ... bench:     347,810 ns/iter (+/- 30,973)
test fib_to_string        ... bench:         225 ns/iter (+/- 14)

x86_64-pc-windows-gnu

master:
test fib2_100             ... bench:         982 ns/iter (+/- 14)
test fib2_1000            ... bench:      13,680 ns/iter (+/- 786)
test fib2_10000           ... bench:     717,770 ns/iter (+/- 52,712)
test fib_100              ... bench:         749 ns/iter (+/- 12)
test fib_1000             ... bench:       7,327 ns/iter (+/- 138)
test fib_10000            ... bench:     384,290 ns/iter (+/- 18,288)
test fib_to_string        ... bench:         225 ns/iter (+/- 1)

addcarry_instrinsic:
test fib2_100             ... bench:         968 ns/iter (+/- 8)
test fib2_1000            ... bench:      12,187 ns/iter (+/- 547)
test fib2_10000           ... bench:     583,310 ns/iter (+/- 43,455)
test fib_100              ... bench:         782 ns/iter (+/- 43)
test fib_1000             ... bench:       7,300 ns/iter (+/- 104)
test fib_10000            ... bench:     333,940 ns/iter (+/- 9,086)
test fib_to_string        ... bench:         226 ns/iter (+/- 19)

i686-pc-windows-msvc

master:
test fib2_100             ... bench:       2,066 ns/iter (+/- 43)
test fib2_1000            ... bench:      24,621 ns/iter (+/- 968)
test fib2_10000           ... bench:   1,629,165 ns/iter (+/- 85,346)
test fib_100              ... bench:       1,133 ns/iter (+/- 8)
test fib_1000             ... bench:      14,107 ns/iter (+/- 336)
test fib_10000            ... bench:     807,895 ns/iter (+/- 24,295)
test fib_to_string        ... bench:         300 ns/iter (+/- 22)

addcarry_instrinsic:
test fib2_100             ... bench:       2,002 ns/iter (+/- 94)
test fib2_1000            ... bench:      23,803 ns/iter (+/- 4,320)
test fib2_10000           ... bench:   1,674,750 ns/iter (+/- 89,655)
test fib_100              ... bench:       1,083 ns/iter (+/- 18)
test fib_1000             ... bench:      14,480 ns/iter (+/- 1,098)
test fib_10000            ... bench:     935,170 ns/iter (+/- 62,175)
test fib_to_string        ... bench:         298 ns/iter (+/- 16)

i686-pc-windows-gnu

master:
test fib2_100             ... bench:       1,348 ns/iter (+/- 9)
test fib2_1000            ... bench:      24,365 ns/iter (+/- 3,929)
test fib2_10000           ... bench:   1,630,710 ns/iter (+/- 22,994)
test fib_100              ... bench:       1,000 ns/iter (+/- 10)
test fib_1000             ... bench:      13,136 ns/iter (+/- 516)
test fib_10000            ... bench:     799,360 ns/iter (+/- 6,903)
test fib_to_string        ... bench:         304 ns/iter (+/- 6)

addcarry_instrinsic:
test fib2_100             ... bench:       1,231 ns/iter (+/- 7)
test fib2_1000            ... bench:      21,380 ns/iter (+/- 2,944)
test fib2_10000           ... bench:   1,654,720 ns/iter (+/- 58,099)
test fib_100              ... bench:         954 ns/iter (+/- 7)
test fib_1000             ... bench:      12,638 ns/iter (+/- 1,719)
test fib_10000            ... bench:     923,640 ns/iter (+/- 8,906)
test fib_to_string        ... bench:         831 ns/iter (+/- 24)

On both the msvc and gnu toolchains, there's a 10-20% improvement for big problems on x86_64. But on both toolchains, there's actually a 0-10% drop in performance for i686

…64, applied rustfmt

cuviper · 2020-10-30T22:03:27Z

Sorry for leaving this so long! I rebased your branch, and cleaned up the feature conditions a bit (possibly subjective). Then I ran the benchmarks myself on Fedora 33, for both i686 and x86_64-unknown-linux-gnu, and on two different CPUs: Intel i7-7700K and AMD Ryzen 7 3800X. In all cases, the intrinsics look like a clear winner to me!

ejmahler · 2020-11-02T02:24:12Z

I'm glad to hear it was an improvement, and no problem on the delay.

Looking at the changes to cfg stuff, I think it's much more clear and upfront what the intention is.

cuviper · 2020-11-02T20:31:12Z

bors r+

bors · 2020-11-02T20:40:28Z

Build succeeded:

cuviper mentioned this pull request Jun 26, 2020

Reduce carry/borrow size to 1 digit #157

Closed

ejmahler added 5 commits October 30, 2020 11:52

Use the addcarry intrinsic when avilable

57fdf6a

Include intrinsics from core::arch, not std::arch

4e20fc3

Moved the platform-specific code to adc and sbb, added a build.res entry

0be00d8

Backed out adc and sbb parameter rearrangement, implemented u32 for x…

49ff7b7

…64, applied rustfmt

fixed copy/paste error

0cc50c9

cuviper force-pushed the addcarry_instrinsic branch from a18350c to 029a3df Compare October 30, 2020 21:56

cuviper added 2 commits October 30, 2020 15:00

Unify addcarry probing for x86_64/x86

e3971e6

Restructure adc/sbb to match addcarry/subborrow

e03bbc1

cuviper force-pushed the addcarry_instrinsic branch from 029a3df to e03bbc1 Compare October 30, 2020 22:00

bors bot merged commit b3d48f4 into rust-num:master Nov 2, 2020

ejmahler deleted the addcarry_instrinsic branch November 2, 2020 21:29

vrmiguel mentioned this pull request Sep 9, 2021

graph, store: bump num-bigint and bigdecimal graphprotocol/graph-node#2781

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use the _addcarry and _subborrow intrinsics when available #141

Use the _addcarry and _subborrow intrinsics when available #141

ejmahler commented Mar 24, 2020

ejmahler commented Mar 24, 2020

ejmahler commented Mar 24, 2020 •

edited

Loading

ejmahler commented Mar 24, 2020

cuviper commented Mar 24, 2020

ejmahler commented Mar 25, 2020 •

edited

Loading

cuviper commented Mar 25, 2020

ejmahler commented Mar 26, 2020

cuviper commented Apr 2, 2020

ejmahler commented Apr 10, 2020 •

edited

Loading

cuviper commented Oct 30, 2020

ejmahler commented Nov 2, 2020 •

edited

Loading

cuviper commented Nov 2, 2020

bors bot commented Nov 2, 2020

Use the _addcarry and _subborrow intrinsics when available #141

Use the _addcarry and _subborrow intrinsics when available #141

Conversation

ejmahler commented Mar 24, 2020

ejmahler commented Mar 24, 2020

ejmahler commented Mar 24, 2020 • edited Loading

ejmahler commented Mar 24, 2020

cuviper commented Mar 24, 2020

ejmahler commented Mar 25, 2020 • edited Loading

cuviper commented Mar 25, 2020

ejmahler commented Mar 26, 2020

cuviper commented Apr 2, 2020

ejmahler commented Apr 10, 2020 • edited Loading

cuviper commented Oct 30, 2020

ejmahler commented Nov 2, 2020 • edited Loading

cuviper commented Nov 2, 2020

bors bot commented Nov 2, 2020

ejmahler commented Mar 24, 2020 •

edited

Loading

ejmahler commented Mar 25, 2020 •

edited

Loading

ejmahler commented Apr 10, 2020 •

edited

Loading

ejmahler commented Nov 2, 2020 •

edited

Loading