-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf/factor ~ deduplicate divisors #1571
Conversation
Expected performance gain is ⪆20%, but I am missing precise measurements: the only computer I can use for this is currently my laptop, and as the weather got rather warm here, I'm having issues benchmarking precisely due to thermal throttling. Moreover, the performance figures in the commit messages are likely wrong and need to be measured again, as I reordered the commits. |
Thinking further about it, I should submit everything up to 25176ee (i.e. using a flat vector, eliding most heap allocations, etc.) as a separate PR that doesn't involve heavy maths or algorithmics. That should be easier to review, if I split it in a few PRs with narrower scope, and it means it can be reviewed/merged while I'm still working on this. :) |
Rebased on top of #1572 after those commits got extracted. |
Unrelated CI failure:
|
Test failure should be fixed by #1586. |
Rebased to 'master'. |
newer changes
Overall, this looks to be ready for merge and narrows the performance gap with GNU $ ## 10,000,001 factorizations
$ hyperfine -L exe "factor,../../../target/release/factor,../../../target/release/coreutils factor" "seq 0 $((10 ** 7)) | {exe} > /dev/null"
Benchmark #1: seq 0 10000000 | factor > /dev/null
Time (mean ± σ): 2.974 s ± 0.088 s [User: 2.364 s, System: 0.809 s]
Range (min … max): 2.881 s … 3.117 s 10 runs
Benchmark #2: seq 0 10000000 | ../../../target/release/factor > /dev/null
Time (mean ± σ): 10.221 s ± 0.152 s [User: 10.139 s, System: 0.292 s]
Range (min … max): 9.901 s … 10.331 s 10 runs
Benchmark #3: seq 0 10000000 | ../../../target/release/coreutils factor > /dev/null
Time (mean ± σ): 10.585 s ± 0.141 s [User: 10.580 s, System: 0.204 s]
Range (min … max): 10.254 s … 10.693 s 10 runs
Summary
'seq 0 10000000 | factor > /dev/null' ran
3.44 ± 0.11 times faster than 'seq 0 10000000 | ../../../target/release/factor > /dev/null'
3.56 ± 0.12 times faster than 'seq 0 10000000 | ../../../target/release/coreutils factor > /dev/null' $ ## 101 factorizations
$ hyperfine -L exe "factor,../../../target/release/factor,../../../target/release/coreutils factor" "seq $((10 ** 7)) $((10 ** 5)) $((2 * (10 ** 7))) | {exe} > /dev/null"
Benchmark #1: seq 10000000 100000 20000000 | factor > /dev/null
Time (mean ± σ): 4.7 ms ± 0.5 ms [User: 0.8 ms, System: 6.9 ms]
Range (min … max): 3.7 ms … 6.6 ms 399 runs
Warning: Command took less than 5 ms to complete. Results might be inaccurate.
Benchmark #2: seq 10000000 100000 20000000 | ../../../target/release/factor > /dev/null
Time (mean ± σ): 6.8 ms ± 0.6 ms [User: 1.0 ms, System: 9.3 ms]
Range (min … max): 5.7 ms … 9.4 ms 279 runs
Benchmark #3: seq 10000000 100000 20000000 | ../../../target/release/coreutils factor > /dev/null
Time (mean ± σ): 7.5 ms ± 0.6 ms [User: 1.1 ms, System: 9.8 ms]
Range (min … max): 6.3 ms … 9.8 ms 283 runs
Summary
'seq 10000000 100000 20000000 | factor > /dev/null' ran
1.45 ± 0.21 times faster than 'seq 10000000 100000 20000000 | ../../../target/release/factor > /dev/null'
1.60 ± 0.22 times faster than 'seq 10000000 100000 20000000 | ../../../target/release/coreutils factor > /dev/null' Thanks to @nbraud for the initial heavy lifting! There are some unrelated new clippy warnings that I'll fix in a subsequent PR (see #1603). |
This way, we can easily replace u8 with a larger type when moving to support larger integers.
The new type can be used to represent in-progress factorisations, which contain non-prime factors.
~18% faster than BTreeMap, and ~5% faster than 'master'
~2.9% faster than the previous commit, ~11% faster than “master” overall.
~7% slowdown, paves the way for upcoming improvements
~17% faster, many optimisation opportunities still missed >:)
The invariant is checked by a debug_assert!, and follows from the previous commit, as `dec` and `factors` only ever contains coprime numbers: - true at the start: factor = ∅ and dec = { n¹ } ; - on every loop iteration, we pull out an element `f` from `dec` and either: - discover it is prime, and add it to `factors` ; - split it into a number of coprime factors, that get reinserted into `dec`; the invariant is maintained, as all divisors of `f` are coprime with all numbers in `dec` and `factors` (as `f` itself is coprime with them. As we only add elements to `Decomposition` objects that are coprime with the existing ones, they are distinct.
This avoids allocating on the heap when factoring most numbers, without using much space on the stack. This is ~3.5% faster than the previous commit, and ~8.3% faster than “master”.
…time improvement)
@uutils/maintainers , if there are no objections, I'll merge this on Monday. |
perf/factor ~ deduplicate divisors
Closed via merge commit effb94b. |
factor()
function andFactors
datastructureDecomposition
, a representation of divisor multisets, common tofactor()
andFactors
.Decomposition
, first by switching to a flat vector representation, then by usingsmallvec
to stack-allocate the space in most cases.fmt
implementation forFactors
, avoiding a data copy to sort the factors.This is marked WiP, as the implementation is rather ugly and leaves some performance on the table.
Decomposition
again, knowing we do not ever add the same factor twice (so we can skip looking for it)