-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast implementation of digits with base a power of 2 #31722
Conversation
Have you tried benchmarking inside a function? Can't the compiler do constant propagation by itself here? |
Like this? julia> function test()
n = 1234567890123456
digits(n, base = 2)
end
test (generic function with 1 method)
julia> @btime test();
74.073 ns (1 allocation: 496 bytes)
julia> function test()
n = 1234567890123456
digits(n, base = 10)
end
test (generic function with 1 method)
julia> @btime test();
215.014 ns (1 allocation: 208 bytes) |
With this change @@ -755,9 +755,17 @@ function digits!(a::AbstractVector{T}, n::Integer; base::Integer = 10) where T<:
isempty(a) && return a
if base > 0
- for i in eachindex(a)
- n, d = divrem(n, base)
- a[i] = d
+ if ispow2(base)
+ k = trailing_zeros(base)
+ c = base - 1
+ for i in eachindex(a)
+ a[i] = (n >> (k * (i - 1))) & c
+ end
+ else
+ for i in eachindex(a)
+ n, d = divrem(n, base)
+ a[i] = d
+ end
end
else
# manually peel one loop iteration for type stability we can have specialised implementation for all bases that are powers of 2. Here are benchmarks with the proposed change to the PR: # Before
julia> @btime digits(1, base = 2);
33.756 ns (1 allocation: 96 bytes)
julia> @btime digits(1234567890123456, base = 2);
660.050 ns (1 allocation: 496 bytes)
julia> @btime digits(81985529216486895, base = 2);
759.219 ns (1 allocation: 544 bytes)
julia> @btime digits(1, base = 16);
34.774 ns (1 allocation: 96 bytes)
julia> @btime digits(81985529216486895, base = 16);
219.475 ns (1 allocation: 208 bytes)
# After
julia> @btime digits(1, base = 2);
33.044 ns (1 allocation: 96 bytes)
julia> @btime digits(1234567890123456, base = 2);
111.649 ns (1 allocation: 496 bytes)
julia> @btime digits(81985529216486895, base = 2);
122.859 ns (1 allocation: 544 bytes)
julia> @btime digits(1, base = 16);
32.785 ns (1 allocation: 96 bytes)
julia> @btime digits(81985529216486895, base = 16);
58.451 ns (1 allocation: 208 bytes) Note that performance of |
Ok, I've fixed the first index as suggested by Rafael and implemented the fast method for all powers of 2, as well as added more tests. |
Bump 🙂 |
I guess the question is who should make a decision here? This seems like a straightforward speedup. And which version should we go with? |
I have no idea 😂
I'd suggest the last one with any power of 2, I think it's a common case to see the expansion in base 8 and 16, not only base 2. Anyway, I can easily revert to the previous version if we want to prioritise |
Should there be a faster |
It doesn't seem that julia> Profile.clear(); @profile for n in 1:1_000_000
digits(1234567890123456, base = 2)
end
julia> Profile.print()
133 ./task.jl:268; (::getfield(REPL, Symbol("##26#27")){REPL.REPLBackend})()
133 ...e/julia/stdlib/v1.3/REPL/src/REPL.jl:118; macro expansion
133 ...e/julia/stdlib/v1.3/REPL/src/REPL.jl:86; eval_user_input(::Any, ::REPL.REPLBackend)
133 ./boot.jl:330; eval(::Module, ::Any)
1 ./intfuncs.jl:755; #digits!#326(::Int64, ::typeof(digits!), ::Array{Int64...
1 ./intfuncs.jl:513; ndigits0z(::Int64, ::Int64)
131 .../stdlib/v1.3/Profile/src/Profile.jl:25; top-level scope
131 ./REPL[17]:2; macro expansion
131 ./none:0; #digits
131 ./intfuncs.jl:710; #digits#324
131 ./none:0; #digits
131 ./intfuncs.jl:714; #digits#325
34 ./array.jl:441; zeros
26 ./array.jl:445; zeros
26 ./boot.jl:413; Type
26 ./boot.jl:404; Type
8 ./array.jl:446; zeros
8 ./array.jl:295; fill!(::Array{Int64,1}, ::Int64)
7 ./array.jl:766; setindex!
1 ./range.jl:595; iterate
1 ./promotion.jl:403; ==
93 ./none:0; #digits!
3 ./int.jl:0; #digits!#326(::Int64, ::typeof(digits!), ::Array{In...
9 ./int.jl:444; #digits!#326(::Int64, ::typeof(digits!), ::Array{In...
3 ./intfuncs.jl:0; #digits!#326(::Int64, ::typeof(digits!), ::Array{In...
1 ./intfuncs.jl:752; #digits!#326(::Int64, ::typeof(digits!), ::Array{In...
2 ./intfuncs.jl:755; #digits!#326(::Int64, ::typeof(digits!), ::Array{In...
75 ./intfuncs.jl:762; #digits!#326(::Int64, ::typeof(digits!), ::Array{In...
15 ./array.jl:766; setindex!
10 ./int.jl:293; &
10 ./int.jl:437; >>
38 ./int.jl:444; >>
4 ./array.jl:766; <<
27 ./int.jl:439; <<
7 ./int.jl:424; <=
4 ./none:0; #ndigits
4 ./intfuncs.jl:544; #ndigits#322
3 ./intfuncs.jl:513; ndigits0z(::Int64, ::Int64)
1 ./int.jl:49; <
1 ./intfuncs.jl:516; ndigits0z(::Int64, ::Int64) Actually, Lines 459 to 461 in 93c9ae4
Also in a non-optimised case (e.g., julia> Profile.clear(); @profile for n in 1:1_000_000
digits(1234567890123456, base = 3)
end
julia> Profile.print()
484 ./task.jl:268; (::getfield(REPL, Symbol("##26#27")){REPL.REPLBackend})()
484 ...e/julia/stdlib/v1.3/REPL/src/REPL.jl:118; macro expansion
484 ...e/julia/stdlib/v1.3/REPL/src/REPL.jl:86; eval_user_input(::Any, ::REPL.REPLBackend)
484 ./boot.jl:330; eval(::Module, ::Any)
1 ./intfuncs.jl:752; #digits!#326(::Int64, ::typeof(digits!), ::Array{Int64...
1 ./intfuncs.jl:513; ndigits0z(::Int64, ::Int64)
1 .../stdlib/v1.3/Profile/src/Profile.jl:0; top-level scope
477 .../stdlib/v1.3/Profile/src/Profile.jl:25; top-level scope
477 ./REPL[22]:2; macro expansion
477 ./none:0; #digits
477 ./intfuncs.jl:710; #digits#324
477 ./none:0; #digits
477 ./intfuncs.jl:714; #digits#325
60 ./array.jl:441; zeros
53 ./array.jl:445; zeros
53 ./boot.jl:413; Type
53 ./boot.jl:404; Type
7 ./array.jl:446; zeros
2 ./array.jl:294; fill!(::Array{Int64,1}, ::Int64)
4 ./array.jl:295; fill!(::Array{Int64,1}, ::Int64)
4 ./array.jl:766; setindex!
384 ./none:0; #digits!
96 ./array.jl:0; #digits!#326(::Int64, ::typeof(digits!), ::Array{I...
30 ./intfuncs.jl:0; #digits!#326(::Int64, ::typeof(digits!), ::Array{I...
9 ./intfuncs.jl:752; #digits!#326(::Int64, ::typeof(digits!), ::Array{I...
6 ./int.jl:136; abs
6 ./int.jl:96; flipsign
4 ./intfuncs.jl:755; #digits!#326(::Int64, ::typeof(digits!), ::Array{I...
1 ./abstractarray.jl:917; isempty
1 ./array.jl:200; length
7 ./intfuncs.jl:766; #digits!#326(::Int64, ::typeof(digits!), ::Array{I...
7 ./number.jl:105; divrem
7 ./int.jl:228; div
238 ./intfuncs.jl:767; #digits!#326(::Int64, ::typeof(digits!), ::Array{I...
225 ./array.jl:0; setindex!
13 ./array.jl:766; setindex!
33 ./none:0; #ndigits
33 ./intfuncs.jl:544; #ndigits#322
2 ./intfuncs.jl:513; ndigits0z(::Int64, ::Int64)
30 ./intfuncs.jl:516; ndigits0z(::Int64, ::Int64)
5 ./intfuncs.jl:0; ndigits0zpb(::Int64, ::Int64)
1 ./intfuncs.jl:454; ndigits0zpb(::Int64, ::Int64)
1 ./intfuncs.jl:456; ndigits0zpb(::Int64, ::Int64)
1 ./int.jl:136; abs
1 ./int.jl:96; flipsign
7 ./intfuncs.jl:470; ndigits0zpb(::Int64, ::Int64)
6 ./int.jl:0; div
1 ./int.jl:230; div
8 ./intfuncs.jl:474; ndigits0zpb(::Int64, ::Int64)
4 ./intfuncs.jl:475; ndigits0zpb(::Int64, ::Int64)
4 ./int.jl:54; *
3 ./intfuncs.jl:476; ndigits0zpb(::Int64, ::Int64)
3 ./int.jl:53; + |
I've changed the condition for the fast path to ispow2(base) && n >= 0 && n isa Base.BitInteger
Regarding julia> Profile.clear(); @profile for n in 1:1_000_000
digits(1234567890123456, base = 4)
end
julia> Profile.print()
123 ./task.jl:268; (::getfield(REPL, Symbol("##26#27")){REPL.REPLBackend})()
123 ...e/julia/stdlib/v1.3/REPL/src/REPL.jl:118; macro expansion
123 ...e/julia/stdlib/v1.3/REPL/src/REPL.jl:86; eval_user_input(::Any, ::REPL.REPLBackend)
123 ./boot.jl:330; eval(::Module, ::Any)
2 ./intfuncs.jl:513; ndigits0z(::Int64, ::Int64)
118 .../stdlib/v1.3/Profile/src/Profile.jl:25; top-level scope
118 ./REPL[20]:2; macro expansion
118 ./none:0; #digits
118 ./intfuncs.jl:710; #digits#324
118 ./none:0; #digits
118 ./intfuncs.jl:714; #digits#325
37 ./array.jl:441; zeros
23 ./array.jl:445; zeros
23 ./boot.jl:413; Type
23 ./boot.jl:404; Type
14 ./array.jl:446; zeros
8 ./array.jl:294; fill!(::Array{Int64,1}, ::Int64)
1 ./abstractarray.jl:212; eachindex
1 ./abstractarray.jl:95; axes1
1 ./abstractarray.jl:75; axes
1 ./array.jl:155; size
6 ./array.jl:295; fill!(::Array{Int64,1}, ::Int64)
5 ./array.jl:766; setindex!
1 ./range.jl:595; iterate
1 ./promotion.jl:403; ==
45 ./none:0; #digits!
5 ./int.jl:444; #digits!#19(::Int64, ::typeof(digits!), ::Array{Int...
2 /tmp/foo.jl:0; #digits!#19(::Int64, ::typeof(digits!), ::Array{Int...
4 /tmp/foo.jl:2; #digits!#19(::Int64, ::typeof(digits!), ::Array{Int...
2 ./int.jl:136; abs
2 ./int.jl:96; flipsign
2 /tmp/foo.jl:5; #digits!#19(::Int64, ::typeof(digits!), ::Array{Int...
31 /tmp/foo.jl:12; #digits!#19(::Int64, ::typeof(digits!), ::Array{Int...
10 ./array.jl:766; setindex!
1 ./int.jl:293; &
1 ./int.jl:437; >>
17 ./int.jl:444; >>
14 ./int.jl:439; <<
1 ./int.jl:424; <=
2 ./int.jl:437; >>
1 /tmp/foo.jl:15; #digits!#19(::Int64, ::typeof(digits!), ::Array{Int...
1 ./abstractarray.jl:212; eachindex
1 ./abstractarray.jl:95; axes1
1 ./abstractarray.jl:75; axes
1 ./array.jl:155; size
36 ./none:0; #ndigits
36 ./intfuncs.jl:544; #ndigits#322
3 ./intfuncs.jl:513; ndigits0z(::Int64, ::Int64)
2 ./int.jl:49; <
33 ./intfuncs.jl:516; ndigits0z(::Int64, ::Int64)
3 ./intfuncs.jl:0; ndigits0zpb(::Int64, ::Int64)
1 ./intfuncs.jl:454; ndigits0zpb(::Int64, ::Int64)
14 ./intfuncs.jl:470; ndigits0zpb(::Int64, ::Int64)
14 ./int.jl:0; div
5 ./intfuncs.jl:474; ndigits0zpb(::Int64, ::Int64)
8 ./intfuncs.jl:475; ndigits0zpb(::Int64, ::Int64)
8 ./int.jl:54; *
2 ./intfuncs.jl:476; ndigits0zpb(::Int64, ::Int64)
2 ./int.jl:53; + Here |
How about @@ -460,6 +460,10 @@ function ndigits0zpb(x::Integer, b::Integer)
b == 8 && return (sizeof(x)<<3 - leading_zeros(x) + 2) ÷ 3
b == 16 && return sizeof(x)<<1 - leading_zeros(x)>>2
b == 10 && return bit_ndigits0z(x)
+ if ispow2(b)
+ dv, rm = divrem(sizeof(x)<<3 - leading_zeros(x), trailing_zeros(b))
+ return iszero(rm) ? dv : dv + 1
+ end
end
d = 0 in Before this change: julia> @btime ndigits(81985529216486895, base = 4);
29.934 ns (0 allocations: 0 bytes)
julia> @btime digits(81985529216486895, base = 4);
106.863 ns (1 allocation: 336 bytes) After it: julia> @btime ndigits(81985529216486895, base = 4);
8.963 ns (0 allocations: 0 bytes)
julia> @btime digits(81985529216486895, base = 4);
85.248 ns (1 allocation: 336 bytes) I think it'd be nicer to remove the special cases for 2, 8, 16, but this generic method I came up with is a bit slower than those super-optimised cases. |
Any comment on the change to |
That seems good to me. In general, as long as it's a) correct, b) faster, and c) not too insane, I'm happy with any of these changes. |
Ok, I installed the proposed change. Here are some benchmarks: before the PR: julia> for base in (2, 3, 4, 5, 8, 10, 16),
n in (1, 3, 1234567890123456, 81985529216486895)
println("n = $n, base = $base")
@btime digits($n, base = $base)
end
n = 1, base = 2
33.569 ns (1 allocation: 96 bytes)
n = 3, base = 2
38.626 ns (1 allocation: 96 bytes)
n = 1234567890123456, base = 2
642.904 ns (1 allocation: 496 bytes)
n = 81985529216486895, base = 2
759.684 ns (1 allocation: 544 bytes)
n = 1, base = 3
37.212 ns (1 allocation: 96 bytes)
n = 3, base = 3
44.564 ns (1 allocation: 96 bytes)
n = 1234567890123456, base = 3
432.618 ns (1 allocation: 336 bytes)
n = 81985529216486895, base = 3
509.099 ns (1 allocation: 368 bytes)
n = 1, base = 4
36.202 ns (1 allocation: 96 bytes)
n = 3, base = 4
37.109 ns (1 allocation: 96 bytes)
n = 1234567890123456, base = 4
341.088 ns (1 allocation: 288 bytes)
n = 81985529216486895, base = 4
399.398 ns (1 allocation: 336 bytes)
n = 1, base = 5
37.231 ns (1 allocation: 96 bytes)
n = 3, base = 5
37.755 ns (1 allocation: 96 bytes)
n = 1234567890123456, base = 5
307.689 ns (1 allocation: 256 bytes)
n = 81985529216486895, base = 5
362.034 ns (1 allocation: 288 bytes)
n = 1, base = 8
33.838 ns (1 allocation: 96 bytes)
n = 3, base = 8
33.675 ns (1 allocation: 96 bytes)
n = 1234567890123456, base = 8
232.650 ns (1 allocation: 224 bytes)
n = 81985529216486895, base = 8
273.321 ns (1 allocation: 240 bytes)
n = 1, base = 10
34.165 ns (1 allocation: 96 bytes)
n = 3, base = 10
34.032 ns (1 allocation: 96 bytes)
n = 1234567890123456, base = 10
218.868 ns (1 allocation: 208 bytes)
n = 81985529216486895, base = 10
246.200 ns (1 allocation: 224 bytes)
n = 1, base = 16
33.623 ns (1 allocation: 96 bytes)
n = 3, base = 16
33.525 ns (1 allocation: 96 bytes)
n = 1234567890123456, base = 16
183.280 ns (1 allocation: 192 bytes)
n = 81985529216486895, base = 16
221.509 ns (1 allocation: 208 bytes) After the PR: julia> for base in (2, 3, 4, 5, 8, 10, 16),
n in (1, 3, 1234567890123456, 81985529216486895)
println("n = $n, base = $base")
@btime digits($n, base = $base)
end
n = 1, base = 2
33.610 ns (1 allocation: 96 bytes)
n = 3, base = 2
34.253 ns (1 allocation: 96 bytes)
n = 1234567890123456, base = 2
110.220 ns (1 allocation: 496 bytes)
n = 81985529216486895, base = 2
123.984 ns (1 allocation: 544 bytes)
n = 1, base = 3
37.236 ns (1 allocation: 96 bytes)
n = 3, base = 3
44.465 ns (1 allocation: 96 bytes)
n = 1234567890123456, base = 3
430.688 ns (1 allocation: 336 bytes)
n = 81985529216486895, base = 3
509.145 ns (1 allocation: 368 bytes)
n = 1, base = 4
34.196 ns (1 allocation: 96 bytes)
n = 3, base = 4
34.354 ns (1 allocation: 96 bytes)
n = 1234567890123456, base = 4
81.159 ns (1 allocation: 288 bytes)
n = 81985529216486895, base = 4
86.555 ns (1 allocation: 336 bytes)
n = 1, base = 5
37.245 ns (1 allocation: 96 bytes)
n = 3, base = 5
37.056 ns (1 allocation: 96 bytes)
n = 1234567890123456, base = 5
306.630 ns (1 allocation: 256 bytes)
n = 81985529216486895, base = 5
359.809 ns (1 allocation: 288 bytes)
n = 1, base = 8
33.450 ns (1 allocation: 96 bytes)
n = 3, base = 8
33.501 ns (1 allocation: 96 bytes)
n = 1234567890123456, base = 8
59.394 ns (1 allocation: 224 bytes)
n = 81985529216486895, base = 8
61.872 ns (1 allocation: 240 bytes)
n = 1, base = 10
34.125 ns (1 allocation: 96 bytes)
n = 3, base = 10
35.073 ns (1 allocation: 96 bytes)
n = 1234567890123456, base = 10
217.121 ns (1 allocation: 208 bytes)
n = 81985529216486895, base = 10
244.356 ns (1 allocation: 224 bytes)
n = 1, base = 16
33.395 ns (1 allocation: 96 bytes)
n = 3, base = 16
33.392 ns (1 allocation: 96 bytes)
n = 1234567890123456, base = 16
54.078 ns (1 allocation: 192 bytes)
n = 81985529216486895, base = 16
57.518 ns (1 allocation: 208 bytes) I also included some benchmarks for bases not powers of 2 (3, 5, 10), to make sure that performance is effectively unaffected in those cases. |
a[i] = d | ||
if ispow2(base) && n >= 0 && n isa Base.BitInteger | ||
k = trailing_zeros(base) | ||
c = base - 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You make the assumption that base
behaves like Base.BitIntegers
, so I suggest you write in the condition 2 lines above: && base <= typemax(Int)
and then base = Int(base)
. (We can't write base = Int(base)
unconditionally at the top of the function, because we support bases which don't fit in an Int
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If your concern is that c
would always be an Int
or larger, even if base
is of a smaller type, what about instead
c = base - one(base)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is that trailing_zeros(base)
won't necessarily be implemented for all integer types, and that base - 1
also assumes 2-complement implementation, so seems safer to just use Int
(or typeof(n)
which is Base.BitIntgers
).
Moreover, even if base :: BigInt
which is implementing 2-complements logic, it's wasteful in the loop below to convert everything to BigInt
with the &
operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! Note, however, that while a base larger than typemax
is possible in digits!
, it doesn't currently work in digits
because ndigits
internally truncates the base to Int
:
julia> digits(9223372036854775810, base = 9223372036854775808)
ERROR: InexactError: trunc(Int64, 9223372036854775808)
Stacktrace:
[1] throw_inexacterror(::Symbol, ::Any, ::Int128) at ./boot.jl:560
[2] checked_trunc_sint at ./boot.jl:582 [inlined]
[3] toInt64 at ./boot.jl:631 [inlined]
[4] Type at ./boot.jl:710 [inlined]
[5] ndigits0zpb(::Int128, ::Int128) at ./intfuncs.jl:455
[6] ndigits0z(::Int128, ::Int128) at ./intfuncs.jl:520
[7] #ndigits#322 at ./intfuncs.jl:548 [inlined]
[8] #ndigits at ./none:0 [inlined]
[9] #digits#325 at ./intfuncs.jl:718 [inlined]
[10] #digits at ./none:0 [inlined]
[11] #digits#324 at ./intfuncs.jl:714 [inlined]
[12] (::getfield(Base, Symbol("#kw##digits")))(::NamedTuple{(:base,),Tuple{Int128}}, ::typeof(digits), ::Int128) at ./none:0
[13] top-level scope at REPL[2]:1
julia> digits!(zeros(Int, 2), 9223372036854775810, base = 9223372036854775808)
2-element Array{Int64,1}:
2
1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right, on the other hand it works with BigInt
. We may have to refine these "casts" in a later iteration. Maybe a future improvement here would be then to convert base
to the smallest bit integer type ">= Int" which can represent it.
k = trailing_zeros(base) | ||
c = base - 1 | ||
for i in eachindex(a) | ||
a[i] = (n >> (k * (i - firstindex(a)))) & c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record: here it's assumed that firstindex(x)
is an integer and that the indices of a
are consecutive, which is not enforced by the AbstractArray
interface apparently. The same assumption is made below in the base < 0
branch, so anyway this is IMHO not blocking merging this PR.
The improvements are great! I didn't try myself to find corner cases where this could be slower, but it seems like mostly increasing performance, fo the price of only a small increase of complexity, so I say let's go with it. I will merge after you fix the |
The failing travis is "No output has been received in the last 30m0s" so assuming it's unrelated. |
It seems this might have caused the performance regression reported at https://discourse.julialang.org/t/julia-v1-3-runs-this-script-at-least-10x-slower-than-julia-v1-1-on-my-computer/27815. Possibly just because it makes |
Just out of curiosity, right now the functions are set up like that: digits(...)=digits!(..., fitting array, ...)
digits!(...)= complicated stuff why wouldn't it make sense to flip that over and make digits an iterator? -> digits(...)=iterator with equally complicated stuff
function digits!(...,dest,...)
dest.=digits(...)
end The general question here is, whether there are any cases where an iterator is slower than the corresponding array (in the non-mutating case |
An iterator would be a reasonable API but would be a breaking change. A custom array type which calculates the digits on demand would be less of a change but I'm not sure that can be done efficiently (I guess the main cost is mostly the integer divides (the One fun alternative might be to do a custom array type where the digits are calculated up front but bitpacked into the bits of the next larger integer type. Then you get extremely efficient non-allocating storage as well as the simplicity of calculating the digits in a single pass. |
How about keeping the custom array only as iterator and collecting the whole structure into an abstract array only when the first random access via getindex is made? That way we get superior performance for all iterations and have only an O(d) extra overhead when needed for random access. EDIT: On the other hand, why didn't we incorporate the power of two optimization of this method into the div and remainder functions? That way we wouldn't specialize the caller to do a better job than the actual function made for this purpose. |
This PR adds a fast specialised implementation of
digits(::Integer, base = 2)
. @ararslan said that makingbase = 2
a special case wouldn't be a bad idea. Performance with other bases shouldn't be affected.Before this PR:
After this PR: