Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have jl_realloc_aligned /try/ realloc, then fallback to malloc if not aligned #32320

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

NHDaly
Copy link
Member

@NHDaly NHDaly commented Jun 14, 2019

Since there isn't a system realloc_aligned on non-windows systems, we
will simply call realloc, and hope that it will either manage to grow
the existing allocation, or if it had to move it, hope that the new
allocation is correctly aligned. If not, we will manually redo the
allocation via (malloc_aligned, copy, free).

Note that this makes the growth event on large arrays drastically faster:

$ julia-master -e 'b = collect(1:2^30); @time push!(b, 1);'
  5.433267 seconds (121 allocations: 327.686 MiB, 0.02% gc time)
$ ./julia -e 'b = collect(1:2^30); @time push!(b, 1);'
  0.003441 seconds (128 allocations: 327.689 MiB, 31.64% gc time)

Down to 0.003441 seconds from the previous 5.433267 seconds to double an
array of 2^30 integers. :)
(Of course, if you're growing an array from empty, one element at a time, this cost
would be amortized over all those insertions, so for such a task, this should end
up being around a 2x improvement.)

This is the solution to problem (2.) of #28588.

… aligned

Since there isn't a system realloc_aligned on non-windows systems, we
will simply call realloc, and hope that it will either manage to grow
the existing allocation, or if it had to move it, hope that the new
allocation is correctly aligned. If not, we will manually redo the
allocation via (malloc_aligned, copy, free).

Note that this makes growing large arrays drastically faster:

```
$ julia-master -e 'b = collect(1:2^30); @time push!(b, 1);'
  5.433267 seconds (121 allocations: 327.686 MiB, 0.02% gc time)
$ ./julia -e 'b = collect(1:2^30); @time push!(b, 1);'
  0.003441 seconds (128 allocations: 327.689 MiB, 31.64% gc time)

Down to 0.003441 seconds from the previous 5.433267 seconds to double an
array of 2^30 integers. :)
@NHDaly
Copy link
Member Author

NHDaly commented Jun 14, 2019

I want to add some benchmarks and get some good proper performance testing in before we merge this, but the initial results seem promising! It's easy, clean, and seems like a reasonable improvement. :)

@chethega
Copy link
Contributor

Could we try only using this for large arrays? I.e. in form of a jl_array_realloc?

Reason is that realloc is only allocation preserving when using mremap / specialized syscalls. Consider the 16-byte or 32-byte alignments observed below:

julia> for N in [10_000]
       for i=1:3
       @show N,i
       p=ccall(:memalign, Ptr{Nothing}, (Csize_t,Csize_t), 64, N)
       #p=ccall(:malloc, Ptr{Nothing}, (Csize_t,), N)
       @show p, trailing_zeros(convert(UInt, p))
       @time pp = ccall(:realloc, Ptr{Nothing}, (Ptr{Nothing}, Csize_t,), p, 2*N)
       @show pp, trailing_zeros(convert(UInt, pp))
       println()
       end
       end
(N, i) = (10000, 1)
(p, trailing_zeros(convert(UInt, p))) = (Ptr{Nothing} @0x000055d77c0cefc0, 6)
  0.000005 seconds
(pp, trailing_zeros(convert(UInt, pp))) = (Ptr{Nothing} @0x000055d77c0ecfe0, 5)

(N, i) = (10000, 2)
(p, trailing_zeros(convert(UInt, p))) = (Ptr{Nothing} @0x000055d77c102340, 6)
  0.000002 seconds
(pp, trailing_zeros(convert(UInt, pp))) = (Ptr{Nothing} @0x000055d77c0c3ea0, 5)

(N, i) = (10000, 3)
(p, trailing_zeros(convert(UInt, p))) = (Ptr{Nothing} @0x000055d77c102340, 6)
  0.000002 seconds
(pp, trailing_zeros(convert(UInt, pp))) = (Ptr{Nothing} @0x000055d77c10cc50, 4)

@NHDaly
Copy link
Member Author

NHDaly commented Jun 19, 2019

Mmm bummer. Yeah that seems like a reasonable approach to me... But then the trouble is picking what that cutoff should be?

@chethega
Copy link
Contributor

chethega commented Jun 19, 2019

One possibility is to go with glibc cutoffs and see whether other users complain. That would be e.g. here, i.e. 32 MB. Afaik windows is perfectly capable of aligned realloc, so there is no issue there. Any apple or bsd users to chime in?

A somewhat ugly approach could be to have some global variable that tracks the largest observed realloc alignment failure, and use malloc+memcopy for everything below this. Checking against this is cheap (as long as we don't share a cache line with anything that is frequently written in multithreaded contexts). The actual limits are mostly unpredictable (glibc uses a dynamic threshold), but that way we would adapt to whatever our allocator/OS does.

@stillyslalom
Copy link
Contributor

Any updates on this? Inefficient push!-ing creates a bottleneck when collecting Generators of SizeUnknown, as seen in this thread: https://discourse.julialang.org/t/how-to-obtain-indices-of-an-array-satisfying-boolean-condition/37780/19

@PallHaraldsson
Copy link
Contributor

@NHDaly, for me actually worse timing before this change (before killed) on (and "100% of which was recompilation" is simply wrong?!):

julia +nightly -e 'b = collect(1:2^28); @time push!(b, 1);'
  1.320491 seconds (105 allocations: 3.250 GiB, 0.14% gc time, 0.67% compilation time: 100% of which was recompilation)

julia +nightly -e 'b = collect(1:2^29); @time push!(b, 1);'
  2.661248 seconds (105 allocations: 6.500 GiB, 0.07% gc time, 0.33% compilation time: 100% of which was recompilation)

time julia +nightly -e 'b = collect(1:2^30); @time push!(b, 1);'
/bin/bash: line 1: 1761177 Killed                  julia +nightly -e 'b = collect(1:2^30); @time push!(b, 1);'

real	0m13,528s
user	0m2,567s
sys	0m9,877s

I have "10060,6 free" in top.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants