Array `push!` seems to grow amortized O(N)... #28588

NHDaly · 2018-08-11T11:20:31Z

Average insertion time for inserting n-elements:

Total insertion time for inserting n-elements:

Preliminarily, this behavior seems to be true for both push-front and push-back since 0.5 onwards. Still running the profiles.

vtjnash · 2018-08-11T11:40:34Z

general array-insert is O(n), only push/pop (e.g. insert at any fixed offset from begin or end) are expected to be O(1)

KristofferC · 2018-08-11T11:45:45Z

This is normal push! though?

NHDaly · 2018-08-11T12:07:59Z

So... this is probably an orthogonal, unrelated problem, but @JeffBezanson and I have found that We're manually doing a malloc, memcpy, free every time an array is grown, rather than using realloc:

julia/src/julia_internal.h

Lines 840 to 844 in a2149f8

    
           void *b = jl_malloc_aligned(sz, align); 
        
           if (b != NULL) { 
        
               memcpy(b, d, oldsz > sz ? sz : oldsz); 
        
               free(d); 
        
           }

This seems to be because there doesn't exist a reasonable method to realloc aligned data, so we have to manually re-do the _aligned_malloc. However, this StackOverflow post suggests that we may be better off attempting a realloc, then testing the resulting alignment, and then only re-doing the _aligned_malloc if it's needed (because the alignment is off):
https://stackoverflow.com/a/9078627/751061

That's especially true when the arrays get huge -- because in that case it's pretty hard to imagine realloc not producing an aligned memory region, so it's probably worth the potential double allocation.

EDIT: This would make it slower, but should be an amortized constant increase, not an order of magnitude increase.

NHDaly · 2018-08-11T12:09:01Z

Sorry, the problem description wasn't clear. yes, this is push! and pushfront!.

vtjnash · 2018-08-11T12:22:19Z

So... this is probably an orthogonal, unrelated problem, but JeffBezanson and I have found that We're manually doing a malloc, memcpy, free every time an array is grown, rather than using realloc:

This is still an O(1) operation (and this implementation is better for future threading changes and will be required when switching to the Buffer-based pure-julia Array implementation)

NHDaly · 2018-08-11T12:25:58Z

This is still an O(1) operation

Yeah, agreed. That's what i meant by this is orthogonal. I shouldn't have qualified that with "probably". Sorry. :)

and this implementation is better for future threading changes and will be required when switching to the Buffer-based pure-julia Array implementation

Huh yeah that's interesting, i'm interested to hear more about that, but that makes sense.

JeffBezanson · 2018-08-11T12:26:17Z

I still think we should use the technique of calling realloc, checking the alignment, and only copying if necessary. realloc is essential at large array sizes, and at that point also highly likely to be aligned.

NHDaly · 2018-08-11T12:27:51Z

(I'm happy to branch this conversation to another thread, though, if we want?)

NHDaly · 2018-10-05T00:14:25Z

So, to continue the original conversation here.

We discovered the actual problem. To summarize what we discussed offline at JuliaCon:

We came to believe: Arrays grow following the standard "double when full" algorithm, up until a single array uses >= 1% of the computer's RAM, at which point it switches to only growing by exactly the space needed to append the elements! i.e. it grows by one for each insertion if using push!, making each insertion O(N)!

However, upon more investigation with Valentin, we think that's not quite right. The behavior we noticed above should actually be summarized as follows:

if an array attempts to grow by more than 1% memory, CAP it's growth at 1% of RAM.

And that actually seems somewhat reasonable! That change was introduced here:
139a73a

Now, despite that being significantly more reasonable than we originally thought, I still think that might be somewhat related to the performance described above. It still means that above a certain size, the arrays grow constantly -- just with a much larger constant growth rate than growing by 1. So I think that would still result in an exponential growth of the total cost to insert N elements?

So this still bears some thinking about: is that acceptable?

NHDaly · 2018-10-05T00:24:17Z

BUT THEN, something else has changed that introduced a significant slowdown in array growth inbetween julia 0.4 and 0.5. As far as I can tell, it's a constant-slowdown of around 1.7x.

We see a pretty big slowdown between 0.4 and 0.5, and it's not related to the above change; the above change was introduced 5 years ago, in julia 0.3.

Here's the plots comparing 0.4-0.7:
https://plot.ly/~nhdaly/1/
https://plot.ly/~nhdaly/3/

In particular, here's the change between 0.4 and 0.5 for push!

0.4 looks deceivingly flat, but it actually has the same shape, just constant-scaled down:

So all of them suffer from this same problem of greater-than-O(1) average time append, but after 5.0 that got slower still. So there's two unrelated "problems" here.

NHDaly · 2018-10-05T00:36:43Z

Sorry, and just to clarify, the v0.4 to v0.5 regression seems to be always around 2x:

julia0.7> v05[2] ./ v04[2]
>> 6-element Array{Float64,1}:
 2.048994351707916
 1.7865544298045446
 1.1380501128347926
 0.9317471797454173
 0.9720312759644577
 3.3472547019000554

NHDaly · 2018-10-05T00:52:50Z

So, NEXT STEPS:

We should have some kind of quick fix for this problem. I think the simplest thing would be to increase that 1% RAM cap to something like 5%. I think who cares about the asymptotic complexity: it's fine if your growth becomes "quadratic" if there's only going to be at most 20 more growth operations. 100 is kind of a lot, but maybe 20 is okay. or 10. We could do 10% maybe? idk.
Find and fix the 2x array growth regression between julia 0.4 and julia 0.5.
Investigate other cool growth strategies:
- we goofed around, talking about other growth functions we could use besides the current discontinuous function: "double until some cutoff, then always add a constant amount". Could we imagine some continuous function that behaves nicely such that it at least doubles -- maybe faster than doubles -- at the beginning, and then tapers off near the end of RAM? Maybe!

oscardssmith · 2018-10-05T01:20:39Z

what about 1.0? where is that relative to .5?

StefanKarpinski · 2018-10-05T01:30:53Z

Could we imagine some continuous function that behaves nicely such that it at least doubles -- maybe faster than doubles -- at the beginning, and then tapers off near the end of RAM? Maybe!

Something in the Sigmoid family maybe? Probably best to write down the properties we want it to have and then derive a function that does that.

NHDaly · 2018-10-05T02:27:34Z

CC: @vchuravy

NHDaly · 2018-10-05T02:29:15Z

what about 1.0? where is that relative to .5?

I didn't actually measure 1.0, because i assumed it was the same as 0.7. I can do that in a bit though. Assuming it is, 0.7 was the same as 0.5.

oscardssmith · 2018-10-05T06:41:54Z

oh, all the benchmarks were labled .4 or.5 so i assumed.

nalimilan · 2018-10-05T08:00:56Z

Leaving breadcrumbs to #25879, which might not be needed if we fix the performance of push!.

NHDaly · 2018-10-05T15:58:51Z

oh, all the benchmarks were labled .4 or.5 so i assumed.

Ah, sorry, i didn't include a screenshot, but here's the full results:

Here's the plots comparing 0.4-0.7:
https://plot.ly/~nhdaly/1/
https://plot.ly/~nhdaly/3/

NHDaly · 2018-10-05T16:36:30Z

-- EDIT: I forgot I ran the 0.4 and 0.5 tests on julia-debug, but 1.0 on julia. After switching them all to julia, the 0.5 and 1.0 profiles are almost identical. --

Okay, profiling update. (Is there a better place to be posting these thoughts/updates? Maybe in a performance slack channel or something? let me know.)

So @vchuravy helped me set up a profiling script, which I then profiled on my Mac using Xcode's Instruments:

# profilegrowth.jl  (Note, this file is targetting julia 0.4 and 0.5)
function testfunc(capacity)
  a = zeros(capacity÷2)
  ccall(:jl_array_sizehint, Void, (Array, Cint), a, capacity)  # For the julia 1.0 test, I changed Void to Nothing
end

testfunc(10)

function test(iterations)
  for _ in 1:iterations
    testfunc(10^8)
  end
end

test(1)
test(10)

Some interesting early observations:

The total time is significantly different, 2.8secs in v0.4.7 vs 6.1 secs in v0.5.2 (or around 2x).
- weirdly, actually, it seems like it's not just growing the array (jl_array_grow_end) that got slower -- everything seems to have gotten slower... Creating the original zeroed-out array takes about twice as long in 0.5, and allocating the array ~10x (but starting from a very small value, so probably insignificant).
Anyway, the thing we're interested in, is that jl_array_grow_end went from 254ms at 7.8% of the profile, to 3.29secs at 47.4% of the profile, which seems pretty bad.

julia v0.4:

julia v0.5:

EDIT: And julia 1.0's profile is almost identical to 0.5's. (What I thought was a speed-up was actually just due to using julia vs julia-debug.)
And for comparison, julia 1.0's profile looks very similar to 0.5's, percentage-wise. Everything is faster, with a total time of 5.98secs instead of 8.2secs, but the percentages are similar, and jl_array_grow_end takes 2.87secs at 44% of the profile:

(zero-filing in 1.0 is now slightly faster than in 0.4 though, so that's cool!)

So then, digging into those profiles, from what I can tell, the extra time seems to come from the fact that 0.5 and 1.0 are actually calling memcpy or memmove, whereas 0.4 isn't!:

julia 0.4:

julia 0.5:

julia 1.0:

So actually, as best as I can tell, this might actually be a symptom of the problem Jeff originally identified above (#28588 (comment)) and which I broke out into a separate issue here (#29526).

That is, that 0.4 was avoiding the expensive operation of actually copying the data, since most of the time it can get away with just growing the memory in-place!

Am I understanding that correctly? And if so, is that only true because this is a toy example that does nothing else besides allocate that array? In the real-world, would we expect memory to be more fragmented than this such that that optimization matters less? I think it would probably be good to profile a bunch of example normal workflows to compare the CPU spent in jl_array_grow_end.

This also makes sense to explain the ~2x slowdown: we're now touching all the data twice, once during allocation and once during the grow event.

NHDaly · 2018-10-05T16:51:04Z

Oops, I just realized I used julia-debug binaries for 0.4 and 0.5, but julia for 1.0. The results are still similar, but let me update the numbers above.

KristofferC · 2018-10-05T17:00:30Z

I watched a talk recently that seems quite relevant to this (https://youtu.be/2YXwg0n9e7E?t=2193), it compares std::vector and their in-house vector class. The discussion is about 4 minutes

I paraphrase:

One of the things that happens with a vector, as you push back you have to grow and resize the array.

std::vector: create temporary, move the old to the new, resize... This is fine...

We wrote our growth capacity like this: we used realloc...

There are two artifacts, realloc doesn't have to allocate new memory if the memory allocation library says that you can just grow

Unable to get std::vector to ever match the performance.

NHDaly · 2018-10-05T17:21:53Z

@KristofferC Yeah, so the equivalent C++ (compiled with -O3) has a profile that looks more like the 0.5/1.0 versions:

It spends ~30% of its time zeroing the initial array, and an equal 30% of time copying that data to a new array.

Although c++ is hindered by spending an extra 30% of its time zeroing out the rest of the array after growing, whereas julia chooses to leave that data uninitialized.

It's hard to directly compare the total times, since the julia runs included parsing and compiling the code as well, and I'm not sure how much of that contributed to the final profiles, but still, it seems that the C++ is actually a good amount faster despite doing the extra work that 0.5 and 1.0 do... It would be interesting to know why! :)

Here's the c++ file I used:

// Compiled via: clang++ --std=c++11 -O3 profilegrowth.cc
#include <vector>
using namespace std;

void testfunc(int capacity) {
  auto a = vector<int>(capacity/2);
  a.resize(capacity);
}
void test(int iterations) {
  for (int i = 0; i < iterations; ++i) {
    testfunc((int)1e8);
  }
}
int main() {
  // "Warm up" To maintain correlation with the julia script which runs these to precompile
  testfunc(10);
  test(1);
  // Actual test
  test(10);
  return 0;
}

NHDaly · 2018-10-05T17:31:27Z

watched a video recently that seems quite relevant to this (https://youtu.be/2YXwg0n9e7E?t=2193)

This is a fun talk, thanks for sharing! Agreed that it seems quite relevant, and that your summary is good.

NHDaly · 2019-05-15T15:17:03Z

Hi again!
So I've been (finally) doing some more investigation into this, and want to share my thoughts.

First, a recap:

There are two problems we've discovered:

Array growth is capped at 1% physical RAM, causing quadratic total insertion time complexity
- After the size of the array reaches 1% of physical memory, each subsequent growth event will grow the array by another 1% of RAM (i.e. a constant-size growth).
- Having any constant-sized growth instead of a scaled growth factor leads to quadratic total insertion time instead of linear (or ammortized linear per element insertion instead of amortized O(1) per insertion).
- We've been doing this since v0.3 (in limit xtra buffer allocation #6667)
Julia v0.5 stopped using realloc for growing the array, causing ~2x regression
- Julia v0.5 stopped using realloc for aligned data, instead performing malloc, copy, free. This is because realloc doesn't guarantee alignment.
- Based on the performance profiles above, this seems to cause a consistent around 2x slowdown, especially at large sizes. Presumably this is due to the extra time spent copying and freeing, rather than just a single allocation.

The solution to problem (2.) is probably to follow the suggested behavior in this stackoverflow post:
https://stackoverflow.com/a/9078627/751061
Namely, to attempt realloc, check for alignment (which it probably will be), and then only do malloc, copy, free if not aligned. However, the legalese is sufficiently complicated enough to make me wary that this will be unsafe in some weird edge case... It seems probably worth doing, but it scares me a little.

Problem number (1.) however is straightforward. All the literature I have read on this topic says you should never grow with a constant growth amount. It should always be a scale-factor of the current array size. So we should fix that. The next remaining question is what should the growth factor be?

Picking a scaling factor

There is apparently a lot of interesting debate on this topic.

It's summarized well in the wikipedia article on dynamic arrays, which also includes a table of the growth factors in various languages:
https://en.wikipedia.org/wiki/Dynamic_array#Growth_factor

Implementation	Growth factor (a)
Java ArrayList[1]	1.5 (3/2)
Python PyListObject[7]	~1.125 (n + n >> 3)
Microsoft Visual C++ 2013[8]	1.5 (3/2)
G++ 5.2.0[5]	2
Clang 3.6[5]	2
Facebook folly/FBVector[9]	1.5 (3/2)

Introductory CS courses explain that you double an array when it needs to grow. However, some sources such as Facebook's folly claim that 2x growth is provably the worst choice, "because it never allows the vector to reuse any of its previously-allocated memory":
https://github.com/facebook/folly/blob/master/folly/docs/FBVector.md

I think that this stackoverflow answer explained the concept well:
https://stackoverflow.com/a/1100426/751061

From what I understand, conceptually any growth factor less than or equal to the golden ratio sufficiently allows for reusing previous allocation slots, but people prefer 1.5 for other reasons.

Changing growth factor based on physical RAM

The approach julia took to limit allocation size as you approach the total system RAM is interesting. One thing we could maybe consider is gradually slowing the growth factor as you get larger, rather than switching to a constant growth size like we do now. (For example, we'd discussed other interesting growth functions like sqrt(x) in the past.)

But I think that this is probably a bad idea. Here are some reasons:

Scaling down the growth factor as the array size scales up is close to effectively having a constant-size growth increment. I'm not sure, but my gut tells me that doing this by any amount would make our total insertion time complexity super-linear.
The current limit is based on a percentage of physical RAM, but these days with virtual memory, it's often times totally reasonable to keep an array that is larger than physical RAM!
- Especially as more computers have SSDs, swapping to disk is significantly less expensive than in the past, and new feature like overcommit allow you allocate larger than RAM without slowing down/OOMing until you access it.
No one else seems to do this, which makes me think it's not a good idea.

Next Steps

So given all of that, we have a few decisions to make. I'll list them and my proposals here:

Remove the entire "over allocation" protection mechanism vs. change it to scale down gradually? [NHDaly: I propose we remove it]
What should our scale factor be? [NHDaly: I propose 1.5, but python's very low growth rate is interesting.]
Related: should we allow users to configure the growth strategy in any way? [NHDaly: could be useful, but lower priority.]

How does all of that sound? :)

NHDaly · 2019-05-15T15:19:40Z

I've opened #32035, which implements the proposals above, which we can use as a strawperson to investigate the best fix.

NHDaly changed the title ~~Array insertion seems to grow amortized O(N)...~~ Array push! seems to grow amortized O(N)... Aug 11, 2018

NHDaly mentioned this issue Oct 4, 2018

Consider using realloc for growing arrays, instead of malloc, memcpy, free every time #29526

Closed

NHDaly mentioned this issue May 15, 2019

[WIP]: Fix array growth thresholding #32035

Closed

chethega mentioned this issue Jun 3, 2019

change array realloc alignment to 16 bytes? #32236

Closed

NHDaly mentioned this issue Jun 14, 2019

Have jl_realloc_aligned /try/ realloc, then fallback to malloc if not aligned #32320

Open

vtjnash mentioned this issue Oct 28, 2020

Add optimised methods for reducing(hcat/vcat on any iterators of vectors #31644

Closed

vtjnash mentioned this issue Apr 13, 2021

alter array alloc threshold to remove O(n^2) behavior #40453

Merged

vtjnash closed this as completed in #40453 Jun 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Array `push!` seems to grow amortized O(N)... #28588

Array `push!` seems to grow amortized O(N)... #28588

NHDaly commented Aug 11, 2018

vtjnash commented Aug 11, 2018

KristofferC commented Aug 11, 2018

NHDaly commented Aug 11, 2018 •

edited

Loading

NHDaly commented Aug 11, 2018

vtjnash commented Aug 11, 2018

NHDaly commented Aug 11, 2018 •

edited

Loading

JeffBezanson commented Aug 11, 2018

NHDaly commented Aug 11, 2018

NHDaly commented Oct 5, 2018

NHDaly commented Oct 5, 2018

NHDaly commented Oct 5, 2018

NHDaly commented Oct 5, 2018

oscardssmith commented Oct 5, 2018

StefanKarpinski commented Oct 5, 2018 •

edited

Loading

NHDaly commented Oct 5, 2018

NHDaly commented Oct 5, 2018

oscardssmith commented Oct 5, 2018

nalimilan commented Oct 5, 2018

NHDaly commented Oct 5, 2018

NHDaly commented Oct 5, 2018 •

edited

Loading

NHDaly commented Oct 5, 2018

KristofferC commented Oct 5, 2018 •

edited

Loading

NHDaly commented Oct 5, 2018 •

edited

Loading

NHDaly commented Oct 5, 2018

NHDaly commented May 15, 2019 •

edited

Loading

NHDaly commented May 15, 2019

Array push! seems to grow amortized O(N)... #28588

Array push! seems to grow amortized O(N)... #28588

Comments

NHDaly commented Aug 11, 2018

vtjnash commented Aug 11, 2018

KristofferC commented Aug 11, 2018

NHDaly commented Aug 11, 2018 • edited Loading

NHDaly commented Aug 11, 2018

vtjnash commented Aug 11, 2018

NHDaly commented Aug 11, 2018 • edited Loading

JeffBezanson commented Aug 11, 2018

NHDaly commented Aug 11, 2018

NHDaly commented Oct 5, 2018

NHDaly commented Oct 5, 2018

NHDaly commented Oct 5, 2018

NHDaly commented Oct 5, 2018

oscardssmith commented Oct 5, 2018

StefanKarpinski commented Oct 5, 2018 • edited Loading

NHDaly commented Oct 5, 2018

NHDaly commented Oct 5, 2018

oscardssmith commented Oct 5, 2018

nalimilan commented Oct 5, 2018

NHDaly commented Oct 5, 2018

NHDaly commented Oct 5, 2018 • edited Loading

NHDaly commented Oct 5, 2018

KristofferC commented Oct 5, 2018 • edited Loading

NHDaly commented Oct 5, 2018 • edited Loading

NHDaly commented Oct 5, 2018

NHDaly commented May 15, 2019 • edited Loading

First, a recap:

Picking a scaling factor

Changing growth factor based on physical RAM

Next Steps

NHDaly commented May 15, 2019

Array `push!` seems to grow amortized O(N)... #28588

Array `push!` seems to grow amortized O(N)... #28588

NHDaly commented Aug 11, 2018 •

edited

Loading

NHDaly commented Aug 11, 2018 •

edited

Loading

StefanKarpinski commented Oct 5, 2018 •

edited

Loading

NHDaly commented Oct 5, 2018 •

edited

Loading

KristofferC commented Oct 5, 2018 •

edited

Loading

NHDaly commented Oct 5, 2018 •

edited

Loading

NHDaly commented May 15, 2019 •

edited

Loading