Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Array push! seems to grow amortized O(N)... #28588

Closed
NHDaly opened this issue Aug 11, 2018 · 26 comments · Fixed by #40453
Closed

Array push! seems to grow amortized O(N)... #28588

NHDaly opened this issue Aug 11, 2018 · 26 comments · Fixed by #40453

Comments

@NHDaly
Copy link
Member

NHDaly commented Aug 11, 2018

Average insertion time for inserting n-elements:
julia front-back insertion time average
Total insertion time for inserting n-elements:
julia front-back insertion time total

Preliminarily, this behavior seems to be true for both push-front and push-back since 0.5 onwards. Still running the profiles.

@vtjnash
Copy link
Member

vtjnash commented Aug 11, 2018

general array-insert is O(n), only push/pop (e.g. insert at any fixed offset from begin or end) are expected to be O(1)

@KristofferC
Copy link
Member

This is normal push! though?

@NHDaly
Copy link
Member Author

NHDaly commented Aug 11, 2018

So... this is probably an orthogonal, unrelated problem, but @JeffBezanson and I have found that We're manually doing a malloc, memcpy, free every time an array is grown, rather than using realloc:

julia/src/julia_internal.h

Lines 840 to 844 in a2149f8

void *b = jl_malloc_aligned(sz, align);
if (b != NULL) {
memcpy(b, d, oldsz > sz ? sz : oldsz);
free(d);
}

This seems to be because there doesn't exist a reasonable method to realloc aligned data, so we have to manually re-do the _aligned_malloc. However, this StackOverflow post suggests that we may be better off attempting a realloc, then testing the resulting alignment, and then only re-doing the _aligned_malloc if it's needed (because the alignment is off):
https://stackoverflow.com/a/9078627/751061

That's especially true when the arrays get huge -- because in that case it's pretty hard to imagine realloc not producing an aligned memory region, so it's probably worth the potential double allocation.

EDIT: This would make it slower, but should be an amortized constant increase, not an order of magnitude increase.

@NHDaly NHDaly changed the title Array insertion seems to grow amortized O(N)... Array push! seems to grow amortized O(N)... Aug 11, 2018
@NHDaly
Copy link
Member Author

NHDaly commented Aug 11, 2018

Sorry, the problem description wasn't clear. yes, this is push! and pushfront!.

@vtjnash
Copy link
Member

vtjnash commented Aug 11, 2018

So... this is probably an orthogonal, unrelated problem, but JeffBezanson and I have found that We're manually doing a malloc, memcpy, free every time an array is grown, rather than using realloc:

This is still an O(1) operation (and this implementation is better for future threading changes and will be required when switching to the Buffer-based pure-julia Array implementation)

@NHDaly
Copy link
Member Author

NHDaly commented Aug 11, 2018

This is still an O(1) operation

Yeah, agreed. That's what i meant by this is orthogonal. I shouldn't have qualified that with "probably". Sorry. :)

and this implementation is better for future threading changes and will be required when switching to the Buffer-based pure-julia Array implementation

Huh yeah that's interesting, i'm interested to hear more about that, but that makes sense.

@JeffBezanson
Copy link
Member

I still think we should use the technique of calling realloc, checking the alignment, and only copying if necessary. realloc is essential at large array sizes, and at that point also highly likely to be aligned.

@NHDaly
Copy link
Member Author

NHDaly commented Aug 11, 2018

(I'm happy to branch this conversation to another thread, though, if we want?)

@NHDaly
Copy link
Member Author

NHDaly commented Oct 5, 2018

So, to continue the original conversation here.

We discovered the actual problem. To summarize what we discussed offline at JuliaCon:

  • We came to believe: Arrays grow following the standard "double when full" algorithm, up until a single array uses >= 1% of the computer's RAM, at which point it switches to only growing by exactly the space needed to append the elements! i.e. it grows by one for each insertion if using push!, making each insertion O(N)!

However, upon more investigation with Valentin, we think that's not quite right. The behavior we noticed above should actually be summarized as follows:

  • if an array attempts to grow by more than 1% memory, CAP it's growth at 1% of RAM.

And that actually seems somewhat reasonable! That change was introduced here:
139a73a

Now, despite that being significantly more reasonable than we originally thought, I still think that might be somewhat related to the performance described above. It still means that above a certain size, the arrays grow constantly -- just with a much larger constant growth rate than growing by 1. So I think that would still result in an exponential growth of the total cost to insert N elements?

So this still bears some thinking about: is that acceptable?

@NHDaly
Copy link
Member Author

NHDaly commented Oct 5, 2018

BUT THEN, something else has changed that introduced a significant slowdown in array growth inbetween julia 0.4 and 0.5. As far as I can tell, it's a constant-slowdown of around 1.7x.

We see a pretty big slowdown between 0.4 and 0.5, and it's not related to the above change; the above change was introduced 5 years ago, in julia 0.3.

Here's the plots comparing 0.4-0.7:
https://plot.ly/~nhdaly/1/
https://plot.ly/~nhdaly/3/

In particular, here's the change between 0.4 and 0.5 for push!
screen shot 2018-10-04 at 8 21 04 pm

0.4 looks deceivingly flat, but it actually has the same shape, just constant-scaled down:
screen shot 2018-10-04 at 8 21 12 pm

So all of them suffer from this same problem of greater-than-O(1) average time append, but after 5.0 that got slower still. So there's two unrelated "problems" here.

@NHDaly
Copy link
Member Author

NHDaly commented Oct 5, 2018

Sorry, and just to clarify, the v0.4 to v0.5 regression seems to be always around 2x:

julia0.7> v05[2] ./ v04[2]
>> 6-element Array{Float64,1}:
 2.048994351707916
 1.7865544298045446
 1.1380501128347926
 0.9317471797454173
 0.9720312759644577
 3.3472547019000554

@NHDaly
Copy link
Member Author

NHDaly commented Oct 5, 2018

So, NEXT STEPS:

  1. We should have some kind of quick fix for this problem. I think the simplest thing would be to increase that 1% RAM cap to something like 5%. I think who cares about the asymptotic complexity: it's fine if your growth becomes "quadratic" if there's only going to be at most 20 more growth operations. 100 is kind of a lot, but maybe 20 is okay. or 10. We could do 10% maybe? idk.

  2. Find and fix the 2x array growth regression between julia 0.4 and julia 0.5.

  3. Investigate other cool growth strategies:

    • we goofed around, talking about other growth functions we could use besides the current discontinuous function: "double until some cutoff, then always add a constant amount". Could we imagine some continuous function that behaves nicely such that it at least doubles -- maybe faster than doubles -- at the beginning, and then tapers off near the end of RAM? Maybe!

@oscardssmith
Copy link
Member

what about 1.0? where is that relative to .5?

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Oct 5, 2018

  • Could we imagine some continuous function that behaves nicely such that it at least doubles -- maybe faster than doubles -- at the beginning, and then tapers off near the end of RAM? Maybe!

Something in the Sigmoid family maybe? Probably best to write down the properties we want it to have and then derive a function that does that.

@NHDaly
Copy link
Member Author

NHDaly commented Oct 5, 2018

CC: @vchuravy

@NHDaly
Copy link
Member Author

NHDaly commented Oct 5, 2018

what about 1.0? where is that relative to .5?

I didn't actually measure 1.0, because i assumed it was the same as 0.7. I can do that in a bit though. Assuming it is, 0.7 was the same as 0.5.

@oscardssmith
Copy link
Member

oh, all the benchmarks were labled .4 or.5 so i assumed.

@nalimilan
Copy link
Member

Leaving breadcrumbs to #25879, which might not be needed if we fix the performance of push!.

@NHDaly
Copy link
Member Author

NHDaly commented Oct 5, 2018

oh, all the benchmarks were labled .4 or.5 so i assumed.

Ah, sorry, i didn't include a screenshot, but here's the full results:

Here's the plots comparing 0.4-0.7:
https://plot.ly/~nhdaly/1/
https://plot.ly/~nhdaly/3/

screen shot 2018-10-04 at 10 29 41 pm

@NHDaly
Copy link
Member Author

NHDaly commented Oct 5, 2018

-- EDIT: I forgot I ran the 0.4 and 0.5 tests on julia-debug, but 1.0 on julia. After switching them all to julia, the 0.5 and 1.0 profiles are almost identical. --

Okay, profiling update. (Is there a better place to be posting these thoughts/updates? Maybe in a performance slack channel or something? let me know.)

So @vchuravy helped me set up a profiling script, which I then profiled on my Mac using Xcode's Instruments:

# profilegrowth.jl  (Note, this file is targetting julia 0.4 and 0.5)
function testfunc(capacity)
  a = zeros(capacity÷2)
  ccall(:jl_array_sizehint, Void, (Array, Cint), a, capacity)  # For the julia 1.0 test, I changed Void to Nothing
end

testfunc(10)

function test(iterations)
  for _ in 1:iterations
    testfunc(10^8)
  end
end

test(1)
test(10)

Some interesting early observations:

  1. The total time is significantly different, 2.8secs in v0.4.7 vs 6.1 secs in v0.5.2 (or around 2x).
    • weirdly, actually, it seems like it's not just growing the array (jl_array_grow_end) that got slower -- everything seems to have gotten slower... Creating the original zeroed-out array takes about twice as long in 0.5, and allocating the array ~10x (but starting from a very small value, so probably insignificant).
  2. Anyway, the thing we're interested in, is that jl_array_grow_end went from 254ms at 7.8% of the profile, to 3.29secs at 47.4% of the profile, which seems pretty bad.

julia v0.4:
screen shot 2018-10-05 at 12 55 12 pm

julia v0.5:
screen shot 2018-10-05 at 12 54 40 pm

EDIT: And julia 1.0's profile is almost identical to 0.5's. (What I thought was a speed-up was actually just due to using julia vs julia-debug.)
And for comparison, julia 1.0's profile looks very similar to 0.5's, percentage-wise. Everything is faster, with a total time of 5.98secs instead of 8.2secs, but the percentages are similar, and jl_array_grow_end takes 2.87secs at 44% of the profile:
screen shot 2018-10-05 at 12 55 43 pm

(zero-filing in 1.0 is now slightly faster than in 0.4 though, so that's cool!)


So then, digging into those profiles, from what I can tell, the extra time seems to come from the fact that 0.5 and 1.0 are actually calling memcpy or memmove, whereas 0.4 isn't!:

julia 0.4:
screen shot 2018-10-05 at 12 59 00 pm

julia 0.5:
screen shot 2018-10-05 at 12 59 30 pm

julia 1.0:
screen shot 2018-10-05 at 1 00 02 pm

So actually, as best as I can tell, this might actually be a symptom of the problem Jeff originally identified above (#28588 (comment)) and which I broke out into a separate issue here (#29526).

That is, that 0.4 was avoiding the expensive operation of actually copying the data, since most of the time it can get away with just growing the memory in-place!


Am I understanding that correctly? And if so, is that only true because this is a toy example that does nothing else besides allocate that array? In the real-world, would we expect memory to be more fragmented than this such that that optimization matters less? I think it would probably be good to profile a bunch of example normal workflows to compare the CPU spent in jl_array_grow_end.

This also makes sense to explain the ~2x slowdown: we're now touching all the data twice, once during allocation and once during the grow event.

@NHDaly
Copy link
Member Author

NHDaly commented Oct 5, 2018

Oops, I just realized I used julia-debug binaries for 0.4 and 0.5, but julia for 1.0. The results are still similar, but let me update the numbers above.

@KristofferC
Copy link
Member

KristofferC commented Oct 5, 2018

I watched a talk recently that seems quite relevant to this (https://youtu.be/2YXwg0n9e7E?t=2193), it compares std::vector and their in-house vector class. The discussion is about 4 minutes

I paraphrase:

  • One of the things that happens with a vector, as you push back you have to grow and resize the array.
  • std::vector: create temporary, move the old to the new, resize... This is fine...
  • We wrote our growth capacity like this: we used realloc...
  • There are two artifacts, realloc doesn't have to allocate new memory if the memory allocation library says that you can just grow
  • Unable to get std::vector to ever match the performance.

@NHDaly
Copy link
Member Author

NHDaly commented Oct 5, 2018

@KristofferC Yeah, so the equivalent C++ (compiled with -O3) has a profile that looks more like the 0.5/1.0 versions:
screen shot 2018-10-05 at 1 11 17 pm

It spends ~30% of its time zeroing the initial array, and an equal 30% of time copying that data to a new array.

Although c++ is hindered by spending an extra 30% of its time zeroing out the rest of the array after growing, whereas julia chooses to leave that data uninitialized.

It's hard to directly compare the total times, since the julia runs included parsing and compiling the code as well, and I'm not sure how much of that contributed to the final profiles, but still, it seems that the C++ is actually a good amount faster despite doing the extra work that 0.5 and 1.0 do... It would be interesting to know why! :)

Here's the c++ file I used:

// Compiled via: clang++ --std=c++11 -O3 profilegrowth.cc
#include <vector>
using namespace std;

void testfunc(int capacity) {
  auto a = vector<int>(capacity/2);
  a.resize(capacity);
}
void test(int iterations) {
  for (int i = 0; i < iterations; ++i) {
    testfunc((int)1e8);
  }
}
int main() {
  // "Warm up" To maintain correlation with the julia script which runs these to precompile
  testfunc(10);
  test(1);
  // Actual test
  test(10);
  return 0;
}

@NHDaly
Copy link
Member Author

NHDaly commented Oct 5, 2018

watched a video recently that seems quite relevant to this (https://youtu.be/2YXwg0n9e7E?t=2193)

This is a fun talk, thanks for sharing! Agreed that it seems quite relevant, and that your summary is good.

@NHDaly
Copy link
Member Author

NHDaly commented May 15, 2019

Hi again!
So I've been (finally) doing some more investigation into this, and want to share my thoughts.

First, a recap:

There are two problems we've discovered:

  1. Array growth is capped at 1% physical RAM, causing quadratic total insertion time complexity
    • After the size of the array reaches 1% of physical memory, each subsequent growth event will grow the array by another 1% of RAM (i.e. a constant-size growth).
    • Having any constant-sized growth instead of a scaled growth factor leads to quadratic total insertion time instead of linear (or ammortized linear per element insertion instead of amortized O(1) per insertion).
    • We've been doing this since v0.3 (in limit xtra buffer allocation #6667)
  2. Julia v0.5 stopped using realloc for growing the array, causing ~2x regression
    • Julia v0.5 stopped using realloc for aligned data, instead performing malloc, copy, free. This is because realloc doesn't guarantee alignment.
    • Based on the performance profiles above, this seems to cause a consistent around 2x slowdown, especially at large sizes. Presumably this is due to the extra time spent copying and freeing, rather than just a single allocation.

The solution to problem (2.) is probably to follow the suggested behavior in this stackoverflow post:
https://stackoverflow.com/a/9078627/751061
Namely, to attempt realloc, check for alignment (which it probably will be), and then only do malloc, copy, free if not aligned. However, the legalese is sufficiently complicated enough to make me wary that this will be unsafe in some weird edge case... It seems probably worth doing, but it scares me a little.

Problem number (1.) however is straightforward. All the literature I have read on this topic says you should never grow with a constant growth amount. It should always be a scale-factor of the current array size. So we should fix that. The next remaining question is what should the growth factor be?

Picking a scaling factor

There is apparently a lot of interesting debate on this topic.

It's summarized well in the wikipedia article on dynamic arrays, which also includes a table of the growth factors in various languages:
https://en.wikipedia.org/wiki/Dynamic_array#Growth_factor

Implementation Growth factor (a)
Java ArrayList[1] 1.5 (3/2)
Python PyListObject[7] ~1.125 (n + n >> 3)
Microsoft Visual C++ 2013[8] 1.5 (3/2)
G++ 5.2.0[5] 2
Clang 3.6[5] 2
Facebook folly/FBVector[9] 1.5 (3/2)

Introductory CS courses explain that you double an array when it needs to grow. However, some sources such as Facebook's folly claim that 2x growth is provably the worst choice, "because it never allows the vector to reuse any of its previously-allocated memory":
https://github.com/facebook/folly/blob/master/folly/docs/FBVector.md

I think that this stackoverflow answer explained the concept well:
https://stackoverflow.com/a/1100426/751061

From what I understand, conceptually any growth factor less than or equal to the golden ratio sufficiently allows for reusing previous allocation slots, but people prefer 1.5 for other reasons.

Changing growth factor based on physical RAM

The approach julia took to limit allocation size as you approach the total system RAM is interesting. One thing we could maybe consider is gradually slowing the growth factor as you get larger, rather than switching to a constant growth size like we do now. (For example, we'd discussed other interesting growth functions like sqrt(x) in the past.)

But I think that this is probably a bad idea. Here are some reasons:

  1. Scaling down the growth factor as the array size scales up is close to effectively having a constant-size growth increment. I'm not sure, but my gut tells me that doing this by any amount would make our total insertion time complexity super-linear.
  2. The current limit is based on a percentage of physical RAM, but these days with virtual memory, it's often times totally reasonable to keep an array that is larger than physical RAM!
    • Especially as more computers have SSDs, swapping to disk is significantly less expensive than in the past, and new feature like overcommit allow you allocate larger than RAM without slowing down/OOMing until you access it.
  3. No one else seems to do this, which makes me think it's not a good idea.

Next Steps

So given all of that, we have a few decisions to make. I'll list them and my proposals here:

  • Remove the entire "over allocation" protection mechanism vs. change it to scale down gradually? [NHDaly: I propose we remove it]
  • What should our scale factor be? [NHDaly: I propose 1.5, but python's very low growth rate is interesting.]
  • Related: should we allow users to configure the growth strategy in any way? [NHDaly: could be useful, but lower priority.]

How does all of that sound? :)

@NHDaly
Copy link
Member Author

NHDaly commented May 15, 2019

I've opened #32035, which implements the proposals above, which we can use as a strawperson to investigate the best fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants