migrate ArrayBuilder to new GrowableBuffer #1542

ianna · 2022-07-11T14:22:44Z

replaces #1529

GrowableBuffer has been modified to remove the dependencies on the std containers and utilities.
GrowableBuffer has been moved to an awkward subdirectory. The cmake INTERFACE has been updated to allow the following include:

#include "awkward/GrowableBuffer.h"

C++ tests have been added, and their cmake build configuration.
ArrayBuilder has been updated to use the new GrowableBuffer.

ianna · 2022-07-13T15:51:27Z

@jpivarski - while debugging the failing test - it looks like to_buffers is called twice somewhere in to_list:

def test_big():
    a = ak._v2.highlevel.ArrayBuilder(initial=90)
    for i in range(2000):
        if i == 200:
            tmp = a.snapshot()
        a.boolean(i % 2 == 0)
    print("to_list(a)")
    assert to_list(a) == [True, False] * 1000
    print("----> to_list(tmp)")
    assert to_list(tmp) == [True, False] * 100
    print("DONE!")

tests/v2/test_0018-fromiter-fillable.py::test_big ArrayBuilder::to_buffers 0
BoolBuilder::to_buffers 0 200
to_list(a)
ArrayBuilder::to_buffers 0
BoolBuilder::to_buffers 0 2000
ArrayBuilder::to_buffers 0
BoolBuilder::to_buffers 0 2000
----> to_list(tmp)
DONE!

also, the buffers dictionary would have two node0-data buffers in this case - one with 200 entries and another with 2000. Would it cause any problems? Shall we go for a unique id?

jpivarski · 2022-07-13T16:18:45Z

Being a dict, buffers can't have two buffers with the same name (node0-data), and if they differ in the number of entries, assuming that the small one has a larger number entries than it really does would result in a segfault.

Usually, the buffers are NumPy arrays or Py_Buffers with a correct length (200 or 2000). If that's the case here, it would prevent a segfault—it would be a "buffer is too small," rather than a segfault.

All of the nodes in a Form must have distinct ids from each other. They don't have to be distinct from other Forms, but without uniqueness-per-tree, you'd get these sorts of collisions.

ianna · 2022-07-13T20:22:37Z

Being a dict, buffers can't have two buffers with the same name (node0-data), and if they differ in the number of entries, assuming that the small one has a larger number entries than it really does would result in a segfault.

Usually, the buffers are NumPy arrays or Py_Buffers with a correct length (200 or 2000). If that's the case here, it would prevent a segfault—it would be a "buffer is too small," rather than a segfault.

All of the nodes in a Form must have distinct ids from each other. They don't have to be distinct from other Forms, but without uniqueness-per-tree, you'd get these sorts of collisions.

Thanks! I should have mentioned that the example is taken from the main branch. I saw this behaviour in the PR and re-checked that it is the same without it.

codecov · 2022-07-14T10:27:07Z

Codecov Report

Merging #1542 (c81f7e8) into main (78a6535) will decrease coverage by 0.12%.
The diff coverage is 58.27%.

Impacted Files	Coverage Δ
src/awkward/_v2/contents/indexedoptionarray.py	`89.14% <54.54%> (-0.70%)`	⬇️
src/awkward/_v2/highlevel.py	`70.77% <57.60%> (-2.76%)`	⬇️
src/awkward/_v2/_connect/numba/arrayview.py	`96.76% <100.00%> (ø)`
src/awkward/_v2/types/recordtype.py	`86.72% <100.00%> (-0.35%)`	⬇️
src/awkward/_v2/types/numpytype.py	`88.09% <0.00%> (-1.20%)`	⬇️
src/awkward/_v2/record.py	`76.86% <0.00%> (-0.18%)`	⬇️
src/awkward/_v2/contents/content.py	`76.61% <0.00%> (ø)`
src/awkward/_v2/operations/ak_is_none.py	`96.55% <0.00%> (+0.39%)`	⬆️
src/awkward/_v2/operations/ak_type.py	`53.84% <0.00%> (+2.56%)`	⬆️

ianna · 2022-07-14T11:25:02Z

Being a dict, buffers can't have two buffers with the same name (node0-data), and if they differ in the number of entries, assuming that the small one has a larger number entries than it really does would result in a segfault.
Usually, the buffers are NumPy arrays or Py_Buffers with a correct length (200 or 2000). If that's the case here, it would prevent a segfault—it would be a "buffer is too small," rather than a segfault.
All of the nodes in a Form must have distinct ids from each other. They don't have to be distinct from other Forms, but without uniqueness-per-tree, you'd get these sorts of collisions.

Thanks! I should have mentioned that the example is taken from the main branch. I saw this behaviour in the PR and re-checked that it is the same without it.

This is understood - the second to_buffers is coming from the "form" method - no extra copies are done there.

ianna · 2022-07-14T14:17:26Z

@jpivarski - I'm done with this PR. Please, have a look when you have time. Thanks!

jpivarski

I'm approving this, though I have questions about it. See below—if you're going to keep the idea of having new panels be bigger than old panels (potentially a good idea, but one I hadn't thought of), you'll need a way to pass that parameter down, rather than hard-code it at 1.5. (I have no idea if that's a good value or not.)

These things could be addressed in a new PR, but if you're going to revert the many instances of options → initial back to → options, then perhaps it should be in this PR so the main history will have a smaller diff. The new options could be named BuilderOptions, rather than ArrayBuilderOptions, because of the generality.

jpivarski · 2022-07-14T22:06:26Z

src/awkward/_v2/cpp-headers/awkward/GrowableBuffer.h

-      if (length_[ptr_.size()-1] == reserved_[ptr_.size()-1]) {
-        add_panel(reserved_[ptr_.size()-1]);
+      if (ptr_->length() == ptr_->reserved()) {
+        add_panel((size_t)ceil(ptr_->reserved() * 1.5));


The panels grow as they're added? That's not a bad idea: at first, you don't know how big they need to be, but if you've been adding a lot of panels, you can be more confident that large panels are better.

However, if that's the case, then it needs to be possible to parameterize the resize, not just fix it as 1.5. The old algorithm's resize=1.5 came from the fact that it was a replacement algorithm, and there's some mathematical reason why the golden ratio is the best resize factor (for defragmentation). Adding panels and not removing the old ones is a different dynamic—there's less fragmentation, for one thing—so the best factor might not be 1.5. Eventually, you'll want to do a performance test, and that would involve varying these two parameters, initial and resize.

Those are the two parameters that the old ArrayBuilderOptions carried, and this PR converts a lot of options (which can be extended by modifying ArrayBuilderOptions) into initial (which can't be extended). If we're already wanting to reintroduce a factor, are you sure you want to do this?

On the other hand, I don't yet know that the optimal resize isn't 1.0. Maybe you won't be needing it. But having an ArrayBuilderOptions, or at least a "BuilderOptions", could be more future-proof.

I think, the resize parameter should not be set by a user. Also, perhaps, it should not be a fixed number. I agree, we should profile this first. GrowableBuffer can have a "growable strategy" hint, for example. I'll merge it as is - this will help @ManasviGoyal to rebase her PR.

migrate ArrayBuilder to new GrowableBuffer

d46073f

ianna mentioned this pull request Jul 11, 2022

array builder with new growable buffer #1529

Closed

ianna marked this pull request as draft July 11, 2022 14:23

ianna added 20 commits July 11, 2022 17:19

update to c++17 and cleanup warnings

43b639f

move GrowableBuffer header under awkward and add a interface path to it

4ac7052

build cpp header only tests

7382003

add tests-cpp

0b00d95

add CMakeFile

2c700cd

add CMakeLists.txt

0139354

remove addtest

68b5911

add move constructor and cleanup

3da93cf

cleanup

60c871b

use static function to copy data

0f3d600

revert back to c++11

6913b44

add an array builder cpp test

08aeb8f

back to C++11 for Ubuntu: The CXX compiler identification is GNU 4.8.2

0eb4572

remove debug printout

2e9c3a5

rename

6121fcf

move away from std::vector, use a linked list instead

ea07e69

keep track of the filled data panels length

24326bb

revert

d14c500

add method append and ref

d0c960e

add clear method

8682784

ianna added 3 commits July 14, 2022 08:53

add a copy number to the form_key

c2ddf57

remove debug printout

14595d7

skip concatenate for EmptyBuffer

ff7d0fc

ianna marked this pull request as ready for review July 14, 2022 11:25

ianna requested a review from jpivarski July 14, 2022 11:25

ianna added 2 commits July 14, 2022 15:36

revert test

8d7647e

add a test

9557a4b

ianna mentioned this pull request Jul 14, 2022

configure cpp header-only tests #1528

Closed

use array specialization and cleanup

c81f7e8

jpivarski approved these changes Jul 14, 2022

View reviewed changes

ianna merged commit 0f1b35c into main Jul 15, 2022

ianna deleted the ianna/array-builder-v2 branch July 15, 2022 05:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

migrate ArrayBuilder to new GrowableBuffer #1542

migrate ArrayBuilder to new GrowableBuffer #1542

ianna commented Jul 11, 2022 •

edited

Loading

ianna commented Jul 13, 2022

jpivarski commented Jul 13, 2022

ianna commented Jul 13, 2022

codecov bot commented Jul 14, 2022 •

edited

Loading

ianna commented Jul 14, 2022

ianna commented Jul 14, 2022

jpivarski left a comment

jpivarski Jul 14, 2022

ianna Jul 15, 2022

migrate ArrayBuilder to new GrowableBuffer #1542

migrate ArrayBuilder to new GrowableBuffer #1542

Conversation

ianna commented Jul 11, 2022 • edited Loading

ianna commented Jul 13, 2022

jpivarski commented Jul 13, 2022

ianna commented Jul 13, 2022

codecov bot commented Jul 14, 2022 • edited Loading

Codecov Report

ianna commented Jul 14, 2022

ianna commented Jul 14, 2022

jpivarski left a comment

Choose a reason for hiding this comment

jpivarski Jul 14, 2022

Choose a reason for hiding this comment

ianna Jul 15, 2022

Choose a reason for hiding this comment

ianna commented Jul 11, 2022 •

edited

Loading

codecov bot commented Jul 14, 2022 •

edited

Loading