Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

migrate ArrayBuilder to new GrowableBuffer #1542

Merged
merged 27 commits into from
Jul 15, 2022
Merged

migrate ArrayBuilder to new GrowableBuffer #1542

merged 27 commits into from
Jul 15, 2022

Conversation

ianna
Copy link
Collaborator

@ianna ianna commented Jul 11, 2022

replaces #1529

  • GrowableBuffer has been modified to remove the dependencies on the std containers and utilities.
  • GrowableBuffer has been moved to an awkward subdirectory. The cmake INTERFACE has been updated to allow the following include:
#include "awkward/GrowableBuffer.h"
  • C++ tests have been added, and their cmake build configuration.
  • ArrayBuilder has been updated to use the new GrowableBuffer.

@ianna ianna marked this pull request as draft July 11, 2022 14:23
@ianna
Copy link
Collaborator Author

ianna commented Jul 13, 2022

@jpivarski - while debugging the failing test - it looks like to_buffers is called twice somewhere in to_list:

def test_big():
    a = ak._v2.highlevel.ArrayBuilder(initial=90)
    for i in range(2000):
        if i == 200:
            tmp = a.snapshot()
        a.boolean(i % 2 == 0)
    print("to_list(a)")
    assert to_list(a) == [True, False] * 1000
    print("----> to_list(tmp)")
    assert to_list(tmp) == [True, False] * 100
    print("DONE!")
tests/v2/test_0018-fromiter-fillable.py::test_big ArrayBuilder::to_buffers 0
BoolBuilder::to_buffers 0 200
to_list(a)
ArrayBuilder::to_buffers 0
BoolBuilder::to_buffers 0 2000
ArrayBuilder::to_buffers 0
BoolBuilder::to_buffers 0 2000
----> to_list(tmp)
DONE!

also, the buffers dictionary would have two node0-data buffers in this case - one with 200 entries and another with 2000. Would it cause any problems? Shall we go for a unique id?

@jpivarski
Copy link
Member

Being a dict, buffers can't have two buffers with the same name (node0-data), and if they differ in the number of entries, assuming that the small one has a larger number entries than it really does would result in a segfault.

Usually, the buffers are NumPy arrays or Py_Buffers with a correct length (200 or 2000). If that's the case here, it would prevent a segfault—it would be a "buffer is too small," rather than a segfault.

All of the nodes in a Form must have distinct ids from each other. They don't have to be distinct from other Forms, but without uniqueness-per-tree, you'd get these sorts of collisions.

@ianna
Copy link
Collaborator Author

ianna commented Jul 13, 2022

Being a dict, buffers can't have two buffers with the same name (node0-data), and if they differ in the number of entries, assuming that the small one has a larger number entries than it really does would result in a segfault.

Usually, the buffers are NumPy arrays or Py_Buffers with a correct length (200 or 2000). If that's the case here, it would prevent a segfault—it would be a "buffer is too small," rather than a segfault.

All of the nodes in a Form must have distinct ids from each other. They don't have to be distinct from other Forms, but without uniqueness-per-tree, you'd get these sorts of collisions.

Thanks! I should have mentioned that the example is taken from the main branch. I saw this behaviour in the PR and re-checked that it is the same without it.

@codecov
Copy link

codecov bot commented Jul 14, 2022

Codecov Report

Merging #1542 (c81f7e8) into main (78a6535) will decrease coverage by 0.12%.
The diff coverage is 58.27%.

Impacted Files Coverage Δ
src/awkward/_v2/contents/indexedoptionarray.py 89.14% <54.54%> (-0.70%) ⬇️
src/awkward/_v2/highlevel.py 70.77% <57.60%> (-2.76%) ⬇️
src/awkward/_v2/_connect/numba/arrayview.py 96.76% <100.00%> (ø)
src/awkward/_v2/types/recordtype.py 86.72% <100.00%> (-0.35%) ⬇️
src/awkward/_v2/types/numpytype.py 88.09% <0.00%> (-1.20%) ⬇️
src/awkward/_v2/record.py 76.86% <0.00%> (-0.18%) ⬇️
src/awkward/_v2/contents/content.py 76.61% <0.00%> (ø)
src/awkward/_v2/operations/ak_is_none.py 96.55% <0.00%> (+0.39%) ⬆️
src/awkward/_v2/operations/ak_type.py 53.84% <0.00%> (+2.56%) ⬆️

@ianna
Copy link
Collaborator Author

ianna commented Jul 14, 2022

Being a dict, buffers can't have two buffers with the same name (node0-data), and if they differ in the number of entries, assuming that the small one has a larger number entries than it really does would result in a segfault.
Usually, the buffers are NumPy arrays or Py_Buffers with a correct length (200 or 2000). If that's the case here, it would prevent a segfault—it would be a "buffer is too small," rather than a segfault.
All of the nodes in a Form must have distinct ids from each other. They don't have to be distinct from other Forms, but without uniqueness-per-tree, you'd get these sorts of collisions.

Thanks! I should have mentioned that the example is taken from the main branch. I saw this behaviour in the PR and re-checked that it is the same without it.

This is understood - the second to_buffers is coming from the "form" method - no extra copies are done there.

@ianna ianna marked this pull request as ready for review July 14, 2022 11:25
@ianna ianna requested a review from jpivarski July 14, 2022 11:25
@ianna
Copy link
Collaborator Author

ianna commented Jul 14, 2022

@jpivarski - I'm done with this PR. Please, have a look when you have time. Thanks!

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm approving this, though I have questions about it. See below—if you're going to keep the idea of having new panels be bigger than old panels (potentially a good idea, but one I hadn't thought of), you'll need a way to pass that parameter down, rather than hard-code it at 1.5. (I have no idea if that's a good value or not.)

These things could be addressed in a new PR, but if you're going to revert the many instances of optionsinitial back to → options, then perhaps it should be in this PR so the main history will have a smaller diff. The new options could be named BuilderOptions, rather than ArrayBuilderOptions, because of the generality.

if (length_[ptr_.size()-1] == reserved_[ptr_.size()-1]) {
add_panel(reserved_[ptr_.size()-1]);
if (ptr_->length() == ptr_->reserved()) {
add_panel((size_t)ceil(ptr_->reserved() * 1.5));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The panels grow as they're added? That's not a bad idea: at first, you don't know how big they need to be, but if you've been adding a lot of panels, you can be more confident that large panels are better.

However, if that's the case, then it needs to be possible to parameterize the resize, not just fix it as 1.5. The old algorithm's resize=1.5 came from the fact that it was a replacement algorithm, and there's some mathematical reason why the golden ratio is the best resize factor (for defragmentation). Adding panels and not removing the old ones is a different dynamic—there's less fragmentation, for one thing—so the best factor might not be 1.5. Eventually, you'll want to do a performance test, and that would involve varying these two parameters, initial and resize.

Those are the two parameters that the old ArrayBuilderOptions carried, and this PR converts a lot of options (which can be extended by modifying ArrayBuilderOptions) into initial (which can't be extended). If we're already wanting to reintroduce a factor, are you sure you want to do this?

On the other hand, I don't yet know that the optimal resize isn't 1.0. Maybe you won't be needing it. But having an ArrayBuilderOptions, or at least a "BuilderOptions", could be more future-proof.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, the resize parameter should not be set by a user. Also, perhaps, it should not be a fixed number. I agree, we should profile this first. GrowableBuffer can have a "growable strategy" hint, for example. I'll merge it as is - this will help @ManasviGoyal to rebase her PR.

@ianna ianna merged commit 0f1b35c into main Jul 15, 2022
@ianna ianna deleted the ianna/array-builder-v2 branch July 15, 2022 05:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants