-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-1712: [C++] Add method to BinaryBuilder to reserve space for value data #1481
Conversation
Syncing from original
When building BinaryArrays with a known size using Resize and Reserve methods, space is also reserved for value_data_builder_ to prevent internal reallocation
Update BinaryBuilder::Resize(int64_t capacity) in builder.cc
Plz let me know if I should create a new method and add allocation of value data in the new function, instead of directly putting inside the Resize method. |
@xuepanchen thank you for your contribution. We need to add a new method to To give an example, suppose that we anticipate building an array with 1000 elements, each of which has an expected size of around 100 bytes. You would want to write something like:
(@xhochy do you have an opinion on what to call this?) Please also add a method to return the capacity of the internal |
@xuepanchen note that each time you push any commits to GitHub on an open PR, it creates CI builds in our Travis CI and Appveyor queues, so small incremental pushes can impact other developers who are waiting on their builds to run. In general, it's a good practice to wait to open a PR on a WIP patch until you're ready to validate a completed patch and/or need code review. If you enable Travis CI and Appveyor on your fork of Arrow, you can see CI builds on your branches without having to open a PR to the Arrow repo. e.g. https://travis-ci.org/wesm/arrow/branches |
@wesm thank you for the reminder. Will pay more attention next time. |
cpp/src/arrow/builder.cc
Outdated
|
||
Status BinaryBuilder::ReserveData(int64_t capacity) { | ||
DCHECK_LT(capacity, std::numeric_limits<int32_t>::max()); | ||
return value_data_builder_.Resize(capacity * sizeof(int64_t)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we multiply here with int64_t
? I would expect that ReserveData(x)
will lead to value_data_capacity() = x
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be Resize(capacity + value_data_builder_.length())
. The sizeof(int64_t)
looks like a copy-paste error from implementation of BinaryBuilder::Resize
above (and that should be sizeof(int32_t)
there, so we should fix that)
We should check that extra_capacity + length
does not exceed INT32_MAX but probably return Status::Invalid
since overflowing a BinaryBuilder
is likely to happen somewhat more regularly
I note also that BufferBuilder
and TypedBufferBuilder<T>
don't have a shrink_to_fit
option in their Resize
method, that would be good to add to avoid unnecessary reallocations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace parameter name from "bytes" to "capacity" to avoid confusion.
Add TestCapacityReserve to test space reservation for BinaryBuilder
data_capacity_ represents the indicated capacity for value_data_builder and it is always smaller than or equal to the actual capacity of underlying value_data_builder (data_capacity_ <= value_data_builder.capacity()). That's because when we say: ReserveData(capacity); The new capacity is max(data_capacity_, data length + capacity), and data_capacity_ is set to be equal to new capacity but underlying buffer size is set to BitUtil::RoundUpToMultipleOf64(new capacity) to ensure that the capacity of the buffer is a multiple of 64 bytes as defined in Layout.md. That's why data_capacity_ is needed to show the indicated capacity of the BinaryBuilder, just like ArrayBuilder::capacity_ indicates the indicated capacity of ArrayBuilder. A safety check is added in BinaryBuilder::Append() to update data_capacity_ if data length is greater than data_capacity_. The reason is that data_capacity is updated in ResearveData(). But if users make mistakes to append too much data, data length might be larger than data_capacity_ (data length <= actual capacity of underlying value_data_builder). If this happens data_capacity_ is set equal to data length to avoid confusion.
Update ReserveData(int64_t) method for BinaryBuilder
Update ReserveData method based on feedbacks and add test case for BinaryBuilder. |
Syncing from original
cpp/src/arrow/array-test.cc
Outdated
ASSERT_EQ(builder_->length(), length); | ||
ASSERT_EQ(builder_->capacity(), BitUtil::NextPower2(capacity)); | ||
ASSERT_EQ(builder_->value_data_length(), data_length); | ||
ASSERT_EQ(builder_->value_data_capacity(), capacity); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this not a power of 2?
cpp/src/arrow/array-test.cc
Outdated
if (data_length <= capacity) { | ||
ASSERT_EQ(builder_->value_data_capacity(), capacity); | ||
} else { | ||
ASSERT_EQ(builder_->value_data_capacity(), data_length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me why these assertions would hold true
ASSERT_EQ(builder_->capacity(), BitUtil::NextPower2(capacity));
ASSERT_EQ(builder_->value_data_capacity(), capacity);
I would think that value_data_capacity()
is always the power of 2 greater than or equal to the amount of data appended so far, i.e. ASSERT_EQ(BitUtil::NextPower2(data_length), builder_->value_data_capacity())
Can you make the strings you are appending much larger, at least 10 length each?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wesm value_data_capacity() is actually always a multiple of 64 greater than or equal to the amount of data appended so far because the underlying buffer size is set to ensure that the capacity of the buffer is a multiple of 64 bytes as defined in Layout.md, i.e.
ASSERT_EQ(BitUtil::RoundUpToMultipleOf64(data_length), builder_->value_data_capacity())
So if you call ReserveData(capacity) at the very beginning, then we have
ASSERT_EQ(BitUtil::RoundUpToMultipleOf64(capacity), builder_->value_data_capacity())
cpp/src/arrow/array-test.cc
Outdated
ASSERT_OK(builder_->Reserve(capacity)); | ||
ASSERT_OK(builder_->ReserveData(capacity)); | ||
|
||
ASSERT_EQ(builder_->length(), length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In googletest, the 1st parameter passed to ASSERT_EQ
should be the expected result, so flip the argument order here and below
cpp/src/arrow/array-test.cc
Outdated
int64_t capacity = N; | ||
|
||
ASSERT_OK(builder_->Reserve(capacity)); | ||
ASSERT_OK(builder_->ReserveData(capacity)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These reservations should be viewed as different, because the size of the offsets buffer and the data buffer grow at different rates. The way this unit test should read is:
- Call
ReserveData
with enough space for some large-ish amount of data (say 4K bytes or so) - Append <= N bytes incrementally
- Check that the capacity remains invariant at the end (i.e. the initial
ReserveData
made sure that no additional reallocations took place)
cpp/src/arrow/builder.cc
Outdated
@@ -1208,7 +1208,7 @@ ArrayBuilder* ListBuilder::value_builder() const { | |||
// String and binary | |||
|
|||
BinaryBuilder::BinaryBuilder(const std::shared_ptr<DataType>& type, MemoryPool* pool) | |||
: ArrayBuilder(type, pool), offsets_builder_(pool), value_data_builder_(pool) {} | |||
: ArrayBuilder(type, pool), offsets_builder_(pool), value_data_builder_(pool), data_capacity_(0) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this extra member; we can just use value_data_capacity()
wherever data_capacity_
is currently being used
cpp/src/arrow/builder.cc
Outdated
return Status::Invalid("Cannot reserve capacity larger than 2^31 - 1 in length for binary data"); | ||
} | ||
|
||
RETURN_NOT_OK(value_data_builder_.Resize(value_data_length() + capacity)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you rebase on master, you can use value_data_builder_.Reserve(capacity)
here
cpp/src/arrow/builder.cc
Outdated
} | ||
|
||
RETURN_NOT_OK(value_data_builder_.Resize(value_data_length() + capacity)); | ||
data_capacity_ = value_data_length() + capacity; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed
cpp/src/arrow/builder.cc
Outdated
@@ -1241,6 +1253,9 @@ Status BinaryBuilder::Append(const uint8_t* value, int32_t length) { | |||
RETURN_NOT_OK(Reserve(1)); | |||
RETURN_NOT_OK(AppendNextOffset()); | |||
RETURN_NOT_OK(value_data_builder_.Append(value, length)); | |||
if (data_capacity_ < value_data_length()) { | |||
data_capacity_ = value_data_length(); | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this
cpp/src/arrow/builder.h
Outdated
@@ -682,10 +682,13 @@ class ARROW_EXPORT BinaryBuilder : public ArrayBuilder { | |||
|
|||
Status Init(int64_t elements) override; | |||
Status Resize(int64_t capacity) override; | |||
Status ReserveData(int64_t capacity); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a doxygen comment, since this is a new API
I note that the Reserve
-type methods in this header have different semantics from their STL counterparts. They are reserving additional space rather than absolute space (e.g. std::vector::reserve takes an absolute length as argument)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also call the argument to ReserveData
size
(or elements
) instead of capacity
(to avoid confusion about whether we are passing an incremental value vs. absolute)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is looking pretty close, just a couple of minor comments
cpp/src/arrow/builder.h
Outdated
@@ -682,10 +682,15 @@ class ARROW_EXPORT BinaryBuilder : public ArrayBuilder { | |||
|
|||
Status Init(int64_t elements) override; | |||
Status Resize(int64_t capacity) override; | |||
/// Ensures there is enough space for adding the number of value elements | |||
/// by checking value buffer capacity and resizing if necessary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add \brief
to the method description? Can we rephrase this to "Ensures there is enough allocated capacity to append the indicated number of bytes to the value data buffer without additional allocations"
cpp/src/arrow/array-test.cc
Outdated
ASSERT_EQ(length, builder_->value_data_length()); | ||
ASSERT_EQ(BitUtil::RoundUpToMultipleOf64(capacity), builder_->value_data_capacity()); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add another call to ReserveData here, like builder_->ReserveData(500)
to show that the input argument is an incremental amount rather than an absolute amount?
Syncing from original
…nd change arguments for offsets_builder_.Resize()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 -- thank you. I will merge once the style issues are fixed and the build is passing. There are a number of linting problems, see https://github.com/apache/arrow/tree/master/cpp#continuous-integration
In the future, it is better to make pull requests from a branch on your fork, so that you can work on multiple changes at once. Here are some links from the pandas project about this
|
@kou do you know what is causing this error?
|
cpp/src/arrow/array-test.cc
Outdated
TEST_F(TestBinaryBuilder, TestCapacityReserve) { | ||
vector<string> strings = {"aaaaa", "bbbbbbbbbb", "ccccccccccccccc", "dddddddddddddddddddd", "eeeeeeeeee"}; | ||
vector<string> strings = {"aaaaa", "bbbbbbbbbb", "ccccccccccccccc", "dddddddddd"}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future, you can run make format
(which uses clang-format) to fix these long lines without having to make code changes
@wesm I think that it's a network problem: https://travis-ci.org/apache/arrow/jobs/332472808#L552
We will be able to reduce the case by changing the for i in {1..3}; do
sudo -E apt-get -yq update &>> ~/apt-get-update.log && break
done |
@kou I see. These commands are initiated by Travis CI from https://github.com/apache/arrow/blob/master/.travis.yml#L21. We could install our package toolchain outside of Travis CI's built-in commands, if that might help improve the flakiness |
Modified BinaryBuilder::Resize(int64_t) so that when building BinaryArrays with a known size, space is also reserved for value_data_builder_ to prevent internal reallocation.