-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix logic in to_arrow for empty list column #16279
Fix logic in to_arrow for empty list column #16279
Conversation
An empty list column need not have empty children, it just needs to have zero length. In this case, the offsets array will have zero length, and we need to create a temporary buffer. Now that this branch runs, fix two errors in the construction of the arrow array: 1. The element type, if there are children, should be taken from the child array; 2. If the child arrays are empty, we must make an empty null array, rather than passing a null pointer as the values array, otherwise we hit a segfault inside arrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Co-authored-by: David Wendt <[email protected]>
/merge |
cpp/src/interop/to_arrow.cu
Outdated
// Empty list will have only one value in offset of 4 bytes | ||
auto tmp_offset_buffer = allocate_arrow_buffer(sizeof(int32_t), ar_mr); | ||
memset(tmp_offset_buffer->mutable_data(), 0, sizeof(int32_t)); | ||
|
||
std::shared_ptr<arrow::Array> data = | ||
child_arrays.empty() ? std::make_shared<arrow::NullArray>(0) : child_arrays[1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the type of the empty child array be preserved here i.e. arrow::MakeEmptyArray
or does that not matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aiui if the child_arrays are empty, then the list column had no children, so there's no information from libcudf on what the element type is. But perhaps @davidwendt to confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically an empty list column would normally have a single empty child
cudf/cpp/src/lists/lists_column_factories.cu
Lines 86 to 94 in 04330f2
std::unique_ptr<column> make_empty_lists_column(data_type child_type, | |
rmm::cuda_stream_view stream, | |
rmm::device_async_resource_ref mr) | |
{ | |
auto offsets = make_empty_column(data_type(type_to_id<size_type>())); | |
auto child = make_empty_column(child_type); | |
return make_lists_column( | |
0, std::move(offsets), std::move(child), 0, rmm::device_buffer{}, stream, mr); | |
} |
But it may also not have any child too.
I don't know if Arrow specifically cares but would bias towards what the callers of this API need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the type of the empty child array be preserved here i.e. arrow::MakeEmptyArray or does that not matter?
To follow up, we are preserving the child array if child_arrays
is not empty()
(we use child_arrays[1]
which will have the appropriate type. If child_arrays
is empty()
then there is no information available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, I went for best of both worlds by using MakeEmptyArray
rather than hand-constructing the offset buffer.
Description
An empty list column need not have empty children, it just needs to have zero length. In this case, the offsets array will have zero length, and we need to create a temporary buffer.
Now that this branch runs, fix two errors in the construction of the arrow array:
The previous fix in #16201 correctly handled the empty children case (except for point two), but not the first case, which we do here.
Since we we're previously going down this code path (child_arrays was never empty), we never hit the latent segfault from point two.
Checklist