Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix logic in to_arrow for empty list column #16279

Merged

Conversation

wence-
Copy link
Contributor

@wence- wence- commented Jul 15, 2024

Description

An empty list column need not have empty children, it just needs to have zero length. In this case, the offsets array will have zero length, and we need to create a temporary buffer.

Now that this branch runs, fix two errors in the construction of the arrow array:

  1. The element type, if there are children, should be taken from the child array;
  2. If the child arrays are empty, we must make an empty null array, rather than passing a null pointer as the values array, otherwise we hit a segfault inside arrow.

The previous fix in #16201 correctly handled the empty children case (except for point two), but not the first case, which we do here.

Since we we're previously going down this code path (child_arrays was never empty), we never hit the latent segfault from point two.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

An empty list column need not have empty children, it just needs to
have zero length. In this case, the offsets array will have zero
length, and we need to create a temporary buffer.

Now that this branch runs, fix two errors in the construction of the
arrow array:

1. The element type, if there are children, should be taken from the
   child array;
2. If the child arrays are empty, we must make an empty null array,
   rather than passing a null pointer as the values array, otherwise
   we hit a segfault inside arrow.
@wence- wence- requested a review from a team as a code owner July 15, 2024 14:31
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jul 15, 2024
@wence- wence- added bug Something isn't working non-breaking Non-breaking change and removed libcudf Affects libcudf (C++/CUDA) code. labels Jul 15, 2024
Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Co-authored-by: David Wendt <[email protected]>
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jul 15, 2024
cpp/src/interop/to_arrow.cu Outdated Show resolved Hide resolved
cpp/src/interop/to_arrow.cu Outdated Show resolved Hide resolved
@wence-
Copy link
Contributor Author

wence- commented Jul 15, 2024

/merge

// Empty list will have only one value in offset of 4 bytes
auto tmp_offset_buffer = allocate_arrow_buffer(sizeof(int32_t), ar_mr);
memset(tmp_offset_buffer->mutable_data(), 0, sizeof(int32_t));

std::shared_ptr<arrow::Array> data =
child_arrays.empty() ? std::make_shared<arrow::NullArray>(0) : child_arrays[1];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the type of the empty child array be preserved here i.e. arrow::MakeEmptyArray or does that not matter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aiui if the child_arrays are empty, then the list column had no children, so there's no information from libcudf on what the element type is. But perhaps @davidwendt to confirm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically an empty list column would normally have a single empty child

std::unique_ptr<column> make_empty_lists_column(data_type child_type,
rmm::cuda_stream_view stream,
rmm::device_async_resource_ref mr)
{
auto offsets = make_empty_column(data_type(type_to_id<size_type>()));
auto child = make_empty_column(child_type);
return make_lists_column(
0, std::move(offsets), std::move(child), 0, rmm::device_buffer{}, stream, mr);
}

But it may also not have any child too.
I don't know if Arrow specifically cares but would bias towards what the callers of this API need.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the type of the empty child array be preserved here i.e. arrow::MakeEmptyArray or does that not matter?

To follow up, we are preserving the child array if child_arrays is not empty() (we use child_arrays[1] which will have the appropriate type. If child_arrays is empty() then there is no information available.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, I went for best of both worlds by using MakeEmptyArray rather than hand-constructing the offset buffer.

@rapids-bot rapids-bot bot merged commit 669db3e into rapidsai:branch-24.08 Jul 16, 2024
78 checks passed
@wence- wence- deleted the wence/fix/to-arrow-empty-list-redux branch July 16, 2024 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants