Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: improve record batch partitioning #1396

Merged
merged 1 commit into from
May 28, 2023

Conversation

roeap
Copy link
Collaborator

@roeap roeap commented May 27, 2023

Description

Some time ago, the arrow-row crate was published, specifically with use cases like multi-column ordering / sorting in mind. We need to do this whenever we write data to a table to partition batches according to their partition values. In this PR we adopt this. As a drive-by I also updated the imports in the module to import from the respective sub-crates.

Related Issue(s)

Documentation

@@ -410,6 +403,18 @@ pub(crate) fn divide_by_partition_values(
Ok(partitions)
}

fn lexsort_to_indices(arrays: &[ArrayRef]) -> UInt32Array {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not following why this function is being used here instead of the previous import from arrow::compute as far as I can tell the only difference is this version doesn't take the optional limit parameter?

😕

Copy link
Collaborator Author

@roeap roeap May 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -410,6 +403,18 @@ pub(crate) fn divide_by_partition_values(
Ok(partitions)
}

fn lexsort_to_indices(arrays: &[ArrayRef]) -> UInt32Array {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn lexsort_to_indices(arrays: &[ArrayRef]) -> UInt32Array {
/*
* This version uses arrow-row rather then the "regular" compute kernels.
*/
fn lexsort_to_indices(arrays: &[ArrayRef]) -> UInt32Array {

@roeap roeap merged commit e70504a into delta-io:main May 28, 2023
@roeap roeap deleted the improve-partitioning branch May 28, 2023 05:05
roeap added a commit to roeap/delta-rs that referenced this pull request Jun 2, 2023
# Description

Some time ago, the arrow-row crate was published, specifically with use
cases like multi-column ordering / sorting in mind. We need to do this
whenever we write data to a table to partition batches according to
their partition values. In this PR we adopt this. As a drive-by I also
updated the imports in the module to import from the respective
sub-crates.

# Related Issue(s)
<!---
For example:

- closes #106
--->

# Documentation

<!---
Share links to useful documentation
--->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate rust
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants