Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add simple GC for view array types #5885

Merged
merged 10 commits into from
Jun 14, 2024
Merged

Conversation

XiangpengHao
Copy link
Contributor

@XiangpengHao XiangpengHao commented Jun 13, 2024

Which issue does this PR close?

Closes #5513 .

Rationale for this change

This PR is largely based on excellent work from @ClSlaid in #5720

I simplified the GC process and the test cases, and now they are dead simple (but dumb)!

Note that this pr does not check whether the original array is already compact (and thus does not avoid unnecessary copies). Once we get this in, we can continue the compact check discussion as in #5720

What changes are included in this PR?

Are there any user-facing changes?

ClSlaid and others added 5 commits June 13, 2024 16:52
part 1: implement checker to check if current buffer is compact

Signed-off-by: 蔡略 <[email protected]>
part2: implement actual gc algorithm

Signed-off-by: 蔡略 <[email protected]>
@github-actions github-actions bot added the arrow Changes to the arrow crate label Jun 13, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @XiangpengHao -- this looks very close to me. I think this PR needs a few more test cases and then it would be ok to merge

Bonus points if you added a benchmark for gc as well

arrow-array/src/array/byte_view_array.rs Outdated Show resolved Hide resolved
/// This method will compact the data buffers by recreating the view array and only include the data
/// that is pointed to by the views.
///
/// Note that it will copy the array regardless of whether the original array is compact.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

arrow-array/src/array/byte_view_array.rs Outdated Show resolved Hide resolved
let mut builder = GenericByteViewBuilder::<T>::with_capacity(self.len());

for v in self.iter() {
builder.append_option(v);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is reasonable (and correct) first version

However, it could likely be made faster by special casing inline views -- the code here is going to be doing several checks on length only to copy the bytes back into a u128

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand how the checks can be avoided; even if the view is inlined, we need to change the offset and the block id, so we need to copy that u128 from the original view buffer to the new buffer anyway.

Copy link
Contributor

@alamb alamb Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I was imagining that special casing copying inlined views could potentially be faster than converting to &str and then checking

But I don't think we should make any changes unless we have benchmarks supporting that hypothesis

@XiangpengHao
Copy link
Contributor Author

I overhauled the test case, and it should be more straightforward now. Also added a simple benchmark -- in case we want to make a more performant version :)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @XiangpengHao -- I had two minor suggestions but I think they could both be done as follow on PRs as well

}))
}

fn criterion_benchmark(c: &mut Criterion) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be interesting to have a "array with no nulls" benchmark as I could imagine special casing that case

arrow-array/src/array/byte_view_array.rs Show resolved Hide resolved
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- thanks again @XiangpengHao

@alamb alamb merged commit 956fe76 into apache:master Jun 14, 2024
27 checks passed
@XiangpengHao XiangpengHao deleted the view-array-gc branch June 14, 2024 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add gcgarbage collector support for StringViewArray and BinaryViewArray
3 participants