-
Notifications
You must be signed in to change notification settings - Fork 803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement GenericByteViewArray::gc
for compacting StringViewArray and ByteViewArray
#5707
base: main
Are you sure you want to change the base?
Conversation
part 1: implement checker to check if current buffer is compact Signed-off-by: 蔡略 <[email protected]>
part2: implement actual gc algorithm Signed-off-by: 蔡略 <[email protected]>
Signed-off-by: 蔡略 <[email protected]>
04efa78
to
c45acb8
Compare
The naming of making
renaming |
Thank you -- looking forward to this one. |
/// A helper struct that used to check if the array is compact view | ||
/// | ||
/// The checker is lazy and will not check the array until `finish` is called. | ||
/// | ||
/// This is based on the assumption that the array will most likely to be not compact, | ||
/// so it is likely to scan the entire array. | ||
/// | ||
/// Then it is better to do the check at once, rather than doing it for each accumulate operation. | ||
struct CompactChecker { | ||
length: usize, | ||
intervals: BTreeMap<usize, usize>, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there is a better algorithm for this.
It's the most straightforward and simplest way I could come up with, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend we pull the "compact checking" algorithm into a new PR to discuss it -- I am not sure about the assumption that StringViewArrays will mostly often be compacted (I would actually expect the opposite)
7e47a1d
to
73ab8a3
Compare
Signed-off-by: 蔡略 <[email protected]>
73ab8a3
to
6d2a453
Compare
} | ||
|
||
let mut new_views = Vec::with_capacity(self.views.len()); | ||
let mut new_bufs: Vec<Vec<u8>> = vec![vec![]; self.buffers.len()]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of buffers may shrink after gc
. Every buffer should be filled up to block_size
.
See GenericByteViewBuilder::append_value
.
GenericByteViewArray::gc
for compacting StringViewArray and ByteViewArray
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this contributon @ClSlaid and for the review @RinChanNOWWW
My biggest suggestion is to split the code in this PR into two PRs -- one for the "compact_check" and one for the actual compaction.
I had some suggestions on some additional coverage needed for the gc, but I think it is looking quite close.
Thanks again!
/// A helper struct that used to check if the array is compact view | ||
/// | ||
/// The checker is lazy and will not check the array until `finish` is called. | ||
/// | ||
/// This is based on the assumption that the array will most likely to be not compact, | ||
/// so it is likely to scan the entire array. | ||
/// | ||
/// Then it is better to do the check at once, rather than doing it for each accumulate operation. | ||
struct CompactChecker { | ||
length: usize, | ||
intervals: BTreeMap<usize, usize>, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend we pull the "compact checking" algorithm into a new PR to discuss it -- I am not sure about the assumption that StringViewArrays will mostly often be compacted (I would actually expect the opposite)
let mut view = ByteView::from(array.views[1]); | ||
view.length = 15; | ||
let new_views = ScalarBuffer::from(vec![array.views[0], view.into()]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dn't undertand what this is doing
I think you can more easily create a stringview with multiple buffers like this:
arrow-rs/arrow-cast/src/pretty.rs
Lines 334 to 341 in 905c46b
// Use a small capacity so we end up with multiple views | |
let mut builder = StringViewBuilder::with_capacity(20); | |
builder.append_value("hello"); | |
builder.append_null(); | |
builder.append_value("longer than 12 bytes"); | |
builder.append_value("another than 12 bytes"); | |
builder.append_null(); | |
builder.append_value("small"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not familiar with the rest of the code, so I made it out brutally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that StringViewArrays will mostly often be compacted (I would actually expect the opposite)
We assume the same idea, it's likely to be not compacted.
Thanks for your review @alamb, and I'd left this PR to a draft and reopen two PRs for this. |
Which issue does this PR close?
Closes #5513.This PR is being splited into:
Rationale for this change
What changes are included in this PR?
Adding a new
gc
method forStringViewArray
andByteViewArray
.Are there any user-facing changes?
None.