-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
greedily combine chunks before compressing #783
Conversation
cf080c4
to
baa8830
Compare
9cd9142
to
5bddf67
Compare
With 256 chunks of 256 values each, there's a notable difference between the compressed output of this PR, 8.84 MB, and the compressed output of develop 9.12MB. ~3% reduction. The biggest difference is on dictionary encoded things like varbin_col. This particular test is so dominated by the incompressible binary data that the ~12kB savings on varbin_col isn't prominent. This PR:
develop:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think the approach is fine for now (we will have to change it later). I made some code style comments
let validity = validities.clone().into_iter().collect::<Validity>(); | ||
match (dtype.is_nullable(), validity) { | ||
(true, validity) => Ok(validity), | ||
(false, Validity::AllValid) => Ok(Validity::NonNullable), | ||
(false, _) => vortex_bail!( | ||
"for non-nullable dtype, child validities ought to all be AllValid" | ||
), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let validity = validities.clone().into_iter().collect::<Validity>(); | |
match (dtype.is_nullable(), validity) { | |
(true, validity) => Ok(validity), | |
(false, Validity::AllValid) => Ok(Validity::NonNullable), | |
(false, _) => vortex_bail!( | |
"for non-nullable dtype, child validities ought to all be AllValid" | |
), | |
} | |
if self.dtype().is_nullable() { | |
validities_to_combine.iter().cloned().collect::<Validity>() | |
} else { | |
NonNullable | |
} |
This can be simplified
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I killed combine_validity entirely by using one of your earlier suggestions to go through ChunkedArray. Does that seem reasonable?
} | ||
} | ||
|
||
let dtype = self.dtype(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should inline this, this function is going to be inlined anyway and having self
in front adds extra context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
let mut new_chunks: Vec<Array> = Vec::new(); | ||
let mut validities_to_combine: Vec<LogicalValidity> = Vec::new(); | ||
let mut chunks_to_combine: Vec<Array> = Vec::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need type annotations here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pub fn rechunk(&self, target_bytesize: usize, target_rowsize: usize) -> VortexResult<Self> { | ||
fn combine_validities( | ||
dtype: &DType, | ||
validities: Vec<LogicalValidity>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can make this simpler by taking &[LogicalValidity]
and cloning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment below about removing this function.
&& !chunks_to_combine.is_empty() | ||
{ | ||
let canonical = try_canonicalize_chunks( | ||
chunks_to_combine, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function should not require a Vec but instead &[Array]
. It never uses the ownership for anything
} | ||
|
||
if n_bytes > target_bytesize || n_elements > target_rowsize { | ||
new_chunks.push(chunk.into_canonical()?.into()); // TODO(dk): rechunking maybe shouldn't produce canonical chunks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should just push the chunk without canonicalizing here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In follow up we change the logic to not canonicalize the chunks, i.e. stop using try_canonicalize_chunks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Self { | ||
// Sample length should always be multiple of 1024 | ||
sample_size: 128, | ||
sample_count: 8, | ||
max_depth: 3, | ||
target_chunk_bytesize: 16 * mib, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This use to be called block
. So let's keep it as target_block_bytesize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Self { | ||
// Sample length should always be multiple of 1024 | ||
sample_size: 128, | ||
sample_count: 8, | ||
max_depth: 3, | ||
target_chunk_bytesize: 16 * mib, | ||
target_chunk_rowsize: 64 * kib, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make this (target_)?block_size
(as it was named before).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and Done.
#[macro_export] | ||
macro_rules! assert_arrays_eq { | ||
($expected:expr, $actual:expr) => { | ||
let expected: Array = $expected.into(); | ||
let actual: Array = $actual.into(); | ||
assert_eq!(expected.dtype(), actual.dtype()); | ||
|
||
let expected_contents = (0..expected.len()) | ||
.map(|idx| scalar_at(&expected, idx).map(|x| x.into_value())) | ||
.collect::<VortexResult<Vec<_>>>() | ||
.unwrap(); | ||
let actual_contents = (0..actual.len()) | ||
.map(|idx| scalar_at(&expected, idx).map(|x| x.into_value())) | ||
.collect::<VortexResult<Vec<_>>>() | ||
.unwrap(); | ||
|
||
assert_eq!(expected_contents, actual_contents); | ||
}; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bleh, we need an equals for array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that'd be fantastic!
feat: before compressing, collapse chunks of a chunked array targeting chunks of 16 MiB or 64Ki rows.