Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

greedily combine chunks before compressing #783

Merged
merged 11 commits into from
Sep 11, 2024

Conversation

danking
Copy link
Member

@danking danking commented Sep 10, 2024

feat: before compressing, collapse chunks of a chunked array targeting chunks of 16 MiB or 64Ki rows.

@danking danking force-pushed the dk/rechunk-before-compressing branch from cf080c4 to baa8830 Compare September 10, 2024 19:45
@danking danking force-pushed the dk/rechunk-before-compressing branch from 9cd9142 to 5bddf67 Compare September 10, 2024 20:28
@danking
Copy link
Member Author

danking commented Sep 10, 2024

With 256 chunks of 256 values each, there's a notable difference between the compressed output of this PR, 8.84 MB, and the compressed output of develop 9.12MB. ~3% reduction. The biggest difference is on dictionary encoded things like varbin_col. This particular test is so dominated by the incompressible binary data that the ~12kB savings on varbin_col isn't prominent.

This PR:

  "prim_col": vortex.chunked(0x0b)(i64, len=65536) nbytes=67.59 kB (0.74%)
  "bool_col": vortex.chunked(0x0b)(bool, len=65536) nbytes=10.25 kB (0.11%)
  "varbin_col": vortex.chunked(0x0b)(utf8, len=65536) nbytes=28.94 kB (0.32%)
  "binary_col": vortex.chunked(0x0b)(binary, len=65536) nbytes=8.95 MB (98.09%)
  "timestamp_col": vortex.chunked(0x0b)(ext(vortex.timestamp, ExtMetadata([2, 0, 0])), len=65536) nbytes=67.59 kB (0.74%)

develop:

  "prim_col": vortex.chunked(0x0b)(i64, len=65536) nbytes=65.55 kB (0.74%)
  "bool_col": vortex.chunked(0x0b)(bool, len=65536) nbytes=8.21 kB (0.09%)
  "varbin_col": vortex.chunked(0x0b)(utf8, len=65536) nbytes=16.44 kB (0.19%)
  "binary_col": vortex.chunked(0x0b)(binary, len=65536) nbytes=8.68 MB (98.24%)
  "timestamp_col": vortex.chunked(0x0b)(ext(vortex.timestamp, ExtMetadata([2, 0, 0])), len=65536) nbytes=65.55 kB (0.74%)

Copy link
Member

@robert3005 robert3005 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think the approach is fine for now (we will have to change it later). I made some code style comments

Comment on lines 142 to 149
let validity = validities.clone().into_iter().collect::<Validity>();
match (dtype.is_nullable(), validity) {
(true, validity) => Ok(validity),
(false, Validity::AllValid) => Ok(Validity::NonNullable),
(false, _) => vortex_bail!(
"for non-nullable dtype, child validities ought to all be AllValid"
),
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let validity = validities.clone().into_iter().collect::<Validity>();
match (dtype.is_nullable(), validity) {
(true, validity) => Ok(validity),
(false, Validity::AllValid) => Ok(Validity::NonNullable),
(false, _) => vortex_bail!(
"for non-nullable dtype, child validities ought to all be AllValid"
),
}
if self.dtype().is_nullable() {
validities_to_combine.iter().cloned().collect::<Validity>()
} else {
NonNullable
}

This can be simplified

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I killed combine_validity entirely by using one of your earlier suggestions to go through ChunkedArray. Does that seem reasonable?

}
}

let dtype = self.dtype();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should inline this, this function is going to be inlined anyway and having self in front adds extra context

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 154 to 156
let mut new_chunks: Vec<Array> = Vec::new();
let mut validities_to_combine: Vec<LogicalValidity> = Vec::new();
let mut chunks_to_combine: Vec<Array> = Vec::new();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need type annotations here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

pub fn rechunk(&self, target_bytesize: usize, target_rowsize: usize) -> VortexResult<Self> {
fn combine_validities(
dtype: &DType,
validities: Vec<LogicalValidity>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can make this simpler by taking &[LogicalValidity] and cloning

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below about removing this function.

&& !chunks_to_combine.is_empty()
{
let canonical = try_canonicalize_chunks(
chunks_to_combine,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function should not require a Vec but instead &[Array]. It never uses the ownership for anything

}

if n_bytes > target_bytesize || n_elements > target_rowsize {
new_chunks.push(chunk.into_canonical()?.into()); // TODO(dk): rechunking maybe shouldn't produce canonical chunks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should just push the chunk without canonicalizing here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In follow up we change the logic to not canonicalize the chunks, i.e. stop using try_canonicalize_chunks

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Self {
// Sample length should always be multiple of 1024
sample_size: 128,
sample_count: 8,
max_depth: 3,
target_chunk_bytesize: 16 * mib,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This use to be called block. So let's keep it as target_block_bytesize

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Self {
// Sample length should always be multiple of 1024
sample_size: 128,
sample_count: 8,
max_depth: 3,
target_chunk_bytesize: 16 * mib,
target_chunk_rowsize: 64 * kib,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make this (target_)?block_size (as it was named before).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and Done.

Comment on lines +1 to +19
#[macro_export]
macro_rules! assert_arrays_eq {
($expected:expr, $actual:expr) => {
let expected: Array = $expected.into();
let actual: Array = $actual.into();
assert_eq!(expected.dtype(), actual.dtype());

let expected_contents = (0..expected.len())
.map(|idx| scalar_at(&expected, idx).map(|x| x.into_value()))
.collect::<VortexResult<Vec<_>>>()
.unwrap();
let actual_contents = (0..actual.len())
.map(|idx| scalar_at(&expected, idx).map(|x| x.into_value()))
.collect::<VortexResult<Vec<_>>>()
.unwrap();

assert_eq!(expected_contents, actual_contents);
};
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bleh, we need an equals for array

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that'd be fantastic!

@danking danking merged commit 2e81835 into develop Sep 11, 2024
4 checks passed
@danking danking deleted the dk/rechunk-before-compressing branch September 11, 2024 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants