greedily combine chunks before compressing #783

danking · 2024-09-10T19:41:46Z

feat: before compressing, collapse chunks of a chunked array targeting chunks of 16 MiB or 64Ki rows.

vortex-array/src/array/chunked/mod.rs

danking · 2024-09-10T20:59:54Z

With 256 chunks of 256 values each, there's a notable difference between the compressed output of this PR, 8.84 MB, and the compressed output of develop 9.12MB. ~3% reduction. The biggest difference is on dictionary encoded things like varbin_col. This particular test is so dominated by the incompressible binary data that the ~12kB savings on varbin_col isn't prominent.

This PR:

  "prim_col": vortex.chunked(0x0b)(i64, len=65536) nbytes=67.59 kB (0.74%)
  "bool_col": vortex.chunked(0x0b)(bool, len=65536) nbytes=10.25 kB (0.11%)
  "varbin_col": vortex.chunked(0x0b)(utf8, len=65536) nbytes=28.94 kB (0.32%)
  "binary_col": vortex.chunked(0x0b)(binary, len=65536) nbytes=8.95 MB (98.09%)
  "timestamp_col": vortex.chunked(0x0b)(ext(vortex.timestamp, ExtMetadata([2, 0, 0])), len=65536) nbytes=67.59 kB (0.74%)

develop:

  "prim_col": vortex.chunked(0x0b)(i64, len=65536) nbytes=65.55 kB (0.74%)
  "bool_col": vortex.chunked(0x0b)(bool, len=65536) nbytes=8.21 kB (0.09%)
  "varbin_col": vortex.chunked(0x0b)(utf8, len=65536) nbytes=16.44 kB (0.19%)
  "binary_col": vortex.chunked(0x0b)(binary, len=65536) nbytes=8.68 MB (98.24%)
  "timestamp_col": vortex.chunked(0x0b)(ext(vortex.timestamp, ExtMetadata([2, 0, 0])), len=65536) nbytes=65.55 kB (0.74%)

vortex-array/src/array/chunked/mod.rs

vortex-sampling-compressor/tests/smoketest.rs

robert3005

Ok, I think the approach is fine for now (we will have to change it later). I made some code style comments

robert3005 · 2024-09-11T14:55:01Z

vortex-array/src/array/chunked/mod.rs

+            let validity = validities.clone().into_iter().collect::<Validity>();
+            match (dtype.is_nullable(), validity) {
+                (true, validity) => Ok(validity),
+                (false, Validity::AllValid) => Ok(Validity::NonNullable),
+                (false, _) => vortex_bail!(
+                    "for non-nullable dtype, child validities ought to all be AllValid"
+                ),
+            }


Suggested change

let validity = validities.clone().into_iter().collect::<Validity>();

match (dtype.is_nullable(), validity) {

(true, validity) => Ok(validity),

(false, Validity::AllValid) => Ok(Validity::NonNullable),

(false, _) => vortex_bail!(

"for non-nullable dtype, child validities ought to all be AllValid"

),

}

if self.dtype().is_nullable() {

validities_to_combine.iter().cloned().collect::<Validity>()

} else {

NonNullable

}

This can be simplified

I killed combine_validity entirely by using one of your earlier suggestions to go through ChunkedArray. Does that seem reasonable?

robert3005 · 2024-09-11T14:55:42Z

vortex-array/src/array/chunked/mod.rs

+            }
+        }
+
+        let dtype = self.dtype();


we should inline this, this function is going to be inlined anyway and having self in front adds extra context

robert3005 · 2024-09-11T14:55:53Z

vortex-array/src/array/chunked/mod.rs

+        let mut new_chunks: Vec<Array> = Vec::new();
+        let mut validities_to_combine: Vec<LogicalValidity> = Vec::new();
+        let mut chunks_to_combine: Vec<Array> = Vec::new();


you don't need type annotations here

robert3005 · 2024-09-11T14:56:21Z

vortex-array/src/array/chunked/mod.rs

+    pub fn rechunk(&self, target_bytesize: usize, target_rowsize: usize) -> VortexResult<Self> {
+        fn combine_validities(
+            dtype: &DType,
+            validities: Vec<LogicalValidity>,


I think you can make this simpler by taking &[LogicalValidity] and cloning

See comment below about removing this function.

robert3005 · 2024-09-11T14:56:52Z

vortex-array/src/array/chunked/mod.rs

+                && !chunks_to_combine.is_empty()
+            {
+                let canonical = try_canonicalize_chunks(
+                    chunks_to_combine,


This function should not require a Vec but instead &[Array]. It never uses the ownership for anything

robert3005 · 2024-09-11T14:57:04Z

vortex-array/src/array/chunked/mod.rs

+            }
+
+            if n_bytes > target_bytesize || n_elements > target_rowsize {
+                new_chunks.push(chunk.into_canonical()?.into()); // TODO(dk): rechunking maybe shouldn't produce canonical chunks


we should just push the chunk without canonicalizing here

In follow up we change the logic to not canonicalize the chunks, i.e. stop using try_canonicalize_chunks

robert3005 · 2024-09-11T14:58:51Z

vortex-sampling-compressor/src/lib.rs

        Self {
            // Sample length should always be multiple of 1024
            sample_size: 128,
            sample_count: 8,
            max_depth: 3,
+            target_chunk_bytesize: 16 * mib,


This use to be called block. So let's keep it as target_block_bytesize

robert3005 · 2024-09-11T15:00:30Z

vortex-sampling-compressor/src/lib.rs

        Self {
            // Sample length should always be multiple of 1024
            sample_size: 128,
            sample_count: 8,
            max_depth: 3,
+            target_chunk_bytesize: 16 * mib,
+            target_chunk_rowsize: 64 * kib,


I would make this (target_)?block_size (as it was named before).

robert3005 · 2024-09-11T15:01:03Z

vortex-array/src/array/assertions.rs

+#[macro_export]
+macro_rules! assert_arrays_eq {
+    ($expected:expr, $actual:expr) => {
+        let expected: Array = $expected.into();
+        let actual: Array = $actual.into();
+        assert_eq!(expected.dtype(), actual.dtype());
+
+        let expected_contents = (0..expected.len())
+            .map(|idx| scalar_at(&expected, idx).map(|x| x.into_value()))
+            .collect::<VortexResult<Vec<_>>>()
+            .unwrap();
+        let actual_contents = (0..actual.len())
+            .map(|idx| scalar_at(&expected, idx).map(|x| x.into_value()))
+            .collect::<VortexResult<Vec<_>>>()
+            .unwrap();
+
+        assert_eq!(expected_contents, actual_contents);
+    };
+}


bleh, we need an equals for array

that'd be fantastic!

greedily combine chunks before compressing

baa8830

danking force-pushed the dk/rechunk-before-compressing branch from cf080c4 to baa8830 Compare September 10, 2024 19:45

a10y reviewed Sep 10, 2024

View reviewed changes

vortex-array/src/array/chunked/mod.rs Outdated Show resolved Hide resolved

change default chunk byte size to 16 MiB

5bddf67

danking force-pushed the dk/rechunk-before-compressing branch from 9cd9142 to 5bddf67 Compare September 10, 2024 20:28

danking added 2 commits September 10, 2024 17:04

only add chunks when there are chunks to add

50c226c

disable miri due to roaring bitmaps ffi

a4d985b

robert3005 reviewed Sep 10, 2024

View reviewed changes

vortex-array/src/array/chunked/mod.rs Outdated Show resolved Hide resolved

some tests

683f72b

robert3005 reviewed Sep 10, 2024

View reviewed changes

vortex-array/src/array/chunked/mod.rs Show resolved Hide resolved

pull chunking defaultss up to compress config

3dc3abe

robert3005 reviewed Sep 10, 2024

View reviewed changes

vortex-array/src/array/chunked/mod.rs Outdated Show resolved Hide resolved

robert3005 reviewed Sep 10, 2024

View reviewed changes

vortex-sampling-compressor/tests/smoketest.rs Outdated Show resolved Hide resolved

default compressor

42e2113

robert3005 requested changes Sep 11, 2024

View reviewed changes

danking added 4 commits September 11, 2024 11:06

rechunk_default no longer exists

6ed8893

delegate validity handling to ChunkedArray

2027148

clippy

2399ebb

rename chunk => block, rowsize => size

6842593

danking requested a review from robert3005 September 11, 2024 15:39

robert3005 approved these changes Sep 11, 2024

View reviewed changes

danking merged commit 2e81835 into develop Sep 11, 2024
4 checks passed

danking deleted the dk/rechunk-before-compressing branch September 11, 2024 16:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

greedily combine chunks before compressing #783

greedily combine chunks before compressing #783

danking commented Sep 10, 2024 •

edited

Loading

danking commented Sep 10, 2024

robert3005 left a comment

robert3005 Sep 11, 2024

danking Sep 11, 2024

robert3005 Sep 11, 2024

danking Sep 11, 2024

robert3005 Sep 11, 2024

danking Sep 11, 2024

robert3005 Sep 11, 2024

danking Sep 11, 2024

robert3005 Sep 11, 2024

robert3005 Sep 11, 2024

robert3005 Sep 11, 2024

danking Sep 11, 2024

robert3005 Sep 11, 2024

danking Sep 11, 2024

robert3005 Sep 11, 2024

danking Sep 11, 2024

robert3005 Sep 11, 2024

danking Sep 11, 2024

greedily combine chunks before compressing #783

greedily combine chunks before compressing #783

Conversation

danking commented Sep 10, 2024 • edited Loading

danking commented Sep 10, 2024

robert3005 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danking commented Sep 10, 2024 •

edited

Loading