feat: move and update Optimize operation #1154

roeap · 2023-02-17T18:43:36Z

Description

This PR moves the optimize operation into the operations module. As part of this we do a few updates to the operation as well

adopt IntoFuture pattern
use writer from operations module
replace SerializedFileReader with ParquetRecordBatchStream

As part of this we also update datafusion and arrow, which in turn requires updating pyo3. This requires updating some deprecated features. i.e. how function signatures are annotated. It also leads to a breaking change in the python funcitons - specifically the order of arguments in dataset_partitions.

There is one more thing is was considering while doing these updates. Essentially we may be getting some undesired behaviour, as the writer in the operations module already considers file size, so not all data in a bin might actually end up in the same file. there may also be a naïve yet much simpler way to do this now, by just passing all data in a partition through the partition writer. This should yield files of the desired size within the accuracy of the configured row-group size. It may of course lead to a rather small remainder file.

cc @Blajda

Related Issue(s)

part of #1136

Documentation

roeap · 2023-02-17T18:48:05Z

rust/src/writer/json.rs

+            return Err(DeltaWriterError::PartialParquetWrite {
+                sample_error: match &partial_writes[0].1 {
+                    ParquetError::General(msg) => ParquetError::General(msg.to_owned()),
+                    ParquetError::ArrowError(msg) => ParquetError::ArrowError(msg.to_owned()),
+                    ParquetError::EOF(msg) => ParquetError::EOF(msg.to_owned()),
+                    ParquetError::External(err) => ParquetError::General(err.to_string()),
+                    ParquetError::IndexOutOfBound(u, v) => {
+                        ParquetError::IndexOutOfBound(u.to_owned(), v.to_owned())
+                    }
+                    ParquetError::NYI(msg) => ParquetError::NYI(msg.to_owned()),
+                },
+                skipped_values: partial_writes,


I was not too happy doing this, but the parquet errors are no longer cloneable, so to_onwed does not do much. My assumption was that we have to keep all the individual errors, since kafka-delta-ingest relies on them? @rtyler

roeap · 2023-02-17T18:49:09Z

rust/tests/command_optimize.rs

    assert!(maybe_metrics.is_err());
    assert_eq!(dt.version(), version + 1);
    Ok(())
 }

+#[ignore = "we do not yet re-try in operations commits."]


I'm finally working up to doing the commits conflict resolution. so this will be re-enabled soonishly ..

Blajda · 2023-02-18T00:25:17Z

@roeap The optimize changes LGTM.

There is one more thing is was considering while doing these updates. Essentially we may be getting some undesired behaviour, as the writer in the operations module already considers file size, so not all data in a bin might actually end up in the same file. there may also be a naïve yet much simpler way to do this now, by just passing all data in a partition through the partition writer. This should yield files of the desired size within the accuracy of the configured row-group size. It may of course lead to a rather small remainder file.

The bin packing optimization must be idempotent so I think just passing the entire partition would violate that. Sorting files by their size ensures that. It's good that you kept the check for only one add action so any undesired behaviour should be caught and reported.

wjones127

Just a few minor suggestions. Overall looks good :)

rust/src/operations/optimize.rs

rust/src/operations/writer.rs

Co-authored-by: Will Jones <[email protected]>

@Blajda

# Description This PR moves the optimize operation into the operations module. As part of this we do a few updates to the operation as well - adopt `IntoFuture` pattern - use writer from operations module - replace `SerializedFileReader` with `ParquetRecordBatchStream` As part of this we also update datafusion and arrow, which in turn requires updating pyo3. This requires updating some deprecated features. i.e. how function signatures are annotated. It also leads to a breaking change in the python funcitons - specifically the order of arguments in `dataset_partitions`. cc @Blajda # Related Issue(s) part of delta-io#1136 # Documentation  --------- Co-authored-by: Will Jones <[email protected]>

roeap added 4 commits February 17, 2023 05:08

chore: clippy fix

d3f10e6

chore: bump datqafusion and arrow

9fea909

chore!: update pyo3 function signatures

40611aa

feat: move and update optimize command

03ec7bc

roeap requested review from wjones127, fvaleye, rtyler, houqp, xianwill and mosyp as code owners February 17, 2023 18:43

github-actions bot added binding/python Issues for the Python package binding/rust Issues for the Rust crate rust labels Feb 17, 2023

roeap commented Feb 17, 2023

View reviewed changes

roeap added 3 commits February 17, 2023 19:52

fix: add missing feature cfg

dbb8be9

fix: remove unwrap

9622ef5

docs: fix optimize documentation

3bdcd13

roeap mentioned this pull request Feb 22, 2023

feat: optimistic transaction protocol #632

Merged

wjones127 requested changes Feb 23, 2023

View reviewed changes

Apply suggestions from code review

a09252e

Co-authored-by: Will Jones <[email protected]>

wjones127 approved these changes Feb 23, 2023

View reviewed changes

roeap merged commit 1b617e4 into delta-io:main Feb 23, 2023

roeap deleted the optimize-operation branch February 23, 2023 06:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: move and update Optimize operation #1154

feat: move and update Optimize operation #1154

roeap commented Feb 17, 2023 •

edited

Loading

roeap Feb 17, 2023

roeap Feb 17, 2023

Blajda commented Feb 18, 2023

wjones127 left a comment

feat: move and update Optimize operation #1154

feat: move and update Optimize operation #1154

Conversation

roeap commented Feb 17, 2023 • edited Loading

Description

Related Issue(s)

Documentation

roeap Feb 17, 2023

Choose a reason for hiding this comment

roeap Feb 17, 2023

Choose a reason for hiding this comment

Blajda commented Feb 18, 2023

wjones127 left a comment

Choose a reason for hiding this comment

roeap commented Feb 17, 2023 •

edited

Loading