RecordBatchTransformer: Handle schema migration and column re-ordering in table scans #602

sdd · 2024-09-04T22:58:24Z

Addresses parts 2 and 3 of #405.

Add support for type promotion and default values to the read pipeline.

When the scan includes fields that have undergone type promotion since some of the underlying parquet files were written, any selected fields that have undergone type promotion will be promoted to the newer type before being returned to the user. For example, A table contained a field "a" that was a Float, and some rows were written. The table's schema was changed so that field "a" is now a Double. When a table scan is performed, record batches coming from files written prior to the type promotion will be dynamically converted so that field "a" is of type Double, matching any row batches returned from files written after the schema change.
When the scan includes fields that have been added since some of the files were written, record batches will be dynamically converted as per the above to contain selected fields that were not present at the time the file that they were in was written. These will have a value of null if there is no default value present for the column but the column is not required. If the table schema specifies an initial-default-value for the field, then all rows will have that value for the new column instead.
If any fields have been renamed, the record batch schemas for rows written before the rename occurred will be rewritten to contain the new field name.
If projected_field_ids is provided, the columns in the response will be re-ordered to match the order in the projection.

crates/iceberg/src/arrow/record_batch_evolution_processor.rs

sdd · 2024-09-06T19:05:43Z

I need to go back to the drawing board on this. The current implementation breaks when not all columns in the file are in the list of projected fields.

sdd · 2024-09-09T07:49:15Z

OK, I've addressed the problem with projections other than the equiv of SELECT *. All existing tests passing, and new tests extended to cover these cases.

sdd · 2024-09-09T18:29:11Z

@liurenjie1024 and @Xuanwo - ready for review when you get chance

sdd · 2024-09-23T07:18:38Z

@Xuanwo and @liurenjie1024: PTAL, I've rebased and refactored this to better handle the pass-through case. It would be great to get this merged - without it, I'm experiencing some annoying issues, as table scans against tables that have had new columns added where pre-existing data exists without the added column, and then had new data written in the new schema, results in record batches being streamed back that can have different schemas in the same stream, which is very difficult to deal with.

Xuanwo

Thanks a lot for working on this!

Xuanwo · 2024-09-23T10:38:24Z

Waiting for @liurenjie1024 to take another look.

liurenjie1024

Thanks @sdd for this high quality pr, looks great to me! I believe current approach works well without nested types, and we can add support for nested types later. Just one minor point about arrow dependency, others look great!

Cargo.toml

crates/iceberg/Cargo.toml

…ments

… required but columns can remain unmodified

sdd · 2024-10-03T21:40:41Z

Hi @liurenjie1024 - I addressed your comment around not importing the whole of Arrow.

Additionally, I realised that there was an optimisation that could be made in the case that only the column names differed between the source and target schemas, so I've added code for that use case.

Also, the schemas can contain metadata which, where present, would cause the equality comparison to fail, so I changed the equality check to only compare on the aspects of the schema that we're interested in (data type, nullability, and column name).

This is now ready for re-review (FAO @Xuanwo for this also).

Thanks! :-)

sdd · 2024-10-11T06:52:21Z

@Xuanwo PTAL - you approved an earlier version but there are some small additional changes since then.

I added:

a performance improvement for a particular scenario;
a change to how schemas are compared for equality for this specific case.

(See here: 0ed5721)

Xuanwo

Thank you @sdd for the great work! Also thanks @liurenjie1024 for the review. Let's move.

sdd commented Sep 4, 2024

View reviewed changes

crates/iceberg/src/arrow/record_batch_evolution_processor.rs Outdated Show resolved Hide resolved

sdd marked this pull request as ready for review September 9, 2024 18:03

sdd force-pushed the record-batch-evolution-processor branch from 73c6724 to bf6d2ef Compare September 9, 2024 18:08

sdd mentioned this pull request Sep 12, 2024

fix: reorder record batch #629

Closed

sdd force-pushed the record-batch-evolution-processor branch from 69e2f95 to 36a00fc Compare September 23, 2024 06:08

sdd changed the title ~~Schema Evolution RecordBatch processor~~ RecordBatchTransformer: Handle schema migration and column re-ordering in table scans Sep 23, 2024

sdd force-pushed the record-batch-evolution-processor branch from 53a1267 to ef3a9fc Compare September 23, 2024 07:14

sdd force-pushed the record-batch-evolution-processor branch from ef3a9fc to e0a1ac8 Compare September 23, 2024 08:03

Xuanwo approved these changes Sep 23, 2024

View reviewed changes

liurenjie1024 reviewed Sep 30, 2024

View reviewed changes

Cargo.toml Outdated Show resolved Hide resolved

crates/iceberg/Cargo.toml Outdated Show resolved Hide resolved

sdd added 9 commits October 3, 2024 22:22

feat: Add skeleton of RecordBatchEvolutionProcessor

9fe476e

feat: Add initial implementation of RecordBatchEvolutionProcessor

afc86ea

feat: support more column types. Improve error handling. Add more com…

7172cce

…ments

feat(wip): adress issues with reordered / skipped fields

3d5d8c3

feat: RecordBatchEvolutionProcessor handles skipped fields in projection

3421fe1

chore: add missing license header

657f58b

chore: remove unneeded comment

81480b9

refactor: rename to RecordBatchTransformer. Improve passthrough handling

0b17465

feat: more performant handling of case where only schema transform is…

0ed5721

… required but columns can remain unmodified

sdd force-pushed the record-batch-evolution-processor branch from e0a1ac8 to 3c9bbd1 Compare October 3, 2024 21:23

refactor: import arrow_cast rather than arrow

61d4bdc

sdd force-pushed the record-batch-evolution-processor branch from 3c9bbd1 to 61d4bdc Compare October 3, 2024 21:25

Xuanwo approved these changes Oct 11, 2024

View reviewed changes

Xuanwo merged commit 5c1a9e6 into apache:main Oct 11, 2024
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RecordBatchTransformer: Handle schema migration and column re-ordering in table scans #602

RecordBatchTransformer: Handle schema migration and column re-ordering in table scans #602

sdd commented Sep 4, 2024 •

edited

Loading

sdd commented Sep 6, 2024

sdd commented Sep 9, 2024

sdd commented Sep 9, 2024

sdd commented Sep 23, 2024

Xuanwo left a comment

Xuanwo commented Sep 23, 2024

liurenjie1024 left a comment

sdd commented Oct 3, 2024

sdd commented Oct 11, 2024

Xuanwo left a comment

RecordBatchTransformer: Handle schema migration and column re-ordering in table scans #602

RecordBatchTransformer: Handle schema migration and column re-ordering in table scans #602

Conversation

sdd commented Sep 4, 2024 • edited Loading

sdd commented Sep 6, 2024

sdd commented Sep 9, 2024

sdd commented Sep 9, 2024

sdd commented Sep 23, 2024

Xuanwo left a comment

Choose a reason for hiding this comment

Xuanwo commented Sep 23, 2024

liurenjie1024 left a comment

Choose a reason for hiding this comment

sdd commented Oct 3, 2024

sdd commented Oct 11, 2024

Xuanwo left a comment

Choose a reason for hiding this comment

sdd commented Sep 4, 2024 •

edited

Loading