Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RecordBatchWriter: MergeSchema not working when the table is partitioned #2355

Open
lasantosr opened this issue Mar 28, 2024 · 4 comments
Open
Labels
binding/rust Issues for the Rust crate bug Something isn't working

Comments

@lasantosr
Copy link
Contributor

lasantosr commented Mar 28, 2024

Environment

Delta-rs version: 0.17.1

Binding: Rust

Environment:

  • OS: Ubuntu 22.04

Bug

What happened:
When a table is partitioned by any column:

  • If I try to write a different schema with WriteMode::Default, it's written incomplete instead of errored
  • If I try to write a different schema with WriteMode::MergeSchema, the new schema columns are missing

What you expected to happen:

  • On the first scenario, the write must fail because the schema evolution is not allowed
  • On the second scenario, the new columns should have been written and the table schema should have evolved

How to reproduce it:
Instead of providing a minimum isolated code, I've altered the current test:

        #[tokio::test]
        async fn test_write_mismatched_schema() {
            let batch = get_record_batch(None, false);
            let partition_cols = vec!["id".to_owned()];
            let table = create_initialized_table(&partition_cols).await;
            let mut writer = RecordBatchWriter::for_table(&table).unwrap();

            // Write the first batch with the first schema to the table
            writer.write(batch).await.unwrap();
            let adds = writer.flush().await.unwrap();
            assert_eq!(adds.len(), 2);

            // Create a second batch with a different schema
            let second_schema = Arc::new(ArrowSchema::new(vec![
                Field::new("id", DataType::Utf8, true),
                Field::new("value", DataType::Int32, true),
                Field::new("modified", DataType::Utf8, true),
                Field::new("name", DataType::Utf8, true),
            ]));
            let second_batch = RecordBatch::try_new(
                second_schema,
                vec![
                    Arc::new(StringArray::from(vec![Some("A"), Some("B")])),
                    Arc::new(Int32Array::from(vec![Some(1), Some(2)])),
                    Arc::new(StringArray::from(vec![
                        Some("2021-02-02"),
                        Some("2021-02-01"),
                    ])),
                    Arc::new(StringArray::from(vec![Some("will"), Some("robert")])),
                ],
            )
            .unwrap();

            let result = writer.write(second_batch).await;
            assert!(result.is_err());

            match result {
                Ok(_) => {
                    assert!(false, "Should not have successfully written");
                }
                Err(e) => {
                    match e {
                        DeltaTableError::SchemaMismatch { .. } => {
                            // this is expected
                        }
                        others => {
                            assert!(false, "Got the wrong error: {others:?}");
                        }
                    }
                }
            };
        }

More details:

  • The test fails becase the result is not an error, it was successfully written
    • Also, if the second_schema doesn't have all of the columns from the first one, another error is returned instead of the SchemaMismatch
  • If WriteMode::MergeSchema is used, the new name column is not written and the schema has not evolved
@lasantosr lasantosr added the bug Something isn't working label Mar 28, 2024
@lasantosr
Copy link
Contributor Author

This issue is related with #1386

@ion-elgreco
Copy link
Collaborator

@lasantosr any reason you are using the RecordBatchWriter and not the write operation?

cc @rtyler

@ion-elgreco ion-elgreco added the binding/rust Issues for the Rust crate label Mar 28, 2024
@ion-elgreco ion-elgreco changed the title MergeSchema not working when the table is partitioned RecordBatchWriter: MergeSchema not working when the table is partitioned Mar 28, 2024
@lasantosr
Copy link
Contributor Author

@ion-elgreco not really, I just saw some (maybe old?) example using it, but I can refactor to use the write operation.

It seems the bug will still be there, as it's using the same divide_by_partition_values() function that I suspect is one of the culprits.

@ion-elgreco
Copy link
Collaborator

@lasantosr on my side writing schem_mode='merge' works even if the table is partittioned, with the write operation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants