Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ExprSchema extraction of metadata for Cast expressions. #13305

Merged
merged 3 commits into from
Nov 11, 2024

Conversation

wiedld
Copy link
Contributor

@wiedld wiedld commented Nov 8, 2024

Which issue does this PR close?

Part of #12733

Rationale for this change

The field metadata was not being extracted from the ExprSchema for cast expressions. When cast was used as a projection on an aggregate, it would be missing the field metadata and fail the schema-misalignment check (see background epic).

The result was that the logical plan schema did not have the field metadata, whereas the physical plan schema did:

# physical_input_schema
Schema { 
   fields: [
      Field { 
        name: "dist", 
        data_type: Date32, 
        nullable: false, 
        dict_id: 0, 
        dict_is_ordered: false, 
        metadata: {"metadata_key": "ts non-nullable field"} 
      }
    ],
    metadata: {"metadata_key": "the entire schema"} 
}

# physical_input_schema_from_logical
Schema { 
    fields: [
      Field { 
        name: "dist", 
        data_type: Date32, 
        nullable: false, 
        dict_id: 0, 
        dict_is_ordered: false, 
        metadata: {}                     // the difference
      }
    ], 
    metadata: {"metadata_key": "the entire schema"} 
}

However, I'm not 💯 that this is the right fix. Since it triggers another test to fail -- because cast should drop input metadata since the type has changed. What is the proper behavior here? Instead, should we be removing the field metadata from the physical schema? Per the comment below, we agreed to have the cast expr retain the field metadata.

What changes are included in this PR?

commit 1 = test case with reproducer. Demonstrates exactly which scenario fails (vs is ok).
commit 2 = fix, and update test cases

Are these changes tested?

yes

Are there any user-facing changes?

no

@wiedld
Copy link
Contributor Author

wiedld commented Nov 8, 2024

note: This is a draft PR only, since I'm not sure what is the correct behavior. See the PR description, and please advise. 🙏🏼

@jayzhan211
Copy link
Contributor

I'm not sure why metadata is allowed to change for cast 🤔

@jayzhan211
Copy link
Contributor

@sydduckworth Do you remember why cast should drop input metadata since the type has changed

   #[test]
    fn test_expr_metadata() {
        let mut meta = HashMap::new();
        meta.insert("bar".to_string(), "buzz".to_string());
        let expr = col("foo");
        let schema = MockExprSchema::new()
            .with_data_type(DataType::Int32)
            .with_metadata(meta.clone());

        // col and alias should be metadata-preserving
        assert_eq!(meta, expr.metadata(&schema).unwrap());
        assert_eq!(meta, expr.clone().alias("bar").metadata(&schema).unwrap());

        // cast should drop input metadata since the type has changed
        assert_eq!(
            HashMap::new(),
            expr.clone()
                .cast_to(&DataType::Int64, &schema)
                .unwrap()
                .metadata(&schema)
                .unwrap()
        );

        let schema = DFSchema::from_unqualified_fields(
            vec![Field::new("foo", DataType::Int32, true).with_metadata(meta.clone())]
                .into(),
            HashMap::new(),
        )
        .unwrap();

        // verify to_field method populates metadata
        assert_eq!(&meta, expr.to_field(&schema).unwrap().1.metadata());
    }

@alamb
Copy link
Contributor

alamb commented Nov 8, 2024

Maybe it is time to consider changing the check for output schema = input schema to ignore metadata -- I know we have been holding off on that for a long time, but maybe fundamentally the check is not feasible to maintain 🤔

@alamb
Copy link
Contributor

alamb commented Nov 8, 2024

@sydduckworth Do you remember why cast should drop input metadata since the type has changed

For anyone else following along, the original PR that added this test and related code was

@sydduckworth
Copy link
Contributor

@sydduckworth Do you remember why cast should drop input metadata since the type has changed

When I wrote my PR, I had to make a decision about how to handle metadata propagation across expressions.
It seemed logical to me that the Column and Alias expressions should forward the source field's metadata, but my concern for other expressions was that forwarding the metadata could produce unexpected behavior, so I decided they would drop just any metadata in the source fields.

The only one I wasn't sure about was Cast since I could see an argument for it preserving the source metadata, but I eventually just went with the idea that any expression that can change the datatype of the input should reset its metadata and added that test assertion verify the behavior.

Long story short: there's no reason for that assertion other than my arbitrary decision when implementing the feature. To me it seems reasonable to change the behavior of Cast and remove (or invert) that assertion so the logical plan behavior matches the physical plan behavior.

@wiedld wiedld marked this pull request as ready for review November 9, 2024 06:19
@jayzhan211
Copy link
Contributor

@sydduckworth Do you remember why cast should drop input metadata since the type has changed

When I wrote my PR, I had to make a decision about how to handle metadata propagation across expressions. It seemed logical to me that the Column and Alias expressions should forward the source field's metadata, but my concern for other expressions was that forwarding the metadata could produce unexpected behavior, so I decided they would drop just any metadata in the source fields.

The only one I wasn't sure about was Cast since I could see an argument for it preserving the source metadata, but I eventually just went with the idea that any expression that can change the datatype of the input should reset its metadata and added that test assertion verify the behavior.

Long story short: there's no reason for that assertion other than my arbitrary decision when implementing the feature. To me it seems reasonable to change the behavior of Cast and remove (or invert) that assertion so the logical plan behavior matches the physical plan behavior.

Thanks for your reply, I think we can skip the assertion given the change is little. We can consider to ignore the metadata if we meet any case that blocking us to matches the plan checking

Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@crepererum crepererum merged commit be19afc into apache:main Nov 11, 2024
26 checks passed
jayzhan211 pushed a commit to jayzhan211/datafusion that referenced this pull request Nov 12, 2024
…3305)

* test(12733): reproducers for schema bugs

* fix(12733): properly extract field metadata from Cast expr

* test(12733): update metadata preservation test, for new contract (a.k.a. cast preserves field metadata)
@alamb alamb deleted the 12733/schema-mismatch branch November 13, 2024 06:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants