Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename Expr::display_name to Expr::schema_name, make UNNEST naming conform to convention #11797

Merged
merged 36 commits into from
Aug 9, 2024

Conversation

jayzhan211
Copy link
Contributor

@jayzhan211 jayzhan211 commented Aug 4, 2024

Which issue does this PR close?

Part of #11782
Closes #.

Rationale for this change

display_name is used widely but it is actually the different things from Display trait. After taking a look, I think schema_name is a less confusing term, since we usually build the name with display_name for schema / field.

What changes are included in this PR?

schema_name is introduced and we should continue to use the schema_name for schema/field later on.
display_name is removed and we should use Display trait instead

Are these changes tested?

Are there any user-facing changes?

Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions optimizer Optimizer rules substrait labels Aug 4, 2024
Signed-off-by: jayzhan211 <[email protected]>
@jayzhan211 jayzhan211 changed the title Replace display_name with schema_name Replace display_name with schema_name Aug 4, 2024
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
@github-actions github-actions bot added the core Core DataFusion crate label Aug 4, 2024
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Aug 4, 2024
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
@jayzhan211 jayzhan211 marked this pull request as ready for review August 5, 2024 00:19
@jayzhan211 jayzhan211 marked this pull request as draft August 5, 2024 00:19
@jayzhan211 jayzhan211 marked this pull request as ready for review August 5, 2024 02:14
@alamb alamb changed the title Replace display_name with schema_name Rename Expr::display_name to Expr::schema_name, make UNNEST naming conform to convention Aug 6, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jayzhan211 -- this looks like an improvement to me 🙏

I think making a deprecated function would be really nice to help people upgrade

I also think if we are going to be messing around with the API we should consider updating the API to avoid allocations when possible

@@ -1137,7 +1137,7 @@ from arrays_values_without_nulls;
## array_element (aliases: array_extract, list_extract, list_element)

# Testing with empty arguments should result in an error
query error DataFusion error: Error during planning: Error during planning: array_element does not support zero arguments.
query error DataFusion error: Execution error: expect 2 args, got 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the old error message was nicer -- maybe we can update the code that generates this message to be nicer too 🤔

datafusion/expr/src/expr.rs Show resolved Hide resolved
/// Those are the main difference
/// 1. Alias, where excludes expression
/// 2. Cast / TryCast, where takes expression only
pub fn schema_name(&self) -> Result<String> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to introduce a new API anyways, what do you think about making an API that doesn't require allocating a new string and making it infallable. Something like

Suggested change
pub fn schema_name(&self) -> Result<String> {
pub fn schema_name<'a>(&'a self) -> impl Display + 'a {

We could probably then implement a new type thing and mostly use the same code:

// new type wrapper
struct SchemaDisplay<'a>(&'a Expr);

impl <'a> Display for SchemaDisplay<'a> {
  fn fmt(&self, f: &mut Formatter<'_>) -> fmt::Result {
        match self {
            // The same as Display
            Expr::Column(_)
            | Expr::Literal(_)
            | Expr::ScalarVariable(..)
            | Expr::Sort(_)
            | Expr::OuterReferenceColumn(..)
            | Expr::Placeholder(_)
            | Expr::Wildcard { .. } => write!(f, "{self}")?,
...
  }
}

return SchemaDisplay(self)

The benefit of doing this would be anywhere it needed to get formatted would not require a copy

println!("This would not allocated a String: {}", expr.schema_name());
// get schema name as a string:
let schema_name = expr.schema_name().to_string();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks cool!

Comment on lines 986 to 990
/// Returns the name for schema / field that is different from Display
/// Most of the expressions are the same as Display
/// Those are the main difference
/// 1. Alias, where excludes expression
/// 2. Cast / TryCast, where takes expression only
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a suggesiton for how to improve this documentation:

Suggested change
/// Returns the name for schema / field that is different from Display
/// Most of the expressions are the same as Display
/// Those are the main difference
/// 1. Alias, where excludes expression
/// 2. Cast / TryCast, where takes expression only
/// The name of the column (field) that this `Expr` will produce.
///
/// For example, for a projection (e.g. `SELECT <expr>`) the resulting arrow
/// [`Schema`] will have a field with this name.
///
/// Note that the resulting string is subtlety different than the `Display`
/// representation for certain `Expr`. Some differences:
///
/// 1. [`Expr::Alias`], which shows only the alias itself
/// 2. [`Expr::Cast`] / [`Expr::TryCast`], which only displays the expression
///
/// # Example
/// ```
/// # use datafusion_expr::{col, lit};
/// let expr = col("foo").eq(lit(42));
/// assert_eq!("foo = Int32(42)", expr.schema_name().unwrap());
///
/// let expr = col("foo").alias("bar").eq(lit(11));
/// assert_eq!("bar = Int32(11)", expr.schema_name().unwrap());
/// ```
///
/// [`Schema`]: arrow::datatypes::Schema

@@ -358,8 +358,8 @@ mod tests {

let plan = from_substrait_plan(&ctx, &proto).await?;
let plan_str = format!("{:?}", plan);
assert_eq!(plan_str, "Projection: Decimal128(Some(10000),5,2) * sum(CASE WHEN FILENAME_PLACEHOLDER_1.p_type LIKE CAST(Utf8(\"PROMO%\") AS Utf8) THEN FILENAME_PLACEHOLDER_0.l_extendedprice * Int32(1) - FILENAME_PLACEHOLDER_0.l_discount ELSE Decimal128(Some(0),19,0) END) / sum(FILENAME_PLACEHOLDER_0.l_extendedprice * Int32(1) - FILENAME_PLACEHOLDER_0.l_discount) AS PROMO_REVENUE\
\n Aggregate: groupBy=[[]], aggr=[[sum(CASE WHEN FILENAME_PLACEHOLDER_1.p_type LIKE CAST(Utf8(\"PROMO%\") AS Utf8) THEN FILENAME_PLACEHOLDER_0.l_extendedprice * Int32(1) - FILENAME_PLACEHOLDER_0.l_discount ELSE Decimal128(Some(0),19,0) END), sum(FILENAME_PLACEHOLDER_0.l_extendedprice * Int32(1) - FILENAME_PLACEHOLDER_0.l_discount)]]\
assert_eq!(plan_str, "Projection: Decimal128(Some(10000),5,2) * sum(CASE WHEN FILENAME_PLACEHOLDER_1.p_type LIKE Utf8(\"PROMO%\") THEN FILENAME_PLACEHOLDER_0.l_extendedprice * Int32(1) - FILENAME_PLACEHOLDER_0.l_discount ELSE Decimal128(Some(0),19,0) END) / sum(FILENAME_PLACEHOLDER_0.l_extendedprice * Int32(1) - FILENAME_PLACEHOLDER_0.l_discount) AS PROMO_REVENUE\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense as long as the cast is still in plans (which it is)

Ok(format!("{}({})", self.name(), names.join(",")))
}

/// Returns the user-defined schema name of the UDF given the arguments
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Returns the user-defined schema name of the UDF given the arguments
/// Returns the name of the column this expression would create
///
/// See [`Expr::schema_name`] for details

Maybe it is also worth doing the impl Display thing here too which would allow avoiding these string allocations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems I need another SchemaFunctionDisplay for replacing String with impl Display in UDFImpl. I would like to hold on to the change for function

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense -- I don't think there is any reason that this impl Display thing has to be done -- I was thinking it might be worth exploring if we were already changing the code anyways

@jayzhan211 jayzhan211 marked this pull request as draft August 7, 2024 03:29
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
@jayzhan211 jayzhan211 marked this pull request as ready for review August 7, 2024 14:33
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- thanks @jayzhan211

@@ -1961,6 +2270,7 @@ impl fmt::Display for Expr {
},
Expr::Placeholder(Placeholder { id, .. }) => write!(f, "{id}"),
Expr::Unnest(Unnest { expr }) => {
// TODO: use Display instead of Debug, there is non-unique expression name in projection issue.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we file a ticket to track this (maybe others would be interested in helping)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice that #11198 is filed already.

@jayzhan211 jayzhan211 merged commit 7c41323 into apache:main Aug 9, 2024
24 checks passed
@jayzhan211 jayzhan211 deleted the expr-name branch August 9, 2024 01:00
@jayzhan211
Copy link
Contributor Author

Thanks @alamb

Michael-J-Ward added a commit to Michael-J-Ward/datafusion-python that referenced this pull request Aug 10, 2024
Michael-J-Ward added a commit to Michael-J-Ward/datafusion-python that referenced this pull request Sep 10, 2024
emgeee pushed a commit to probably-nothing-labs/datafusion-python that referenced this pull request Sep 10, 2024
emgeee pushed a commit to probably-nothing-labs/datafusion-python that referenced this pull request Sep 10, 2024
timsaucer added a commit to apache/datafusion-python that referenced this pull request Sep 17, 2024
* update dependencies

* update get_logical_plan signature

* remove row_number() function

row_number was converted to a UDF in datafusion v42 apache/datafusion#12030
This specific functionality needs to be added back in.

* remove unneeded dependency

* fix pyo3 warnings

Implicit defaults for trailing optional arguments have been deprecated
in pyo3 v0.22.0 PyO3/pyo3#4078

* update object_store dependency

* change PyExpr -> PySortExpr

* comment out key.extract::<&PyTuple>() condition statement

* change more instances of PyExpr > PySortExpr

* update function signatures to use _bound versions

* remove clone

* Working through some of the sort requirement changes

* remove unused import

* expr.display_name is deprecated, used format!() + schema_name() instead

* expr.canonical_name() is deprecated, use format!() expr instead

* remove comment

* fix tuple extraction in dataframe.__getitem__()

* remove unneeded import

* Add docstring comments to SortExpr python class

* change extract() to downcast()

Co-authored-by: Michael J Ward <[email protected]>

* deprecate Expr::display_name

Ref: apache/datafusion#11797

* fix lint errors

* update datafusion commit hash

* fix type in cargo file for arrow features

* upgrade to datafusion 42

* cleanup

---------

Co-authored-by: Tim Saucer <[email protected]>
Co-authored-by: Michael J Ward <[email protected]>
Co-authored-by: Michael-J-Ward <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules sql SQL Planner sqllogictest SQL Logic Tests (.slt) substrait
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants