Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Alias from Expr #1468

Open
Dandandan opened this issue Dec 20, 2021 · 22 comments
Open

Remove Alias from Expr #1468

Dandandan opened this issue Dec 20, 2021 · 22 comments
Labels
api change Changes the API exposed to users of the crate enhancement New feature or request

Comments

@Dandandan
Copy link
Contributor

Dandandan commented Dec 20, 2021

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Currently a named expression / alias can be encoded by wrapping an existing Expr in an Expr::Alias.

However, this has some side effects:

  • We can encode expressions that don't make sense, such as (1 + 1 AS x) AS y or (1 AS X) + (1 AS Y) (in SQL it's disallowed by the parser, but it's just to show it is possible to do this in the DataFrame API or the raw API. In dataframes we could do things like expr.alias("x").alias("y").

  • Code dealing with Expr always have to deal with the Alias case, even when it doesn't care. This could lead to more complex code or even subtle bugs.

Describe the solution you'd like

  • Create a new struct, NamedExpr that is used to refer to a named expression
struct NamedExpr {
    /// Alias or generated name
    name: String,
    /// The expression
    expr: Expr,
}
  • The NamedExpr now can be used inside projections, aggregates, etc.

  • The function alias on DataFrame should return an NamedExpr

  • Add an impl From<Expr> for NamedExpr that generates a name.

  • Some functions can accept Into to keep being ergonomic to use.

Describe alternatives you've considered

Additional context

@Dandandan Dandandan added enhancement New feature or request api change Changes the API exposed to users of the crate design labels Dec 20, 2021
@Dandandan
Copy link
Contributor Author

FYI @alamb @houqp

@alamb
Copy link
Contributor

alamb commented Dec 21, 2021

I think NamedExpr sounds like a great idea, personally. 👍

@xudong963
Copy link
Member

Make sense to me @Dandandan

@houqp
Copy link
Member

houqp commented Dec 24, 2021

It's not entirely clear to me how we are going to propagate it through out the code base. But if we can pull it off, that would make the code more readable and maintainable for sure 👍

@jackwener
Copy link
Member

I think it is valuable, there are some case code just for handling the Alias::Expr, I'm willing to do this task.

I will push a draft PR for this issue this weekend.

@jackwener
Copy link
Member

jackwener commented Apr 9, 2022

@Dandandan . Hi, I'm a little confused about the details.

Specifically how to unify NamedExpr and Expr. I think option is a good way.


Now alias in Expr:

Expr is enum as

  • Alias
  • Column
  • ScalarVariable
  • ....

After use NamedExpr,The structure is like this

  • Alias
  • OtherExpr
    • Column
    • ScalarVariable
    • ....

Can we use this struct to present it?

struct NamedExpr {
    /// Alias or generated name
    name: Option<String>,
    /// The expression
    expr: Expr,
}

@Dandandan
Copy link
Contributor Author

Dandandan commented Apr 9, 2022

Hi @jackwener .

The idea is that any expression within DataFusion receives a name, so the nodes in a LogicalPlan use the NamedExpr type. The name in this type is either the alias for the expression if any or the generated name, so it doesn't need to be optional.

@jackwener
Copy link
Member

jackwener commented Apr 10, 2022

😅I found that it isn't easy problem, I need to think more about it. Expr is used in many places.

@jackwener
Copy link
Member

Aliases in SQL is just used in Table and Column in the fromClause and selectClause. So, like Aggregate, aggr_expr is NamedExpr, but group_expr isn't.

@jackwener
Copy link
Member

jackwener commented Apr 10, 2022

In addition, I find some Filter Expr use Alias after rewrite which make logicplan can be connected after rewrited.
It's related with #1319 #1316

It mayn't a good way by using alias. Because we must add alias after write Expr.
BTW, projection is same. we must add alias after rewrite plan
I think it is a future ticket.

@andygrove
Copy link
Member

I plan on working on this since I think it will solve some problems I am running into when working with expressions and schemas.

@andygrove andygrove self-assigned this May 3, 2022
@andygrove andygrove removed their assignment May 5, 2022
@andygrove
Copy link
Member

I spent some time on this and I am no longer sure that this change makes sense. We have a lot of optimizer rules that recurse through expression trees rewriting them and if we move the Alias variant out of the Expr enum it complicates things because we still need to recurse into the expression contained within the alias. It doesn't feel like the effort here justifies the benefit, IMO.

There are other Expr variants that only make sense in certain contexts as well. We can't have an Expr::Sort in a projection, for example. Maybe we just need some validation rules to ensure plans make sense and better utility methods for dealing with alias in cases where it is not valid?

@alamb alamb removed the design label Jul 17, 2024
@findepi
Copy link
Member

findepi commented Aug 24, 2024

I run into this recently. When constructing VALUES via API, the Alias expression can be thought to allow aliasing the VALUES columns, but this didn't work, alias got (silently) ignored. From such "unpleasant surprise" perspective, I would consider this as a bug.

To me, alias is not an expression at all. It's a feature of a select clause and would be best modeled as such.

There are other Expr variants that only make sense in certain contexts as well. We can't have an Expr::Sort in a projection, for example.

That's a good point and good example.
To me, in ORDER BY <expr> [ASC/DESC] [NULLS FIRST/LAST], the <expr> part is an expression (any expression), and the other attributes (asc/desc, nulls first/last) are attributes of the sorting. They don't have to be modeled as "an expression".

@alamb
Copy link
Contributor

alamb commented Aug 24, 2024

I run into this recently. When constructing VALUES via API, the Alias expression can be thought to allow aliasing the VALUES columns, but this didn't work, alias got (silently) ignored. From such "unpleasant surprise" perspective, I would consider this as a bug.

I agree it sounds like a bug to me

To me, alias is not an expression at all. It's a feature of a select clause and would be best modeled as such.

Theoretically I think this makes sense

One thing that is different about DataFusion than other systems is that the schemas are based on name (rather than column order) and the names of the columns are derived from the expressions

So the point is that an expression like a + 5 + 1 will result in a column named approximately "a + 5 + 1" -- so when the optimizer passes rewrite expressions they need to preserve the output name, and they do so using Expr::Alias

So in this case

a + 5 + 1

Becomes

a + 6 as "a + 5 + 1"

That rewrite can happen in multiple parts of the plan, not just the select list

There are likely other ways to handle this than Expr::Alias (a separate list on Projection, for example) but I do think alias is used widely

There are other Expr variants that only make sense in certain contexts as well. We can't have an Expr::Sort in a projection, for example.

That's a good point and good example. To me, in ORDER BY <expr> [ASC/DESC] [NULLS FIRST/LAST], the <expr> part is an expression (any expression), and the other attributes (asc/desc, nulls first/last) are attributes of the sorting. They don't have to be modeled as "an expression".

I think Sort would be an easier thing to remove / fix -- Expr::Sort as an expression is also bad as it means the signatures of fn order_by(...) are in terms of Expr, meaning the compiler can't ensure you are actually passing Expr::Sort when needed

@findepi
Copy link
Member

findepi commented Aug 25, 2024

I think Sort would be an easier thing to remove / fix -- Expr::Sort as an expression is also bad as it means the signatures of fn order_by(...) are in terms of Expr, meaning the compiler can't ensure you are actually passing Expr::Sort when needed

Indeed. There are a few places where sort expression needs to be special cased.

There are likely other ways to handle this than Expr::Alias (a separate list on Projection, for example) but I do think alias is used widely

I see the point. I would be tempted to go into direction where names are explicit and expressions are just expressions.
Ie to model projects as map<String /* name to assign */, Expr /* value to calculate */>

one thing is how easy it is to get there (i get that likely pretty hard per "alias is used widely")
and the other is whether we would want to get there at all (is this the right direction)

@alamb
Copy link
Contributor

alamb commented Aug 25, 2024

one thing is how easy it is to get there (i get that likely pretty hard per "alias is used widely")

I agree

and the other is whether we would want to get there at all (is this the right direction)

I don't have any strong feeling in this regard -- in general Expr::Alias seems to work ok, though there are some places where having to handle it (in filters, etc) have caused some issues in the past (which is why there is Expr::unalias and various other methods)

@findepi
Copy link
Member

findepi commented Aug 25, 2024

Besides fixing some problems (described in this issue + #1468 (comment)), separating "expression" and "named expression" could result in lower mental overhead involved when dealing with the code, given that in many contexts expressions don't have & don't need names (eg VALUES, ORDER BY, function inputs).

@findepi
Copy link
Member

findepi commented Aug 27, 2024

I think Sort would be an easier thing to remove / fix -- Expr::Sort as an expression is also bad as it means the signatures of fn order_by(...) are in terms of Expr, meaning the compiler can't ensure you are actually passing Expr::Sort when needed

filed #12193 for this
and created a PR removing Expr::Sort (#12177)

@jayzhan211
Copy link
Contributor

jayzhan211 commented Aug 27, 2024

I'm confused about the relation in Alias. What is it for? Could we remove it? While digging into #12184, I think we don't need to keep relation in Alias, instead get the relation from expr (Specifically Expr::Column) if needed.

@alamb
Copy link
Contributor

alamb commented Sep 5, 2024

I'm confused about the relation in Alias. What is it for? Could we remove it? While digging into #12184, I think we don't need to keep relation in Alias, instead get the relation from expr (Specifically Expr::Column) if needed.

It may be historical -- Expr::Alias is quite old code I think

@findepi
Copy link
Member

findepi commented Nov 20, 2024

Schema / DFSchema has dual purpose and this is related to Expr::Alias existence and handling.

One necessary purpose is understanding source tables, and identifier resolution in the SQL queries - a necessary job to be done once SQL syntax tree (AST) is produced by the parser.
The other purpose is for column/symbol/variable resolution between the logical plan components. Some engines (like Trino) use separate abstraction for that purpose. Reusing schema here makes it very hard to solve #13476, #6543 (see WIP #13489 (comment))

I know addressing this is hard, but in 5 year from now perspective, would we prefer these issues to be still with us? Or would we prefer to address them sooner or later?

I think we should refactor LogicalPlans not to use aliases and column references at all. This can fall under #12723. We should find a way to do this incrementally, and I believe there is a way to do so.

@alamb
Copy link
Contributor

alamb commented Nov 21, 2024

I know addressing this is hard, but in 5 year from now perspective, would we prefer these issues to be still with us? Or would we prefer to address them sooner or later?

Well I am always for improving the situation!

I think we should refactor LogicalPlans not to use aliases and column references at all. This can fall under #12723. We should find a way to do this incrementally, and I believe there is a way to do so.

I agree -- doing it incrementally is the way to go and the only challenge will be to gather enough people / time to help make it happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

8 participants