-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove Alias
from Expr
#1468
Comments
I think |
Make sense to me @Dandandan |
It's not entirely clear to me how we are going to propagate it through out the code base. But if we can pull it off, that would make the code more readable and maintainable for sure 👍 |
I think it is valuable, there are some case code just for handling the Alias::Expr, I'm willing to do this task. I will push a draft PR for this issue this weekend. |
@Dandandan . Hi, I'm a little confused about the details. Specifically how to unify Now alias in Expr is enum as
After use
Can we use this struct to present it? struct NamedExpr {
/// Alias or generated name
name: Option<String>,
/// The expression
expr: Expr,
} |
Hi @jackwener . The idea is that any expression within DataFusion receives a name, so the nodes in a LogicalPlan use the |
😅I found that it isn't easy problem, I need to think more about it. |
|
In addition, I find some It mayn't a good way by using alias. Because we must add alias after write |
I plan on working on this since I think it will solve some problems I am running into when working with expressions and schemas. |
I spent some time on this and I am no longer sure that this change makes sense. We have a lot of optimizer rules that recurse through expression trees rewriting them and if we move the There are other |
I run into this recently. When constructing VALUES via API, the Alias expression can be thought to allow aliasing the VALUES columns, but this didn't work, alias got (silently) ignored. From such "unpleasant surprise" perspective, I would consider this as a bug. To me, alias is not an expression at all. It's a feature of a select clause and would be best modeled as such.
That's a good point and good example. |
I agree it sounds like a bug to me
Theoretically I think this makes sense One thing that is different about DataFusion than other systems is that the schemas are based on name (rather than column order) and the names of the columns are derived from the expressions So the point is that an expression like So in this case a + 5 + 1 Becomes a + 6 as "a + 5 + 1" That rewrite can happen in multiple parts of the plan, not just the select list There are likely other ways to handle this than
I think Sort would be an easier thing to remove / fix -- |
Indeed. There are a few places where sort expression needs to be special cased.
I see the point. I would be tempted to go into direction where names are explicit and expressions are just expressions. one thing is how easy it is to get there (i get that likely pretty hard per "alias is used widely") |
I agree
I don't have any strong feeling in this regard -- in general |
Besides fixing some problems (described in this issue + #1468 (comment)), separating "expression" and "named expression" could result in lower mental overhead involved when dealing with the code, given that in many contexts expressions don't have & don't need names (eg VALUES, ORDER BY, function inputs). |
filed #12193 for this |
I'm confused about the |
It may be historical -- |
Schema / DFSchema has dual purpose and this is related to Expr::Alias existence and handling. One necessary purpose is understanding source tables, and identifier resolution in the SQL queries - a necessary job to be done once SQL syntax tree (AST) is produced by the parser. I know addressing this is hard, but in 5 year from now perspective, would we prefer these issues to be still with us? Or would we prefer to address them sooner or later? I think we should refactor LogicalPlans not to use aliases and column references at all. This can fall under #12723. We should find a way to do this incrementally, and I believe there is a way to do so. |
Well I am always for improving the situation!
I agree -- doing it incrementally is the way to go and the only challenge will be to gather enough people / time to help make it happen. |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently a named expression / alias can be encoded by wrapping an existing
Expr
in anExpr::Alias
.However, this has some side effects:
We can encode expressions that don't make sense, such as
(1 + 1 AS x) AS y
or(1 AS X) + (1 AS Y)
(in SQL it's disallowed by the parser, but it's just to show it is possible to do this in the DataFrame API or the raw API. In dataframes we could do things likeexpr.alias("x").alias("y")
.Code dealing with
Expr
always have to deal with theAlias
case, even when it doesn't care. This could lead to more complex code or even subtle bugs.Describe the solution you'd like
NamedExpr
that is used to refer to a named expressionThe
NamedExpr
now can be used inside projections, aggregates, etc.The function
alias
onDataFrame
should return anNamedExpr
Add an
impl From<Expr> for NamedExpr
that generates a name.Some functions can accept Into to keep being ergonomic to use.
Describe alternatives you've considered
Additional context
The text was updated successfully, but these errors were encountered: