Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ambiguous reference when aliasing in combination with ORDER BY #8425

Merged
merged 41 commits into from
Dec 6, 2023

Conversation

Asura7969
Copy link
Contributor

Which issue does this PR close?

Closes #8391.

Rationale for this change

❯ CREATE TABLE t (a INTEGER);
0 rows in set. Query took 0.003 seconds.
❯ SELECT t.a AS a FROM t ORDER BY t.a;
Schema error: Schema contains qualified field name t.a and unqualified field name a which would be ambiguous

What changes are included in this PR?

When the field alias is consistent with the original name, cancel the alias

Are these changes tested?

Overwrite existing tests

Are there any user-facing changes?

# Conflicts:
#	datafusion/physical-plan/src/joins/hash_join_utils.rs
# Conflicts:
#	datafusion/physical-plan/src/joins/hash_join_utils.rs
@github-actions github-actions bot added the sql SQL Planner label Dec 5, 2023
@github-actions github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules labels Dec 5, 2023
@github-actions github-actions bot removed the logical-expr Logical plan and expressions label Dec 5, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Asura7969 . This is a really nice fix 🕵️ 💯

The code in this PR looks good to me, though I do wonder abut the relation field (see inline comment)

Also, I tested the original query locally and this PR does indeed fix the problem 🎉

DataFusion CLI v33.0.0
❯ CREATE TABLE t (a INTEGER);
0 rows in set. Query took 0.018 seconds.

❯ SELECT t.a AS a FROM t ORDER BY t.a;
0 rows in set. Query took 0.011 seconds.

❯

Can you please add some tests to ensure we don't break it accidentally in the future (the only tests that were changed are related to subqueries and I worry someone will accidentally break remove the fix code and not realize we introduced a regression)?

These should probably be end to end sqllogic tests:
https://github.com/apache/arrow-datafusion/tree/main/datafusion/sqllogictest

Perhaps added to https://github.com/apache/arrow-datafusion/blob/main/datafusion/sqllogictest/test_files/select.slt

@@ -384,7 +384,11 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
&[&[plan.schema()]],
&plan.using_columns()?,
)?;
let expr = col.alias(self.normalizer.normalize(alias));
let name = self.normalizer.normalize(alias);
let expr = match &col {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a comment here about what this is doing? I think it is avoiding adding an alias if the column name is the same.

Do we have to worry about relation names here too (like what if column.relation is non null?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand, but I didn't expect the scenario where column.relation is inconsistent🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of a query like

EXPLAIN select a as a FROM table1 CROSS JOIN table2 order by a

Where both table1 and table2 have a column named a. I think we should add a test for this case too -- I'll try and do so later today

I would expect the test to fail

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as you expected

DataFusion CLI v33.0.0
❯ CREATE TABLE table1 (a int) as values(1);
0 rows in set. Query took 0.037 seconds.

❯ CREATE TABLE table2 (a int) as values(1);
0 rows in set. Query took 0.012 seconds.

❯ EXPLAIN select a as a FROM table1 CROSS JOIN table2 order by a;
Schema error: Ambiguous reference to unqualified field a

let expected = "Sort: CAST(first_name AS first_name AS Int32) ASC NULLS LAST\
\n Projection: first_name AS first_name\
\n Projection: person.first_name AS first_name\
let expected = "Sort: CAST(person.first_name AS Int32) ASC NULLS LAST\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this certainly looks better

@alamb
Copy link
Contributor

alamb commented Dec 5, 2023

fyi @Jesse-Bakker

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Dec 6, 2023
\n TableScan: person";
quick_test(sql, expected);
}

#[test]
fn test_avoid_add_alias() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure logical consistency

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Asura7969 -- there is just one last thing I worry about with multiple relation names that I think we can resolve with some more tests.

@@ -384,7 +384,11 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
&[&[plan.schema()]],
&plan.using_columns()?,
)?;
let expr = col.alias(self.normalizer.normalize(alias));
let name = self.normalizer.normalize(alias);
let expr = match &col {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of a query like

EXPLAIN select a as a FROM table1 CROSS JOIN table2 order by a

Where both table1 and table2 have a column named a. I think we should add a test for this case too -- I'll try and do so later today

I would expect the test to fail


# ambiguous column references in on join
query error DataFusion error: Schema error: Ambiguous reference to unqualified field a
EXPLAIN select a as a FROM table1 t1 CROSS JOIN table1 t2 order by a
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for sticking with this @Asura7969 -- looks good to me

@alamb alamb merged commit eb08846 into apache:main Dec 6, 2023
22 checks passed
@Asura7969 Asura7969 deleted the ambiguous_reference branch December 7, 2023 00:44
appletreeisyellow pushed a commit to appletreeisyellow/datafusion that referenced this pull request Dec 15, 2023
…pache#8425)

* Minor: Improve the document format of JoinHashMap

* ambiguous reference

* ignore aliases

* fix

* fix

* add test

* add test

* add test

* add test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimizer Optimizer rules sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SQL: Ambiguous reference when aliasing in combination with ORDER BY
2 participants