Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

duplicate columns cause join WHERE expressions to fail #3072

Closed
ukclivecox opened this issue Oct 18, 2024 · 2 comments · Fixed by #3074
Closed

duplicate columns cause join WHERE expressions to fail #3072

ukclivecox opened this issue Oct 18, 2024 · 2 comments · Fixed by #3074
Assignees
Labels
bug Something isn't working

Comments

@ukclivecox
Copy link

Describe the bug

When there are duplicate columns after a join SQL expressions using table aliases fail and the SQL returns zero rows.

To Reproduce

# test for left join on multiple tables with rows
import daft

df1 = daft.from_pydict({"idx":[1,2],"val":[10,20]})

df2 = daft.from_pydict({"idx":[3],"score":[0.1]})

df3 = daft.from_pydict({"idx":[1],"score":[0.1]})

df_sql = daft.sql("select * from df1 left join df2 on (df1.idx=df2.idx) left join df3 on (df1.idx=df3.idx) where df3.score > 0").show()

This produces:
Screenshot_2024-10-18_15-43-28

If one renames one table column so there are no duplicates this works.

# test for left join on multiple tables with rows
import daft

df1 = daft.from_pydict({"idx":[1,2],"val":[10,20]})

df2 = daft.from_pydict({"idx":[3],"score1":[0.1]})

df3 = daft.from_pydict({"idx":[1],"score":[0.1]})

df_sql = daft.sql("select * from df1 left join df2 on (df1.idx=df2.idx) left join df3 on (df1.idx=df3.idx) where df3.score > 0").show()

Screenshot_2024-10-18_15-44-36

I assume this has to do with renaming of dataframe columns to prefix with right.?

Expected behavior

Table aliases work as expected.

Component(s)

SQL

Additional context

Testing with nightly build 0.3.8+dev0019.e4c6f3fa that has additions for joins from #3066

@ukclivecox ukclivecox added bug Something isn't working needs triage labels Oct 18, 2024
@universalmind303 universalmind303 self-assigned this Oct 18, 2024
@universalmind303
Copy link
Contributor

removing the sql label as this appears to be the case when joining through the dataframe api as well.

import daft

df1 = daft.from_pydict({"idx":[1,2],"val":[10,20]})
df2 = daft.from_pydict({"idx":[3],"score":[0.1]})
df3 = daft.from_pydict({"idx":[1],"score":[0.1]})

df = (df1
    .join(df2, on="idx", how="left")
    .join(df3, on="idx", how="left")
    .where(col('score') > 0)
    .collect()
)
df

╭───────┬───────┬─────────┬─────────────╮
│ idxvalscoreright.score │
│ ------------         │
│ Int64Int64Float64Float64     │
╞═══════╪═══════╪═════════╪═════════════╡
╰───────┴───────┴─────────┴─────────────╯
(No data to display: Materialized dataframe has no rows)

@universalmind303 universalmind303 changed the title SQL: duplicate columns cause join SQL WHERE expressions to fail duplicate columns cause join WHERE expressions to fail Oct 18, 2024
@ukclivecox
Copy link
Author

I would add that one could make the case that the Dataframe API has different semantics and you need to use the right. format for where clauses for duplicate columns. However, for SQL this doesn't make sense so needs to be fixed even if Dataframe API stays the same. But I assume the two ideally would act the same - though with SQL there can be various ways to identify a column in the SQL expression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants