Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: Inconsistent behaviour for aggregate expressions after joins #7288

Closed
1 task done
binste opened this issue Oct 3, 2023 · 4 comments
Closed
1 task done

bug: Inconsistent behaviour for aggregate expressions after joins #7288

binste opened this issue Oct 3, 2023 · 4 comments
Labels
bug Incorrect behavior inside of ibis chained joins The bane of Ibis's existence

Comments

@binste
Copy link
Contributor

binste commented Oct 3, 2023

What happened?

It was mentioned before in a few issues such as #5537 that there are some rough edges around joins. One that tripped me up just now happens in aggregated expressions after joins. Taking the example from the join + aggregation section in the docs:

import ibis

t1 = ibis.table(dict(value1="float", key1="string", key2="string"), name="table1")
t2 = ibis.table(dict(value2="float", key3="string", key4="string"), name="table2")

avg_diff = (t1.value1 - t2.value2).mean()
expr = (
    t1.left_join(t2, t1.key1 == t2.key3)
    .group_by(t1.key1)
    .aggregate(avg_diff=avg_diff)
)
ibis.to_sql(expr)

gives me the following as expected

SELECT
  t0.key1,
  AVG(t0.value1 - t1.value2) AS avg_diff
FROM table1 AS t0
LEFT OUTER JOIN table2 AS t1
  ON t0.key1 = t1.key3
GROUP BY
  1

However, when I slightly change this example by renaming t2.key3 to t2.key1:

t2 = ibis.table(dict(value2="float", key1="string", key4="string"), name="table2")

and do exactly the same but now join on key:

avg_diff = (t1.value1 - t2.value2).mean()
expr = (
    t1.left_join(t2, "key1")
    .group_by(t1.key1)
    .aggregate(avg_diff=avg_diff)
)
ibis.to_sql(expr)

I get a cross-join:

SELECT
  t0.key1,
  AVG(t1.value1 - t1.value2) AS avg_diff
FROM (
  SELECT
    t1.value1 AS value1,
    t1.key1 AS key1,
    t1.key2 AS key2,
    t2.value2 AS value2,
    t2.key1 AS key1_right,
    t2.key4 AS key4
  FROM table1 AS t1
  LEFT OUTER JOIN table2 AS t2
    ON t1.key1 = t2.key1
) AS t0, table1 AS t1, table2 AS t1
GROUP BY
  1

I can fix it by using the deferred object in the aggregation expression:

from ibis import deferred as d_

avg_diff = (d_.value1 - d_.value2).mean()
expr = (
    t1.left_join(t2, "key1")
    .group_by(t1.key1)
    .aggregate(avg_diff=avg_diff)
)
ibis.to_sql(expr)

Which removes the cross-join:

SELECT
  t0.key1,
  AVG(t0.value1 - t0.value2) AS avg_diff
FROM (
  SELECT
    t1.value1 AS value1,
    t1.key1 AS key1,
    t1.key2 AS key2,
    t2.value2 AS value2,
    t2.key1 AS key1_right,
    t2.key4 AS key4
  FROM table1 AS t1
  LEFT OUTER JOIN table2 AS t2
    ON t1.key1 = t2.key1
) AS t0
GROUP BY
  1

As mentioned in #5537, maybe it's worth to disable this behaviour completely?

  • Our current mixed support for passing in expressions that don't share a common parent (e.g. t1.join(t2, "id").join(t3, "id").select(t1.a)) can lead to invalid or weird queries. In the past we had talked about fully outlawing these cases and erroring nicely. I think this would be good to do, and would eliminate some of the issues here.

If you would not want to do this, I'd recommend to no longer advertise it in the documentation where it currently states:

You can directly aggregate a join without need for projection, which also allows you to form statistics that reference any of the joined tables.

What version of ibis are you using?

7.0.0

What backend(s) are you using, if any?

No response

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@binste binste added the bug Incorrect behavior inside of ibis label Oct 3, 2023
@cpcloud
Copy link
Member

cpcloud commented Oct 3, 2023

Thanks @binste!

Yeah, this is currently a rough edge.

We are pretty deep into exploring ideas to resolve this "once and for all", hopefully in a way that preserves existing column deduplication behavior and generates the right code whether or not deferred is used.

@cpcloud cpcloud added the chained joins The bane of Ibis's existence label Nov 16, 2023
@lostmygithubaccount lostmygithubaccount moved this from backlog to cooking in Ibis planning and roadmap Dec 4, 2023
@lostmygithubaccount lostmygithubaccount moved this from cooking to todo in Ibis planning and roadmap Dec 4, 2023
@lostmygithubaccount lostmygithubaccount moved this from todo to cooking in Ibis planning and roadmap Dec 4, 2023
@cpcloud cpcloud linked a pull request Dec 6, 2023 that will close this issue
@cpcloud
Copy link
Member

cpcloud commented Dec 6, 2023

This will be addressed by #7580.

It'll be a few releases before it lands, but it's well under way and things are looking good!

cpcloud added a commit to kszucs/ibis that referenced this issue Dec 6, 2023
@binste
Copy link
Contributor Author

binste commented Dec 7, 2023

Very exciting, thanks for working on it and for the update!

cpcloud added a commit to kszucs/ibis that referenced this issue Dec 7, 2023
cpcloud added a commit to kszucs/ibis that referenced this issue Dec 14, 2023
cpcloud added a commit to kszucs/ibis that referenced this issue Dec 14, 2023
cpcloud added a commit to kszucs/ibis that referenced this issue Dec 14, 2023
kszucs pushed a commit to kszucs/ibis that referenced this issue Dec 18, 2023
kszucs pushed a commit to kszucs/ibis that referenced this issue Dec 21, 2023
cpcloud added a commit to kszucs/ibis that referenced this issue Dec 21, 2023
cpcloud added a commit to kszucs/ibis that referenced this issue Dec 22, 2023
cpcloud added a commit to kszucs/ibis that referenced this issue Dec 22, 2023
cpcloud added a commit to kszucs/ibis that referenced this issue Dec 23, 2023
@lostmygithubaccount lostmygithubaccount moved this from cooking to backlog in Ibis planning and roadmap Jan 26, 2024
@cpcloud
Copy link
Member

cpcloud commented Jan 27, 2024

Fixed in the-epic-split

@cpcloud cpcloud closed this as completed Jan 27, 2024
@github-project-automation github-project-automation bot moved this from backlog to done in Ibis planning and roadmap Jan 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis chained joins The bane of Ibis's existence
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants