Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add option to disable FILTER when using distinct #10567

Open
1 task done
Riezebos opened this issue Dec 9, 2024 · 4 comments
Open
1 task done

feat: Add option to disable FILTER when using distinct #10567

Riezebos opened this issue Dec 9, 2024 · 4 comments
Labels
feature Features or general enhancements

Comments

@Riezebos
Copy link
Contributor

Riezebos commented Dec 9, 2024

Is your feature request related to a problem?

.distinct(on=[...]) applies a filter on NULLs by default.

t1 = ibis.memtable(
    {
        "person_id": [1, 1],
        "contacted_at": [ibis.date("2024-01-01").execute(), None],
        "updated_at": [None, ibis.date("2024-01-01").execute()],
    }
)
t1.order_by(_.updated_at.desc()).distinct(on=["person_id"])

Output:

┏━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ person_id ┃ contacted_at ┃ updated_at ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ int64     │ date         │ date       │
├───────────┼──────────────┼────────────┤
│         1 │ 2024-01-01   │ 2024-01-01 │
└───────────┴──────────────┴────────────┘

What is the motivation behind your request?

I have a table with "updates", I want to get the exact first row when ordering by the "updated_at" column. I don't want values from other rows in the result.

Describe the solution you'd like

.distinct(on=[...], filter_nulls=False)

What version of ibis are you running?

9.5.0

What backend(s) are you using, if any?

DuckDB

Code of Conduct

  • I agree to follow this project's Code of Conduct
@Riezebos Riezebos added the feature Features or general enhancements label Dec 9, 2024
@IndexSeek
Copy link
Member

Thank you for reporting this and for providing the code to reproduce.

We will need to look deeper at how we can achieve this. The query this produces behind the scenes looks like it is filtering out nulls, as you describe.

SQL
SELECT
  "t1"."person_id",
  FIRST("t1"."contacted_at") FILTER(WHERE
    "t1"."contacted_at" IS NOT NULL) AS "contacted_at",
  FIRST("t1"."updated_at") FILTER(WHERE
    "t1"."updated_at" IS NOT NULL) AS "updated_at"
FROM (
  SELECT
    *
  FROM "ibis_pandas_memtable_yqklspibhbeqzabymmjjnn35bi" AS "t0"
  ORDER BY
    "t0"."updated_at" DESC
) AS "t1"
GROUP BY
  1

In the meantime, would something like this work for you?

t1.filter(
    ibis.row_number().over(group_by=_.person_id, order_by=_.updated_at.desc()) == 0
)

@cpcloud
Copy link
Member

cpcloud commented Dec 20, 2024

There's a secondary problem here which is your assumption that a subquery with an ORDER BY will return rows to the outer in the order of the subquery's ORDER BY.

This is a common pitfall with SQL.

The solution is that you'll have to manually construct the distinct query while specifying an order_by argument to each first call, like so:

In [18]: expr = t1.group_by("person_id").agg(s.across(~s.cols("person_id"), _.first(order_by=lambda t: t.updated_at.desc(), include_null=True)))

In [19]: expr
Out[19]:
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ person_id ┃ contacted_at ┃ updated_at ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ int64     │ date         │ date       │
├───────────┼──────────────┼────────────┤
│         1 │ NULL         │ 2024-01-01 │
└───────────┴──────────────┴────────────┘

In [20]: ibis.to_sql(expr)
Out[20]:
SELECT
  "t0"."person_id",
  FIRST("t0"."contacted_at" ORDER BY "t0"."updated_at" DESC) AS "contacted_at",
  FIRST("t0"."updated_at" ORDER BY "t0"."updated_at" DESC) AS "updated_at"
FROM "ibis_pandas_memtable_3oqi2bamczbh7bxoihwiv6ibwa" AS "t0"
GROUP BY
  1

@Riezebos
Copy link
Contributor Author

Thanks for the solution and the clarification! That is exactly what I needed.

I was trying something similar before, but couldn't find documentation on first, only the first method in selectors: https://ibis-project.org/reference/selectors.html#ibis.selectors.first
Is it this one? https://ibis-project.org/reference/operations#ibis.expr.operations.reductions.First
And does that mean any class in operations.reductions can be used within .aggregate and that any kwargs will be passed to the backend?

I have read quite some ibis docs over the months and have learned a lot, but I don't fully understand this area yet.

Maybe I should change this feature request then. What I would find beneficial is in essence a full equivalent of pandas df.drop_duplicates(). The docs of .distinct say it is similar to it.

It'd be nice if distinct would have an option to accomplish the same as the group by version. E.g. with a syntax like

t.distinct(on="person_id", order_by=[_.updated_at.desc()] ,filter_nulls=False)

@Riezebos
Copy link
Contributor Author

Ah sorry I found the correct docs: https://ibis-project.org/reference/expression-generic#ibis.expr.types.generic.Column.first

I can't find those docs by searching for "first" though, maybe because it has the exact same name as the other method called first?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements
Projects
Status: backlog
Development

No branches or pull requests

3 participants