feat: Add option to disable FILTER when using `distinct` #10567

Riezebos · 2024-12-09T17:45:08Z

Is your feature request related to a problem?

.distinct(on=[...]) applies a filter on NULLs by default.

t1 = ibis.memtable(
    {
        "person_id": [1, 1],
        "contacted_at": [ibis.date("2024-01-01").execute(), None],
        "updated_at": [None, ibis.date("2024-01-01").execute()],
    }
)
t1.order_by(_.updated_at.desc()).distinct(on=["person_id"])

Output:

┏━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ person_id ┃ contacted_at ┃ updated_at ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ int64     │ date         │ date       │
├───────────┼──────────────┼────────────┤
│         1 │ 2024-01-01   │ 2024-01-01 │
└───────────┴──────────────┴────────────┘

What is the motivation behind your request?

I have a table with "updates", I want to get the exact first row when ordering by the "updated_at" column. I don't want values from other rows in the result.

Describe the solution you'd like

.distinct(on=[...], filter_nulls=False)

What version of ibis are you running?

9.5.0

What backend(s) are you using, if any?

DuckDB

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

IndexSeek · 2024-12-13T00:54:28Z

Thank you for reporting this and for providing the code to reproduce.

We will need to look deeper at how we can achieve this. The query this produces behind the scenes looks like it is filtering out nulls, as you describe.

SQL

SELECT
  "t1"."person_id",
  FIRST("t1"."contacted_at") FILTER(WHERE
    "t1"."contacted_at" IS NOT NULL) AS "contacted_at",
  FIRST("t1"."updated_at") FILTER(WHERE
    "t1"."updated_at" IS NOT NULL) AS "updated_at"
FROM (
  SELECT
    *
  FROM "ibis_pandas_memtable_yqklspibhbeqzabymmjjnn35bi" AS "t0"
  ORDER BY
    "t0"."updated_at" DESC
) AS "t1"
GROUP BY
  1

In the meantime, would something like this work for you?

t1.filter(
    ibis.row_number().over(group_by=_.person_id, order_by=_.updated_at.desc()) == 0
)

cpcloud · 2024-12-20T14:09:08Z

There's a secondary problem here which is your assumption that a subquery with an ORDER BY will return rows to the outer in the order of the subquery's ORDER BY.

This is a common pitfall with SQL.

The solution is that you'll have to manually construct the distinct query while specifying an order_by argument to each first call, like so:

In [18]: expr = t1.group_by("person_id").agg(s.across(~s.cols("person_id"), _.first(order_by=lambda t: t.updated_at.desc(), include_null=True)))

In [19]: expr
Out[19]:
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ person_id ┃ contacted_at ┃ updated_at ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ int64     │ date         │ date       │
├───────────┼──────────────┼────────────┤
│         1 │ NULL         │ 2024-01-01 │
└───────────┴──────────────┴────────────┘

In [20]: ibis.to_sql(expr)
Out[20]:
SELECT
  "t0"."person_id",
  FIRST("t0"."contacted_at" ORDER BY "t0"."updated_at" DESC) AS "contacted_at",
  FIRST("t0"."updated_at" ORDER BY "t0"."updated_at" DESC) AS "updated_at"
FROM "ibis_pandas_memtable_3oqi2bamczbh7bxoihwiv6ibwa" AS "t0"
GROUP BY
  1

Riezebos · 2024-12-20T15:31:04Z

Thanks for the solution and the clarification! That is exactly what I needed.

I was trying something similar before, but couldn't find documentation on first, only the first method in selectors: https://ibis-project.org/reference/selectors.html#ibis.selectors.first
Is it this one? https://ibis-project.org/reference/operations#ibis.expr.operations.reductions.First
And does that mean any class in operations.reductions can be used within .aggregate and that any kwargs will be passed to the backend?

I have read quite some ibis docs over the months and have learned a lot, but I don't fully understand this area yet.

Maybe I should change this feature request then. What I would find beneficial is in essence a full equivalent of pandas df.drop_duplicates(). The docs of .distinct say it is similar to it.

It'd be nice if distinct would have an option to accomplish the same as the group by version. E.g. with a syntax like

t.distinct(on="person_id", order_by=[_.updated_at.desc()] ,filter_nulls=False)

Riezebos · 2024-12-20T15:38:25Z

Ah sorry I found the correct docs: https://ibis-project.org/reference/expression-generic#ibis.expr.types.generic.Column.first

I can't find those docs by searching for "first" though, maybe because it has the exact same name as the other method called first?

Riezebos added the feature Features or general enhancements label Dec 9, 2024

github-project-automation bot added this to Ibis planning and roadmap Dec 9, 2024

github-project-automation bot moved this to backlog in Ibis planning and roadmap Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add option to disable FILTER when using `distinct` #10567

feat: Add option to disable FILTER when using `distinct` #10567

Riezebos commented Dec 9, 2024

IndexSeek commented Dec 13, 2024

cpcloud commented Dec 20, 2024

Riezebos commented Dec 20, 2024

Riezebos commented Dec 20, 2024

feat: Add option to disable FILTER when using distinct #10567

feat: Add option to disable FILTER when using distinct #10567

Comments

Riezebos commented Dec 9, 2024

Is your feature request related to a problem?

What is the motivation behind your request?

Describe the solution you'd like

What version of ibis are you running?

What backend(s) are you using, if any?

Code of Conduct

IndexSeek commented Dec 13, 2024

cpcloud commented Dec 20, 2024

Riezebos commented Dec 20, 2024

Riezebos commented Dec 20, 2024

feat: Add option to disable FILTER when using `distinct` #10567

feat: Add option to disable FILTER when using `distinct` #10567