Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(api): option to include nulls in argmin/argmax #10193

Closed
1 task done
chloeh13q opened this issue Sep 23, 2024 · 3 comments · Fixed by #10227
Closed
1 task done

feat(api): option to include nulls in argmin/argmax #10193

chloeh13q opened this issue Sep 23, 2024 · 3 comments · Fixed by #10227
Assignees
Labels
feature Features or general enhancements
Milestone

Comments

@chloeh13q
Copy link
Contributor

Is your feature request related to a problem?

I have a table like the following

+----+--------------+----------------------+---------------------+
|    |   AIRPORT_ID | AIRPORT_START_DATE   | AIRPORT_THRU_DATE   |
|----+--------------+----------------------+---------------------|
|  0 |        10124 | 1950-01-01 00:00:00  | 2011-06-30 00:00:00 |
|  1 |        10124 | 2011-07-01 00:00:00  | 2012-03-31 00:00:00 |
|  2 |        10124 | 2012-04-01 00:00:00  | 2017-10-31 00:00:00 |
|  3 |        10124 | 2017-11-01 00:00:00  | 2017-11-30 00:00:00 |
|  4 |        10124 | 2017-12-01 00:00:00  | NaT                 |
|  5 |        10145 | 1950-01-01 00:00:00  | 2011-06-30 00:00:00 |
|  6 |        10145 | 2011-07-01 00:00:00  | NaT                 |
+----+--------------+----------------------+---------------------+

pre-sorted by date. I want to find the latest start and thru dates of each airport like

+----+--------------+----------------------+---------------------+
|    |   AIRPORT_ID | AIRPORT_START_DATE   | AIRPORT_THRU_DATE   |
|----+--------------+----------------------+---------------------|
|  0 |        10124 | 2017-12-01 00:00:00  | NaT                 |
|  1 |        10145 | 2011-07-01 00:00:00  | NaT                 |
+----+--------------+----------------------+---------------------+

Using distinct() ignores nulls:

+----+--------------+----------------------+---------------------+
|    |   AIRPORT_ID | AIRPORT_START_DATE   | AIRPORT_THRU_DATE   |
|----+--------------+----------------------+---------------------|
|  0 |        10124 | 2017-12-01 00:00:00  | 2017-11-30 00:00:00 |
|  1 |        10145 | 2011-07-01 00:00:00  | 2011-06-30 00:00:00 |
+----+--------------+----------------------+---------------------+

and aggregate functions like max() and argmax() do not have an option to ignore nulls.

What is the motivation behind your request?

No response

Describe the solution you'd like

To have a parameter ignore_nulls in aggregation functions like argmax().

What version of ibis are you running?

main

What backend(s) are you using, if any?

DuckDB

Code of Conduct

  • I agree to follow this project's Code of Conduct
@chloeh13q chloeh13q added the feature Features or general enhancements label Sep 23, 2024
@jcrist jcrist changed the title feat(API): option to include nulls on aggregation feat(api): option to include nulls in argmin/argmax Sep 23, 2024
@jcrist
Copy link
Member

jcrist commented Sep 23, 2024

To answer your original question, this would be most idiomatically expressed as a groupby with last(order_by=..., include_null=True). For example:

In [1]: import ibis

In [2]: ibis.options.interactive = True

In [3]: t = ibis.memtable(
   ...:     {"id": [1, 1, 1, 1, 1, 2, 2],
   ...:      "start": [1, 2, 3, 4, 5, 10, 20],
   ...:      "end": ['a', 'b', 'c', 'd', None, 'e', None]}
   ...: )

In [4]: t.group_by("id").agg(
   ...:     start=t.start.max(),
   ...:     end=t.end.last(order_by="start", include_null=True),
   ...: )
Out[4]: 
┏━━━━━━━┳━━━━━━━┳━━━━━━━━┓
┃ idstartend    ┃
┡━━━━━━━╇━━━━━━━╇━━━━━━━━┩
│ int64int64string │
├───────┼───────┼────────┤
│     220NULL   │
│     15NULL   │
└───────┴───────┴────────┘

This gets the max start value per group, as well as the last end value in the group ordered by start (effectively imitating argmax).


For the actual request of supporting including NULL values in argmin/argmax, this is a bit tricky. No backend supports configuring null behavior succinctly for this call (no RESPECT NULLS option). For backends that include null values (trino, bigquery, snowflake) we can probably rewrite the call to ignore nulls. For backends that ignore null values (clickhouse, duckdb) I don't think we'll be able to do the opposite. Right now the null handling behavior of argmin/argmax is backend dependent and unspecified - we should definitely spend some time standardizing on one or the other behavior.

@jcrist jcrist self-assigned this Sep 23, 2024
@jcrist
Copy link
Member

jcrist commented Sep 23, 2024

Thinking about this more, I think we always want to respect nulls for these functions (and treat the current duckdb/clickhouse behavior as a bug). The point of argmin/argmax is to find the value in one column that corresponds to the min/max value in a different column. Skipping null values for a in a.argmin(b) would mean that agg(min=b.min(), a=a.argmin(b)) could return values from two different rows if a is null, which would be very misleading. Instead I think we should standardize on

a.argmin(b)

  • will skip null values in b (all backends do this, same as b.min())
  • will not skip null values in a (most but not all backends do this), meaning argmin may return a null value but will always return a value associated with the min value of b.

The only todo on our part then is to better document this, add a test, and fix the bad implementations. I have some workarounds in mind for duckdb and clickhouse, which I think are the only offenders here.

@chloeh13q
Copy link
Contributor Author

@jcrist - Hmm okay I see. The reasoning makes sense to me, and I think using last(order_by=..., include_nulls=True) is a reasonable workaround. Agree with better documentation, etc as well. Thanks for looking into the issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants