Add `DataFrame.unique_indices` #194

MarcoGorelli · 2023-07-03T16:39:06Z

analogous to the already-existing Column.unique_indices

rgommers

Thanks Marco. In #151 (comment) I think we wanted to keep this out, and pandas doesn't have it at the dataframe level. Is there a reason you need it now?

spec/API_specification/dataframe_api/dataframe_object.py

MarcoGorelli · 2023-07-07T16:37:05Z

Is there a reason you need it now?

yeah it's come up in plotly, where we need to iterate over groups (with more than one key), but we intentionally don't allow iterating over groups, so this is the next thing we can do to get the same result

rgommers · 2023-07-10T14:16:48Z

yeah it's come up in plotly, we need to iterate over groups (with more than one key), but we intentionally don't allow iterating over groups

This is a slightly more verbose way of achieving the same it seems. @kkraus14 I believe you wanted to make it harder to iterate over groups in order to discourage that. So just checking - are you fine with adding this method as a solution for that?

kkraus14 · 2023-07-10T18:01:31Z

This is a slightly more verbose way of achieving the same it seems. @kkraus14 I believe you wanted to make it harder to iterate over groups in order to discourage that. So just checking - are you fine with adding this method as a solution for that?

I don't believe this function allows iterating over groups? There's no guarantee that the DataFrame is sorted so the indices can generally point to any row in the group and doesn't give you the information you need to get to actual groups of rows. It does allow you to get the key values for each group and iterate over them if that is what's desired?

In general, outside of grouping this function is still useful where I don't have any opposition to it.

MarcoGorelli · 2023-07-10T19:06:22Z

It allows you to do something like

unique_indices = df.unique_indices(['a', 'b'])
unique_values = df.get_rows(unique_indices)
for i in range(unique_values.shape()[0]):
    mask = (
        (df.get_column_by_name('a') == unique_values.get_column_by_name('a')[i])
        & 
        (df.get_column_by_name('b') == unique_values.get_column_by_name('b')[i])
    )
    sub_df = df.get_rows(mask)
    # do some operation on sub_df

jorisvandenbossche · 2023-07-11T06:37:20Z

That code example is essentially an inefficient way to mimic looping through groupby groups, so shouldn't we rather allow that?

MarcoGorelli · 2023-07-11T08:49:44Z

Agree, thanks

In #131 Keith brought up

This has also been a historical performance footgun in cuDF since you can often run into situations where you have a large number of small groups.

but surely for that use case it's even more inefficient to keep having to filter as in the example above?

kkraus14 · 2023-07-11T19:58:22Z

What happens in the per group iteration in Plotly? Standard-ish dataframe API usage or arbitrary Python stuff? Ideally, we could eventually support doing this via some form of user defined function which then gives a path to dataframe libraries having more control over how things happen, but I imagine that is quite a ways away.

Also, what happens after iterating over the groups? If it's arbitrary Python stuff again, maybe this is just where Plotly should pick a DataFrame library like Pandas or Polars or Arrow or etc. and we should focus our efforts towards an API that can get an interchange object available on a specific device so that there's a nice way to guarantee being able to get to a specific library when needed?

In general if someone wants to iterate over groups in Python and there's a large number of small groups, no matter what it's going to have huge performance implications regardless unless the library has some form of optimizer that can ultimately fuse things before execution in some kind of way.

MarcoGorelli · 2023-07-11T20:10:43Z

Here you go

https://github.com/plotly/plotly.py/blob/69b4d3e0a099e80bbc8fe6fd5391c8280e3de117/packages/python/plotly/plotly/express/_core.py#L1970-L1994

If it's arbitrary Python stuff again, maybe this is just where Plotly should pick a DataFrame library like Pandas or Polars or Arrow or etc.

Sure, but then what would be the point of them using the standard? Currently, they do:

if it's pandas, do pandas operations
if it's not pandas, convert to pandas, and then do the same pandas operations

The purpose of trying to get them to use the standard is to allow them to write library-agnostic dataframe code. If you're saying they need to pick a dataframe library like pandas, then that would be no different to the status-quo

We could say "let's drop plotly", but we said the same about altair. Are we no longer interested in visualisation are as a use case?

Alternatively, we could look at what these libraries are doing, and see if there's a way to let them rewrite their existing code in a library-agnostic manner so that they don't have a hard dependency on pandas. That, at least, is what I'm aiming for

kkraus14 · 2023-07-11T21:37:05Z

The idea would be that they can do the heavy lifting and 90+% of the work in a library agnostic way and then when they hit the last mile where they need to do something more custom and less performance sensitive they can utilize a clear escape hatch to get to a library of choice.

I.E. for a visualization library you could imagine something like needing to join a few tables, run some supplemental data cleaning transformations that could be expensive (i.e. something regex based), and then aggregate the data in some meaningful way. Then once the data is aggregated and the table size is significantly reduced, they then want to iterate over the rows / groups to do something that doesn't nicely fit into a user defined function that we could ever imagine supporting in the standard (i.e. something where there's state associated with the UDF that requires it to be run serially as each iteration updates the state) so they get the data into a library of choice where they can then do their last mile custom functionality that is not "standard friendly".

It seems there's two different focuses here:

It seems like your focus is being able to remove dependencies, i.e., pandas, from libraries by allowing them to write purely standard code that is library agnostic. Is this accurate?
My focus is on enabling tapping into different libraries for their execution backends to take advantage of their performance, scalability, and/or other desirable traits.

MarcoGorelli · 2023-07-12T09:23:12Z

There's no such "heavy lifting" in plotly (nor in any other visualisation library, as far as I'm aware): plotly/plotly.py#3901 (comment) . Are you suggesting that we drop them completely as a use-case?
If so, then we're quickly running out of use cases

Regarding focuses: my focus is on both. That's why I'd like libraries like plotly to get all the way there using the standard:

repeatedly converting to/from pandas is memory-expensive. Interchanging to pandas is often not zero-copy (e.g. if there's missing data / categoricals / nested dtypes)
by not allowing GroupBy.__iter__, the result is that users are forced into writing inefficient code using unique_indices

MarcoGorelli · 2023-07-12T12:25:08Z

Anyway, I'll use DataFrame.unique_indices for now, we can always make it more efficient / convenient later

Any objections to getting this in?

add DataFrame.unique_indices

56cb688

rgommers added the API design label Jul 4, 2023

rgommers changed the title ~~Add DataFrame.unique_indices~~ Add DataFrame.unique_indices Jul 4, 2023

rgommers reviewed Jul 4, 2023

View reviewed changes

spec/API_specification/dataframe_api/dataframe_object.py Outdated Show resolved Hide resolved

spec/API_specification/dataframe_api/dataframe_object.py Outdated Show resolved Hide resolved

MarcoGorelli added 2 commits July 7, 2023 17:03

Merge remote-tracking branch 'upstream/main' into dataframe-unique

8c3a160

typo + minor reword

7fc2e1b

kkraus14 approved these changes Jul 13, 2023

View reviewed changes

rgommers approved these changes Jul 13, 2023

View reviewed changes

MarcoGorelli merged commit 3111499 into data-apis:main Jul 13, 2023

MarcoGorelli mentioned this pull request Jul 13, 2023

"levels" of the Standard #201

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `DataFrame.unique_indices` #194

Add `DataFrame.unique_indices` #194

MarcoGorelli commented Jul 3, 2023 •

edited

Loading

rgommers left a comment

MarcoGorelli commented Jul 7, 2023

rgommers commented Jul 10, 2023

kkraus14 commented Jul 10, 2023

MarcoGorelli commented Jul 10, 2023

jorisvandenbossche commented Jul 11, 2023

MarcoGorelli commented Jul 11, 2023

kkraus14 commented Jul 11, 2023

MarcoGorelli commented Jul 11, 2023

kkraus14 commented Jul 11, 2023

MarcoGorelli commented Jul 12, 2023 •

edited

Loading

MarcoGorelli commented Jul 12, 2023

Add DataFrame.unique_indices #194

Add DataFrame.unique_indices #194

Conversation

MarcoGorelli commented Jul 3, 2023 • edited Loading

rgommers left a comment

Choose a reason for hiding this comment

MarcoGorelli commented Jul 7, 2023

rgommers commented Jul 10, 2023

kkraus14 commented Jul 10, 2023

MarcoGorelli commented Jul 10, 2023

jorisvandenbossche commented Jul 11, 2023

MarcoGorelli commented Jul 11, 2023

kkraus14 commented Jul 11, 2023

MarcoGorelli commented Jul 11, 2023

kkraus14 commented Jul 11, 2023

MarcoGorelli commented Jul 12, 2023 • edited Loading

MarcoGorelli commented Jul 12, 2023

Add `DataFrame.unique_indices` #194

Add `DataFrame.unique_indices` #194

MarcoGorelli commented Jul 3, 2023 •

edited

Loading

MarcoGorelli commented Jul 12, 2023 •

edited

Loading