Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DataFrame.unique_indices #194

Merged
merged 3 commits into from
Jul 13, 2023

Conversation

MarcoGorelli
Copy link
Contributor

@MarcoGorelli MarcoGorelli commented Jul 3, 2023

analogous to the already-existing Column.unique_indices

@rgommers rgommers changed the title Add DataFrame.unique_indices Add DataFrame.unique_indices Jul 4, 2023
Copy link
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Marco. In #151 (comment) I think we wanted to keep this out, and pandas doesn't have it at the dataframe level. Is there a reason you need it now?

spec/API_specification/dataframe_api/dataframe_object.py Outdated Show resolved Hide resolved
spec/API_specification/dataframe_api/dataframe_object.py Outdated Show resolved Hide resolved
@MarcoGorelli
Copy link
Contributor Author

Is there a reason you need it now?

yeah it's come up in plotly, where we need to iterate over groups (with more than one key), but we intentionally don't allow iterating over groups, so this is the next thing we can do to get the same result

@rgommers
Copy link
Member

yeah it's come up in plotly, we need to iterate over groups (with more than one key), but we intentionally don't allow iterating over groups

This is a slightly more verbose way of achieving the same it seems. @kkraus14 I believe you wanted to make it harder to iterate over groups in order to discourage that. So just checking - are you fine with adding this method as a solution for that?

@kkraus14
Copy link
Collaborator

This is a slightly more verbose way of achieving the same it seems. @kkraus14 I believe you wanted to make it harder to iterate over groups in order to discourage that. So just checking - are you fine with adding this method as a solution for that?

I don't believe this function allows iterating over groups? There's no guarantee that the DataFrame is sorted so the indices can generally point to any row in the group and doesn't give you the information you need to get to actual groups of rows. It does allow you to get the key values for each group and iterate over them if that is what's desired?

In general, outside of grouping this function is still useful where I don't have any opposition to it.

@MarcoGorelli
Copy link
Contributor Author

It allows you to do something like

unique_indices = df.unique_indices(['a', 'b'])
unique_values = df.get_rows(unique_indices)
for i in range(unique_values.shape()[0]):
    mask = (
        (df.get_column_by_name('a') == unique_values.get_column_by_name('a')[i])
        & 
        (df.get_column_by_name('b') == unique_values.get_column_by_name('b')[i])
    )
    sub_df = df.get_rows(mask)
    # do some operation on sub_df

@jorisvandenbossche
Copy link
Member

That code example is essentially an inefficient way to mimic looping through groupby groups, so shouldn't we rather allow that?

@MarcoGorelli
Copy link
Contributor Author

Agree, thanks

In #131 Keith brought up

This has also been a historical performance footgun in cuDF since you can often run into situations where you have a large number of small groups.

but surely for that use case it's even more inefficient to keep having to filter as in the example above?

@kkraus14
Copy link
Collaborator

What happens in the per group iteration in Plotly? Standard-ish dataframe API usage or arbitrary Python stuff? Ideally, we could eventually support doing this via some form of user defined function which then gives a path to dataframe libraries having more control over how things happen, but I imagine that is quite a ways away.

Also, what happens after iterating over the groups? If it's arbitrary Python stuff again, maybe this is just where Plotly should pick a DataFrame library like Pandas or Polars or Arrow or etc. and we should focus our efforts towards an API that can get an interchange object available on a specific device so that there's a nice way to guarantee being able to get to a specific library when needed?

In general if someone wants to iterate over groups in Python and there's a large number of small groups, no matter what it's going to have huge performance implications regardless unless the library has some form of optimizer that can ultimately fuse things before execution in some kind of way.

@MarcoGorelli
Copy link
Contributor Author

Here you go

https://github.com/plotly/plotly.py/blob/69b4d3e0a099e80bbc8fe6fd5391c8280e3de117/packages/python/plotly/plotly/express/_core.py#L1970-L1994

If it's arbitrary Python stuff again, maybe this is just where Plotly should pick a DataFrame library like Pandas or Polars or Arrow or etc.

Sure, but then what would be the point of them using the standard? Currently, they do:

  • if it's pandas, do pandas operations
  • if it's not pandas, convert to pandas, and then do the same pandas operations

The purpose of trying to get them to use the standard is to allow them to write library-agnostic dataframe code. If you're saying they need to pick a dataframe library like pandas, then that would be no different to the status-quo

We could say "let's drop plotly", but we said the same about altair. Are we no longer interested in visualisation are as a use case?

Alternatively, we could look at what these libraries are doing, and see if there's a way to let them rewrite their existing code in a library-agnostic manner so that they don't have a hard dependency on pandas. That, at least, is what I'm aiming for

@kkraus14
Copy link
Collaborator

The idea would be that they can do the heavy lifting and 90+% of the work in a library agnostic way and then when they hit the last mile where they need to do something more custom and less performance sensitive they can utilize a clear escape hatch to get to a library of choice.

I.E. for a visualization library you could imagine something like needing to join a few tables, run some supplemental data cleaning transformations that could be expensive (i.e. something regex based), and then aggregate the data in some meaningful way. Then once the data is aggregated and the table size is significantly reduced, they then want to iterate over the rows / groups to do something that doesn't nicely fit into a user defined function that we could ever imagine supporting in the standard (i.e. something where there's state associated with the UDF that requires it to be run serially as each iteration updates the state) so they get the data into a library of choice where they can then do their last mile custom functionality that is not "standard friendly".

It seems there's two different focuses here:

  • It seems like your focus is being able to remove dependencies, i.e., pandas, from libraries by allowing them to write purely standard code that is library agnostic. Is this accurate?
  • My focus is on enabling tapping into different libraries for their execution backends to take advantage of their performance, scalability, and/or other desirable traits.

@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented Jul 12, 2023

There's no such "heavy lifting" in plotly (nor in any other visualisation library, as far as I'm aware): plotly/plotly.py#3901 (comment) . Are you suggesting that we drop them completely as a use-case?
If so, then we're quickly running out of use cases

Regarding focuses: my focus is on both. That's why I'd like libraries like plotly to get all the way there using the standard:

  • repeatedly converting to/from pandas is memory-expensive. Interchanging to pandas is often not zero-copy (e.g. if there's missing data / categoricals / nested dtypes)
  • by not allowing GroupBy.__iter__, the result is that users are forced into writing inefficient code using unique_indices

@MarcoGorelli
Copy link
Contributor Author

Anyway, I'll use DataFrame.unique_indices for now, we can always make it more efficient / convenient later

Any objections to getting this in?

@MarcoGorelli MarcoGorelli merged commit 3111499 into data-apis:main Jul 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants