-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DataFrame.unique_indices
#194
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Marco. In #151 (comment) I think we wanted to keep this out, and pandas doesn't have it at the dataframe level. Is there a reason you need it now?
yeah it's come up in plotly, where we need to iterate over groups (with more than one key), but we intentionally don't allow iterating over groups, so this is the next thing we can do to get the same result |
This is a slightly more verbose way of achieving the same it seems. @kkraus14 I believe you wanted to make it harder to iterate over groups in order to discourage that. So just checking - are you fine with adding this method as a solution for that? |
I don't believe this function allows iterating over groups? There's no guarantee that the DataFrame is sorted so the indices can generally point to any row in the group and doesn't give you the information you need to get to actual groups of rows. It does allow you to get the In general, outside of grouping this function is still useful where I don't have any opposition to it. |
It allows you to do something like unique_indices = df.unique_indices(['a', 'b'])
unique_values = df.get_rows(unique_indices)
for i in range(unique_values.shape()[0]):
mask = (
(df.get_column_by_name('a') == unique_values.get_column_by_name('a')[i])
&
(df.get_column_by_name('b') == unique_values.get_column_by_name('b')[i])
)
sub_df = df.get_rows(mask)
# do some operation on sub_df |
That code example is essentially an inefficient way to mimic looping through groupby groups, so shouldn't we rather allow that? |
Agree, thanks In #131 Keith brought up
but surely for that use case it's even more inefficient to keep having to filter as in the example above? |
What happens in the per group iteration in Plotly? Standard-ish dataframe API usage or arbitrary Python stuff? Ideally, we could eventually support doing this via some form of user defined function which then gives a path to dataframe libraries having more control over how things happen, but I imagine that is quite a ways away. Also, what happens after iterating over the groups? If it's arbitrary Python stuff again, maybe this is just where Plotly should pick a DataFrame library like Pandas or Polars or Arrow or etc. and we should focus our efforts towards an API that can get an interchange object available on a specific device so that there's a nice way to guarantee being able to get to a specific library when needed? In general if someone wants to iterate over groups in Python and there's a large number of small groups, no matter what it's going to have huge performance implications regardless unless the library has some form of optimizer that can ultimately fuse things before execution in some kind of way. |
Here you go
Sure, but then what would be the point of them using the standard? Currently, they do:
The purpose of trying to get them to use the standard is to allow them to write library-agnostic dataframe code. If you're saying they need to pick a dataframe library like pandas, then that would be no different to the status-quo We could say "let's drop plotly", but we said the same about altair. Are we no longer interested in visualisation are as a use case? Alternatively, we could look at what these libraries are doing, and see if there's a way to let them rewrite their existing code in a library-agnostic manner so that they don't have a hard dependency on pandas. That, at least, is what I'm aiming for |
The idea would be that they can do the heavy lifting and 90+% of the work in a library agnostic way and then when they hit the last mile where they need to do something more custom and less performance sensitive they can utilize a clear escape hatch to get to a library of choice. I.E. for a visualization library you could imagine something like needing to join a few tables, run some supplemental data cleaning transformations that could be expensive (i.e. something regex based), and then aggregate the data in some meaningful way. Then once the data is aggregated and the table size is significantly reduced, they then want to iterate over the rows / groups to do something that doesn't nicely fit into a user defined function that we could ever imagine supporting in the standard (i.e. something where there's state associated with the UDF that requires it to be run serially as each iteration updates the state) so they get the data into a library of choice where they can then do their last mile custom functionality that is not "standard friendly". It seems there's two different focuses here:
|
There's no such "heavy lifting" in plotly (nor in any other visualisation library, as far as I'm aware): plotly/plotly.py#3901 (comment) . Are you suggesting that we drop them completely as a use-case? Regarding focuses: my focus is on both. That's why I'd like libraries like plotly to get all the way there using the standard:
|
Anyway, I'll use Any objections to getting this in? |
analogous to the already-existing Column.unique_indices