Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are we allowing iterating over GroupBy? #131

Closed
MarcoGorelli opened this issue Apr 4, 2023 · 7 comments
Closed

Are we allowing iterating over GroupBy? #131

MarcoGorelli opened this issue Apr 4, 2023 · 7 comments

Comments

@MarcoGorelli
Copy link
Contributor

For example in seaborn we see this:

https://github.com/mwaskom/seaborn/blob/5d9f37159bbd3ac44c8c8a06825583ba25648525/seaborn/_oldcore.py#L1253-L1261

            # Now actually update the matplotlib objects to do the conversion we want
            grouped = self.plot_data[var].groupby(self.converters[var], sort=False)
            for converter, seed_data in grouped:
                if self.var_types[var] == "categorical":
                    if self._var_ordered[var]:
                        order = self.var_levels[var]
                    else:
                        order = None
                    seed_data = categorical_order(seed_data, order)
                converter.update_units(seed_data)

Currently, both pandas and polars allow it: if you iterate of a GroupBy object, you get a tuple where the first element is the key and the second is the subset of the DataFrame corresponding to that key

@kkraus14
Copy link
Collaborator

kkraus14 commented Apr 5, 2023

I'm -1 on including this in a first revision of the standard. This is effectively to allow per group user defined functions and we've generally punted on user defined functions thus far.

This has also been a historical performance footgun in cuDF since you can often run into situations where you have a large number of small groups.

@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented Apr 5, 2023

ok sounds good

if seaborn wanted to use the standard, they'd have to do a pretty extensive refactor anyway, though that'd probably benefit them in the long-run anyway

@jorisvandenbossche
Copy link
Member

if seaborn wanted to use the standard, they'd have to do a pretty extensive refactor, though that'd probably benefit them in the long-run anyway

How would the example snippet be done without groupby? Getting the unique values, and then iterating over those and in the loop each time filter the data with a mask for equal to that one unique value?

@MarcoGorelli
Copy link
Contributor Author

Getting the unique values, and then iterating over those and in the loop each time filter the data with a mask for equal to that one unique value?

Yeah sounds fine

@MarcoGorelli
Copy link
Contributor Author

Are we OK to just not do anything here? As in, leave it out of the Standard. then, if anyone wants to allow it, they're free to do so, but whatever, it won't be part of the Standard

just like how for the Array API, if people want to allow extra things that aren't in the Standard (like string dtypes in numpy), they're free to do so

@rgommers
Copy link
Member

That seems reasonable to me.

@MarcoGorelli
Copy link
Contributor Author

sure, let's do that then, thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants