Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement {Series,DataFrame}GroupBy fillna methods #8869

Merged
merged 27 commits into from
May 10, 2022

Conversation

pavithraes
Copy link
Member

@pavithraes pavithraes commented Mar 31, 2022

TODO before merging:

@pavithraes pavithraes marked this pull request as draft March 31, 2022 15:33
Copy link
Collaborator

@ian-r-rose ian-r-rose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pavithraes, this is looking great! The implementation is looking good to me, I think we could get some more coverage of the optional args, but otherwise I think this looks close.

I also notice that there are forwardfill and backfill aliases in pandas, we could think about adding those as well while we're at it.

dask/dataframe/groupby.py Show resolved Hide resolved
dask/dataframe/groupby.py Outdated Show resolved Hide resolved
dask/dataframe/groupby.py Outdated Show resolved Hide resolved
dask/dataframe/tests/test_groupby.py Outdated Show resolved Hide resolved
@pavithraes
Copy link
Member Author

@ian-r-rose Thanks for the review!

I also notice that there are forwardfill and backfill aliases in pandas, we could think about adding those as well while we're at it.

Looks like pad and backfill has been deprecated in pandas, so I think we don't need to add them?

I also noticed the transform implementation doesn't work for axis=1, it throws: ValueError: transform must return a scalar value for each group. I'm assuming this is because transform applies the function on a column-by-column basis?

Full traceback
ddf.groupby("A").fillna(0, axis=1)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/generic.py in _get_axis_number(cls, axis)
    545         try:
--> 546             return cls._AXIS_TO_AXIS_NUMBER[axis]
    547         except KeyError:

KeyError: 1

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/groupby/generic.py in _transform_general(self, func, *args, **kwargs)
   1316             try:
-> 1317                 path, res = self._choose_path(fast_path, slow_path, group)
   1318             except TypeError:

~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/groupby/generic.py in _choose_path(self, fast_path, slow_path, group)
   1393         path = slow_path
-> 1394         res = slow_path(group)
   1395 

~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/groupby/generic.py in <lambda>(group)
   1386             fast_path = lambda group: func(group, *args, **kwargs)
-> 1387             slow_path = lambda group: group.apply(
   1388                 lambda x: func(x, *args, **kwargs), axis=self.axis

~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/frame.py in apply(self, func, axis, raw, result_type, args, **kwargs)
   8739         )
-> 8740         return op.apply()
   8741 

~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/apply.py in apply(self)
    687 
--> 688         return self.apply_standard()
    689 

~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/apply.py in apply_standard(self)
    811     def apply_standard(self):
--> 812         results, res_index = self.apply_series_generator()
    813 

~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/apply.py in apply_series_generator(self)
    827                 # ignore SettingWithCopy here in case the user mutates
--> 828                 results[i] = self.f(v)
    829                 if isinstance(results[i], ABCSeries):

~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/groupby/generic.py in <lambda>(x)
   1387             slow_path = lambda group: group.apply(
-> 1388                 lambda x: func(x, *args, **kwargs), axis=self.axis
   1389             )

~/Developer/Dask/dask/dask/dataframe/groupby.py in _fillna_groups(self, groups, **kwargs)
   1987     def _fillna_groups(self, groups, **kwargs):
-> 1988         return groups.fillna(**kwargs)
   1989 

~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    310                 )
--> 311             return func(*args, **kwargs)
    312 

~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/series.py in fillna(self, value, method, axis, inplace, limit, downcast)
   4815     ) -> Series | None:
-> 4816         return super().fillna(
   4817             value=value,

~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/generic.py in fillna(self, value, method, axis, inplace, limit, downcast)
   6320             axis = 0
-> 6321         axis = self._get_axis_number(axis)
   6322 

~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/generic.py in _get_axis_number(cls, axis)
    547         except KeyError:
--> 548             raise ValueError(f"No axis named {axis} for object type {cls.__name__}")
    549 

ValueError: No axis named 1 for object type Series

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
<ipython-input-4-f9b1a98697a0> in <module>
----> 1 ddf.groupby("A").fillna(0, axis=1)

~/Developer/Dask/dask/dask/dataframe/groupby.py in fillna(self, value, method, limit, axis)
   1993                 "groupby-fillna with value=dict/Series/DataFrame is currently not supported"
   1994             )
-> 1995         meta = self._meta_nonempty.transform(
   1996             self._fillna_groups, value=value, method=method, limit=limit, axis=axis
   1997         )

~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/groupby/generic.py in transform(self, func, engine, engine_kwargs, *args, **kwargs)
   1355     @Appender(_transform_template)
   1356     def transform(self, func, *args, engine=None, engine_kwargs=None, **kwargs):
-> 1357         return self._transform(
   1358             func, *args, engine=engine, engine_kwargs=engine_kwargs, **kwargs
   1359         )

~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/groupby/groupby.py in _transform(self, func, engine, engine_kwargs, *args, **kwargs)
   1444 
   1445         if not isinstance(func, str):
-> 1446             return self._transform_general(func, *args, **kwargs)
   1447 
   1448         elif func not in base.transform_kernel_allowlist:

~/mambaforge/envs/dask-dev/lib/python3.9/site-packages/pandas/core/groupby/generic.py in _transform_general(self, func, *args, **kwargs)
   1320             except ValueError as err:
   1321                 msg = "transform must return a scalar value for each group"
-> 1322                 raise ValueError(msg) from err
   1323 
   1324             if isinstance(res, Series):

ValueError: transform must return a scalar value for each group

@ian-r-rose
Copy link
Collaborator

Looks like pad and backfill has been deprecated in pandas, so I think we don't need to add them?

Sounds good to me, good catch.

I also noticed the transform implementation doesn't work for axis=1

Interesting! Let's take a look in person.

dask/dataframe/groupby.py Outdated Show resolved Hide resolved
@pavithraes pavithraes marked this pull request as ready for review April 7, 2022 14:13
@ian-r-rose ian-r-rose self-requested a review April 7, 2022 16:04
dask/dataframe/groupby.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@ian-r-rose ian-r-rose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pavithraes, this is looking close.

I did some experimenting with grouping by multiple columns, and ran into some troubles with transform not liking empty groups very much, but apply did just fine. I wonder if we should just simplify things by using apply everywhere

dask/dataframe/groupby.py Outdated Show resolved Hide resolved
dask/dataframe/groupby.py Outdated Show resolved Hide resolved
dask/dataframe/groupby.py Outdated Show resolved Hide resolved
@pavithraes
Copy link
Member Author

I did some experimenting with grouping by multiple columns, and ran into some troubles with transform not liking empty groups very much, but apply did just fine. I wonder if we should just simplify things by using apply everywhere

Thanks for checking! I agree that we can use apply throughout. I'll also include a test for grouping by multiple columns. :)

@github-actions github-actions bot added the documentation Improve or add to documentation label Apr 8, 2022
@ian-r-rose
Copy link
Collaborator

@pavithraes This change in upstream pandas might also affect what you are doing here.

@pavithraes
Copy link
Member Author

This change in upstream pandas might also affect what you are doing here.

Thanks for the note! I just triggered an upstream build to check and it does seem to be affected. :/

@pavithraes
Copy link
Member Author

pavithraes commented Apr 12, 2022

@jrbourbeau Thanks for helping review this!

To solve the upstream build issues, we need to add group_keys=False to the apply calls in this PR (it's False because fillna disregards the columns we grouped-by in the output.)

I think we can't add that right now because Dask's groupby-apply doesn't support the group_keys parameter yet, so we may need to resolve this with the other failing pandas-upstream-tests in #8875?

Copy link
Collaborator

@ian-r-rose ian-r-rose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rebased your PR on top of #8961 to see if it still works with the group_keys changes, and it does 🎉 !

The bad news is: with the group_keys changes I think there is one more thing we need to cover. I tried parameterizing your test_fillna over group_keys=[True, False, None]. Everything works on pandas 1.4.x, but on 1.5.x, group_keys=True presents a problem: in that case, the apply returns with extra multiindex columns for your grouped-by items.

I think the fix should be fairly straightforward, however. For the case where self.group_keys is True in the dd.GroupBy object, we can do a final map_partitions to df.droplevel(by). That is to say, we can easily drop the extra index columns at the end when group_keys is True. It's annoying, but doable

@pavithraes
Copy link
Member Author

@ian-r-rose Thanks for the notes! I've pushed the map_partitions fix that you suggested, and it seems to work locally with pandas 1.4.2 and 1.5.0!

@pavithraes pavithraes requested a review from ian-r-rose April 26, 2022 12:27
@pavithraes
Copy link
Member Author

pavithraes commented Apr 28, 2022

I'm not sure why the macos-python-3.8 test is failing with: https://github.com/dask/dask/runs/6209266570?check_suite_focus=true#step:6:21560 :/

Maybe it's related to #8889 (comment)?

Edit: Looks like it was a one-off.

Copy link
Collaborator

@ian-r-rose ian-r-rose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to be slow in reviewing @pavithraes, and thanks for your persistence. This looks great! I have one minor comment that I would consider optional. But from my perspective, this is ready to go into our next release!

dask/dataframe/groupby.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@ian-r-rose ian-r-rose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pavithraes!

@pavithraes
Copy link
Member Author

@jrbourbeau I think this is ready to merge! Could you please help take a quick look?

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pavithraes and @ian-r-rose! There's a merge conflict that's popped up (apologies for the delayed response). @pavithraes would you mind fixing that conflict? Otherwise, this looks good to go

@pavithraes
Copy link
Member Author

@jrbourbeau Thanks for taking a look! I've fixed the conflict. :)

@jrbourbeau jrbourbeau merged commit 5fbda77 into dask:main May 10, 2022
@ian-r-rose
Copy link
Collaborator

Thank you @pavithraes!

@pavithraes pavithraes deleted the groupby-fillna branch May 11, 2022 13:01
erayaslan pushed a commit to erayaslan/dask that referenced this pull request May 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataframe documentation Improve or add to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Missing Built-in methods for SeriesGroupBy ffill
3 participants