Add collect list to dask-cudf groupby aggregations #8045

charlesbluca · 2021-04-23T13:55:08Z

Adds support for cuDF's collect aggregation in dask-cuDF.

…-list

charlesbluca · 2021-04-23T13:56:12Z

To illustrate the problem:

>>> data = """a,b
    1595802,1611:0.92
    1595802,1610:0.07
    1524246,1807:0.92
    1524246,1608:0.07"""

>>> df = pd.read_csv(StringIO(data))
>>> ddf = dd.from_pandas(df, 2)

>>> gdf = cudf.from_pandas(df)
>>> gddf = dask_cudf.from_cudf(gdf, 2)

>>> print(ddf.groupby("a").agg({"b":list}).compute())
                              b
a                              
1595802  [1611:0.92, 1610:0.07]
1524246  [1807:0.92, 1608:0.07]
>>> print(gddf.groupby("a").agg({"b":list}).compute())
                                b
a                                
1524246  [[1807:0.92, 1608:0.07]]
1595802  [[1611:0.92, 1610:0.07]]

codecov · 2021-04-23T16:31:57Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@167c2b7). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-21.08    #8045   +/-   ##
===============================================
  Coverage                ?   10.64%           
===============================================
  Files                   ?      109           
  Lines                   ?    18653           
  Branches                ?        0           
===============================================
  Hits                    ?     1985           
  Misses                  ?    16668           
  Partials                ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 167c2b7...405b2c7. Read the comment docs.

python/dask_cudf/dask_cudf/groupby.py

…-list

…t-list

rjzamora

This looks great @charlesbluca - Thanks!

My only concern is that you are explicity setting the index names to None in the test. Is the "collection" result somehow different from other aggregations?

rjzamora · 2021-07-06T12:53:35Z

python/dask_cudf/dask_cudf/groupby.py

+        list: "collect",
+        "list": "collect",


I like this approach.

rjzamora · 2021-07-06T12:54:50Z

python/dask_cudf/dask_cudf/groupby.py

@@ -478,6 +513,9 @@ def _finalize_gb_agg(
                gb.drop(columns=[sum_name], inplace=True)
            if "count" not in agg_list:
                gb.drop(columns=[count_name], inplace=True)
+        if "collect" in agg_list:
+            collect_name = _make_name(col, "collect", sep=sep)
+            gb[collect_name] = gb[collect_name].list.concat()


Worth the wait - Thanks for this concat method @shwina :)

python/dask_cudf/dask_cudf/tests/test_groupby.py

rjzamora · 2021-07-06T13:08:12Z

python/dask_cudf/dask_cudf/tests/test_groupby.py

+@pytest.mark.parametrize(
+    "func",
+    [
+        lambda df: df.groupby("x").agg({"y": "collect"}),
+        pytest.param(
+            lambda df: df.groupby("x").y.agg("collect"), marks=pytest.mark.skip
+        ),
+    ],
+)


Is there any reason to define func this way? Am I misunderstanding, or will the second of two cases always skipped?

This param skip, and the index nulling, I lifted from another dask-cudf groupby test:

cudf/python/dask_cudf/dask_cudf/tests/test_groupby.py

Lines 47 to 77 in 5b8895d

@pytest.mark.parametrize(

"func",

[

lambda df: df.groupby("x").agg({"y": "max"}),

pytest.param(

lambda df: df.groupby("x").y.agg(["sum", "max"]),

marks=pytest.mark.skip,

),

],

)

def test_groupby_agg(func):

pdf = pd.DataFrame(

{

"x": np.random.randint(0, 5, size=10000),

"y": np.random.normal(size=10000),

}

)

gdf = cudf.DataFrame.from_pandas(pdf)

ddf = dask_cudf.from_cudf(gdf, npartitions=5)

a = func(gdf).to_pandas()

b = func(ddf).compute().to_pandas()

a.index.name = None

a.name = None

b.index.name = None

b.name = None

dd.assert_eq(a, b)

I can see what happens when we don't set the index to None here, but when that test isn't skipped we get an AssertionError on the types:

__________________________________________ test_groupby_collect[<lambda>1] ___________________________________________ func = <function <lambda> at 0x7f3c804a0d30> @pytest.mark.parametrize( "func", [ lambda df: df.groupby("x").agg({"y": "collect"}), lambda df: df.groupby("x").y.agg("collect"), ], ) def test_groupby_collect(func): pdf = pd.DataFrame( { "x": np.random.randint(0, 5, size=10000), "y": np.random.normal(size=10000), } ) gdf = cudf.DataFrame.from_pandas(pdf) ddf = dask_cudf.from_cudf(gdf, npartitions=5) a = func(gdf).to_pandas() b = func(ddf).compute().to_pandas() a.index.name = None a.name = None b.index.name = None b.name = None > dd.assert_eq(a, b) dask_cudf/tests/test_groupby.py:155: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ../../../compose/etc/conda/cuda_11.2.72/envs/rapids/lib/python3.8/site-packages/dask/dataframe/utils.py:559: in assert_eq tm.assert_series_equal(a, b, check_names=check_names, **kwargs) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ left = 0 [-0.02279966962796973, -0.2268040371246616, 0.... 1 [1.0547561143327269, 0.07632651478542447, -0.0... 2 [1.... [0.35305396010499146, 2.022936601816015, -0.02... 4 [0.7639835327097312, 0.9458744987601149, 0.370... dtype: object right = y 0 [-0.02279966962796973, -0.2268040371246616, 0.... 1 [1.054756...796, 1.498... 3 [0.35305396010499146, 2.022936601816015, -0.02... 4 [0.7639835327097312, 0.9458744987601149, 0.370... cls = <class 'pandas.core.series.Series'> def _check_isinstance(left, right, cls): """ Helper method for our assert_* methods that ensures that the two objects being compared have the right type before proceeding with the comparison. Parameters ---------- left : The first object being compared. right : The second object being compared. cls : The class type to check against. Raises ------ AssertionError : Either `left` or `right` is not an instance of `cls`. """ cls_name = cls.__name__ if not isinstance(left, cls): raise AssertionError( f"{cls_name} Expected type {cls}, found {type(left)} instead" ) if not isinstance(right, cls): > raise AssertionError( f"{cls_name} Expected type {cls}, found {type(right)} instead" ) E AssertionError: Series Expected type <class 'pandas.core.series.Series'>, found <class 'pandas.core.frame.DataFrame'> instead

It looks like we are creating a dataframe here when we should be making a series.

It looks like we are creating a dataframe here when we should be making a series.

Ah - It seems like this was already a problem before this PR. In that case, it is probably okay to fix that in a follow-up PR.

Sure, do you know if there is an open issue for this problem? If not, I can open one so we can keep track of the follow up fix.

rjzamora

Thanks again @charlesbluca - Note that I filed #8655 to make sure we address the Series/DataFrame inconsistency discussed here.

charlesbluca · 2021-07-06T14:32:02Z

Thanks for opening the issue @rjzamora!

charlesbluca · 2021-07-06T16:20:39Z

@gpucibot merge

charlesbluca added 2 commits April 16, 2021 08:15

Add naive list collect agg to dask-cudf

df93b7f

Merge remote-tracking branch 'upstream/branch-0.20' into dask-collect…

63e4d62

…-list

github-actions bot added the Python Affects Python cuDF API. label Apr 23, 2021

Add collect support for dask-cudf series

b0218b4

charlesbluca mentioned this pull request Apr 23, 2021

Redirect callable aggregations to their named equivalent in dask-cuDF #8048

Merged

kkraus14 added bug Something isn't working dask Dask issue non-breaking Non-breaking change labels Apr 27, 2021

kkraus14 reviewed Apr 27, 2021

View reviewed changes

python/dask_cudf/dask_cudf/groupby.py Outdated Show resolved Hide resolved

charlesbluca added 4 commits May 3, 2021 10:31

Merge remote-tracking branch 'upstream/branch-0.20' into dask-collect…

b3fab63

…-list

Redirect callable list agg to 'collect'

6dee893

Merge remote-tracking branch 'upstream/branch-21.06' into dask-collec…

f16ec9d

…t-list

Remove meta try/except block

a9dc883

charlesbluca changed the base branch from branch-21.06 to branch-21.08 May 27, 2021 00:48

charlesbluca added the 0 - Blocked Cannot progress due to external reasons label Jun 16, 2021

charlesbluca linked an issue Jun 16, 2021 that may be closed by this pull request

[BUG] Groupby collect list fails with Dask #7812

Closed

Merge remote-tracking branch 'upstream/branch-21.08' into dask-collec…

b768d3f

…t-list

charlesbluca added 2 - In Progress Currently a work in progress and removed 0 - Blocked Cannot progress due to external reasons labels Jul 2, 2021

Flatten nested groupby collect result

1c12d9c

charlesbluca marked this pull request as ready for review July 2, 2021 20:26

charlesbluca requested a review from a team as a code owner July 2, 2021 20:26

charlesbluca added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jul 2, 2021

rjzamora reviewed Jul 6, 2021

View reviewed changes

Remove index nulling from test

405b2c7

rjzamora mentioned this pull request Jul 6, 2021

[BUG] Dask-cuDF Series groupby aggregation returns DataFrame #8655

Closed

rjzamora approved these changes Jul 6, 2021

View reviewed changes

rapids-bot bot merged commit c54346e into rapidsai:branch-21.08 Jul 6, 2021

charlesbluca deleted the dask-collect-list branch August 3, 2021 17:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add collect list to dask-cudf groupby aggregations #8045

Add collect list to dask-cudf groupby aggregations #8045

charlesbluca commented Apr 23, 2021 •

edited

Loading

charlesbluca commented Apr 23, 2021

codecov bot commented Apr 23, 2021 •

edited

Loading

rjzamora left a comment

rjzamora Jul 6, 2021

rjzamora Jul 6, 2021

rjzamora Jul 6, 2021

charlesbluca Jul 6, 2021

rjzamora Jul 6, 2021

charlesbluca Jul 6, 2021

rjzamora left a comment

charlesbluca commented Jul 6, 2021

charlesbluca commented Jul 6, 2021

	@pytest.mark.parametrize(
	"func",
	[
	lambda df: df.groupby("x").agg({"y": "max"}),
	pytest.param(
	lambda df: df.groupby("x").y.agg(["sum", "max"]),
	marks=pytest.mark.skip,
	),
	],
	)
	def test_groupby_agg(func):
	pdf = pd.DataFrame(
	{
	"x": np.random.randint(0, 5, size=10000),
	"y": np.random.normal(size=10000),
	}
	)

	gdf = cudf.DataFrame.from_pandas(pdf)

	ddf = dask_cudf.from_cudf(gdf, npartitions=5)

	a = func(gdf).to_pandas()
	b = func(ddf).compute().to_pandas()

	a.index.name = None
	a.name = None
	b.index.name = None
	b.name = None

	dd.assert_eq(a, b)

Add collect list to dask-cudf groupby aggregations #8045

Add collect list to dask-cudf groupby aggregations #8045

Conversation

charlesbluca commented Apr 23, 2021 • edited Loading

charlesbluca commented Apr 23, 2021

codecov bot commented Apr 23, 2021 • edited Loading

Codecov Report

rjzamora left a comment

Choose a reason for hiding this comment

rjzamora Jul 6, 2021

Choose a reason for hiding this comment

rjzamora Jul 6, 2021

Choose a reason for hiding this comment

rjzamora Jul 6, 2021

Choose a reason for hiding this comment

charlesbluca Jul 6, 2021

Choose a reason for hiding this comment

rjzamora Jul 6, 2021

Choose a reason for hiding this comment

charlesbluca Jul 6, 2021

Choose a reason for hiding this comment

rjzamora left a comment

Choose a reason for hiding this comment

charlesbluca commented Jul 6, 2021

charlesbluca commented Jul 6, 2021

charlesbluca commented Apr 23, 2021 •

edited

Loading

codecov bot commented Apr 23, 2021 •

edited

Loading