Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disallow groupby aggs for StructColumns #8499

Merged
merged 2 commits into from
Jun 14, 2021
Merged

Conversation

charlesbluca
Copy link
Member

Closes #8474

We were erroneously using the _STRING_AGGS set of allowed aggregations for struct dtypes in groupby.pyx, which allowed us to perform erroneous groupbys on StructColumns; in @ayushdg's example:

df = cudf.DataFrame(
    {
        'a':['aa','aa','cc'],
        'd':[{"b": '1', "c": "one"}, {"b": '2', "c": "two"}, {"b": '3', "c": "one"}]
     }
)

df
	a	d
0	aa	{'b': '1', 'c': 'one'}
1	aa	{'b': '2', 'c': 'two'}
2	cc	{'b': '3', 'c': 'one'}

df.groupby('a').collect()

	d
a	
aa	[{'0': '1', '1': 'one'}, {'0': '2', '1': 'two'}]
cc	[{'0': '3', '1': 'one'}]

This change corrects this error, which should now prevent groupby operations on StructColumns.

@charlesbluca charlesbluca added bug Something isn't working 3 - Ready for Review Ready for review by team non-breaking Non-breaking change labels Jun 11, 2021
@charlesbluca charlesbluca requested a review from a team as a code owner June 11, 2021 17:44
@github-actions github-actions bot added the Python Affects Python cuDF API. label Jun 11, 2021
@codecov
Copy link

codecov bot commented Jun 11, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@0a4e8a1). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head e31f9a4 differs from pull request most recent head bfa53fb. Consider uploading reports for the commit bfa53fb to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.08    #8499   +/-   ##
===============================================
  Coverage                ?   82.91%           
===============================================
  Files                   ?      110           
  Lines                   ?    18094           
  Branches                ?        0           
===============================================
  Hits                    ?    15002           
  Misses                  ?     3092           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0a4e8a1...bfa53fb. Read the comment docs.

Copy link
Contributor

@marlenezw marlenezw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch Charles, this looks good to me.

Copy link
Member

@quasiben quasiben left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @charlesbluca

@quasiben
Copy link
Member

@gpucibot merge

@quasiben
Copy link
Member

@charlesbluca do you want to add a test for this as well ?

@charlesbluca
Copy link
Member Author

charlesbluca commented Jun 14, 2021

Sure! I'll make sure we catch a pandas.core.base.DataError in the case of struct column groupbys.

@rapids-bot rapids-bot bot merged commit a0b792f into branch-21.08 Jun 14, 2021
@charlesbluca charlesbluca deleted the charlesbluca-patch-1 branch June 14, 2021 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Groupby collect on struct columns losing field name
3 participants