Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Adding support for categorical column indexes #8743

Open
charlesbluca opened this issue Jul 14, 2021 · 1 comment
Open

[FEA] Adding support for categorical column indexes #8743

charlesbluca opened this issue Jul 14, 2021 · 1 comment
Labels
feature request New feature or request Python Affects Python cuDF API.

Comments

@charlesbluca
Copy link
Member

Is your feature request related to a problem? Please describe.
Categorical column indexes exists in a weird place of quasi-support in cuDF; while it is possible to set a dataframe's column index to be a pd.CategoricalIndex without any error or warning, it isn't actually possible for the index to be recreated with df.columns, which contrasts the behavior of Pandas:

import cudf
import pandas as pd

pdf = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
pdf.columns = pdf.columns.astype("category")

gdf = cudf.from_pandas(pdf)

print(pdf.columns)  # CategoricalIndex(['a', 'b'], categories=['a', 'b'], ordered=False, dtype='category')
print(gdf.columns)  # Index(['a', 'b'], dtype='object')

This means that while there are user-facing issues which come as a result of using cuDF's "categorical" column indexes (such as #7365), the ability to test for them is limited in that we cannot do the standard comparison to Pandas dataframes here:

from cudf.testing._utils import assert_eq

assert_eq(pdf, gdf)  # AssertionError: DataFrame.columns are different

Describe the solution you'd like
After chatting with @shwina, it seems like an ideal solution that can't be done here is to use the individual categorical scalars instead of their string names as data when constructing the ColumnAccessor in the columns setter method. However, this isn't possible, as neither Pandas nor cuDF offer categorical scalars.

An alternative to this would be to have a boolean attribute either of the dataframe or ColumnAccessor saying whether or not the column index is categorical; this could then be used by ColumnAccessor.to_pandas_index()to properly reconstruct the index with categories if needed. This would come with its own consequences, specifically either

  • a relatively niche param/attribute of ColumnAccessor that is only used for dataframes
  • an attribute of dataframes that now must be explicitly copied from one to another in the case of copies

Describe alternatives you've considered
A possible alternative that @shwina and I explored, but were unable to get working, is to pass specific kwargs to assert_eq such that it would only check the column index names, but not the index type. Passing different combos of check_categorical=False, check_column_type=False, etc. we were unable to get a passing test when comparing these indexes.

Additional context
This issue came up while working on #8560, where added test cases would require this feature and needed to be xfailed.

@charlesbluca charlesbluca added feature request New feature or request Python Affects Python cuDF API. labels Jul 14, 2021
@charlesbluca charlesbluca changed the title [BUG] Adding support for categorical column indexes [FEA] Adding support for categorical column indexes Jul 14, 2021
@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Python Affects Python cuDF API.
Projects
Status: No status
Development

No branches or pull requests

3 participants