You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Categorical column indexes exists in a weird place of quasi-support in cuDF; while it is possible to set a dataframe's column index to be a pd.CategoricalIndex without any error or warning, it isn't actually possible for the index to be recreated with df.columns, which contrasts the behavior of Pandas:
This means that while there are user-facing issues which come as a result of using cuDF's "categorical" column indexes (such as #7365), the ability to test for them is limited in that we cannot do the standard comparison to Pandas dataframes here:
fromcudf.testing._utilsimportassert_eqassert_eq(pdf, gdf) # AssertionError: DataFrame.columns are different
Describe the solution you'd like
After chatting with @shwina, it seems like an ideal solution that can't be done here is to use the individual categorical scalars instead of their string names as data when constructing the ColumnAccessor in the columns setter method. However, this isn't possible, as neither Pandas nor cuDF offer categorical scalars.
An alternative to this would be to have a boolean attribute either of the dataframe or ColumnAccessor saying whether or not the column index is categorical; this could then be used by ColumnAccessor.to_pandas_index()to properly reconstruct the index with categories if needed. This would come with its own consequences, specifically either
a relatively niche param/attribute of ColumnAccessor that is only used for dataframes
an attribute of dataframes that now must be explicitly copied from one to another in the case of copies
Describe alternatives you've considered
A possible alternative that @shwina and I explored, but were unable to get working, is to pass specific kwargs to assert_eq such that it would only check the column index names, but not the index type. Passing different combos of check_categorical=False, check_column_type=False, etc. we were unable to get a passing test when comparing these indexes.
Additional context
This issue came up while working on #8560, where added test cases would require this feature and needed to be xfailed.
The text was updated successfully, but these errors were encountered:
This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
Is your feature request related to a problem? Please describe.
Categorical column indexes exists in a weird place of quasi-support in cuDF; while it is possible to set a dataframe's column index to be a
pd.CategoricalIndex
without any error or warning, it isn't actually possible for the index to be recreated withdf.columns
, which contrasts the behavior of Pandas:This means that while there are user-facing issues which come as a result of using cuDF's "categorical" column indexes (such as #7365), the ability to test for them is limited in that we cannot do the standard comparison to Pandas dataframes here:
Describe the solution you'd like
After chatting with @shwina, it seems like an ideal solution that can't be done here is to use the individual categorical scalars instead of their string names as data when constructing the
ColumnAccessor
in the columns setter method. However, this isn't possible, as neither Pandas nor cuDF offer categorical scalars.An alternative to this would be to have a boolean attribute either of the dataframe or
ColumnAccessor
saying whether or not the column index is categorical; this could then be used byColumnAccessor.to_pandas_index()
to properly reconstruct the index with categories if needed. This would come with its own consequences, specifically eitherColumnAccessor
that is only used for dataframesDescribe alternatives you've considered
A possible alternative that @shwina and I explored, but were unable to get working, is to pass specific kwargs to
assert_eq
such that it would only check the column index names, but not the index type. Passing different combos ofcheck_categorical=False
,check_column_type=False
, etc. we were unable to get a passing test when comparing these indexes.Additional context
This issue came up while working on #8560, where added test cases would require this feature and needed to be xfailed.
The text was updated successfully, but these errors were encountered: