Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove partial support for duplicate MultIindex names unless they are all None #10500

Open
vyasr opened this issue Mar 23, 2022 · 10 comments
Open
Labels
breaking Breaking change improvement Improvement / enhancement to an existing function Python Affects Python cuDF API.

Comments

@vyasr
Copy link
Contributor

vyasr commented Mar 23, 2022

Currently our MultiIndex class supports duplicate names, while DataFrames do not. The MultiIndex support is buggy, however, and we are frequently finding new edge cases that break it. Since pandas DataFrames do support duplicate names and we explicitly choose not to, I think it makes sense to do the same for MultiIndex. It improves our internal consistency and helps us write much more robust code. Making this change would probably fix a number of currently unknown/hidden bugs.

The major caveat here is that we do need to support MultiIndexes where all the names are None. However, handling this case would potentially be much simpler since we could use a sentinel or another class attribute to track whether names are meaningful or not. Default names could be integers, and any setting of names would require setting all column names to unique values.

An aside: If we ever did want to support duplicate names properly, it would involve a refactoring at the level of ColumnAccessor, which currently uses a dictionary as the underlying data structure to map names to columns. We would then need to update all of our functions that rely on _from_data to populate a new object that could support duplicate names rather than a dictionary. This is a substantial undertaking and out of scope for this issue.

@vyasr vyasr added code quality Python Affects Python cuDF API. improvement Improvement / enhancement to an existing function breaking Breaking change labels Mar 23, 2022
@vyasr vyasr added this to the CuDF Python Refactoring milestone Mar 23, 2022
@vyasr
Copy link
Contributor Author

vyasr commented Mar 23, 2022

CC @shwina @galipremsagar

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@vyasr
Copy link
Contributor Author

vyasr commented Apr 25, 2022

The alternative resolution to this issue would be if we changed the ColumnAccessor to use a list of columns and a list of names instead of maintaining a dictionary of name:column mappings. If we made that change, then both DataFrame and MultiIndex could support duplicate names with only a little bit of additional logic to ensure that accessor methods behave in the expected way when duplicate names exist. However, any such changes to the ColumnAccessor will be motivated by more pressing concerns than supporting duplicate names (overall package performance, stability, and robustness), so we shouldn't rush to any solutions just to solve the duplicate names problem.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@vyasr
Copy link
Contributor Author

vyasr commented May 17, 2024

@mroeschke do you know if there are any plans to change the duplicate names support in pandas? There are a lot of ways in which it's kind of broken to allow this since basic operations stop working if there are duplicate names, so this seems like an API improvement that we could suggest in pandas itself (excepting the MultiIndex all None case; we may still need to support that since a MultiIndex without names is probably the most common case).

@mroeschke
Copy link
Contributor

We could propose disallowing duplicate names, but I doubt there would be much appetite to disallow them.

I don't recall seeing many bug reports over the years because a MultiIndex had duplicate names as the names are essentially metadata (carried around as a immutable list) and not used in any meaningful way in MultiIndex operations.

@vyasr
Copy link
Contributor Author

vyasr commented May 20, 2024

Sorry, I should clarify. I wasn't only thinking about MultiIndex objects, but also DataFrame objects. For example, pandas lets you do this:

In [1]: import pandas as pd       
                                                   
In [2]: df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [7, 8, 9]})
                                                   
In [3]: df.columns = ["a", "a", "b"]
                                                                                                                                                                                                            In [4]: df     
Out[4]:                     
   a  a  b                 
0  1  4  7                                         
1  2  5  8          
2  3  6  9   

That certainly has impacts on various downstream operations and leads to odd-looking failures, e.g.

In [5]: df.groupby("a").sum()
...
ValueError: Grouper for 'a' not 1-dimensional

@mroeschke
Copy link
Contributor

Ah I see. I think this would be tough sell too since a lot of APIs were developed overtime to handle duplicate columns (I suspect the main motivation was to "gracefully" support IO usecases (CSVs) with duplicate headers).

There has been an ask to make column labels unique by default pandas-dev/pandas#53217, but also a larger discussion at one point to make handling duplicate columns consistent pandas-dev/pandas#47718 so I think there's greater appetite at the behavior consistency of duplicate labels rather than disallowing them

@vyasr
Copy link
Contributor Author

vyasr commented May 21, 2024

OK got it. That is very helpful context, thanks! If that is the case and there is real interest in this in pandas, then we may have to rethink cudf's plans around duplicate names and issues like #13273

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking change improvement Improvement / enhancement to an existing function Python Affects Python cuDF API.
Projects
Status: Todo
Development

No branches or pull requests

3 participants