-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove partial support for duplicate MultIindex names unless they are all None #10500
Comments
This issue has been labeled |
The alternative resolution to this issue would be if we changed the ColumnAccessor to use a list of columns and a list of names instead of maintaining a dictionary of name:column mappings. If we made that change, then both DataFrame and MultiIndex could support duplicate names with only a little bit of additional logic to ensure that accessor methods behave in the expected way when duplicate names exist. However, any such changes to the ColumnAccessor will be motivated by more pressing concerns than supporting duplicate names (overall package performance, stability, and robustness), so we shouldn't rush to any solutions just to solve the duplicate names problem. |
This issue has been labeled |
This issue has been labeled |
@mroeschke do you know if there are any plans to change the duplicate names support in pandas? There are a lot of ways in which it's kind of broken to allow this since basic operations stop working if there are duplicate names, so this seems like an API improvement that we could suggest in pandas itself (excepting the MultiIndex all None case; we may still need to support that since a MultiIndex without names is probably the most common case). |
We could propose disallowing duplicate names, but I doubt there would be much appetite to disallow them. I don't recall seeing many bug reports over the years because a MultiIndex had duplicate names as the names are essentially metadata (carried around as a immutable |
Sorry, I should clarify. I wasn't only thinking about MultiIndex objects, but also DataFrame objects. For example, pandas lets you do this:
That certainly has impacts on various downstream operations and leads to odd-looking failures, e.g.
|
Ah I see. I think this would be tough sell too since a lot of APIs were developed overtime to handle duplicate columns (I suspect the main motivation was to "gracefully" support IO usecases (CSVs) with duplicate headers). There has been an ask to make column labels unique by default pandas-dev/pandas#53217, but also a larger discussion at one point to make handling duplicate columns consistent pandas-dev/pandas#47718 so I think there's greater appetite at the behavior consistency of duplicate labels rather than disallowing them |
OK got it. That is very helpful context, thanks! If that is the case and there is real interest in this in pandas, then we may have to rethink cudf's plans around duplicate names and issues like #13273 |
Currently our MultiIndex class supports duplicate names, while DataFrames do not. The MultiIndex support is buggy, however, and we are frequently finding new edge cases that break it. Since pandas DataFrames do support duplicate names and we explicitly choose not to, I think it makes sense to do the same for MultiIndex. It improves our internal consistency and helps us write much more robust code. Making this change would probably fix a number of currently unknown/hidden bugs.
The major caveat here is that we do need to support MultiIndexes where all the names are None. However, handling this case would potentially be much simpler since we could use a sentinel or another class attribute to track whether names are meaningful or not. Default names could be integers, and any setting of names would require setting all column names to unique values.
An aside: If we ever did want to support duplicate names properly, it would involve a refactoring at the level of
ColumnAccessor
, which currently uses a dictionary as the underlying data structure to map names to columns. We would then need to update all of our functions that rely on_from_data
to populate a new object that could support duplicate names rather than a dictionary. This is a substantial undertaking and out of scope for this issue.The text was updated successfully, but these errors were encountered: