Remove partial support for duplicate MultIindex names unless they are all None #10500

vyasr · 2022-03-23T19:26:15Z

Currently our MultiIndex class supports duplicate names, while DataFrames do not. The MultiIndex support is buggy, however, and we are frequently finding new edge cases that break it. Since pandas DataFrames do support duplicate names and we explicitly choose not to, I think it makes sense to do the same for MultiIndex. It improves our internal consistency and helps us write much more robust code. Making this change would probably fix a number of currently unknown/hidden bugs.

The major caveat here is that we do need to support MultiIndexes where all the names are None. However, handling this case would potentially be much simpler since we could use a sentinel or another class attribute to track whether names are meaningful or not. Default names could be integers, and any setting of names would require setting all column names to unique values.

An aside: If we ever did want to support duplicate names properly, it would involve a refactoring at the level of ColumnAccessor, which currently uses a dictionary as the underlying data structure to map names to columns. We would then need to update all of our functions that rely on _from_data to populate a new object that could support duplicate names rather than a dictionary. This is a substantial undertaking and out of scope for this issue.

The text was updated successfully, but these errors were encountered:

vyasr · 2022-03-23T19:26:23Z

CC @shwina @galipremsagar

github-actions · 2022-04-22T20:03:02Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

vyasr · 2022-04-25T14:34:09Z

The alternative resolution to this issue would be if we changed the ColumnAccessor to use a list of columns and a list of names instead of maintaining a dictionary of name:column mappings. If we made that change, then both DataFrame and MultiIndex could support duplicate names with only a little bit of additional logic to ensure that accessor methods behave in the expected way when duplicate names exist. However, any such changes to the ColumnAccessor will be motivated by more pressing concerns than supporting duplicate names (overall package performance, stability, and robustness), so we shouldn't rush to any solutions just to solve the duplicate names problem.

github-actions · 2022-05-25T15:12:43Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions · 2022-08-23T16:03:18Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

vyasr · 2024-05-17T14:02:52Z

@mroeschke do you know if there are any plans to change the duplicate names support in pandas? There are a lot of ways in which it's kind of broken to allow this since basic operations stop working if there are duplicate names, so this seems like an API improvement that we could suggest in pandas itself (excepting the MultiIndex all None case; we may still need to support that since a MultiIndex without names is probably the most common case).

mroeschke · 2024-05-17T16:19:58Z

We could propose disallowing duplicate names, but I doubt there would be much appetite to disallow them.

I don't recall seeing many bug reports over the years because a MultiIndex had duplicate names as the names are essentially metadata (carried around as a immutable list) and not used in any meaningful way in MultiIndex operations.

vyasr · 2024-05-20T23:03:17Z

Sorry, I should clarify. I wasn't only thinking about MultiIndex objects, but also DataFrame objects. For example, pandas lets you do this:

In [1]: import pandas as pd       
                                                   
In [2]: df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [7, 8, 9]})
                                                   
In [3]: df.columns = ["a", "a", "b"]
                                                                                                                                                                                                            In [4]: df     
Out[4]:                     
   a  a  b                 
0  1  4  7                                         
1  2  5  8          
2  3  6  9

That certainly has impacts on various downstream operations and leads to odd-looking failures, e.g.

In [5]: df.groupby("a").sum()
...
ValueError: Grouper for 'a' not 1-dimensional

mroeschke · 2024-05-21T00:01:40Z

Ah I see. I think this would be tough sell too since a lot of APIs were developed overtime to handle duplicate columns (I suspect the main motivation was to "gracefully" support IO usecases (CSVs) with duplicate headers).

There has been an ask to make column labels unique by default pandas-dev/pandas#53217, but also a larger discussion at one point to make handling duplicate columns consistent pandas-dev/pandas#47718 so I think there's greater appetite at the behavior consistency of duplicate labels rather than disallowing them

vyasr · 2024-05-21T00:15:11Z

OK got it. That is very helpful context, thanks! If that is the case and there is real interest in this in pandas, then we may have to rethink cudf's plans around duplicate names and issues like #13273

vyasr added code quality Python Affects Python cuDF API. improvement Improvement / enhancement to an existing function breaking Breaking change labels Mar 23, 2022

vyasr added this to the CuDF Python Refactoring milestone Mar 23, 2022

github-actions bot added the inactive-30d label Apr 22, 2022

github-actions bot removed the inactive-30d label Apr 25, 2022

github-actions bot added the inactive-30d label May 25, 2022

github-actions bot added the inactive-90d label Aug 23, 2022

GregoryKimball removed the inactive-90d label Apr 3, 2023

vyasr removed code quality labels Feb 23, 2024

mroeschke mentioned this issue Aug 12, 2024

[FEA] Support duplicate column labels in cudf.DataFrame #16533

Open

vyasr added this to cuDF Python Nov 5, 2024

github-project-automation bot moved this to Todo in cuDF Python Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove partial support for duplicate MultIindex names unless they are all None #10500

Remove partial support for duplicate MultIindex names unless they are all None #10500

vyasr commented Mar 23, 2022

vyasr commented Mar 23, 2022

github-actions bot commented Apr 22, 2022

vyasr commented Apr 25, 2022

github-actions bot commented May 25, 2022

github-actions bot commented Aug 23, 2022

vyasr commented May 17, 2024

mroeschke commented May 17, 2024

vyasr commented May 20, 2024

mroeschke commented May 21, 2024

vyasr commented May 21, 2024

Remove partial support for duplicate MultIindex names unless they are all None #10500

Remove partial support for duplicate MultIindex names unless they are all None #10500

Comments

vyasr commented Mar 23, 2022

vyasr commented Mar 23, 2022

github-actions bot commented Apr 22, 2022

vyasr commented Apr 25, 2022

github-actions bot commented May 25, 2022

github-actions bot commented Aug 23, 2022

vyasr commented May 17, 2024

mroeschke commented May 17, 2024

vyasr commented May 20, 2024

mroeschke commented May 21, 2024

vyasr commented May 21, 2024