-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pandas inconsistenly handles identically named columns in csv export and merging #3468
Comments
|
@jreback actually brought this up a few days ago, and I talked him down because Name mangling
My guess is that the column name mangling was put there before dupe columns As it turns out It's very easy to undo this behavior and it only breaks one test, the one Merge behaviorThe merge behavior is naturally more subtle. Though the error message is unclear What I will concede, is that:
Is really terrible usability, because once a frame with dupe cols is created the user can't If we can fix that somehow, I think that should be a reasonable compromise. @jreback , what do you think? |
|
The mangling code sits in a place that's reused by several io paths, |
Do we really need an option to make read_csv name-mangle the columns? Yeah, we want to be backwards compatible and all, but really, is there even one person in the world who has a CSV that has duplicate columns and does NOT want pandas to create a dataframe with the exact same column names as in the file? |
Some users will have existing code that depends on this behavior, when users What the default behavior should be (opt-in to new, or activate "legacy" mode) is a bike shed topic |
#3511 merged, unmangled will become the default in 0.12. |
@y-p I think this an be closed? (or wait to revers mangle default in 0.13?) |
definitely, you nailed this one long ago. |
Using pandas 0.10.1
Pandas allows creating a dataframe with two columns with the same name. (I disagree that it should be allowed, but it is allowed, so OK). However, it doesn't handle that correctly in several cases.
Pandas ought to either completely disallow duplicate named columns or handle them everywhere. But it shouldn't handle them in some cases but not others.
Problem 1: Round-trip to a CSV
Dump the dataframe to a CSV and then read it back. Even though duplicate columns are supposed to be legal, Pandas won't allow that in the CSV import/export.
Problem 2: Merging to a dataframe with dup columns does not work
I'd be ok with almost any solution (disallowing duplicate named columns, giving you a warning when you do it, handling it correctly in the merge and csv read, etc). But it should be consistently one way or the other.
The text was updated successfully, but these errors were encountered: