Fix repeated mangled names in read_csv with duplicate column names #8645

karthikeyann · 2021-07-02T14:51:23Z

Fixes mangled name bug read_csv with duplicate columns.
mismatch with pandas behavior.

csv file:

A,A,A.1,A,A.2,A,A.4,A,A
1,2,3,4.0,a,a,a.4,a,a
2,4,6,8.0,b,b,b.4,b,a
3,6,2,6.0,c,c,c.4,c,c

A	A	A.1	A	A.2	A	A.4	A	A
A	A.1	A.1.1	A.2	A.2.1	A.3	A.4	A.4.1	A.5

Pandas:

In [1]: import pandas as pd
In [2]: pd.read_csv("test.csv")
Out[2]: 
   A  A.1  A.1.1  A.2 A.2.1 A.3  A.4 A.4.1 A.5
0  1    2      3  4.0     a   a  a.4     a   a
1  2    4      6  8.0     b   b  b.4     b   a
2  3    6      2  6.0     c   c  c.4     c   c

cudf: (21.08 nightly docker)

In [1]: import cudf
In [2]: cudf.__version__
Out[2]: '21.08.00a+238.gfba09e66d8'
In [3]: cudf.read_csv("test.csv")
Out[3]: 
   A  A.1 A.2 A.3 A.4 A.5
0  1    3   a   a   a   a
1  2    6   b   b   b   a
2  3    2   c   c   c   c

This PR fixes this issue.

In [2]: cudf.read_csv("test.csv")
Out[2]: 
   A  A.1  A.1.1  A.2 A.2.1 A.3  A.4 A.4.1 A.5
0  1    2      3  4.0     a   a  a.4     a   a
1  2    4      6  8.0     b   b  b.4     b   a
2  3    6      2  6.0     c   c  c.4     c   c

Related info (sparks):
Spark duplicate column naming.
https://issues.apache.org/jira/browse/SPARK-16896
apache/spark#14745
cudf sparks addon doesn't use libcudf names. So, this PR does not affect it.

codecov · 2021-07-02T17:01:42Z

Codecov Report

Merging #8645 (b9a4c9e) into branch-21.08 (fba09e6) will increase coverage by 0.01%.
The diff coverage is n/a.

❗ Current head b9a4c9e differs from pull request most recent head 81494a9. Consider uploading reports for the commit 81494a9 to get more accurate results

@@               Coverage Diff                @@
##           branch-21.08    #8645      +/-   ##
================================================
+ Coverage         10.60%   10.61%   +0.01%     
================================================
  Files               109      109              
  Lines             18280    18645     +365     
================================================
+ Hits               1938     1980      +42     
- Misses            16342    16665     +323

Impacted Files	Coverage Δ
python/cudf/cudf/io/hdf.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/orc.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/_version.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/abc.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/api/types.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/dlpack.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/frame.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/index.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/feather.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/parquet.py	`0.00% <0.00%> (ø)`
... and 44 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fba09e6...81494a9. Read the comment docs.

vuule

Just remembered - please add a test case that exercises the behavior change.

vuule · 2021-07-06T06:14:07Z

rerun tests

karthikeyann · 2021-07-06T12:40:40Z

@gpucibot merge

fix repeated names for mangle duplicate names

4cb82a7

karthikeyann requested a review from a team as a code owner July 2, 2021 14:51

karthikeyann requested review from hyperbolic2346 and mythrocks July 2, 2021 14:51

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jul 2, 2021

karthikeyann added 3 - Ready for Review Ready for review by team bug Something isn't working cuIO cuIO issue non-breaking Non-breaking change and removed libcudf Affects libcudf (C++/CUDA) code. labels Jul 2, 2021

vuule approved these changes Jul 3, 2021

View reviewed changes

elstehle approved these changes Jul 5, 2021

View reviewed changes

vuule requested changes Jul 5, 2021

View reviewed changes

add unit test for repeated column names

81494a9

karthikeyann requested a review from a team as a code owner July 5, 2021 08:44

karthikeyann requested review from rgsl888prabhu, skirui-source and vuule July 5, 2021 08:44

vuule approved these changes Jul 5, 2021

View reviewed changes

vuule added 5 - Ready to Merge Testing and reviews complete, ready to merge 4 - Needs cuDF (Python) Reviewer and removed 3 - Ready for Review Ready for review by team 5 - Ready to Merge Testing and reviews complete, ready to merge labels Jul 5, 2021

rgsl888prabhu approved these changes Jul 5, 2021

View reviewed changes

hyperbolic2346 approved these changes Jul 6, 2021

View reviewed changes

vuule added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Needs cuDF (Python) Reviewer labels Jul 6, 2021

ajschmidt8 changed the title ~~Fix repeated mangled names in read_csv with duplicate column names~~ Fix repeated mangled names in read_csv with duplicate column names Jul 6, 2021

rapids-bot bot merged commit d77ba82 into rapidsai:branch-21.08 Jul 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix repeated mangled names in read_csv with duplicate column names #8645

Fix repeated mangled names in read_csv with duplicate column names #8645

karthikeyann commented Jul 2, 2021 •

edited

Loading

codecov bot commented Jul 2, 2021 •

edited

Loading

vuule left a comment

vuule commented Jul 6, 2021

karthikeyann commented Jul 6, 2021

Fix repeated mangled names in read_csv with duplicate column names #8645

Fix repeated mangled names in read_csv with duplicate column names #8645

Conversation

karthikeyann commented Jul 2, 2021 • edited Loading

csv file:

Pandas:

cudf: (21.08 nightly docker)

codecov bot commented Jul 2, 2021 • edited Loading

Codecov Report

vuule left a comment

Choose a reason for hiding this comment

vuule commented Jul 6, 2021

karthikeyann commented Jul 6, 2021

karthikeyann commented Jul 2, 2021 •

edited

Loading

codecov bot commented Jul 2, 2021 •

edited

Loading