-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Dask cuDF csv reader can incorrectly read rows when usecols is passed #9387
Comments
From a quick triage, this appears to be a bug in dask cudf when using import dask_cudf
data = dask_cudf.read_csv('pubchem.chembl.dataset4publication_inchi_smiles_v2.tsv',
delimiter='\t',
usecols=['Gene_Symbol', 'pXC50']
)
print(data['Gene_Symbol'].compute().value_counts())
data = dask_cudf.read_csv('pubchem.chembl.dataset4publication_inchi_smiles_v2.tsv',
delimiter='\t',
# usecols=['Gene_Symbol', 'pXC50']
)
print(data['Gene_Symbol'].compute().value_counts())
ALPI 9218
IDH1 6552
CFTR 6502
CHRM1 5885
NFE2L2 5716
...
CHEMBL1518141 1
CHEMBL2022514 1
21775681 1
CHEMBL2349003 1
9880373 1
Name: Gene_Symbol, Length: 1404649, dtype: int32
ALPI 657351
IDH1 466456
CFTR 456865
CHRM1 420325
NFE2L2 409692
...
DPEP1 20
GZMB 20
CDK18 20
SBK1 20
PCSK6 20
Name: Gene_Symbol, Length: 1331, dtype: int32 import cudf
gdf = cudf.read_csv('pubchem.chembl.dataset4publication_inchi_smiles_v2.tsv',
delimiter='\t',
usecols=['Gene_Symbol', 'pXC50']
)
gdf['Gene_Symbol'].value_counts()
ALPI 657351
IDH1 466456
CFTR 456865
CHRM1 420325
NFE2L2 409692
...
DPEP1 20
GZMB 20
CDK18 20
SBK1 20
PCSK6 20
Name: Gene_Symbol, Length: 1331, dtype: int32 |
I was able to narrow down the issue to libcudf layer, here is a minimal repro: >>> import cudf
>>> df1 = cudf.read_csv('pubchem.chembl.dataset4publication_inchi_smiles_v2.tsv', delimiter='\t', byte_range=(536870912, 268435456), header=None)
>>> df1
0 1 2 3 4 ... 8 9 10 11 12
0 ARNGBSRPBQXTBD-UHFFFAOYNA-N 2920708 599 N <NA> ... BCL2L2 387 InChI=1/C24H20N2O4S/c27-23-20-12-6-7-13-21(20)... S(=O)(=O)(N1C(C=2C(CC1)=CC=CC2)CN3C(=O)C=4C(C3... <NA>
1 ARNGBSRPBQXTBD-UHFFFAOYNA-N 2920708 5999 N <NA> ... RGS4 3736 InChI=1/C24H20N2O4S/c27-23-20-12-6-7-13-21(20)... S(=O)(=O)(N1C(C=2C(CC1)=CC=CC2)CN3C(=O)C=4C(C3... <NA>
2 ARNGBSRPBQXTBD-UHFFFAOYNA-N 2920708 60482 N <NA> ... SLC5A7 10913 InChI=1/C24H20N2O4S/c27-23-20-12-6-7-13-21(20)... S(=O)(=O)(N1C(C=2C(CC1)=CC=CC2)CN3C(=O)C=4C(C3... <NA>
3 ARNGBSRPBQXTBD-UHFFFAOYNA-N 2920708 60489 N <NA> ... APOBEC3G un61 InChI=1/C24H20N2O4S/c27-23-20-12-6-7-13-21(20)... S(=O)(=O)(N1C(C=2C(CC1)=CC=CC2)CN3C(=O)C=4C(C3... <NA>
4 ARNGBSRPBQXTBD-UHFFFAOYNA-N 2920708 6311 N <NA> ... ATXN2 3910 InChI=1/C24H20N2O4S/c27-23-20-12-6-7-13-21(20)... S(=O)(=O)(N1C(C=2C(CC1)=CC=CC2)CN3C(=O)C=4C(C3... <NA>
... ... ... ... .. ... ... ... ... ... ... ...
996980 BAKNXGNQZCUCSC-UHFFFAOYNA-N 3147040 836 N <NA> ... CASP3 557 InChI=1/C13H15F3N2O3/c1-2-3-5-9-8-12(20,13(14,... FC(F)(F)C1(O)N(N=C(C1)CCCC)C(=O)C=2OC=CC2 <NA>
996981 BAKNXGNQZCUCSC-UHFFFAOYNA-N 3147040 839 N <NA> ... CASP6 559 InChI=1/C13H15F3N2O3/c1-2-3-5-9-8-12(20,13(14,... FC(F)(F)C1(O)N(N=C(C1)CCCC)C(=O)C=2OC=CC2 <NA>
996982 BAKNXGNQZCUCSC-UHFFFAOYNA-N 3147040 8484 N <NA> ... GALR3 5061 InChI=1/C13H15F3N2O3/c1-2-3-5-9-8-12(20,13(14,... FC(F)(F)C1(O)N(N=C(C1)CCCC)C(=O)C=2OC=CC2 <NA>
996983 BAKNXGNQZCUCSC-UHFFFAOYNA-N 3147040 84867 N <NA> ... PTPN5 12632 InChI=1/C13H15F3N2O3/c1-2-3-5-9-8-12(20,13(14,... FC(F)(F)C1(O)N(N=C(C1)CCCC)C(=O)C=2OC=CC2 <NA>
996984 BAKNXGNQZCUCSC-UHFFFAOYNA-N 3147040 8698 N <NA> ... S1PR4 5211 InChI=1/C13H15F3N2O3/c1-2-3-5-9-8-12(20,13(14,... FC(F)(F)C1(O)N(N=C(C1)CCCC)C(=O)C=2OC=CC2 <NA>
[996985 rows x 13 columns]
>>> df1['8'] # Ignore the `'8'`, this is because we cannot infer header names while providing a byte_range. Hence the Index.
0 BCL2L2
1 RGS4
2 SLC5A7
3 APOBEC3G
4 ATXN2
...
996980 CASP3
996981 CASP6
996982 GALR3
996983 PTPN5
996984 S1PR4
Name: 8, Length: 996985, dtype: object
>>> df2 = cudf.read_csv('pubchem.chembl.dataset4publication_inchi_smiles_v2.tsv', delimiter='\t', byte_range=(536870912, 268435456), usecols=['Gene_Symbol', 'pXC50'], header=None, names=cudf.Index(['pXC50', 'Gene_Symbol'], dtype='object').to_pandas())
>>> df2['Gene_Symbol']
0 2920708
1 2920708
2 2920708
3 2920708
4 2920708
...
996980 3147040
996981 3147040
996982 3147040
996983 3147040
996984 3147040
Name: Gene_Symbol, Length: 996985, dtype: object In |
I wonder if the issue exists because the order of names in |
Changing the order too doesn't seem to fix the data i.e., the data isn't matching with either of the columns. |
Pretty sure names of all columns should be passed via |
Discussed offline with @vuule and discovered this is purely a |
Fixes: #9387 This PR fixes `usecols` parameter usage in `dask_cudf.read_csv`. When the csv read using byterange's the csv reader has to be passed complete column names in `names` param but should pass `usecols` to return the exact columns that are needed only. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Charles Blackmon-Luca (https://github.com/charlesbluca) URL: #9618
'isin' function in dask_cudf returns different result when compared with results while using cudf, dask and pandas.
Steps/Code to reproduce bug
Data file is at https://zenodo.org/record/2543724/files/pubchem.chembl.dataset4publication_inchi_smiles_v2.tsv.xz?download=1
Expected behavior
Expected the number to match.
Environment overview (please complete the following information)
Environment details
Click here to see environment details
The text was updated successfully, but these errors were encountered: