Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Dask cuDF csv reader can incorrectly read rows when usecols is passed #9387

Closed
rilango opened this issue Oct 6, 2021 · 6 comments · Fixed by #9618
Closed

[BUG] Dask cuDF csv reader can incorrectly read rows when usecols is passed #9387

rilango opened this issue Oct 6, 2021 · 6 comments · Fixed by #9618
Assignees
Labels
bug Something isn't working cuIO cuIO issue dask Dask issue libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.

Comments

@rilango
Copy link

rilango commented Oct 6, 2021

'isin' function in dask_cudf returns different result when compared with results while using cudf, dask and pandas.

Steps/Code to reproduce bug
Data file is at https://zenodo.org/record/2543724/files/pubchem.chembl.dataset4publication_inchi_smiles_v2.tsv.xz?download=1

  • Following code was used to filter using dask_cudf:
gene_list = ["ABL1", "ACHE", "ADAM17", "ADORA2A", "ADORA2B", "ADORA3", "ADRA1A", "ADRA1D", 
             "ADRB1", "ADRB2", "ADRB3", "AKT1", "AKT2", "ALK", "ALOX5", "AR", "AURKA", 
             "AURKB", "BACE1", "CA1", "CA12", "CA2", "CA9", "CASP1", "CCKBR", "CCR2", 
             "CCR5", "CDK1", "CDK2", "CHEK1", "CHRM1", "CHRM2", "CHRM3", "CHRNA7", "CLK4", 
             "CNR1", "CNR2", "CRHR1", "CSF1R", "CTSK", "CTSS", "CYP19A1", "DHFR", "DPP4", 
             "DRD1", "DRD3", "DRD4", "DYRK1A", "EDNRA", "EGFR", "EPHX2", "ERBB2", "ESR1", 
             "ESR2", "F10", "F2", "FAAH", "FGFR1", "FLT1", "FLT3", "GHSR", "GNRHR", "GRM5", 
             "GSK3A", "GSK3B", "HDAC1", "HPGD", "HRH3", "HSD11B1", "HSP90AA1", "HTR2A", 
             "HTR2C", "HTR6", "HTR7", "IGF1R", "INSR", "ITK", "JAK2", "JAK3", "KCNH2", 
             "KDR", "KIT", "LCK", "MAOB", "MAPK14", "MAPK8", "MAPK9", "MAPKAPK2", "MC4R", 
             "MCHR1", "MET", "MMP1", "MMP13", "MMP2", "MMP3", "MMP9", "MTOR", "NPY5R", 
             "NR3C1", "NTRK1", "OPRD1", "OPRK1", "OPRL1", "OPRM1", "P2RX7", "PARP1", "PDE5A", 
             "PDGFRB", "PGR", "PIK3CA", "PIM1", "PIM2", "PLK1", "PPARA", "PPARD", "PPARG", 
             "PRKACA", "PRKCD", "PTGDR2", "PTGS2", "PTPN1", "REN", "ROCK1", "ROCK2", "S1PR1", 
             "SCN9A", "SIGMAR1", "SLC6A2", "SLC6A3", "SRC", "TACR1", "TRPV1", "VDR"]

data = dask_cudf.read_csv('pubchem.chembl.dataset4publication_inchi_smiles_v2.tsv',
                          delimiter='\t',
                          usecols=['Gene_Symbol', 'pXC50'])

test = data['Gene_Symbol'].isin(gene_list)
test = test.compute()
test.value_counts()

Result
False 70743570
True 106593
Name: Gene_Symbol, dtype: int32

  • Below is the result returned from cudf which matches with pandas and dask

False 63240187
True 7609976
Name: Gene_Symbol, dtype: int64

Expected behavior
Expected the number to match.

Environment overview (please complete the following information)

  • Environment location: Bare-metal
  • Method of cuDF install: conda

Environment details

Click here to see environment details
 **git***
 Not inside a git repository
 
 ***OS Information***
 DISTRIB_ID=Ubuntu
 DISTRIB_RELEASE=20.04
 DISTRIB_CODENAME=focal
 DISTRIB_DESCRIPTION="Ubuntu 20.04.3 LTS"
 NAME="Ubuntu"
 VERSION="20.04.3 LTS (Focal Fossa)"
 ID=ubuntu
 ID_LIKE=debian
 PRETTY_NAME="Ubuntu 20.04.3 LTS"
 VERSION_ID="20.04"
 HOME_URL="https://www.ubuntu.com/"
 SUPPORT_URL="https://help.ubuntu.com/"
 BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
 PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
 VERSION_CODENAME=focal
 UBUNTU_CODENAME=focal
 Linux rilango-dt1 5.11.0-37-generic #41~20.04.2-Ubuntu SMP Fri Sep 24 09:06:38 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
 
 ***GPU Information***
 Wed Oct  6 09:41:10 2021
 +-----------------------------------------------------------------------------+
 | NVIDIA-SMI 465.27       Driver Version: 465.27       CUDA Version: 11.3     |
 |-------------------------------+----------------------+----------------------+
 | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
 | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
 |                               |                      |               MIG M. |
 |===============================+======================+======================|
 |   0  NVIDIA RTX A6000    Off  | 00000000:67:00.0 Off |                  Off |
 | 35%   63C    P2    83W / 300W |   8545MiB / 48685MiB |      0%      Default |
 |                               |                      |                  N/A |
 +-------------------------------+----------------------+----------------------+
 |   1  NVIDIA RTX A6000    Off  | 00000000:68:00.0 Off |                  Off |
 | 30%   45C    P8    17W / 300W |   1052MiB / 48682MiB |      0%      Default |
 |                               |                      |                  N/A |
 +-------------------------------+----------------------+----------------------+
 
 +-----------------------------------------------------------------------------+
 | Processes:                                                                  |
 |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
 |        ID   ID                                                   Usage      |
 |=============================================================================|
 |    0   N/A  N/A      1209      G   /usr/lib/xorg/Xorg                  4MiB |
 |    0   N/A  N/A    198201      C   /usr/NX/bin/nxnode.bin            292MiB |
 |    0   N/A  N/A    200398      C   ...al/lib/vmd/vmd_LINUXAMD64      943MiB |
 |    0   N/A  N/A    700608      C   ...s/rapids-21.08/bin/python     7301MiB |
 |    1   N/A  N/A      1209      G   /usr/lib/xorg/Xorg                 83MiB |
 |    1   N/A  N/A      1389      G   /usr/bin/gnome-shell               31MiB |
 |    1   N/A  N/A    200398      C   ...al/lib/vmd/vmd_LINUXAMD64      933MiB |
 +-----------------------------------------------------------------------------+
 
 ***CPU***
 Architecture:                    x86_64
 CPU op-mode(s):                  32-bit, 64-bit
 Byte Order:                      Little Endian
 Address sizes:                   46 bits physical, 48 bits virtual
 CPU(s):                          20
 On-line CPU(s) list:             0-19
 Thread(s) per core:              2
 Core(s) per socket:              10
 Socket(s):                       1
 NUMA node(s):                    1
 Vendor ID:                       GenuineIntel
 CPU family:                      6
 Model:                           85
 Model name:                      Intel(R) Core(TM) i9-9820X CPU @ 3.30GHz
 Stepping:                        4
 CPU MHz:                         3300.000
 CPU max MHz:                     4200.0000
 CPU min MHz:                     1200.0000
 BogoMIPS:                        6599.98
 Virtualization:                  VT-x
 L1d cache:                       320 KiB
 L1i cache:                       320 KiB
 L2 cache:                        10 MiB
 L3 cache:                        16.5 MiB
 NUMA node0 CPU(s):               0-19
 Vulnerability Itlb multihit:     KVM: Mitigation: VMX disabled
 Vulnerability L1tf:              Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
 Vulnerability Mds:               Mitigation; Clear CPU buffers; SMT vulnerable
 Vulnerability Meltdown:          Mitigation; PTI
 Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
 Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
 Vulnerability Spectre v2:        Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
 Vulnerability Srbds:             Not affected
 Vulnerability Tsx async abort:   Mitigation; Clear CPU buffers; SMT vulnerable
 Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req md_clear flush_l1d
 
 ***CMake***
 /usr/bin/cmake
 cmake version 3.16.3
 
 CMake suite maintained and supported by Kitware (kitware.com/cmake).
 
 ***g++***
 /usr/bin/g++
 g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
 Copyright (C) 2019 Free Software Foundation, Inc.
 This is free software; see the source for copying conditions.  There is NO
 warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
 
 
 ***nvcc***
 
 ***Python***
 
 ***Environment Variables***
 PATH                            : /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/opt/ngc
 LD_LIBRARY_PATH                 :
 NUMBAPRO_NVVM                   :
 NUMBAPRO_LIBDEVICE              :
 CONDA_PREFIX                    :
 PYTHON_PATH                     :
 
 conda not found
 ***pip packages***
 /usr/bin/pip
 Package                       Version
 ----------------------------- --------------------
 absl-py                       0.14.0
 aiohttp                       3.7.4.post0
 alabaster                     0.7.12
 alembic                       1.4.1
 antlr4-python3-runtime        4.8
 appdirs                       1.4.3
 apturl                        0.5.2
 argon2-cffi                   21.1.0
 async-timeout                 3.0.1
 attrs                         21.2.0
 Babel                         2.9.1
 backcall                      0.1.0
 bcrypt                        3.1.7
 bleach                        4.1.0
 blinker                       1.4
 boto3                         1.18.52
 botocore                      1.21.52
 bravado                       11.0.3
 bravado-core                  5.17.0
 Brlapi                        0.7.0
 build                         0.7.0
 cachetools                    4.2.4
 certifi                       2019.11.28
 cffi                          1.14.6
 cfgv                          3.3.1
 chardet                       3.0.4
 check-manifest                0.47
 Click                         7.0
 cloudpickle                   2.0.0
 codecov                       2.1.12
 colorama                      0.4.3
 comet-ml                      3.17.0
 command-not-found             0.3
 configobj                     5.0.6
 configparser                  5.0.2
 coverage                      5.5
 cryptography                  2.8
 cupshelpers                   1.0
 cycler                        0.10.0
 databricks-cli                0.15.0
 dbus-python                   1.2.16
 debugpy                       1.4.3
 decorator                     4.4.2
 defer                         1.0.6
 defusedxml                    0.7.1
 distlib                       0.3.0
 distro                        1.4.0
 distro-info                   0.23ubuntu1
 docker                        5.0.2
 docker-pycreds                0.4.0
 docstring-parser              0.11
 docutils                      0.17.1
 dulwich                       0.20.25
 duplicity                     0.8.12.0
 entrypoints                   0.3
 everett                       2.0.1
 fasteners                     0.14.1
 filelock                      3.0.12
 Flask                         2.0.1
 flatbuffers                   2.0
 fsspec                        2021.9.0
 future                        0.18.2
 gcsfs                         2021.9.0
 gitdb                         4.0.7
 GitPython                     3.1.24
 google-auth                   1.35.0
 google-auth-oauthlib          0.4.6
 greenlet                      1.1.2
 grpcio                        1.41.0
 gunicorn                      20.1.0
 gym                           0.20.0
 horovod                       0.22.1
 httplib2                      0.14.0
 hydra-core                    1.1.1
 identify                      2.2.15
 idna                          2.8
 imageio                       2.9.0
 imagesize                     1.2.0
 importlib-metadata            4.8.1
 importlib-resources           5.2.2
 iniconfig                     1.1.1
 ipykernel                     6.4.1
 ipyparallel                   7.1.0
 ipython                       7.13.0
 ipython-genutils              0.2.0
 ipywidgets                    7.6.5
 itsdangerous                  2.0.1
 jedi                          0.15.2
 Jinja2                        3.0.1
 jmespath                      0.10.0
 joblib                        1.0.1
 jsonargparse                  3.19.3
 jsonref                       0.2
 jsonschema                    4.0.1
 jupyter-client                7.0.5
 jupyter-core                  4.8.1
 jupyterlab-pygments           0.1.2
 jupyterlab-widgets            1.0.2
 keyring                       18.0.1
 kiwisolver                    1.3.2
 language-selector             0.1
 launchpadlib                  1.10.13
 lazr.restfulclient            0.14.2
 lazr.uri                      1.0.3
 lockfile                      0.12.2
 louis                         3.12.0
 macaroonbakery                1.3.1
 Mako                          1.1.0
 Markdown                      3.3.4
 MarkupSafe                    2.0.1
 matplotlib                    3.4.3
 matplotlib-inline             0.1.3
 mistune                       0.8.4
 mlflow                        1.20.2
 monotonic                     1.5
 more-itertools                4.2.0
 msgpack                       1.0.2
 multidict                     5.1.0
 mypy                          0.910
 mypy-extensions               0.4.3
 nbclient                      0.5.4
 nbconvert                     6.2.0
 nbformat                      5.1.3
 neptune-client                0.12.0
 nest-asyncio                  1.5.1
 netifaces                     0.10.4
 networkx                      2.6.3
 nltk                          3.6.3
 nodeenv                       1.6.0
 nose                          1.3.7
 notebook                      6.4.4
 numpy                         1.21.2
 nvidia-ml-py3                 7.352.0
 oauthlib                      3.1.0
 olefile                       0.46
 omegaconf                     2.1.1
 onnx                          1.10.1
 onnxruntime                   1.9.0
 packaging                     21.0
 pandas                        1.3.3
 pandocfilters                 1.5.0
 paramiko                      2.6.0
 parso                         0.5.2
 pathtools                     0.1.2
 pep517                        0.11.0
 pexpect                       4.6.0
 pickleshare                   0.7.5
 Pillow                        7.0.0
 pip                           20.0.2
 pkginfo                       1.7.1
 pluggy                        1.0.0
 pre-commit                    2.15.0
 prometheus-client             0.11.0
 prometheus-flask-exporter     0.18.2
 promise                       2.3
 prompt-toolkit                2.0.10
 protobuf                      3.6.1
 psutil                        5.8.0
 ptyprocess                    0.7.0
 py                            1.10.0
 pyasn1                        0.4.8
 pyasn1-modules                0.2.8
 pycairo                       1.16.2
 pycparser                     2.20
 pycups                        1.9.73
 pyDeprecate                   0.3.1
 Pygments                      2.3.1
 PyGObject                     3.36.0
 PyJWT                         1.7.1
 pymacaroons                   0.13.0
 PyNaCl                        1.3.0
 pyparsing                     2.4.7
 pyRFC3339                     1.1
 pyrsistent                    0.18.0
 pytest                        6.2.5
 python-apt                    2.0.0+ubuntu0.20.4.6
 python-dateutil               2.7.3
 python-debian                 0.1.36ubuntu1
 python-editor                 1.0.4
 pytorch-lightning             1.4.9
 pytz                          2019.3
 PyWavelets                    1.1.1
 pyxdg                         0.26
 PyYAML                        5.3.1
 pyzmq                         22.3.0
 qtconsole                     5.1.1
 QtPy                          1.11.2
 querystring-parser            1.2.4
 readme-renderer               30.0
 regex                         2021.9.24
 reportlab                     3.5.34
 requests                      2.22.0
 requests-oauthlib             1.3.0
 requests-toolbelt             0.9.1
 requests-unixsocket           0.2.0
 rfc3986                       1.5.0
 rsa                           4.7.2
 s3transfer                    0.5.0
 scikit-image                  0.18.3
 scikit-learn                  1.0
 scipy                         1.7.1
 screen-resolution-extra       0.0.0
 SecretStorage                 2.3.1
 semantic-version              2.8.5
 Send2Trash                    1.8.0
 sentry-sdk                    1.4.3
 setuptools                    45.2.0
 shortuuid                     1.0.1
 simplejson                    3.16.0
 six                           1.14.0
 smmap                         4.0.0
 snowballstemmer               2.1.0
 Sphinx                        4.2.0
 sphinxcontrib-applehelp       1.0.2
 sphinxcontrib-devhelp         1.0.2
 sphinxcontrib-htmlhelp        2.0.0
 sphinxcontrib-jsmath          1.0.1
 sphinxcontrib-qthelp          1.0.3
 sphinxcontrib-serializinghtml 1.1.5
 SQLAlchemy                    1.4.25
 sqlparse                      0.4.2
 ssh-import-id                 5.10
 subprocess32                  3.5.4
 swagger-spec-validator        2.7.3
 systemd-python                234
 tabulate                      0.8.9
 tensorboard                   2.6.0
 tensorboard-data-server       0.6.1
 tensorboard-plugin-wit        1.8.0
 termcolor                     1.1.0
 terminado                     0.12.1
 test-tube                     0.7.5
 testpath                      0.5.0
 threadpoolctl                 2.2.0
 tifffile                      2021.8.30
 toml                          0.10.2
 tomli                         1.2.1
 torch                         1.9.1
 torchmetrics                  0.5.1
 torchtext                     0.10.1
 torchvision                   0.10.1
 tornado                       6.1
 tqdm                          4.62.3
 traitlets                     4.3.3
 twine                         3.2.0
 typing-extensions             3.10.0.2
 ubuntu-advantage-tools        27.2
 ubuntu-drivers-common         0.0.0
 ufw                           0.36
 unattended-upgrades           0.1
 urllib3                       1.25.8
 usb-creator                   0.3.7
 virtualenv                    20.0.17
 wadllib                       1.3.3
 wandb                         0.12.3
 wcwidth                       0.1.8
 webencodings                  0.5.1
 websocket-client              1.2.1
 Werkzeug                      2.0.1
 wheel                         0.34.2
 widgetsnbextension            3.5.1
 wrapt                         1.12.1
 wurlitzer                     3.0.2
 xkit                          0.0.0
 yarl                          1.6.3
 yaspin                        2.1.0
 zipp                          3.6.0

@rilango rilango added Needs Triage Need team to review and classify bug Something isn't working labels Oct 6, 2021
@beckernick
Copy link
Member

From a quick triage, this appears to be a bug in dask cudf when using usecols.

import dask_cudfdata = dask_cudf.read_csv('pubchem.chembl.dataset4publication_inchi_smiles_v2.tsv',

                          delimiter='\t',

                          usecols=['Gene_Symbol', 'pXC50']

                         )

print(data['Gene_Symbol'].compute().value_counts())

​

​

data = dask_cudf.read_csv('pubchem.chembl.dataset4publication_inchi_smiles_v2.tsv',

                          delimiter='\t',

                          # usecols=['Gene_Symbol', 'pXC50']

                         )

print(data['Gene_Symbol'].compute().value_counts())

ALPI             9218
IDH1             6552
CFTR             6502
CHRM1            5885
NFE2L2           5716
                 ... 
CHEMBL1518141       1
CHEMBL2022514       1
21775681            1
CHEMBL2349003       1
9880373             1
Name: Gene_Symbol, Length: 1404649, dtype: int32
ALPI      657351
IDH1      466456
CFTR      456865
CHRM1     420325
NFE2L2    409692
           ...  
DPEP1         20
GZMB          20
CDK18         20
SBK1          20
PCSK6         20
Name: Gene_Symbol, Length: 1331, dtype: int32
import cudfgdf = cudf.read_csv('pubchem.chembl.dataset4publication_inchi_smiles_v2.tsv',

                          delimiter='\t',

                          usecols=['Gene_Symbol', 'pXC50']

                   )

​

gdf['Gene_Symbol'].value_counts()

ALPI      657351
IDH1      466456
CFTR      456865
CHRM1     420325
NFE2L2    409692
           ...  
DPEP1         20
GZMB          20
CDK18         20
SBK1          20
PCSK6         20
Name: Gene_Symbol, Length: 1331, dtype: int32

@beckernick beckernick changed the title [BUG] [BUG] Dask cuDF csv reader can incorrectly read rows when usecols is passed Oct 6, 2021
@beckernick beckernick added Python Affects Python cuDF API. dask Dask issue and removed Needs Triage Need team to review and classify labels Oct 6, 2021
@galipremsagar galipremsagar self-assigned this Nov 3, 2021
@galipremsagar
Copy link
Contributor

galipremsagar commented Nov 3, 2021

I was able to narrow down the issue to libcudf layer, here is a minimal repro:

>>> import cudf
>>> df1 = cudf.read_csv('pubchem.chembl.dataset4publication_inchi_smiles_v2.tsv', delimiter='\t', byte_range=(536870912, 268435456), header=None)
>>> df1
                                  0        1      2  3     4  ...         8      9                                                 10                                                 11    12
0       ARNGBSRPBQXTBD-UHFFFAOYNA-N  2920708    599  N  <NA>  ...    BCL2L2    387  InChI=1/C24H20N2O4S/c27-23-20-12-6-7-13-21(20)...  S(=O)(=O)(N1C(C=2C(CC1)=CC=CC2)CN3C(=O)C=4C(C3...  <NA>
1       ARNGBSRPBQXTBD-UHFFFAOYNA-N  2920708   5999  N  <NA>  ...      RGS4   3736  InChI=1/C24H20N2O4S/c27-23-20-12-6-7-13-21(20)...  S(=O)(=O)(N1C(C=2C(CC1)=CC=CC2)CN3C(=O)C=4C(C3...  <NA>
2       ARNGBSRPBQXTBD-UHFFFAOYNA-N  2920708  60482  N  <NA>  ...    SLC5A7  10913  InChI=1/C24H20N2O4S/c27-23-20-12-6-7-13-21(20)...  S(=O)(=O)(N1C(C=2C(CC1)=CC=CC2)CN3C(=O)C=4C(C3...  <NA>
3       ARNGBSRPBQXTBD-UHFFFAOYNA-N  2920708  60489  N  <NA>  ...  APOBEC3G   un61  InChI=1/C24H20N2O4S/c27-23-20-12-6-7-13-21(20)...  S(=O)(=O)(N1C(C=2C(CC1)=CC=CC2)CN3C(=O)C=4C(C3...  <NA>
4       ARNGBSRPBQXTBD-UHFFFAOYNA-N  2920708   6311  N  <NA>  ...     ATXN2   3910  InChI=1/C24H20N2O4S/c27-23-20-12-6-7-13-21(20)...  S(=O)(=O)(N1C(C=2C(CC1)=CC=CC2)CN3C(=O)C=4C(C3...  <NA>
...                             ...      ...    ... ..   ...  ...       ...    ...                                                ...                                                ...   ...
996980  BAKNXGNQZCUCSC-UHFFFAOYNA-N  3147040    836  N  <NA>  ...     CASP3    557  InChI=1/C13H15F3N2O3/c1-2-3-5-9-8-12(20,13(14,...          FC(F)(F)C1(O)N(N=C(C1)CCCC)C(=O)C=2OC=CC2  <NA>
996981  BAKNXGNQZCUCSC-UHFFFAOYNA-N  3147040    839  N  <NA>  ...     CASP6    559  InChI=1/C13H15F3N2O3/c1-2-3-5-9-8-12(20,13(14,...          FC(F)(F)C1(O)N(N=C(C1)CCCC)C(=O)C=2OC=CC2  <NA>
996982  BAKNXGNQZCUCSC-UHFFFAOYNA-N  3147040   8484  N  <NA>  ...     GALR3   5061  InChI=1/C13H15F3N2O3/c1-2-3-5-9-8-12(20,13(14,...          FC(F)(F)C1(O)N(N=C(C1)CCCC)C(=O)C=2OC=CC2  <NA>
996983  BAKNXGNQZCUCSC-UHFFFAOYNA-N  3147040  84867  N  <NA>  ...     PTPN5  12632  InChI=1/C13H15F3N2O3/c1-2-3-5-9-8-12(20,13(14,...          FC(F)(F)C1(O)N(N=C(C1)CCCC)C(=O)C=2OC=CC2  <NA>
996984  BAKNXGNQZCUCSC-UHFFFAOYNA-N  3147040   8698  N  <NA>  ...     S1PR4   5211  InChI=1/C13H15F3N2O3/c1-2-3-5-9-8-12(20,13(14,...          FC(F)(F)C1(O)N(N=C(C1)CCCC)C(=O)C=2OC=CC2  <NA>

[996985 rows x 13 columns]
>>> df1['8']           # Ignore the `'8'`, this is because we cannot infer header names while providing a byte_range. Hence the Index. 
0           BCL2L2
1             RGS4
2           SLC5A7
3         APOBEC3G
4            ATXN2
            ...   
996980       CASP3
996981       CASP6
996982       GALR3
996983       PTPN5
996984       S1PR4
Name: 8, Length: 996985, dtype: object
>>> df2 = cudf.read_csv('pubchem.chembl.dataset4publication_inchi_smiles_v2.tsv', delimiter='\t', byte_range=(536870912, 268435456), usecols=['Gene_Symbol', 'pXC50'], header=None, names=cudf.Index(['pXC50', 'Gene_Symbol'], dtype='object').to_pandas())
>>> df2['Gene_Symbol']
0         2920708
1         2920708
2         2920708
3         2920708
4         2920708
           ...   
996980    3147040
996981    3147040
996982    3147040
996983    3147040
996984    3147040
Name: Gene_Symbol, Length: 996985, dtype: object

In df1 & df2, values of df1['8'] & df2['Gene_Symbol'] should be same, whereas here it isn't the case. Because of which these are getting appended to one another in dask_cudf and resulting in an incorrect data-set in the end.

@galipremsagar galipremsagar added cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. labels Nov 3, 2021
@vuule
Copy link
Contributor

vuule commented Nov 3, 2021

I wonder if the issue exists because the order of names in usecols and names is different. Will look into this once I'm done with ORC+decimal128.

@vuule vuule self-assigned this Nov 3, 2021
@galipremsagar
Copy link
Contributor

I wonder if the issue exists because the order of names in usecols and names is different.

Changing the order too doesn't seem to fix the data i.e., the data isn't matching with either of the columns.

@vuule
Copy link
Contributor

vuule commented Nov 4, 2021

Pretty sure names of all columns should be passed via names. Not sure how the reader is expected to infer that Gene_Symbol is the 8th column from the given input parameters.

@galipremsagar
Copy link
Contributor

Discussed offline with @vuule and discovered this is purely a dask_cudf issue. I have identified the root cause, opening a fix.

@rapids-bot rapids-bot bot closed this as completed in #9618 Nov 5, 2021
rapids-bot bot pushed a commit that referenced this issue Nov 5, 2021
Fixes: #9387 

This PR fixes `usecols` parameter usage in `dask_cudf.read_csv`. When the csv read using byterange's the csv reader has to be passed complete column names in `names` param but should pass `usecols` to return the exact columns that are needed only.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Charles Blackmon-Luca (https://github.com/charlesbluca)

URL: #9618
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue dask Dask issue libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants