Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] cudf incorrect when merging on both index level and column when specifying left_on and right_on #11550

Closed
eriknw opened this issue Aug 17, 2022 · 4 comments · Fixed by #12271
Assignees
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working Python Affects Python cuDF API.

Comments

@eriknw
Copy link
Contributor

eriknw commented Aug 17, 2022

Describe the bug
dask_cudf does not give the expected result when merging two dataframes on index and column. pandas, dask.dataframe, and cudf all behave the same.

Steps/Code to reproduce bug
In the merge below, note that for on=["a", "b"], "a" is the index and "b" is a column.

import cudf
import dask_cudf

df = cudf.DataFrame({'a': [1, 2, 1, 2], 'b': [2, 3, 3, 4]}).set_index('a')
df2 = cudf.DataFrame({'a': [1, 2, 1, 3], 'b': [2, 30, 3, 4]}).set_index('a')
df2['c'] = 10
expected = df2.merge(df, on=["a", "b"], how="outer")
expected = expected.sort_values(list(expected.columns))

ddf = dask_cudf.from_cudf(df, npartitions=2)
ddf2 = dask_cudf.from_cudf(df2, npartitions=2)
result = ddf2.merge(ddf, on=["a", "b"], how="outer").compute()
result = result.sort_values(list(result.columns))

cudf.testing.assert_frame_equal(result, expected)  # raises

The two dataframes are:

>>> result  # from dask_cudf
   b_x     c   b_y
a
1    2    10     2
1    3    10     3
2    3  <NA>     3
3    4    10  <NA>
2    4  <NA>     4
2   30    10  <NA>

>>> expected  # from cudf
    b     c
a
1   2    10
1   3    10
2   3  <NA>
3   4    10
2   4  <NA>
2  30    10

Expected behavior
Same code as above that works when using pandas and dask.dataframe instead:

import pandas as pd
import dask.dataframe as dd

df = pd.DataFrame({'a': [1, 2, 1, 2], 'b': [2, 3, 3, 4]}).set_index('a')
df2 = pd.DataFrame({'a': [1, 2, 1, 3], 'b': [2, 30, 3, 4]}).set_index('a')
df2['c'] = 10
expected = df2.merge(df, on=["a", "b"], how="outer")
expected = expected.sort_values(list(expected.columns))

ddf = dd.from_pandas(df, npartitions=2)
ddf2 = dd.from_pandas(df2, npartitions=2)
result = ddf2.merge(ddf, on=["a", "b"], how="outer").compute()
result = result.sort_values(list(result.columns))

pd.testing.assert_frame_equal(result, expected)

Environment overview (please complete the following information)

  • Environment location: [Bare-metal]
  • Method of cuDF install: [conda]

Environment details

Click here to see environment details
 **git***
 commit 65a782112f4b76941483adf17f9a30a6824f6164 (HEAD -> branch-22.10, origin/branch-22.10, origin/HEAD)
 Author: Mike Wilson <[email protected]>
 Date:   Tue Aug 16 17:40:29 2022 -0400

 Removing unnecessary asserts in parquet tests (#11544)

 As noticed in review of #11524 there are unnecessary asserts in the parquet tests. This removes those.

 closes #11541

 Authors:
 - Mike Wilson (https://github.com/hyperbolic2346)

 Approvers:
 - Vukasin Milovanovic (https://github.com/vuule)
 - Nghia Truong (https://github.com/ttnghia)

 URL: https://github.com/rapidsai/cudf/pull/11544
 **git submodules***

 ***OS Information***
 DGX_NAME="DGX Server"
 DGX_PRETTY_NAME="NVIDIA DGX Server"
 DGX_SWBUILD_DATE="2020-03-04"
 DGX_SWBUILD_VERSION="4.4.0"
 DGX_COMMIT_ID="ee09ebc"
 DGX_PLATFORM="DGX Server for DGX-1"
 DGX_SERIAL_NUMBER="QTFCOU8220028"

 DGX_R418_REPO_ENABLED=20220727-142458

 DGX_OTA_VERSION="4.13.0"
 DGX_OTA_DATE="Wed Jul 27 14:38:05 PDT 2022"
 DISTRIB_ID=Ubuntu
 DISTRIB_RELEASE=18.04
 DISTRIB_CODENAME=bionic
 DISTRIB_DESCRIPTION="Ubuntu 18.04.6 LTS"
 NAME="Ubuntu"
 VERSION="18.04.6 LTS (Bionic Beaver)"
 ID=ubuntu
 ID_LIKE=debian
 PRETTY_NAME="Ubuntu 18.04.6 LTS"
 VERSION_ID="18.04"
 HOME_URL="https://www.ubuntu.com/"
 SUPPORT_URL="https://help.ubuntu.com/"
 BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
 PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
 VERSION_CODENAME=bionic
 UBUNTU_CODENAME=bionic
 Linux dgx12 4.15.0-189-generic #200-Ubuntu SMP Wed Jun 22 19:53:37 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

 ***GPU Information***
 Tue Aug 16 20:13:59 2022
 +-----------------------------------------------------------------------------+
 | NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
 |-------------------------------+----------------------+----------------------+
 | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
 | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
 |                               |                      |               MIG M. |
 |===============================+======================+======================|
 |   0  Tesla V100-SXM2...  On   | 00000000:06:00.0 Off |                    0 |
 | N/A   31C    P0    42W / 300W |      0MiB / 32768MiB |      0%      Default |
 |                               |                      |                  N/A |
 +-------------------------------+----------------------+----------------------+
 |   1  Tesla V100-SXM2...  On   | 00000000:07:00.0 Off |                    0 |
 | N/A   29C    P0    41W / 300W |      0MiB / 32768MiB |      0%      Default |
 |                               |                      |                  N/A |
 +-------------------------------+----------------------+----------------------+
 |   2  Tesla V100-SXM2...  On   | 00000000:0A:00.0 Off |                    0 |
 | N/A   28C    P0    41W / 300W |      0MiB / 32768MiB |      0%      Default |
 |                               |                      |                  N/A |
 +-------------------------------+----------------------+----------------------+
 |   3  Tesla V100-SXM2...  On   | 00000000:0B:00.0 Off |                    0 |
 | N/A   28C    P0    41W / 300W |      0MiB / 32768MiB |      0%      Default |
 |                               |                      |                  N/A |
 +-------------------------------+----------------------+----------------------+
 |   4  Tesla V100-SXM2...  On   | 00000000:85:00.0 Off |                    0 |
 | N/A   30C    P0    42W / 300W |      0MiB / 32768MiB |      0%      Default |
 |                               |                      |                  N/A |
 +-------------------------------+----------------------+----------------------+
 |   5  Tesla V100-SXM2...  On   | 00000000:86:00.0 Off |                    0 |
 | N/A   30C    P0    41W / 300W |      0MiB / 32768MiB |      0%      Default |
 |                               |                      |                  N/A |
 +-------------------------------+----------------------+----------------------+
 |   6  Tesla V100-SXM2...  On   | 00000000:89:00.0 Off |                    0 |
 | N/A   32C    P0    43W / 300W |      0MiB / 32768MiB |      0%      Default |
 |                               |                      |                  N/A |
 +-------------------------------+----------------------+----------------------+
 |   7  Tesla V100-SXM2...  On   | 00000000:8A:00.0 Off |                    0 |
 | N/A   28C    P0    41W / 300W |      0MiB / 32768MiB |      0%      Default |
 |                               |                      |                  N/A |
 +-------------------------------+----------------------+----------------------+

 +-----------------------------------------------------------------------------+
 | Processes:                                                                  |
 |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
 |        ID   ID                                                   Usage      |
 |=============================================================================|
 |  No running processes found                                                 |
 +-----------------------------------------------------------------------------+

 ***CPU***
 Architecture:        x86_64
 CPU op-mode(s):      32-bit, 64-bit
 Byte Order:          Little Endian
 CPU(s):              80
 On-line CPU(s) list: 0-79
 Thread(s) per core:  2
 Core(s) per socket:  20
 Socket(s):           2
 NUMA node(s):        2
 Vendor ID:           GenuineIntel
 CPU family:          6
 Model:               79
 Model name:          Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
 Stepping:            1
 CPU MHz:             2851.616
 CPU max MHz:         3600.0000
 CPU min MHz:         1200.0000
 BogoMIPS:            4389.85
 Virtualization:      VT-x
 L1d cache:           32K
 L1i cache:           32K
 L2 cache:            256K
 L3 cache:            51200K
 NUMA node0 CPU(s):   0-19,40-59
 NUMA node1 CPU(s):   20-39,60-79
 Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear flush_l1d

 ***CMake***
 /home/nfs/erwelch/miniconda3/envs/cugraph_dev9/bin/cmake
 cmake version 3.24.0

 CMake suite maintained and supported by Kitware (kitware.com/cmake).

 ***g++***
 /home/nfs/erwelch/miniconda3/envs/cugraph_dev9/bin/g++
 g++ (conda-forge gcc 10.4.0-16) 10.4.0
 Copyright (C) 2020 Free Software Foundation, Inc.
 This is free software; see the source for copying conditions.  There is NO
 warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


 ***nvcc***
 /home/nfs/erwelch/miniconda3/envs/cugraph_dev9/bin/nvcc
 nvcc: NVIDIA (R) Cuda compiler driver
 Copyright (c) 2005-2021 NVIDIA Corporation
 Built on Thu_Nov_18_09:45:30_PST_2021
 Cuda compilation tools, release 11.5, V11.5.119
 Build cuda_11.5.r11.5/compiler.30672275_0

 ***Python***
 /home/nfs/erwelch/miniconda3/envs/cugraph_dev9/bin/python
 Python 3.9.13

 ***Environment Variables***
 PATH                            : /home/nfs/erwelch/miniconda3/envs/cugraph_dev9/bin:/home/nfs/erwelch/miniconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
 LD_LIBRARY_PATH                 :
 NUMBAPRO_NVVM                   :
 NUMBAPRO_LIBDEVICE              :
 CONDA_PREFIX                    : /home/nfs/erwelch/miniconda3/envs/cugraph_dev9
 PYTHON_PATH                     :

 ***conda packages***
 /home/nfs/erwelch/miniconda3/condabin/conda
 # packages in environment at /home/nfs/erwelch/miniconda3/envs/cugraph_dev9:
 #
 # Name                    Version                   Build  Channel
 _libgcc_mutex             0.1                 conda_forge    conda-forge
 _openmp_mutex             4.5                       2_gnu    conda-forge
 abseil-cpp                20211102.0           h93e1e8c_2    conda-forge
 alabaster                 0.7.12                     py_0    conda-forge
 argon2-cffi               21.3.0             pyhd8ed1ab_0    conda-forge
 argon2-cffi-bindings      21.2.0           py39hb9d737c_2    conda-forge
 arrow-cpp                 9.0.0           py39h2531139_1_cpu    conda-forge
 asttokens                 2.0.8              pyhd8ed1ab_0    conda-forge
 asvdb                     0.4.2               g90e8f2c_40    rapidsai
 attrs                     22.1.0             pyh71513ae_1    conda-forge
 aws-c-cal                 0.5.11               h95a6274_0    conda-forge
 aws-c-common              0.6.2                h7f98852_0    conda-forge
 aws-c-event-stream        0.2.7               h3541f99_13    conda-forge
 aws-c-io                  0.10.5               hfb6a706_0    conda-forge
 aws-checksums             0.1.11               ha31a3da_7    conda-forge
 aws-sdk-cpp               1.8.186              hb4091e7_3    conda-forge
 babel                     2.10.3             pyhd8ed1ab_0    conda-forge
 backcall                  0.2.0              pyh9f0ad1d_0    conda-forge
 backports                 1.0                        py_2    conda-forge
 backports.functools_lru_cache 1.6.4              pyhd8ed1ab_0    conda-forge
 beautifulsoup4            4.11.1             pyha770c72_0    conda-forge
 binutils                  2.36.1               hdd6e379_2    conda-forge
 binutils_impl_linux-64    2.36.1               h193b22a_2    conda-forge
 binutils_linux-64         2.36                hf3e587d_10    conda-forge
 bleach                    5.0.1              pyhd8ed1ab_0    conda-forge
 bokeh                     2.4.3              pyhd8ed1ab_3    conda-forge
 boost                     1.78.0           py39hac2352c_0    conda-forge
 boost-cpp                 1.78.0               h75c5d50_1    conda-forge
 boto3                     1.24.52            pyhd8ed1ab_0    conda-forge
 botocore                  1.27.52            pyhd8ed1ab_0    conda-forge
 brotlipy                  0.7.0           py39hb9d737c_1004    conda-forge
 bzip2                     1.0.8                h7f98852_4    conda-forge
 c-ares                    1.18.1               h7f98852_0    conda-forge
 c-compiler                1.4.2                h166bdaf_0    conda-forge
 ca-certificates           2022.6.15            ha878542_0    conda-forge
 cachetools                5.2.0              pyhd8ed1ab_0    conda-forge
 certifi                   2022.6.15        py39hf3d152e_0    conda-forge
 cffi                      1.15.1           py39he91dace_0    conda-forge
 charset-normalizer        2.1.0              pyhd8ed1ab_0    conda-forge
 clang                     11.1.0               ha770c72_1    conda-forge
 clang-11                  11.1.0          default_ha53f305_1    conda-forge
 clang-tools               11.1.0          default_ha53f305_1    conda-forge
 clangxx                   11.1.0          default_ha53f305_1    conda-forge
 click                     8.1.3            py39hf3d152e_0    conda-forge
 cloudpickle               2.1.0              pyhd8ed1ab_0    conda-forge
 cmake                     3.24.0               h5432695_0    conda-forge
 colorama                  0.4.5              pyhd8ed1ab_0    conda-forge
 commonmark                0.9.1                      py_0    conda-forge
 coverage                  6.4.4            py39hb9d737c_0    conda-forge
 cryptography              37.0.4           py39hd97740a_0    conda-forge
 cuda-python               11.7.0           py39h3fd9d12_0    nvidia
 cudatoolkit               11.5.1               hcf5317a_9    nvidia
 cudf                      22.10.00a220816 cuda_11_py39_g0c4b319855_131    rapidsai-nightly
 cupy                      10.6.0           py39hc3c280e_0    conda-forge
 cxx-compiler              1.4.2                h924138e_0    conda-forge
 cython                    0.29.32          py39h5a03fae_0    conda-forge
 cytoolz                   0.12.0           py39hb9d737c_0    conda-forge
 dask                      2022.8.0           pyhd8ed1ab_1    conda-forge
 dask-core                 2022.8.0           pyhd8ed1ab_0    conda-forge
 dask-cuda                 22.10.00a220816 py39_g7245104_14    rapidsai-nightly
 dask-cudf                 22.10.00a220816 cuda_11_py39_g0c4b319855_131    rapidsai-nightly
 debugpy                   1.6.3            py39h5a03fae_0    conda-forge
 decorator                 5.1.1              pyhd8ed1ab_0    conda-forge
 defusedxml                0.7.1              pyhd8ed1ab_0    conda-forge
 distributed               2022.8.0           pyhd8ed1ab_1    conda-forge
 distro                    1.6.0              pyhd8ed1ab_0    conda-forge
 dlpack                    0.5                  h9c3ff4c_0    conda-forge
 docutils                  0.19             py39hf3d152e_0    conda-forge
 doxygen                   1.9.3                h583eb01_1    conda-forge
 entrypoints               0.4                pyhd8ed1ab_0    conda-forge
 executing                 0.10.0             pyhd8ed1ab_0    conda-forge
 expat                     2.4.8                h27087fc_0    conda-forge
 fastavro                  1.6.0            py39hb9d737c_0    conda-forge
 fastrlock                 0.8              py39h5a03fae_2    conda-forge
 flake8                    5.0.4              pyhd8ed1ab_0    conda-forge
 flit-core                 3.7.1              pyhd8ed1ab_0    conda-forge
 freetype                  2.12.1               hca18f0e_0    conda-forge
 fsspec                    2022.7.1           pyhd8ed1ab_0    conda-forge
 future                    0.18.2           py39hf3d152e_5    conda-forge
 gcc                       10.4.0              hb92f740_10    conda-forge
 gcc_impl_linux-64         10.4.0              h7ee1905_16    conda-forge
 gcc_linux-64              10.4.0              h9215b83_10    conda-forge
 gflags                    2.2.2             he1b5a44_1004    conda-forge
 gh                        2.14.4               ha8f183a_0    conda-forge
 glog                      0.6.0                h6f12383_0    conda-forge
 gmock                     1.10.0               h4bd325d_7    conda-forge
 grpc-cpp                  1.46.4               h6fc47f4_3    conda-forge
 gtest                     1.10.0               h4bd325d_7    conda-forge
 gxx                       10.4.0              hb92f740_10    conda-forge
 gxx_impl_linux-64         10.4.0              h7ee1905_16    conda-forge
 gxx_linux-64              10.4.0              h6e491c6_10    conda-forge
 heapdict                  1.0.1                      py_0    conda-forge
 icu                       70.1                 h27087fc_0    conda-forge
 idna                      3.3                pyhd8ed1ab_0    conda-forge
 imagesize                 1.4.1              pyhd8ed1ab_0    conda-forge
 importlib-metadata        4.11.4           py39hf3d152e_0    conda-forge
 importlib_resources       5.9.0              pyhd8ed1ab_0    conda-forge
 iniconfig                 1.1.1              pyh9f0ad1d_0    conda-forge
 ipykernel                 6.15.1             pyh210e3f2_0    conda-forge
 ipython                   8.4.0            py39hf3d152e_0    conda-forge
 ipython_genutils          0.2.0                      py_1    conda-forge
 jedi                      0.18.1             pyhd8ed1ab_2    conda-forge
 jinja2                    3.1.2              pyhd8ed1ab_1    conda-forge
 jmespath                  1.0.1              pyhd8ed1ab_0    conda-forge
 joblib                    1.1.0              pyhd8ed1ab_0    conda-forge
 jpeg                      9e                   h166bdaf_2    conda-forge
 jsonschema                4.9.1              pyhd8ed1ab_0    conda-forge
 jupyter_client            7.3.4              pyhd8ed1ab_0    conda-forge
 jupyter_core              4.11.1           py39hf3d152e_0    conda-forge
 jupyterlab_pygments       0.2.2              pyhd8ed1ab_0    conda-forge
 kernel-headers_linux-64   2.6.32              he073ed8_15    conda-forge
 keyutils                  1.6.1                h166bdaf_0    conda-forge
 krb5                      1.19.3               h3790be6_0    conda-forge
 lcms2                     2.12                 hddcbb42_0    conda-forge
 ld_impl_linux-64          2.36.1               hea4e1c9_2    conda-forge
 lerc                      4.0.0                h27087fc_0    conda-forge
 libabseil                 20211102.0      cxx17_h48a1fff_2    conda-forge
 libblas                   3.9.0           16_linux64_openblas    conda-forge
 libbrotlicommon           1.0.9                h166bdaf_7    conda-forge
 libbrotlidec              1.0.9                h166bdaf_7    conda-forge
 libbrotlienc              1.0.9                h166bdaf_7    conda-forge
 libcblas                  3.9.0           16_linux64_openblas    conda-forge
 libclang-cpp11.1          11.1.0          default_ha53f305_1    conda-forge
 libcrc32c                 1.1.2                h9c3ff4c_0    conda-forge
 libcudf                   22.10.00a220816 cuda11_g65a782112f_135    rapidsai-nightly
 libcugraphops             22.10.00a220810 cuda11_g0438dcaf_19    rapidsai-nightly
 libcurl                   7.83.1               h7bff187_0    conda-forge
 libcusolver               11.4.0.1                      0    nvidia
 libcusparse               11.7.4.91                     0    nvidia
 libdeflate                1.13                 h166bdaf_0    conda-forge
 libedit                   3.1.20191231         he28a2e2_2    conda-forge
 libev                     4.33                 h516909a_1    conda-forge
 libevent                  2.1.10               h9b69904_4    conda-forge
 libffi                    3.4.2                h7f98852_5    conda-forge
 libgcc-devel_linux-64     10.4.0              h74af60c_16    conda-forge
 libgcc-ng                 12.1.0              h8d9b700_16    conda-forge
 libgfortran-ng            12.1.0              h69a702a_16    conda-forge
 libgfortran5              12.1.0              hdcd56e2_16    conda-forge
 libgomp                   12.1.0              h8d9b700_16    conda-forge
 libgoogle-cloud           1.40.2               hefc27d0_0    conda-forge
 libiconv                  1.16                 h516909a_0    conda-forge
 liblapack                 3.9.0           16_linux64_openblas    conda-forge
 libllvm11                 11.1.0               hf817b99_3    conda-forge
 libnghttp2                1.47.0               hdcd2b5c_1    conda-forge
 libnsl                    2.0.0                h7f98852_0    conda-forge
 libopenblas               0.3.21          pthreads_h78a6416_1    conda-forge
 libpng                    1.6.37               h753d276_4    conda-forge
 libprotobuf               3.20.1               h6239696_1    conda-forge
 libraft-headers           22.10.00a220816 cuda11_gc9cce720_26    rapidsai-nightly
 librmm                    22.10.00a220816 cuda11_gadcfb934_9    rapidsai-nightly
 libsanitizer              10.4.0              hde28e3b_16    conda-forge
 libsodium                 1.0.18               h36c2ea0_1    conda-forge
 libsqlite                 3.39.2               h753d276_1    conda-forge
 libssh2                   1.10.0               haa6b8db_3    conda-forge
 libstdcxx-devel_linux-64  10.4.0              h74af60c_16    conda-forge
 libstdcxx-ng              12.1.0              ha89aaad_16    conda-forge
 libthrift                 0.16.0               h519c5ea_1    conda-forge
 libtiff                   4.4.0                h0e0dad5_3    conda-forge
 libutf8proc               2.7.0                h7f98852_0    conda-forge
 libuuid                   2.32.1            h7f98852_1000    conda-forge
 libuv                     1.44.2               h166bdaf_0    conda-forge
 libwebp-base              1.2.4                h166bdaf_0    conda-forge
 libxcb                    1.13              h7f98852_1004    conda-forge
 libxml2                   2.9.14               h22db469_4    conda-forge
 libxslt                   1.1.35               h8affb1d_0    conda-forge
 libzlib                   1.2.12               h166bdaf_2    conda-forge
 llvmlite                  0.38.1           py39h7d9a04d_0    conda-forge
 locket                    1.0.0              pyhd8ed1ab_0    conda-forge
 lxml                      4.9.1            py39hb9d737c_0    conda-forge
 lz4                       4.0.0            py39h029007f_2    conda-forge
 lz4-c                     1.9.3                h9c3ff4c_1    conda-forge
 make                      4.3                  hd18ef5c_1    conda-forge
 markdown                  3.4.1              pyhd8ed1ab_0    conda-forge
 markupsafe                2.1.1            py39hb9d737c_1    conda-forge
 matplotlib-inline         0.1.3              pyhd8ed1ab_0    conda-forge
 mccabe                    0.7.0              pyhd8ed1ab_0    conda-forge
 mistune                   0.8.4           py39h3811e60_1005    conda-forge
 msgpack-python            1.0.4            py39hf939315_0    conda-forge
 nbclient                  0.6.6              pyhd8ed1ab_0    conda-forge
 nbconvert                 6.5.3              pyhd8ed1ab_0    conda-forge
 nbconvert-core            6.5.3              pyhd8ed1ab_0    conda-forge
 nbconvert-pandoc          6.5.3              pyhd8ed1ab_0    conda-forge
 nbformat                  5.4.0              pyhd8ed1ab_0    conda-forge
 nbsphinx                  0.8.9              pyhd8ed1ab_0    conda-forge
 nccl                      2.13.4.1             h0800d71_0    conda-forge
 ncurses                   6.3                  h27087fc_1    conda-forge
 nest-asyncio              1.5.5              pyhd8ed1ab_0    conda-forge
 networkx                  2.8.5              pyhd8ed1ab_0    conda-forge
 notebook                  6.4.12             pyha770c72_0    conda-forge
 numba                     0.55.2           py39h66db6d7_0    conda-forge
 numpy                     1.22.4           py39hc58783e_0    conda-forge
 numpydoc                  1.4.0              pyhd8ed1ab_1    conda-forge
 nvcc_linux-64             10.1                hcaf9a05_10
 nvtx                      0.2.3            py39h3811e60_1    conda-forge
 openjpeg                  2.5.0                h7d73246_1    conda-forge
 openssl                   1.1.1q               h166bdaf_0    conda-forge
 orc                       1.7.5                h6c59b99_0    conda-forge
 packaging                 21.3               pyhd8ed1ab_0    conda-forge
 pandas                    1.4.3            py39h1832856_0    conda-forge
 pandoc                    2.19                 ha770c72_0    conda-forge
 pandocfilters             1.5.0              pyhd8ed1ab_0    conda-forge
 parquet-cpp               1.5.1                         2    conda-forge
 parso                     0.8.3              pyhd8ed1ab_0    conda-forge
 partd                     1.3.0              pyhd8ed1ab_0    conda-forge
 pexpect                   4.8.0              pyh9f0ad1d_2    conda-forge
 pickleshare               0.7.5                   py_1003    conda-forge
 pillow                    9.2.0            py39hd5dbb17_2    conda-forge
 pip                       22.2.2             pyhd8ed1ab_0    conda-forge
 pkgutil-resolve-name      1.3.10             pyhd8ed1ab_0    conda-forge
 pluggy                    1.0.0            py39hf3d152e_3    conda-forge
 prometheus_client         0.14.1             pyhd8ed1ab_0    conda-forge
 prompt-toolkit            3.0.30             pyha770c72_0    conda-forge
 protobuf                  3.20.1           py39h5a03fae_0    conda-forge
 psutil                    5.9.1            py39hb9d737c_0    conda-forge
 pthread-stubs             0.4               h36c2ea0_1001    conda-forge
 ptxcompiler               0.2.0            py39h107f55c_0    rapidsai
 ptyprocess                0.7.0              pyhd3deb0d_0    conda-forge
 pure_eval                 0.2.2              pyhd8ed1ab_0    conda-forge
 py                        1.11.0             pyh6c4a22f_0    conda-forge
 py-cpuinfo                8.0.0              pyhd8ed1ab_0    conda-forge
 pyarrow                   9.0.0           py39h58137f1_1_cpu    conda-forge
 pycodestyle               2.9.1              pyhd8ed1ab_0    conda-forge
 pycparser                 2.21               pyhd8ed1ab_0    conda-forge
 pydata-sphinx-theme       0.9.0              pyhd8ed1ab_1    conda-forge
 pyflakes                  2.5.0              pyhd8ed1ab_0    conda-forge
 pygal                     2.4.0                      py_0    conda-forge
 pygments                  2.13.0             pyhd8ed1ab_0    conda-forge
 pynvml                    11.4.1             pyhd8ed1ab_0    conda-forge
 pyopenssl                 22.0.0             pyhd8ed1ab_0    conda-forge
 pyparsing                 3.0.9              pyhd8ed1ab_0    conda-forge
 pyraft                    22.10.00a220816 cuda11_py39_gc9cce720_26    rapidsai-nightly
 pyrsistent                0.18.1           py39hb9d737c_1    conda-forge
 pysocks                   1.7.1            py39hf3d152e_5    conda-forge
 pytest                    7.1.2            py39hf3d152e_0    conda-forge
 pytest-benchmark          3.2.3              pyh9f0ad1d_0    conda-forge
 pytest-cov                3.0.0              pyhd8ed1ab_0    conda-forge
 python                    3.9.13          h9a8a25e_0_cpython    conda-forge
 python-dateutil           2.8.2              pyhd8ed1ab_0    conda-forge
 python-fastjsonschema     2.16.1             pyhd8ed1ab_0    conda-forge
 python_abi                3.9                      2_cp39    conda-forge
 pytz                      2022.2.1           pyhd8ed1ab_0    conda-forge
 pyyaml                    6.0              py39hb9d737c_4    conda-forge
 pyzmq                     23.2.1           py39headdf64_0    conda-forge
 rapids-pytest-benchmark   0.0.14                     py_0    rapidsai
 re2                       2022.06.01           h27087fc_0    conda-forge
 readline                  8.1.2                h0f457ee_0    conda-forge
 recommonmark              0.7.1              pyhd8ed1ab_0    conda-forge
 requests                  2.28.1             pyhd8ed1ab_0    conda-forge
 rhash                     1.4.3                h166bdaf_0    conda-forge
 rmm                       22.10.00a220816 cuda11_py39_gadcfb934_9    rapidsai-nightly
 s2n                       1.0.10               h9b69904_0    conda-forge
 s3transfer                0.6.0              pyhd8ed1ab_0    conda-forge
 scikit-build              0.15.0             pyhb871ab6_0    conda-forge
 scikit-learn              1.1.2            py39he5e8d7e_0    conda-forge
 scipy                     1.9.0            py39h8ba3f38_0    conda-forge
 send2trash                1.8.0              pyhd8ed1ab_0    conda-forge
 setuptools                65.0.1           py39hf3d152e_0    conda-forge
 six                       1.16.0             pyh6c4a22f_0    conda-forge
 snappy                    1.1.9                hbd366e4_1    conda-forge
 snowballstemmer           2.2.0              pyhd8ed1ab_0    conda-forge
 sortedcontainers          2.4.0              pyhd8ed1ab_0    conda-forge
 soupsieve                 2.3.2.post1        pyhd8ed1ab_0    conda-forge
 spdlog                    1.8.5                h4bd325d_1    conda-forge
 sphinx                    5.1.1              pyhd8ed1ab_1    conda-forge
 sphinx-copybutton         0.5.0              pyhd8ed1ab_0    conda-forge
 sphinx-markdown-tables    0.0.17             pyh6c4a22f_0    conda-forge
 sphinxcontrib-applehelp   1.0.2                      py_0    conda-forge
 sphinxcontrib-devhelp     1.0.2                      py_0    conda-forge
 sphinxcontrib-htmlhelp    2.0.0              pyhd8ed1ab_0    conda-forge
 sphinxcontrib-jsmath      1.0.1                      py_0    conda-forge
 sphinxcontrib-qthelp      1.0.3                      py_0    conda-forge
 sphinxcontrib-serializinghtml 1.1.5              pyhd8ed1ab_2    conda-forge
 sphinxcontrib-websupport  1.2.4              pyhd8ed1ab_1    conda-forge
 sqlite                    3.39.2               h4ff8645_1    conda-forge
 stack_data                0.4.0              pyhd8ed1ab_0    conda-forge
 sysroot_linux-64          2.12                he073ed8_15    conda-forge
 tblib                     1.7.0              pyhd8ed1ab_0    conda-forge
 terminado                 0.15.0           py39hf3d152e_0    conda-forge
 threadpoolctl             3.1.0              pyh8a188c0_0    conda-forge
 tinycss2                  1.1.1              pyhd8ed1ab_0    conda-forge
 tk                        8.6.12               h27826a3_0    conda-forge
 toml                      0.10.2             pyhd8ed1ab_0    conda-forge
 tomli                     2.0.1              pyhd8ed1ab_0    conda-forge
 toolz                     0.12.0             pyhd8ed1ab_0    conda-forge
 tornado                   6.1              py39hb9d737c_3    conda-forge
 traitlets                 5.3.0              pyhd8ed1ab_0    conda-forge
 typing_extensions         4.3.0              pyha770c72_0    conda-forge
 tzdata                    2022c                h191b570_0    conda-forge
 ucx                       1.13.0               h538f049_0    conda-forge
 ucx-proc                  1.0.0                       gpu    rapidsai
 ucx-py                    0.28.00a220810  py39_gf585c50_16    rapidsai-nightly
 urllib3                   1.26.11            pyhd8ed1ab_0    conda-forge
 wcwidth                   0.2.5              pyh9f0ad1d_2    conda-forge
 webencodings              0.5.1                      py_1    conda-forge
 wheel                     0.37.1             pyhd8ed1ab_0    conda-forge
 xorg-libxau               1.0.9                h7f98852_0    conda-forge
 xorg-libxdmcp             1.1.3                h7f98852_0    conda-forge
 xz                        5.2.6                h166bdaf_0    conda-forge
 yaml                      0.2.5                h7f98852_2    conda-forge
 zeromq                    4.3.4                h9c3ff4c_1    conda-forge
 zict                      2.2.0              pyhd8ed1ab_0    conda-forge
 zipp                      3.8.1              pyhd8ed1ab_0    conda-forge
 zlib                      1.2.12               h166bdaf_2    conda-forge
 zstd                      1.5.2                h8a70e8d_4    conda-forge


Additional context
Encountered this when implementing functionality for MG PropertyGraph: rapidsai/cugraph#2523

@eriknw eriknw added Needs Triage Need team to review and classify bug Something isn't working labels Aug 17, 2022
eriknw added a commit to eriknw/cugraph that referenced this issue Sep 14, 2022
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@eriknw
Copy link
Contributor Author

eriknw commented Sep 20, 2022

Note that this behavior was added to pandas 0.23:

https://pandas.pydata.org/docs/whatsnew/v0.23.0.html#whatsnew-0230-enhancements-merge-on-columns-and-levels

and was added to Dask here:
dask/dask#2950
dask/dask#2960

This is an important optimization for dask.dataframe and dask_cudf to avoid full shuffles, and b/c they don't allow multiindex.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment dask Dask issue and removed Needs Triage Need team to review and classify inactive-30d labels Oct 21, 2022
@wence-
Copy link
Contributor

wence- commented Nov 29, 2022

This is a bug in cudf's treatment of merge when both left_on and right_on are provided:

import cudf

df = cudf.DataFrame({'a': [1, 2, 1, 2], 'b': [2, 3, 3, 4]}).set_index('a')
df2 = cudf.DataFrame({'a': [1, 2, 1, 3], 'b': [2, 30, 3, 4]}).set_index('a')

df2['c'] = 10
expected = df2.merge(df, on=["a", "b"], how="outer")

#    b     c
# a          
# 1   2    10
# 1   3    10
# 2  30    10
# 3   4    10
# 2   3  <NA>
# 2   4  <NA>

got = df2.merge(df, left_on=["a", "b"], right_on=["a", "b"], how="outer")

#    b_x     c   b_y
# a                 
# 1    2    10     2
# 1    3    10     3
# 2   30    10  <NA>
# 3    4    10  <NA>
# 2    3  <NA>     3
# 2    4  <NA>     4

I think this is because the Merge object incorrectly determines that there are no key columns with matching names if any of the key columns are indices.

This patch might be right:

diff --git a/python/cudf/cudf/core/join/join.py b/python/cudf/cudf/core/join/join.py
index 0e5ac8dc02..18f02170bc 100644
--- a/python/cudf/cudf/core/join/join.py
+++ b/python/cudf/cudf/core/join/join.py
@@ -147,12 +147,13 @@ class Merge:
         self._key_columns_with_same_name = (
             set(_coerce_to_tuple(on))
             if on
-            else set()
-            if (self._using_left_index or self._using_right_index)
             else {
                 lkey.name
                 for lkey, rkey in zip(self._left_keys, self._right_keys)
                 if lkey.name == rkey.name
+                and not (
+                    isinstance(lkey, _IndexIndexer) or isinstance(rkey, _IndexIndexer)
+                )
             }
         )
 

@wence- wence- self-assigned this Nov 29, 2022
@wence- wence- added Python Affects Python cuDF API. and removed dask Dask issue labels Nov 29, 2022
@wence- wence- changed the title [BUG] dask_cudf incorrect when merging on both index level and column [BUG] cudf incorrect when merging on both index level and column when specifying left_on and right_on Nov 29, 2022
wence- added a commit to wence-/cudf that referenced this issue Nov 30, 2022
Previously, if any of the join keys were indices, we assumed that they
all were, and provided an empty set of key columns with matching names
in the left and right dataframe. This does the wrong thing for mixed
join keys (on a combination of index and normal columns), producing
more output columns than is correct. To avoid this, only skip matching
key names if they name indices.

Closes rapidsai#11550.
rapids-bot bot pushed a commit that referenced this issue Nov 30, 2022
Previously, if any of the join keys were indices, we assumed that they
all were, and provided an empty set of key columns with matching names
in the left and right dataframe. This does the wrong thing for mixed
join keys (on a combination of index and normal columns), producing
more output columns than is correct. To avoid this, only skip matching
key names if they name indices.

Closes #11550.

Authors:
  - Lawrence Mitchell (https://github.com/wence-)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)

URL: #12271
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working Python Affects Python cuDF API.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants