Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] wheel tests do not fail when raft-dask wheel has unsatisfiable dependency requirements #2348

Closed
jameslamb opened this issue May 31, 2024 · 1 comment · Fixed by #2349
Assignees
Labels
bug Something isn't working ci

Comments

@jameslamb
Copy link
Member

Describe the bug

We recently observed a situation where raft-dask nightly wheels were being published with duplicated dependencies:

  • pylibraft-cu12==24.8.*,>=0.0.0a0 AND pylibraft==24.8.*,>=0.0.0a0
  • ucx-py-cu12==0.39.* AND ucx-py==0.39.*

The unsuffixed ones are a mistake, fixed in #2347. However... that was only caught by cugraph's CI (build link).

It should have been caught here in raft's CI, probably here:

python -m pip install "raft_dask-${RAPIDS_PY_CUDA_SUFFIX}[test]>=0.0.0a0" --find-links dist/

Steps/Code to reproduce bug

Trying to reproduce a very recent CI build that passed despite using wheels that suffer from the issue fixed in #2437 (build link).

Ran a container mimicking what was used in that CI run.

docker run \
    --rm \
    --env NVIDIA_VISIBLE_DEVICES \
    --env RAPIDS_BUILD_TYPE="pull-request" \
    --env RAPIDS_REPOSITORY="rapidsai/raft" \
    --env RAPIDS_REF_NAME=pull-request/2343 \
    -it rapidsai/citestwheel:cuda12.2.2-ubuntu22.04-py3.9 \
    bash

And then the following code mirroring ci/test_wheel_raft_dask.sh (code link), with a bit of extra debugging stuff added.

setup mimicking what happens in CI (click me)

Checked if there was extra pip configuration setup in the image.

pip config list

Just one, an index URL.

# :env:.extra-index-url='https://pypi.anaconda.org/rapidsai-wheels-nightly/simple'

Checked the version of pip.

pip --version
# 23.0.1

Installed pkginfo to inspect the wheels.

pip install pkginfo

Downloaded wheels from the same CI run and put them in separate directories.

mkdir -p ./dist
RAPIDS_PY_CUDA_SUFFIX="$(rapids-wheel-ctk-name-gen ${RAPIDS_CUDA_VERSION})"

# git ref (entered in interactive prompt): 04186e4
RAPIDS_PY_WHEEL_NAME="raft_dask_${RAPIDS_PY_CUDA_SUFFIX}" rapids-download-wheels-from-s3 ./dist
RAPIDS_PY_WHEEL_NAME="pylibraft_${RAPIDS_PY_CUDA_SUFFIX}" rapids-download-wheels-from-s3 ./local-pylibraft-dep

Inspected them to confirm that:

  • both wheels' name fields have -cu12 suffix
  • raft_dask wheel depends on both pylibraft-cu12 and pylibraft

They do.

# raft-dask
pkginfo \
    --field=name \
    --field=version \
    --field=requires_dist \
    ./dist/raft_dask_cu12-*cp39*.whl

# name: raft-dask-cu12
# version: 24.8.0a20
# requires_dist: ['dask-cuda==24.8.*,>=0.0.0a0', 'distributed-ucxx-cu12==0.39.*', 'joblib>=0.11', 'numba>=0.57', 'numpy<2.0a0,>=1.23', 'pylibraft-cu12==24.8.*,>=0.0.0a0', 'pylibraft==24.8.*,>=0.0.0a0', 'rapids-dask-dependency==24.8.*,>=0.0.0a0', 'ucx-py-cu12==0.39.*', 'ucx-py==0.39.*', 'pytest-cov; extra == "test"', 'pytest==7.*; extra == "test"']

# pylibraft
pkginfo \
    --field=name \
    --field=version \
    --field=requires_dist \
    ./local-pylibraft-dep/pylibraft_cu12-*cp39*.whl
# name: pylibraft-cu12
# version: 24.8.0a20
# requires_dist: ['cuda-python<13.0a0,>=12.0', 'numpy<2.0a0,>=1.23', 'rmm-cu12==24.8.*,>=0.0.0a0', 'cupy-cuda12x>=12.0.0; extra == "test"', 'pytest-cov; extra == "test"', 'pytest==7.*; extra == "test"', 'scikit-learn; extra == "test"', 'scipy; extra == "test"']

Installed the pylibraft wheel, just as the test script does.

python -m pip -v install --no-deps ./local-pylibraft-dep/pylibraft*.whl

That worked as expected.

Processing /local-pylibraft-dep/pylibraft_cu12-24.8.0a20-cp39-cp39-manylinux_2_28_x86_64.whl
Installing collected packages: pylibraft-cu12
Successfully installed pylibraft-cu12-24.8.0a20

With that set up (a raft_dask-cu12 wheel in ./dist and pylibraft-cu12already installed), I ran the following:

python -m pip -v install "raft_dask-${RAPIDS_PY_CUDA_SUFFIX}[test]>=0.0.0a0" --find-links dist/

Just like we observed in CI:

  • this succeeds
  • pylibraft (unsuffixed) is not mentioned in the logs, but all other dependencies are
  • there are no warnings or errors
Successfully installed MarkupSafe-2.1.5 click-8.1.7 cloudpickle-3.0.0 coverage-7.5.3 cuda-python-12.5.0 dask-2024.5.1 dask-cuda-24.8.0a0 dask-expr-1.1.1 distributed-2024.5.1 distributed-ucxx-cu12-0.39.0a0 exceptiongroup-1.2.1 fsspec-2024.5.0 importlib-metadata-7.1.0 iniconfig-2.0.0 jinja2-3.1.4 joblib-1.4.2 libucx-cu12-1.15.0.post1 llvmlite-0.42.0 locket-1.0.0 msgpack-1.0.8 numba-0.59.1 numpy-1.26.4 packaging-24.0 pandas-2.2.2 partd-1.4.2 pluggy-1.5.0 psutil-5.9.8 pyarrow-16.1.0 pynvml-11.4.1 pytest-7.4.4 pytest-cov-5.0.0 python-dateutil-2.9.0.post0 pytz-2024.1 pyyaml-6.0.1 raft_dask-cu12-24.8.0a18 rapids-dask-dependency-24.8.0a4 rmm-cu12-24.8.0a6 six-1.16.0 sortedcontainers-2.4.0 tblib-3.0.0 tomli-2.0.1 toolz-0.12.1 tornado-6.4 tzdata-2024.1 ucx-py-cu12-0.39.0a0 ucxx-cu12-0.39.0a0 urllib3-2.2.1 zict-3.0.0 zipp-3.19

HOWEVER... this alternative form fails in the expected way.

python -m pip -v install ./dist/*.whl
ERROR: Could not find a version that satisfies the requirement ucx-py==0.39.* (from raft-dask-cu12) (from versions: 0.0.1.post1)
ERROR: No matching distribution found for ucx-py==0.39.*

Expected behavior

I expected CI to fail because the constraints pylibraft==24.8.* and ucx-py==0.39.* are not satisfiable (those packages do not exist).

Environment details (please complete the following information):

nvidia-smi (click me)
Fri May 31 12:06:47 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-32GB           On  | 00000000:06:00.0 Off |                    0 |
| N/A   33C    P0              55W / 300W |    341MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2-32GB           On  | 00000000:07:00.0 Off |                    0 |
| N/A   33C    P0              42W / 300W |      3MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2-32GB           On  | 00000000:0A:00.0 Off |                    0 |
| N/A   31C    P0              42W / 300W |      3MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2-32GB           On  | 00000000:0B:00.0 Off |                    0 |
| N/A   29C    P0              41W / 300W |      3MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2-32GB           On  | 00000000:85:00.0 Off |                    0 |
| N/A   31C    P0              41W / 300W |      3MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2-32GB           On  | 00000000:86:00.0 Off |                    0 |
| N/A   30C    P0              42W / 300W |      3MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2-32GB           On  | 00000000:89:00.0 Off |                    0 |
| N/A   34C    P0              43W / 300W |      3MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2-32GB           On  | 00000000:8A:00.0 Off |                    0 |
| N/A   30C    P0              43W / 300W |      3MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

Additional context

The particular unsatisfiable dependency issue was likely introduced by recent changes adding rapids-build-backend (#2331, for rapidsai/build-planning#31). But in theory this could just as easily happen with some other unrelated issue with dependencies, like a typo of the form joblibbbbb or something.

I am actively investigating this (along with @bdice and @nv-rliu). Just posting for documentation purposes.

@nv-rliu
Copy link

nv-rliu commented May 31, 2024

Wow, well written! Thanks for documenting this bug

@rapids-bot rapids-bot bot closed this as completed in #2349 Jun 3, 2024
rapids-bot bot pushed a commit that referenced this issue Jun 3, 2024
… run, fix wheel dependencies (#2349)

Fixes #2348 

#2331 introduced `rapids-build-backend` (https://github.com/rapidsai/rapids-build-backend) as the build backend for `pylibraft`, `raft-dask`, and `raft-ann-bench`.

That library handles automatically modifying a wheel's dependencies based on the target CUDA version. Unfortunately, we missed a few cases in #2331, and as a result the last few days of nightly `raft-dask` wheels had the following issues:

* depending on `pylibraft`
  - *(does not exist, it's called `pylibraft-cu12`)*
* depending on `ucx-py==0.39.*`
  - *(does not exist, it's called `ucx-py-cu12`)*
* depending on `distributed-ucxx-cu11==0.39.*` instead of `distributed-ucxx-cu11==0.39.*,>=0.0.0a0`
   - *(without that alpha specifier, `pip install --pre` is required to install pre-release versions)*

This wasn't caught in `raft`'s CI, but in downstream CI like `cuml` and `cugraph`, with errors like this:

```text
ERROR: ResolutionImpossible:
  
The conflict is caused by:
    raft-dask-cu12 24.8.0a20 depends on pylibraft==24.8.* and >=0.0.0a0
    raft-dask-cu12 24.8.0a19 depends on pylibraft==24.8.* and >=0.0.0a0
```

([example cugraph build](https://github.com/rapidsai/cugraph/actions/runs/9315062495/job/25656684762?pr=4454#step:7:1811))

This PR:

* fixes those dependency issues
* modifies `raft`'s CI so that similar issues would be caught here in the future, before publishing wheels

## Notes for Reviewers

### What was the root cause of CI missing this, and how does this PR fix it?

The `raft-dask` test CI jobs use this pattern to install the `raft-dask` wheel built earlier in the CI pipeline.

```shell
pip install "raft_dask-cu12[test]>=0.0.0a0" --find-links dist/
```

As described in the `pip` docs ([link](https://pip.pypa.io/en/stable/cli/pip_install/#finding-packages)), `--find-links` just adds a directory to the list of other places `pip` searches for packages. Because the wheel there had unsatisfiable constraints (e.g. `pylibraft==24.8.*` does not exist anywhere), `pip install` silently disregarded that locally-downloaded `raft_dask` wheel and backtracked (i.e. downloaded older and older wheels from https://pypi.anaconda.org/rapidsai-wheels-nightly/simple/) until it found one that wasn't problematic.

This PR ensures that won't happen by telling `pip` to install **exactly that locally-downloaded file**, like this

```shell
pip install "$(echo ./dist/raft_dask_cu12*.whl)[test]"
```

If that file is uninstallable, `pip install` fails and you find out via a CI failure.

### How I tested this

Initially pushed a commit with just the changes to the test script. Saw the `wheel-tests-raft-dask` CI jobs fail in the expected way, instead of silently falling back to an older wheel and passing 🎉 .

```text
ERROR: Could not find a version that satisfies the requirement ucx-py-cu12==0.39.* (from raft-dask-cu12[test]) (from versions: 0.32.0, 0.33.0, 0.34.0, 0.35.0, 0.36.0, 0.37.0, 0.38.0a4, 0.38.0a5, 0.38.0a6, 0.39.0a0)
ERROR: No matching distribution found for ucx-py-cu12==0.39.*
```

([build link](https://github.com/rapidsai/raft/actions/runs/9323598882/job/25668146747?pr=2349))

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: #2349
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ci
Projects
None yet
2 participants