Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: update/pin dependencies to get ONNX runtime working again #107

Merged
merged 6 commits into from
Aug 5, 2024

Conversation

tjohnson31415
Copy link
Member

@tjohnson31415 tjohnson31415 commented Jul 31, 2024

Motivation

Internal regression tests are failing when using the ONNX Runtime with an error indicating a dependency issue with ONNX Runtime and cuDNN:

Shard 0: 2024-07-31 19:38:04.423164988 [E:onnxruntime:Default, provider_bridge_ort.cc:1745 TryGetProviderInfo_CUDA] /onnxruntime_src/onnxruntime/core/session/provider_bridge_ort.cc:1426 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_cuda.so with error: libcudnn.so.9: cannot open shared object file: No such file or directory

I found that ORT 1.18.1 started to build against cudnn 9 (included in the release notes). However, PyTorch does not use cudnn 9 until 2.4.0, so I pinned in to 1.18.0. In updating poetry.lock, I let other deps update as well, but found other compatibility issue and had to pin transformers and optimum as well to get internal tests passing.

Modifications

  • pin the onnxruntime version to 1.18.0
  • pin transformers to 4.40.2 (and remove separate pip install for it)
  • pin optimum to 1.20
  • run poetry update to update poetry.lock

Result

DEPLOYMENT_FRAMEWORK=hf_optimum_ort will start working again and internal tests will be passing.

@tjohnson31415 tjohnson31415 changed the title fix: fix: onnxruntime usage has broken dependency on cudnn Jul 31, 2024
@tjohnson31415 tjohnson31415 changed the title fix: onnxruntime usage has broken dependency on cudnn fix: onnxruntime is broken due to dependency on cudnn Jul 31, 2024
In 1.18.1, the runtime packages are built against cudnn 9. PyTorch does
not use cudnn 9 until 2.4.0, so we hold back onnxruntime for now

Signed-off-by: Travis Johnson <[email protected]>
Signed-off-by: Travis Johnson <[email protected]>
Signed-off-by: Travis Johnson <[email protected]>
Signed-off-by: Travis Johnson <[email protected]>
@tjohnson31415 tjohnson31415 changed the title fix: onnxruntime is broken due to dependency on cudnn fix: update/pin dependencies to get ONNX runtime working again Aug 1, 2024
@tjohnson31415 tjohnson31415 merged commit 015070b into main Aug 5, 2024
7 checks passed
@tjohnson31415 tjohnson31415 deleted the set-onnx-version branch August 5, 2024 17:24
dtrifiro pushed a commit to dtrifiro/text-generation-inference that referenced this pull request Sep 13, 2024
Internal regression tests are failing when using the ONNX Runtime with
an error indicating a dependency issue with ONNX Runtime and cuDNN:
```
Shard 0: 2024-07-31 19:38:04.423164988 [E:onnxruntime:Default, provider_bridge_ort.cc:1745 TryGetProviderInfo_CUDA] /onnxruntime_src/onnxruntime/core/session/provider_bridge_ort.cc:1426 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_cuda.so with error: libcudnn.so.9: cannot open shared object file: No such file or directory
```

I found that ORT 1.18.1 started to build against cudnn 9 (included in
the [release
notes](https://github.com/Microsoft/onnxruntime/releases/tag/v1.18.1)).
However, PyTorch does not use cudnn 9 until 2.4.0, so I pinned in to
1.18.0. In updating poetry.lock, I let other deps update as well, but
found other compatibility issue and had to pin transformers and optimum
as well to get internal tests passing.

- pin the onnxruntime version to 1.18.0
- pin transformers to 4.40.2 (and remove separate `pip install` for it)
- pin optimum to 1.20
- run `poetry update` to update poetry.lock

`DEPLOYMENT_FRAMEWORK=hf_optimum_ort` will start working again and
internal tests will be passing.

---------

Signed-off-by: Travis Johnson <[email protected]>
dtrifiro pushed a commit to dtrifiro/text-generation-inference that referenced this pull request Sep 13, 2024
Internal regression tests are failing when using the ONNX Runtime with
an error indicating a dependency issue with ONNX Runtime and cuDNN:
```
Shard 0: 2024-07-31 19:38:04.423164988 [E:onnxruntime:Default, provider_bridge_ort.cc:1745 TryGetProviderInfo_CUDA] /onnxruntime_src/onnxruntime/core/session/provider_bridge_ort.cc:1426 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_cuda.so with error: libcudnn.so.9: cannot open shared object file: No such file or directory
```

I found that ORT 1.18.1 started to build against cudnn 9 (included in
the [release
notes](https://github.com/Microsoft/onnxruntime/releases/tag/v1.18.1)).
However, PyTorch does not use cudnn 9 until 2.4.0, so I pinned in to
1.18.0. In updating poetry.lock, I let other deps update as well, but
found other compatibility issue and had to pin transformers and optimum
as well to get internal tests passing.

- pin the onnxruntime version to 1.18.0
- pin transformers to 4.40.2 (and remove separate `pip install` for it)
- pin optimum to 1.20
- run `poetry update` to update poetry.lock

`DEPLOYMENT_FRAMEWORK=hf_optimum_ort` will start working again and
internal tests will be passing.

---------

Signed-off-by: Travis Johnson <[email protected]>
dtrifiro pushed a commit to opendatahub-io/text-generation-inference that referenced this pull request Sep 16, 2024
Internal regression tests are failing when using the ONNX Runtime with
an error indicating a dependency issue with ONNX Runtime and cuDNN:
```
Shard 0: 2024-07-31 19:38:04.423164988 [E:onnxruntime:Default, provider_bridge_ort.cc:1745 TryGetProviderInfo_CUDA] /onnxruntime_src/onnxruntime/core/session/provider_bridge_ort.cc:1426 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_cuda.so with error: libcudnn.so.9: cannot open shared object file: No such file or directory
```

I found that ORT 1.18.1 started to build against cudnn 9 (included in
the [release
notes](https://github.com/Microsoft/onnxruntime/releases/tag/v1.18.1)).
However, PyTorch does not use cudnn 9 until 2.4.0, so I pinned in to
1.18.0. In updating poetry.lock, I let other deps update as well, but
found other compatibility issue and had to pin transformers and optimum
as well to get internal tests passing.

- pin the onnxruntime version to 1.18.0
- pin transformers to 4.40.2 (and remove separate `pip install` for it)
- pin optimum to 1.20
- run `poetry update` to update poetry.lock

`DEPLOYMENT_FRAMEWORK=hf_optimum_ort` will start working again and
internal tests will be passing.

---------

Signed-off-by: Travis Johnson <[email protected]>
dtrifiro pushed a commit to opendatahub-io/text-generation-inference that referenced this pull request Sep 16, 2024
Internal regression tests are failing when using the ONNX Runtime with
an error indicating a dependency issue with ONNX Runtime and cuDNN:
```
Shard 0: 2024-07-31 19:38:04.423164988 [E:onnxruntime:Default, provider_bridge_ort.cc:1745 TryGetProviderInfo_CUDA] /onnxruntime_src/onnxruntime/core/session/provider_bridge_ort.cc:1426 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_cuda.so with error: libcudnn.so.9: cannot open shared object file: No such file or directory
```

I found that ORT 1.18.1 started to build against cudnn 9 (included in
the [release
notes](https://github.com/Microsoft/onnxruntime/releases/tag/v1.18.1)).
However, PyTorch does not use cudnn 9 until 2.4.0, so I pinned in to
1.18.0. In updating poetry.lock, I let other deps update as well, but
found other compatibility issue and had to pin transformers and optimum
as well to get internal tests passing.

- pin the onnxruntime version to 1.18.0
- pin transformers to 4.40.2 (and remove separate `pip install` for it)
- pin optimum to 1.20
- run `poetry update` to update poetry.lock

`DEPLOYMENT_FRAMEWORK=hf_optimum_ort` will start working again and
internal tests will be passing.

---------

Signed-off-by: Travis Johnson <[email protected]>
dtrifiro pushed a commit to opendatahub-io/text-generation-inference that referenced this pull request Sep 17, 2024
Internal regression tests are failing when using the ONNX Runtime with
an error indicating a dependency issue with ONNX Runtime and cuDNN:
```
Shard 0: 2024-07-31 19:38:04.423164988 [E:onnxruntime:Default, provider_bridge_ort.cc:1745 TryGetProviderInfo_CUDA] /onnxruntime_src/onnxruntime/core/session/provider_bridge_ort.cc:1426 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_cuda.so with error: libcudnn.so.9: cannot open shared object file: No such file or directory
```

I found that ORT 1.18.1 started to build against cudnn 9 (included in
the [release
notes](https://github.com/Microsoft/onnxruntime/releases/tag/v1.18.1)).
However, PyTorch does not use cudnn 9 until 2.4.0, so I pinned in to
1.18.0. In updating poetry.lock, I let other deps update as well, but
found other compatibility issue and had to pin transformers and optimum
as well to get internal tests passing.

- pin the onnxruntime version to 1.18.0
- pin transformers to 4.40.2 (and remove separate `pip install` for it)
- pin optimum to 1.20
- run `poetry update` to update poetry.lock

`DEPLOYMENT_FRAMEWORK=hf_optimum_ort` will start working again and
internal tests will be passing.

---------

Signed-off-by: Travis Johnson <[email protected]>
dtrifiro pushed a commit to opendatahub-io/text-generation-inference that referenced this pull request Sep 17, 2024
Internal regression tests are failing when using the ONNX Runtime with
an error indicating a dependency issue with ONNX Runtime and cuDNN:
```
Shard 0: 2024-07-31 19:38:04.423164988 [E:onnxruntime:Default, provider_bridge_ort.cc:1745 TryGetProviderInfo_CUDA] /onnxruntime_src/onnxruntime/core/session/provider_bridge_ort.cc:1426 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_cuda.so with error: libcudnn.so.9: cannot open shared object file: No such file or directory
```

I found that ORT 1.18.1 started to build against cudnn 9 (included in
the [release
notes](https://github.com/Microsoft/onnxruntime/releases/tag/v1.18.1)).
However, PyTorch does not use cudnn 9 until 2.4.0, so I pinned in to
1.18.0. In updating poetry.lock, I let other deps update as well, but
found other compatibility issue and had to pin transformers and optimum
as well to get internal tests passing.

- pin the onnxruntime version to 1.18.0
- pin transformers to 4.40.2 (and remove separate `pip install` for it)
- pin optimum to 1.20
- run `poetry update` to update poetry.lock

`DEPLOYMENT_FRAMEWORK=hf_optimum_ort` will start working again and
internal tests will be passing.

---------

Signed-off-by: Travis Johnson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants