Remove dnf update from docker build scripts #17551

snnn · 2023-09-14T19:32:30Z

Description

Remove 'dnf update' from docker build scripts, because it upgrades TRT packages from CUDA 11.x to CUDA 12.x.
To reproduce it, you can run the following commands in a CentOS CUDA 11.x docker image such as nvidia/cuda:11.8.0-cudnn8-devel-ubi8.

export v=8.6.1.6-1.cuda11.8
dnf  install -y libnvinfer8-${v} libnvparsers8-${v} libnvonnxparsers8-${v} libnvinfer-plugin8-${v} libnvinfer-vc-plugin8-${v}        libnvinfer-devel-${v} libnvparsers-devel-${v} libnvonnxparsers-devel-${v} libnvinfer-plugin-devel-${v} libnvinfer-vc-plugin-devel-${v} libnvinfer-headers-devel-${v}  libnvinfer-headers-plugin-devel-${v} 
dnf update -y

The last command will generate the following outputs:

========================================================================================================================
 Package                                     Architecture       Version                          Repository        Size
========================================================================================================================
Upgrading:
 libnvinfer-devel                            x86_64             8.6.1.6-1.cuda12.0               cuda             542 M
 libnvinfer-headers-devel                    x86_64             8.6.1.6-1.cuda12.0               cuda             118 k
 libnvinfer-headers-plugin-devel             x86_64             8.6.1.6-1.cuda12.0               cuda              14 k
 libnvinfer-plugin-devel                     x86_64             8.6.1.6-1.cuda12.0               cuda              13 M
 libnvinfer-plugin8                          x86_64             8.6.1.6-1.cuda12.0               cuda              13 M
 libnvinfer-vc-plugin-devel                  x86_64             8.6.1.6-1.cuda12.0               cuda             107 k
 libnvinfer-vc-plugin8                       x86_64             8.6.1.6-1.cuda12.0               cuda             251 k
 libnvinfer8                                 x86_64             8.6.1.6-1.cuda12.0               cuda             543 M
 libnvonnxparsers-devel                      x86_64             8.6.1.6-1.cuda12.0               cuda             467 k
 libnvonnxparsers8                           x86_64             8.6.1.6-1.cuda12.0               cuda             757 k
 libnvparsers-devel                          x86_64             8.6.1.6-1.cuda12.0               cuda             2.0 M
 libnvparsers8                               x86_64             8.6.1.6-1.cuda12.0               cuda             854 k
Installing dependencies:
 cuda-toolkit-12-0-config-common             noarch             12.0.146-1                       cuda             7.7 k
 cuda-toolkit-12-config-common               noarch             12.2.140-1                       cuda             7.9 k
 libcublas-12-0                              x86_64             12.0.2.224-1                     cuda             361 M
 libcublas-devel-12-0                        x86_64             12.0.2.224-1                     cuda             397 M

Transaction Summary
========================================================================================================================

As you can see from the output, they are CUDA 12 packages.

The problem can also be solved by lock the packages' versions by using "dnf versionlock" command right after installing the CUDA/TRT packages.
However, going forward, to get the better reproducibility, I suggest manually fix dnf package versions in the installation scripts like we do for TRT now.

v="8.6.1.6-1.cuda11.8" &&\
    yum-config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo &&\
    yum -y install libnvinfer8-${v} libnvparsers8-${v} libnvonnxparsers8-${v} libnvinfer-plugin8-${v} libnvinfer-vc-plugin8-${v}\
        libnvinfer-devel-${v} libnvparsers-devel-${v} libnvonnxparsers-devel-${v} libnvinfer-plugin-devel-${v} libnvinfer-vc-plugin-devel-${v} libnvinfer-headers-devel-${v}  libnvinfer-headers-plugin-devel-${v}

When we have a need to upgrade a package due to security alert or some other reasons, we manually change the version string instead of relying on "dnf update". Though this approach increases efforts, it can make our pipeines more stable.

Motivation and Context

Right now the nightly gpu package mixes using CUDA 11.x and CUDA 12.x and the result package is totally not usable(crashes every time)

yuslepukhin · 2023-09-14T21:10:23Z

There are some suggestions in the File Changed section with regard to scripting.

snnn · 2023-09-14T21:57:45Z

There are some suggestions in the File Changed section with regard to scripting.

Let me try to fix them.

snnn · 2023-09-15T17:04:21Z

This one cannot be cleanly applied to the release branch. I will create a new PR that directly target that branch and keep this one open.

snnn · 2023-09-18T21:44:23Z

The error in "Orttraining Linux Lazy Tensor CI Pipeline" is not related to this change.

### Description 1. Delete Prefast tasks (#17522) 2. Disable yum update (#17551) 3. Avoid calling patchelf (#17365 and #17562) we that we can validate the above fix The main problem I'm trying to solve is: our GPU package depends on both CUDA 11.x and CUDA 12.x . However, it's not easy to see the information because ldd doesn't work with the shared libraries we generate(see issue #9754) . So the patchelf change are useful for me to validate the "Disabling yum update" was successful. As you can see we call "yum update" from multiple places. Without some kind of validation it's hard to say if I have covered all of them. The Prefast change is needed because I'm going to update the VM images in the next a few weeks. In case of we need to publish a patch release after that. ### Motivation and Context Without this fix we will mix using CUDA 11.x and CUDA 12.x. And it will crash every time when we use TensorRT.

tools/ci_build/github/azure-pipelines/py-package-test-pipeline.yml

### Description 1. Remove 'dnf update' from docker build scripts, because it upgrades TRT packages from CUDA 11.x to CUDA 12.x. To reproduce it, you can run the following commands in a CentOS CUDA 11.x docker image such as nvidia/cuda:11.8.0-cudnn8-devel-ubi8. ``` export v=8.6.1.6-1.cuda11.8 dnf install -y libnvinfer8-${v} libnvparsers8-${v} libnvonnxparsers8-${v} libnvinfer-plugin8-${v} libnvinfer-vc-plugin8-${v} libnvinfer-devel-${v} libnvparsers-devel-${v} libnvonnxparsers-devel-${v} libnvinfer-plugin-devel-${v} libnvinfer-vc-plugin-devel-${v} libnvinfer-headers-devel-${v} libnvinfer-headers-plugin-devel-${v} dnf update -y ``` The last command will generate the following outputs: ``` ======================================================================================================================== Package Architecture Version Repository Size ======================================================================================================================== Upgrading: libnvinfer-devel x86_64 8.6.1.6-1.cuda12.0 cuda 542 M libnvinfer-headers-devel x86_64 8.6.1.6-1.cuda12.0 cuda 118 k libnvinfer-headers-plugin-devel x86_64 8.6.1.6-1.cuda12.0 cuda 14 k libnvinfer-plugin-devel x86_64 8.6.1.6-1.cuda12.0 cuda 13 M libnvinfer-plugin8 x86_64 8.6.1.6-1.cuda12.0 cuda 13 M libnvinfer-vc-plugin-devel x86_64 8.6.1.6-1.cuda12.0 cuda 107 k libnvinfer-vc-plugin8 x86_64 8.6.1.6-1.cuda12.0 cuda 251 k libnvinfer8 x86_64 8.6.1.6-1.cuda12.0 cuda 543 M libnvonnxparsers-devel x86_64 8.6.1.6-1.cuda12.0 cuda 467 k libnvonnxparsers8 x86_64 8.6.1.6-1.cuda12.0 cuda 757 k libnvparsers-devel x86_64 8.6.1.6-1.cuda12.0 cuda 2.0 M libnvparsers8 x86_64 8.6.1.6-1.cuda12.0 cuda 854 k Installing dependencies: cuda-toolkit-12-0-config-common noarch 12.0.146-1 cuda 7.7 k cuda-toolkit-12-config-common noarch 12.2.140-1 cuda 7.9 k libcublas-12-0 x86_64 12.0.2.224-1 cuda 361 M libcublas-devel-12-0 x86_64 12.0.2.224-1 cuda 397 M Transaction Summary ======================================================================================================================== ``` As you can see from the output, they are CUDA 12 packages. The problem can also be solved by lock the packages' versions by using "dnf versionlock" command right after installing the CUDA/TRT packages. However, going forward, to get the better reproducibility, I suggest manually fix dnf package versions in the installation scripts like we do for TRT now. ```bash v="8.6.1.6-1.cuda11.8" &&\ yum-config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo &&\ yum -y install libnvinfer8-${v} libnvparsers8-${v} libnvonnxparsers8-${v} libnvinfer-plugin8-${v} libnvinfer-vc-plugin8-${v}\ libnvinfer-devel-${v} libnvparsers-devel-${v} libnvonnxparsers-devel-${v} libnvinfer-plugin-devel-${v} libnvinfer-vc-plugin-devel-${v} libnvinfer-headers-devel-${v} libnvinfer-headers-plugin-devel-${v} ``` When we have a need to upgrade a package due to security alert or some other reasons, we manually change the version string instead of relying on "dnf update". Though this approach increases efforts, it can make our pipeines more stable. 2. Move python test to docker ### Motivation and Context Right now the nightly gpu package mixes using CUDA 11.x and CUDA 12.x and the result package is totally not usable(crashes every time)

snnn requested a review from mszhanyi September 14, 2023 19:58

natke added the release:1.16 label Sep 14, 2023

mszhanyi previously approved these changes Sep 14, 2023

View reviewed changes

natke added the triage:approved Approved for cherrypicks for release label Sep 14, 2023

mszhanyi dismissed their stale review via 59832ca September 15, 2023 03:25

mszhanyi previously approved these changes Sep 15, 2023

View reviewed changes

snnn dismissed mszhanyi’s stale review via f54984a September 15, 2023 15:26

snnn requested a review from a team as a code owner September 15, 2023 15:26

snnn added 2 commits September 15, 2023 08:32

update

a317cb6

Merge remote-tracking branch 'origin/main' into snnn/remove_dnf_update

986757e

snnn force-pushed the snnn/remove_dnf_update branch from 5dbacec to 986757e Compare September 15, 2023 15:33

snnn added 2 commits September 15, 2023 08:36

Merge remote-tracking branch 'origin/main' into snnn/remove_dnf_update

6c9d9fc

update

d82614f

remove

a02a64e

snnn mentioned this pull request Sep 15, 2023

Cherry-picks pipeline changes to 1.16.0 release branch #17577

Merged

update

74f8f0b

snnn requested a review from mszhanyi September 18, 2023 21:44

snnn removed triage:approved Approved for cherrypicks for release release:1.16 labels Sep 19, 2023

snnn added 2 commits September 20, 2023 09:41

Merge remote-tracking branch 'origin/main' into snnn/remove_dnf_update

dfce0d2

update

1f6af04

snnn commented Sep 21, 2023

View reviewed changes

tools/ci_build/github/azure-pipelines/py-package-test-pipeline.yml Show resolved Hide resolved

snnn added 2 commits September 20, 2023 17:39

current

69dd04f

change machine pool back

e5bfe7b

mszhanyi approved these changes Sep 21, 2023

View reviewed changes

snnn merged commit 57dfd15 into main Sep 21, 2023
239 of 247 checks passed

snnn deleted the snnn/remove_dnf_update branch September 21, 2023 14:33

snnn restored the snnn/remove_dnf_update branch September 21, 2023 15:22

snnn deleted the snnn/remove_dnf_update branch September 21, 2023 16:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove dnf update from docker build scripts #17551

Remove dnf update from docker build scripts #17551

snnn commented Sep 14, 2023 •

edited

Loading

yuslepukhin commented Sep 14, 2023

snnn commented Sep 14, 2023

snnn commented Sep 15, 2023

snnn commented Sep 18, 2023

Remove dnf update from docker build scripts #17551

Remove dnf update from docker build scripts #17551

Conversation

snnn commented Sep 14, 2023 • edited Loading

Description

Motivation and Context

yuslepukhin commented Sep 14, 2023

snnn commented Sep 14, 2023

snnn commented Sep 15, 2023

snnn commented Sep 18, 2023

snnn commented Sep 14, 2023 •

edited

Loading