Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RHOAIENG-10783: fix(rocm): remove more files that instructlab also removes #653

Closed
wants to merge 1 commit into from

Conversation

jiridanek
Copy link
Member

@jiridanek jiridanek commented Jul 31, 2024

  • RHOAIENG-10783 Optimize ROCm images to reduce the size so they can be built on OpenShift CI

Description

Turns out we can remove the llvm installation if we do it forcibly, without removing dependencies. Additionally, there are gfx files for all supported cards, so since we support less, we can remove many.

See https://github.com/tiran/instructlab-containers/blob/main/containers/rocm/Containerfile.c9s#L47

How Has This Been Tested?

before

rocm-ubi9-python-3.9-main_37df13c29fdde3fa4f8e455d30b5bf39d80f6dfb  1f64a39a93d7  7 days ago   27.8 GB

after

rocm-ubi9-python-3.9-99b24d8574bffe38c4c480007a0cf69f9eeb48ce                     c7dfe8bee592  3 minutes ago   23 GB

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

@jiridanek jiridanek requested review from harshad16 and removed request for caponetto and paulovmr July 31, 2024 16:09
Copy link
Contributor

openshift-ci bot commented Jul 31, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from jiridanek. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jstourac
Copy link
Member

Looks good, but we should really assure that this isn't missing in the final image for any use-case we should support.

@@ -18,6 +18,9 @@ WORKDIR /opt/app-root/bin
ARG ROCM_VERSION=6.1
ARG AMDGPU_VERSION=6.1

# default: same targets and ROCm version as https://github.com/tiran/instructlab-containers/blob/main/containers/rocm/Containerfile.c9s#L47
ARG AMDGPU_TARGETS=gfx900;gfx906:xnack-;gfx908:xnack-;gfx90a:xnack-;gfx90a:xnack+;gfx942;gfx1030;gfx1100
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get some official set of AMD GPUs we should/need to support? I mean - are we sure that this default is what we want?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but we should really assure that this isn't missing in the final image for any use-case we should support.

It's much better to miss something and add it later, than trying to remove things afterwards when we actually have users, imo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plan is to support: AMD MI210, MI300 and later on MI250
Based on the strat we have https://issues.redhat.com/browse/RHOAISTRAT-135.

i agree, we can add the details if the opportunity presents, and try this out for now.
however what are these gfx files, are we aware what functionality is this removing ?

@jiridanek jiridanek changed the title RHOAIENG-9853: fix(rocm): remove more files that instructlab also removes RHOAIENG-10783: fix(rocm): remove more files that instructlab also removes Aug 2, 2024
…oves

Turns out we can remove the llvm installation if we do it forcibly, without removing dependencies.
Additionally, there are gfx files for all supported cards, so since we support less, we can remove many.

See https://github.com/tiran/instructlab-containers/blob/main/containers/rocm/Containerfile.c9s#L47
@jiridanek
Copy link
Member Author

/test ci/prow/images

Copy link
Contributor

openshift-ci bot commented Aug 7, 2024

@jiridanek: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test anaconda-ubi8-e2e-tests
  • /test codeserver-notebook-e2e-tests
  • /test habana-notebooks-e2e-tests
  • /test images
  • /test intel-notebooks-e2e-tests
  • /test jupyter-datascience-anaconda-python-3-8-pr-image-mirror
  • /test notebook-base-c9s-python-3-9-pr-image-mirror
  • /test notebook-base-ubi8-python-3-8-pr-image-mirror
  • /test notebook-base-ubi9-python-3-9-pr-image-mirror
  • /test notebook-codeserver-ubi9-python-3-9-pr-image-mirror
  • /test notebook-cuda-c9s-python-3-9-pr-image-mirror
  • /test notebook-cuda-jupyter-ds-ubi8-python-3-8-pr-image-mirror
  • /test notebook-cuda-jupyter-ds-ubi9-python-3-9-pr-image-mirror
  • /test notebook-cuda-jupyter-minimal-ubi8-python-3-8-pr-image-mirror
  • /test notebook-cuda-jupyter-minimal-ubi9-python-3-9-pr-image-mirror
  • /test notebook-cuda-jupyter-tf-ubi9-python-3-9-pr-image-mirror
  • /test notebook-cuda-rstudio-c9s-python-3-9-pr-image-mirror
  • /test notebook-cuda-ubi8-python-3-8-pr-image-mirror
  • /test notebook-cuda-ubi9-python-3-9-pr-image-mirror
  • /test notebook-habana-1-10-0-ubi8-python-3-8-pr-image-mirror
  • /test notebook-habana-1-13-0-ubi8-python-3-8-pr-image-mirror
  • /test notebook-jupyter-datascience-ubi8-python-3-8-pr-image-mirror
  • /test notebook-jupyter-datascience-ubi9-python-3-9-pr-image-mirror
  • /test notebook-jupyter-intel-ml-ubi9-python-3-9-pr-image-mirror
  • /test notebook-jupyter-intel-pyt-ubi9-python-3-9-pr-image-mirror
  • /test notebook-jupyter-intel-tf-ubi9-python-3-9-pr-image-mirror
  • /test notebook-jupyter-minimal-ubi8-python-3-8-pr-image-mirror
  • /test notebook-jupyter-minimal-ubi9-python-3-9-pr-image-mirror
  • /test notebook-jupyter-pytorch-ubi9-python-3-9-pr-image-mirror
  • /test notebook-jupyter-trustyai-ubi9-python-3-9-pr-image-mirror
  • /test notebook-rocm-jupyter-pyt-ubi9-python-3-9-pr-image-mirror
  • /test notebook-rocm-jupyter-tf-ubi9-python-3-9-pr-image-mirror
  • /test notebook-rocm-ubi9-python-3-9-pr-image-mirror
  • /test notebook-rstudio-c9s-python-3-9-pr-image-mirror
  • /test notebooks-ubi8-e2e-tests
  • /test notebooks-ubi9-e2e-tests
  • /test rocm-notebooks-e2e-tests
  • /test rocm-runtimes-ubi9-e2e-tests
  • /test rstudio-notebook-e2e-tests
  • /test runtime-cuda-tensorflow-ubi8-python-3-8-pr-image-mirror
  • /test runtime-cuda-tensorflow-ubi9-python-3-9-pr-image-mirror
  • /test runtime-datascience-ubi8-python-3-8-pr-image-mirror
  • /test runtime-datascience-ubi9-python-3-9-pr-image-mirror
  • /test runtime-intel-ml-ubi9-python-3-9-pr-image-mirror
  • /test runtime-intel-pyt-ubi9-python-3-9-pr-image-mirror
  • /test runtime-intel-tf-ubi9-python-3-9-pr-image-mirror
  • /test runtime-minimal-ubi8-python-3-8-pr-image-mirror
  • /test runtime-minimal-ubi9-python-3-9-pr-image-mirror
  • /test runtime-pytorch-ubi8-python-3-8-pr-image-mirror
  • /test runtime-pytorch-ubi9-python-3-9-pr-image-mirror
  • /test runtime-rocm-pytorch-ubi9-python-3-9-pr-image-mirror
  • /test runtime-rocm-tensorflow-ubi9-python-3-9-pr-image-mirror
  • /test runtimes-ubi8-e2e-tests
  • /test runtimes-ubi9-e2e-tests

Use /test all to run the following jobs that were automatically triggered:

  • pull-ci-opendatahub-io-notebooks-main-images
  • pull-ci-opendatahub-io-notebooks-main-notebook-rocm-jupyter-pyt-ubi9-python-3-9-pr-image-mirror
  • pull-ci-opendatahub-io-notebooks-main-notebook-rocm-jupyter-tf-ubi9-python-3-9-pr-image-mirror
  • pull-ci-opendatahub-io-notebooks-main-notebook-rocm-ubi9-python-3-9-pr-image-mirror
  • pull-ci-opendatahub-io-notebooks-main-rocm-notebooks-e2e-tests

In response to this:

/test ci/prow/images

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@jiridanek
Copy link
Member Author

/test images

@jiridanek
Copy link
Member Author

@harshad16 The gfx files are card-specific compiled code. The filenames end in .bc, co and . hsaco

There is at least 3 types of device-side (aka GPU) binary files -

  • .bc for bitcode,
  • .hsaco for HSA code object and
  • .co for code object.
    (source)

NVCC and HIP-Clang target different architectures and use different code object formats: NVCC is cubin or ptx files, while the HIP-Clang path is the hsaco format. (source)

@jiridanek
Copy link
Member Author

/retest all

Copy link
Contributor

openshift-ci bot commented Aug 16, 2024

@jiridanek: The /retest command does not accept any targets.
The following commands are available to trigger required jobs:

  • /test anaconda-ubi8-e2e-tests
  • /test codeserver-notebook-e2e-tests
  • /test habana-notebooks-e2e-tests
  • /test images
  • /test intel-notebooks-e2e-tests
  • /test jupyter-datascience-anaconda-python-3-8-pr-image-mirror
  • /test notebook-base-c9s-python-3-9-pr-image-mirror
  • /test notebook-base-ubi8-python-3-8-pr-image-mirror
  • /test notebook-base-ubi9-python-3-9-pr-image-mirror
  • /test notebook-codeserver-ubi9-python-3-9-pr-image-mirror
  • /test notebook-cuda-c9s-python-3-9-pr-image-mirror
  • /test notebook-cuda-jupyter-ds-ubi8-python-3-8-pr-image-mirror
  • /test notebook-cuda-jupyter-ds-ubi9-python-3-9-pr-image-mirror
  • /test notebook-cuda-jupyter-minimal-ubi8-python-3-8-pr-image-mirror
  • /test notebook-cuda-jupyter-minimal-ubi9-python-3-9-pr-image-mirror
  • /test notebook-cuda-jupyter-tf-ubi9-python-3-9-pr-image-mirror
  • /test notebook-cuda-rstudio-c9s-python-3-9-pr-image-mirror
  • /test notebook-cuda-ubi8-python-3-8-pr-image-mirror
  • /test notebook-cuda-ubi9-python-3-9-pr-image-mirror
  • /test notebook-habana-1-10-0-ubi8-python-3-8-pr-image-mirror
  • /test notebook-habana-1-13-0-ubi8-python-3-8-pr-image-mirror
  • /test notebook-jupyter-datascience-ubi8-python-3-8-pr-image-mirror
  • /test notebook-jupyter-datascience-ubi9-python-3-9-pr-image-mirror
  • /test notebook-jupyter-intel-ml-ubi9-python-3-9-pr-image-mirror
  • /test notebook-jupyter-intel-pyt-ubi9-python-3-9-pr-image-mirror
  • /test notebook-jupyter-intel-tf-ubi9-python-3-9-pr-image-mirror
  • /test notebook-jupyter-minimal-ubi8-python-3-8-pr-image-mirror
  • /test notebook-jupyter-minimal-ubi9-python-3-9-pr-image-mirror
  • /test notebook-jupyter-pytorch-ubi9-python-3-9-pr-image-mirror
  • /test notebook-jupyter-trustyai-ubi9-python-3-9-pr-image-mirror
  • /test notebook-rocm-jupyter-minimal-ubi9-python-3-9-pr-image-mirror
  • /test notebook-rocm-jupyter-pyt-ubi9-python-3-9-pr-image-mirror
  • /test notebook-rocm-jupyter-tf-ubi9-python-3-9-pr-image-mirror
  • /test notebook-rocm-ubi9-python-3-9-pr-image-mirror
  • /test notebook-rstudio-c9s-python-3-9-pr-image-mirror
  • /test notebooks-ubi8-e2e-tests
  • /test notebooks-ubi9-e2e-tests
  • /test rocm-notebooks-e2e-tests
  • /test rocm-runtimes-ubi9-e2e-tests
  • /test rstudio-notebook-e2e-tests
  • /test runtime-cuda-tensorflow-ubi8-python-3-8-pr-image-mirror
  • /test runtime-cuda-tensorflow-ubi9-python-3-9-pr-image-mirror
  • /test runtime-datascience-ubi8-python-3-8-pr-image-mirror
  • /test runtime-datascience-ubi9-python-3-9-pr-image-mirror
  • /test runtime-intel-ml-ubi9-python-3-9-pr-image-mirror
  • /test runtime-intel-pyt-ubi9-python-3-9-pr-image-mirror
  • /test runtime-intel-tf-ubi9-python-3-9-pr-image-mirror
  • /test runtime-minimal-ubi8-python-3-8-pr-image-mirror
  • /test runtime-minimal-ubi9-python-3-9-pr-image-mirror
  • /test runtime-pytorch-ubi8-python-3-8-pr-image-mirror
  • /test runtime-pytorch-ubi9-python-3-9-pr-image-mirror
  • /test runtime-rocm-pytorch-ubi9-python-3-9-pr-image-mirror
  • /test runtime-rocm-tensorflow-ubi9-python-3-9-pr-image-mirror
  • /test runtimes-ubi8-e2e-tests
  • /test runtimes-ubi9-e2e-tests

Use /test all to run the following jobs that were automatically triggered:

  • pull-ci-opendatahub-io-notebooks-main-images
  • pull-ci-opendatahub-io-notebooks-main-notebook-rocm-jupyter-minimal-ubi9-python-3-9-pr-image-mirror
  • pull-ci-opendatahub-io-notebooks-main-notebook-rocm-jupyter-pyt-ubi9-python-3-9-pr-image-mirror
  • pull-ci-opendatahub-io-notebooks-main-notebook-rocm-jupyter-tf-ubi9-python-3-9-pr-image-mirror
  • pull-ci-opendatahub-io-notebooks-main-notebook-rocm-ubi9-python-3-9-pr-image-mirror
  • pull-ci-opendatahub-io-notebooks-main-rocm-notebooks-e2e-tests

In response to this:

/retest all

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@jiridanek
Copy link
Member Author

/test all

Copy link
Contributor

openshift-ci bot commented Aug 29, 2024

@jiridanek: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/notebook-cuda-jupyter-tf-ubi9-python-3-11-pr-image-mirror 07d48b1 link true /test notebook-cuda-jupyter-tf-ubi9-python-3-11-pr-image-mirror
ci/prow/notebook-cuda-jupyter-ds-ubi9-python-3-11-pr-image-mirror 07d48b1 link true /test notebook-cuda-jupyter-ds-ubi9-python-3-11-pr-image-mirror
ci/prow/runtime-rocm-pytorch-ubi9-python-3-11-pr-image-mirror 07d48b1 link true /test runtime-rocm-pytorch-ubi9-python-3-11-pr-image-mirror
ci/prow/notebook-rocm-jupyter-min-ubi9-python-3-11-pr-image-mirror 07d48b1 link true /test notebook-rocm-jupyter-min-ubi9-python-3-11-pr-image-mirror
ci/prow/notebook-rocm-ubi9-python-3-11-pr-image-mirror 07d48b1 link true /test notebook-rocm-ubi9-python-3-11-pr-image-mirror
ci/prow/runtime-rocm-tensorflow-ubi9-python-3-11-pr-image-mirror 07d48b1 link true /test runtime-rocm-tensorflow-ubi9-python-3-11-pr-image-mirror
ci/prow/runtimes-ubi9-e2e-tests 07d48b1 link true /test runtimes-ubi9-e2e-tests
ci/prow/notebooks-ubi9-e2e-tests 07d48b1 link true /test notebooks-ubi9-e2e-tests
ci/prow/rstudio-notebook-e2e-tests 07d48b1 link true /test rstudio-notebook-e2e-tests
ci/prow/intel-notebooks-e2e-tests 07d48b1 link true /test intel-notebooks-e2e-tests
ci/prow/codeserver-notebook-e2e-tests 07d48b1 link true /test codeserver-notebook-e2e-tests
ci/prow/rocm-runtimes-ubi9-e2e-tests 07d48b1 link true /test rocm-runtimes-ubi9-e2e-tests
ci/prow/rocm-notebooks-e2e-tests 07d48b1 link true /test rocm-notebooks-e2e-tests

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@harshad16
Copy link
Member

@jiridanek , with this #673 getting merged
do we want to invest time in getting to test the additional changes made here ?

@jiridanek
Copy link
Member Author

not planned for now

@jiridanek jiridanek closed this Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants