Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the ROCm versions installed from 6.0 to 6.1 to make it in sync with what we use in Dockerfile #656

Merged
merged 2 commits into from
Aug 2, 2024

Conversation

jstourac
Copy link
Member

@jstourac jstourac commented Aug 1, 2024

  • This was created as a response to this comment. We should keep the ROCm in sync on all our places to assure the compatibility
  • Also addresses this comment

https://issues.redhat.com/browse/RHOAIENG-10824

How Has This Been Tested?

Not tested anyhow now.

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

@jiridanek
Copy link
Member

Nice, but the problem is that I cannot probably test a build out of this because openshift-ci will not build the huge pytorch image ;(

@jiridanek
Copy link
Member

So, i'll test github actions image!

@jiridanek
Copy link
Member

Here's my build of this, https://github.com/jiridanek/notebooks/actions/runs/10196455848, I'm going to try out

  • ghcr.io/jiridanek/notebooks/workbench-images:rocm-jupyter-pytorch-ubi9-python-3.9-syncRocm_67c524cf08791fb84c46992acc411ffd2ea09869
  • ghcr.io/jiridanek/notebooks/workbench-images:rocm-jupyter-tensorflow-ubi9-python-3.9-syncRocm_67c524cf08791fb84c46992acc411ffd2ea09869

@jiridanek
Copy link
Member

https://rocm.docs.amd.com/projects/install-on-linux/en/develop/install/3rd-party/pytorch-install.html#running-a-basic-pytorch-example

(app-root) bash-5.1# python
Python 3.9.18 (main, Jul  3 2024, 00:00:00) 
[GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_
torch.cuda.is_available(                 torch.cuda.is_bf16_supported(            torch.cuda.is_current_stream_capturing(  torch.cuda.is_initialized(               
>>> torch.cuda.is_available()
True

(app-root) bash-5.1# git clone https://github.com/pytorch/examples.git
Cloning into 'examples'...
remote: Enumerating objects: 4296, done.
remote: Counting objects: 100% (13/13), done.
remote: Compressing objects: 100% (12/12), done.
remote: Total 4296 (delta 2), reused 6 (delta 0), pack-reused 4283
Receiving objects: 100% (4296/4296), 41.36 MiB | 25.18 MiB/s, done.
Resolving deltas: 100% (2151/2151), done.
(app-root) bash-5.1# cd examples/mnist
(app-root) bash-5.1# python3 main.py
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
...
Train Epoch: 14 [57600/60000 (96%)]	Loss: 0.001088
Train Epoch: 14 [58240/60000 (97%)]	Loss: 0.039564
Train Epoch: 14 [58880/60000 (98%)]	Loss: 0.003215
Train Epoch: 14 [59520/60000 (99%)]	Loss: 0.002248

Test set: Average loss: 0.0263, Accuracy: 9920/10000 (99%)
(app-root) bash-5.1# git clone https://github.com/pytorch/examples.git
fatal: destination path 'examples' already exists and is not an empty directory.
(app-root) bash-5.1# cd examples/imagenet
(app-root) bash-5.1# python3 main.py
/opt/app-root/src/examples/imagenet/main.py:110: UserWarning: nccl backend >=2.5 requires GPU count>1, see https://github.com/NVIDIA/nccl/issues/103 perhaps use 'gloo'
  warnings.warn("nccl backend >=2.5 requires GPU count>1, see https://github.com/NVIDIA/nccl/issues/103 perhaps use 'gloo'")
=> creating model 'resnet18'
Traceback (most recent call last):
  File "/opt/app-root/src/examples/imagenet/main.py", line 514, in <module>
    main()
  File "/opt/app-root/src/examples/imagenet/main.py", line 123, in main
    main_worker(args.gpu, ngpus_per_node, args)
  File "/opt/app-root/src/examples/imagenet/main.py", line 239, in main_worker
    train_dataset = datasets.ImageFolder(
  File "/opt/app-root/lib64/python3.9/site-packages/torchvision/datasets/folder.py", line 328, in __init__
    super().__init__(
  File "/opt/app-root/lib64/python3.9/site-packages/torchvision/datasets/folder.py", line 149, in __init__
    classes, class_to_idx = self.find_classes(self.root)
  File "/opt/app-root/lib64/python3.9/site-packages/torchvision/datasets/folder.py", line 234, in find_classes
    return find_classes(directory)
  File "/opt/app-root/lib64/python3.9/site-packages/torchvision/datasets/folder.py", line 41, in find_classes
    classes = sorted(entry.name for entry in os.scandir(directory) if entry.is_dir())
FileNotFoundError: [Errno 2] No such file or directory: 'imagenet/train'
(app-root) bash-5.1# 

@jiridanek
Copy link
Member

[jdanek@nvd-srv-05 ~]$ podman run --entrypoint /bin/bash --device=/dev/kfd --device=/dev/dri --ipc=host  --rm -it ghcr.io/jiridanek/notebooks/workbench-images:rocm-jupyter-tensorflow-ubi9-python-3.9-syncRocm_67c524cf08791fb84c46992acc411ffd2ea09869

(app-root) bash-5.1$ python
Python 3.9.18 (main, Jul  3 2024, 00:00:00) 
[GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>  import tensorflow as tf
  File "<stdin>", line 1
    import tensorflow as tf
IndentationError: unexpected indent
>>> import tensorflow as tf
2024-08-01 16:46:02.802388: E external/local_xla/xla/stream_executor/plugin_registry.cc:93] Invalid plugin kind specified: DNN
2024-08-01 16:46:02.962263: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
>>> tf.config.list_physical_devices()
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
>>> 

https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/3rd-party/tensorflow-install.html#running-a-basic-tensorflow-example

2024-08-01 16:47:13.853661: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:508] ROCm Fusion is enabled.
2024-08-01 16:47:14.014130: I external/local_xla/xla/service/service.cc:168] XLA service 0x7faaa03f8d40 initialized for platform ROCM (this does not guarantee that XLA will be used). Devices:
2024-08-01 16:47:14.014178: I external/local_xla/xla/service/service.cc:176]   StreamExecutor device (0): AMD Instinct MI210, AMDGPU ISA version: gfx90a:sramecc+:xnack-
2024-08-01 16:47:14.019864: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1722530834.134721     490 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
1848/1875 [============================>.] - ETA: 0s - loss: 0.2958 - accuracy: 0.91522024-08-01 16:47:17.922692: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:508] ROCm Fusion is enabled.
1875/1875 [==============================] - 4s 2ms/step - loss: 0.2935 - accuracy: 0.9158
2024-08-01 16:47:17.925299: I 

@jiridanek
Copy link
Member

eh, I don't want to make the imagenet example work,

#!/bin/bash
#
# script to extract ImageNet dataset
# ILSVRC2012_img_train.tar (about 138 GB)
# ILSVRC2012_img_val.tar (about 6.3 GB)
# make sure ILSVRC2012_img_train.tar & ILSVRC2012_img_val.tar in your current directory

it's huge!

So, torch and tensorflow both work, so test passed and /lgtm

Copy link
Member

@harshad16 harshad16 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

thanks for the quick work 💯

Copy link
Contributor

openshift-ci bot commented Aug 1, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: harshad16

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jstourac
Copy link
Member Author

jstourac commented Aug 2, 2024

/override ci/prow/images
/override ci/prow/notebook-rocm-jupyter-pyt-ubi9-python-3-9-pr-image-mirror
/override ci/prow/notebook-rocm-ubi9-python-3-9-pr-image-mirror
/override ci/prow/rocm-notebooks-e2e-tests
/override ci/prow/rocm-runtimes-ubi9-e2e-tests
/override ci/prow/runtime-rocm-pytorch-ubi9-python-3-9-pr-image-mirror

Copy link
Contributor

openshift-ci bot commented Aug 2, 2024

@jstourac: Overrode contexts on behalf of jstourac: ci/prow/images, ci/prow/notebook-rocm-jupyter-pyt-ubi9-python-3-9-pr-image-mirror, ci/prow/notebook-rocm-ubi9-python-3-9-pr-image-mirror, ci/prow/rocm-notebooks-e2e-tests, ci/prow/rocm-runtimes-ubi9-e2e-tests, ci/prow/runtime-rocm-pytorch-ubi9-python-3-9-pr-image-mirror

In response to this:

/override ci/prow/images
/override ci/prow/notebook-rocm-jupyter-pyt-ubi9-python-3-9-pr-image-mirror
/override ci/prow/notebook-rocm-ubi9-python-3-9-pr-image-mirror
/override ci/prow/rocm-notebooks-e2e-tests
/override ci/prow/rocm-runtimes-ubi9-e2e-tests
/override ci/prow/runtime-rocm-pytorch-ubi9-python-3-9-pr-image-mirror

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@jstourac
Copy link
Member Author

jstourac commented Aug 2, 2024

Thank you all for the reviews! And also for all the testing, Jiri! I overrode the failing tests and created a tracking issue for this https://issues.redhat.com/browse/RHOAIENG-10824.

@openshift-merge-bot openshift-merge-bot bot merged commit 2695aaf into opendatahub-io:main Aug 2, 2024
16 checks passed
@jstourac jstourac deleted the syncRocm branch August 2, 2024 07:55
caponetto added a commit to caponetto/opendatahub-io-notebooks that referenced this pull request Aug 5, 2024
caponetto added a commit to caponetto/opendatahub-io-notebooks that referenced this pull request Aug 6, 2024
caponetto added a commit to caponetto/opendatahub-io-notebooks that referenced this pull request Aug 7, 2024
caponetto added a commit to caponetto/opendatahub-io-notebooks that referenced this pull request Aug 8, 2024
openshift-merge-bot bot pushed a commit that referenced this pull request Aug 16, 2024
* Add images based on python 3.11

* Apply #656 to Python 3.11 images

* Fix expected TF vesion on the test file

* Fix labels for Python 3.11

* Apply #652 to Python 3.11 images

* Update lock to fix debugpy package version

* Apply #635 to Python 3.11 images

* Replace 3-9 -> 3-11 leftovers

* Fix runtime rocm image name according to openshift/release

* Apply #667 to Python 3.11 images

* Adapt test code for Python 3.11 images
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants