Skip to content

Commit

Permalink
add rhoai with gpu on rosa hcp (#568)
Browse files Browse the repository at this point in the history
* add rhoai with gpu on rosa hcp

Signed-off-by: Paul Czarkowski <[email protected]>

* fix version

Signed-off-by: Paul Czarkowski <[email protected]>

* adding stable diffusion etc

* updating main page with new title

* removing redundant console looks etc

* fixing minor formatting

* fixing formatting again

---------

Signed-off-by: Paul Czarkowski <[email protected]>
Co-authored-by: diana-sari <[email protected]>
  • Loading branch information
paulczar and diana-sari authored Aug 26, 2024
1 parent 7b0ea16 commit 1333fbb
Show file tree
Hide file tree
Showing 8 changed files with 301 additions and 6 deletions.
14 changes: 8 additions & 6 deletions content/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,7 @@ description: "Step-by-step tutorials from Red Hat experts to help you get the mo
### Red Hat OpenShift AI (RHOAI)

* [Running and Deploying LLMs using Red Hat OpenShift AI on ROSA cluster and Storing the Model in Amazon S3 Bucket](/experts/rhoai/rosa-s3)
* [Creating Images using Stable Diffusion on Red Hat OpenShift AI on ROSA cluster with GPU enabled](/experts/rhoai/rosa-gpu)

### Data Science on Jupyter Notebook on OpenShift

Expand Down Expand Up @@ -171,6 +172,13 @@ description: "Step-by-step tutorials from Red Hat experts to help you get the mo

## Miscellaneous

* [Demonstrating GitOps - ArgoCD](/experts/redhat/gitops/)
* [Migrate Kubernetes Applications with Konveyor Crane](/experts/redhat/crane/)
* [Red Hat Cost Management for Cloud Services](/experts/misc/cost-management/)
* [Deploy OpenShift Advanced Data Protection on a ROSA STS cluster](/experts/misc/oadp/rosa-sts/)
* [Azure DevOps with Managed OpenShift](/experts/misc/azure-dev-ops-with-managed-openshift/)
* [Configuring OpenShift Dev Spaces to serve Custom Domains](/experts/misc/devspaces-custom-domain)

### Applications

* [Deploying Astronomer to OpenShift](/experts/aro/astronomer)
Expand All @@ -179,12 +187,6 @@ description: "Step-by-step tutorials from Red Hat experts to help you get the mo
### Ingress

* [Configure a custom ingress TLS profile for ROSA/OSD](/experts/misc/tls-cipher-customization)
* [Demonstrating GitOps - ArgoCD](/experts/redhat/gitops/)
* [Migrate Kubernetes Applications with Konveyor Crane](/experts/redhat/crane/)
* [Red Hat Cost Management for Cloud Services](/experts/misc/cost-management/)
* [Deploy OpenShift Advanced Data Protection on a ROSA STS cluster](/experts/misc/oadp/rosa-sts/)
* [Azure DevOps with Managed OpenShift](/experts/misc/azure-dev-ops-with-managed-openshift/)
* [Configuring OpenShift Dev Spaces to serve Custom Domains](/experts/misc/devspaces-custom-domain)

### Fixes / Workarounds

Expand Down
Binary file added content/rhoai/rosa-gpu/images/cat.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added content/rhoai/rosa-gpu/images/catdog.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added content/rhoai/rosa-gpu/images/catdog1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added content/rhoai/rosa-gpu/images/dog.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added content/rhoai/rosa-gpu/images/dogdog.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added content/rhoai/rosa-gpu/images/nvidiagpu-accl.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
293 changes: 293 additions & 0 deletions content/rhoai/rosa-gpu/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,293 @@
---
date: '2024-08-27'
title: Creating Images using Stable Diffusion on Red Hat OpenShift AI on ROSA cluster with GPU enabled
tags: ["ROSA", "HCP", "RHOAI", "Jupyter", "GPU", "Stable Diffusion"]
authors:
- Diana Sari
- Paul Czarkowski
---

## 1. Introduction

[Stable Diffusion](https://en.wikipedia.org/wiki/Stable_Diffusion) is an AI model to generate images from text description. It uses a diffusion process to iteratively denoise random Gaussian noise into coherent images. This is a simple tutorial to create images using Stable Diffusion model using [Red Hat OpenShift AI (RHOAI)](https://www.redhat.com/en/technologies/cloud-computing/openshift/openshift-ai), formerly called Red Hat OpenShift Data Science (RHODS), which is our OpenShift platform for AI/ML projects lifecycle management, running on a [Red Hat OpenShift Services on AWS (ROSA)](https://www.redhat.com/en/technologies/cloud-computing/openshift/aws) cluster, which is our managed service OpenShift platform on AWS, with NVIDIA GPU enabled.

Note that this guide requires a ROSA cluster with GPU enabled. The first half in this tutorial is installing service mesh operator, followed by installing RHOAI operator and creating DataScienceCluster instance. And the second half, we'll be running Stable Diffusion model to create cat and dog images on RHOAI's Jupyter notebook. In addition, the RHOAI operator version used in this tutorial is version 2.12.0 and please note that as RHOAI undergoes ongoing development and refinement, certain features and GUI may evolve or change over time.

*Disclaimer: When using Stable Diffusion or other open-source image generation models, please be aware that while these tools include certain content filters and safety features, these are not foolproof. Therefore, it is your responsibility to use this tool in a safe manner, ensure the prompts you input are appropriate, and verify that the generated images are suitable for your intended audience. Neither the author of this tutorial nor the infrastructure providers can be held responsible for any inappropriate or unwanted results you may generate. By proceeding with this tutorial, you acknowledge that you understand the potential risks and agree to use the tool responsibly. Remember that the output of AI image generation models can sometimes be unpredictable and thus it is important to review all the generated images before sharing or using them in any context.*


## 2. Prerequisites

### 2.1 Tools

* [OpenShift CLI](https://docs.openshift.com/container-platform/4.14/cli_reference/openshift_cli/getting-started-cli.html)
* [ROSA CLI](https://docs.openshift.com/rosa/rosa_install_access_delete_clusters/rosa_getting_started_iam/rosa-installing-rosa.html)

### 2.2 Environment

1. You will need a ROSA cluster (classic or HCP), if you don't have one, you can follow the [ROSA guide](/experts/rosa/terraform/hcp) to create an HCP ROSA cluster.
- I ran this tutorial on an HCP ROSA 4.16.8 cluster with `m5.4xlarge` node with 48 vCPUs and ~185Gi memory.
- Please be sure that you have cluster admin access to the cluster.

1. You will need a GPU enabled machine pool in your ROSA cluster. If you don't have one, you can follow the [Adding GPUs to a ROSA cluster](/experts/rosa/gpu) guide to add GPUs to your cluster.
- I also ran this tutorial using `g5.4xlarge` node with autoscaling enabled up to 4 nodes.


## 3. Setting up RHOAI

### 3.1 Installing OpenShift Service Mesh Operator

1. Deploy the Operator

```bash
cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: servicemeshoperator
namespace: openshift-operators
spec:
channel: stable
installPlanApproval: Automatic
name: servicemeshoperator
source: redhat-operators
sourceNamespace: openshift-marketplace
EOF
```
### 3.2 Installing RHOAI Operator and DataScienceCluster Instance
1. Create a project for the RHOAI operator:
```bash
oc new-project redhat-ods-operator
```
1. Deploy the OpenShift AI Operator:
```yaml
cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
name: redhat-ods-operator
namespace: redhat-ods-operator
spec:
upgradeStrategy: Default
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: rhods-operator
namespace: redhat-ods-operator
spec:
channel: fast
installPlanApproval: Automatic
name: rhods-operator
source: redhat-operators
sourceNamespace: openshift-marketplace
EOF
```
1. Wait until the operator is installed
```bash
oc wait --for=jsonpath='{.status.replicas}'=1 deployment \
-n redhat-ods-operator rhods-operator
```
If you're on Linux and seeing error message like *Error from server (NotFound): deployments.apps "rhods-operator" not found*, then please wait a couple of minutes and rerun the above command again.
1. Create a DataScienceCluster
```bash
cat << EOF | oc apply -f -
apiVersion: datasciencecluster.opendatahub.io/v1
kind: DataScienceCluster
metadata:
name: default-dsc
spec:
components:
codeflare:
managementState: Managed
kserve:
managementState: Managed
serving:
ingressGateway:
certificate:
type: SelfSigned
managementState: Managed
name: knative-serving
trustyai: {}
ray:
managementState: Managed
kueue:
managementState: Managed
workbenches:
managementState: Managed
dashboard:
managementState: Managed
modelmeshserving:
managementState: Managed
datasciencepipelines:
managementState: Managed
EOF
```
Wait for the DataScienceCluster to be ready
```bash
oc wait --for=jsonpath='{.status.phase}'=Ready datasciencecluster \
default-dsc
```
1. Finally, log into the OpenShift AI console using your web browser and the output of this command
```bash
oc -n redhat-ods-applications get route rhods-dashboard -o jsonpath='{.spec.host}'
```
## 3. Deploying Stable Diffusion Model
In this tutorial, we'll use the [Stable Diffusion 2.1](https://huggingface.co/stabilityai/stable-diffusion-2-1) model from Stability AI to generate images based on text prompts. We'll generate three images based on prompts about cats and dogs, using 50 inference steps and a guidance scale of 7.5. These images are then displayed vertically using matplotlib, with each image titled by its corresponding prompt.
And now that we have the environment ready, let's go to the RHOAI dashboard. From the navigator pane on the left hand side, select **Applications**, and click **Enabled**, which will then lead you to launch a Jupyter notebook. FYI, you could also take a look at the third section of our other guide [here](https://cloud.redhat.com/experts/misc/rhoai-s3/) for more details on the console.
Click **Launch application** and then select **TensorFlow 2024.1** notebook. You can leave the container size to **Small**. And then select **NVIDIA GPU** as the accelerator from the dropdown option.
![NVIDIA-GPU](images/nvidiagpu-accl.png)
<br />
Click the **Start** server button and wait until the notebook is ready, and click **Open in new tab**. And once you're routed to the Jupyter notebook, click **Python 3.9** notebook button on top, and run the following script in a single cell.
```python
# install the necessary dependencies and libraries
!pip install --upgrade diffusers transformers torch accelerate matplotlib datasets torchvision
import torch
from diffusers import StableDiffusionPipeline
from datasets import load_dataset
import random
from torchvision import transforms
from PIL import Image
import matplotlib.pyplot as plt
import gc
# clean up memory and reset CUDA cache
def cleanup_memory():
gc.collect()
torch.cuda.empty_cache()
if torch.cuda.is_available():
torch.cuda.reset_peak_memory_stats()
# load the Stable Diffusion model
def load_model(model_id):
pipeline = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipeline = pipeline.to("cuda" if torch.cuda.is_available() else "cpu")
return pipeline
# generate the images
def generate_images(pipeline, prompts, num_images_per_prompt=1, num_inference_steps=50, guidance_scale=7.5):
images = []
for prompt in prompts:
batch = pipeline(
prompt,
num_images_per_prompt=num_images_per_prompt,
num_inference_steps=num_inference_steps,
guidance_scale=guidance_scale,
output_type="pil"
)
images.extend(batch.images)
cleanup_memory()
return images
# display the images
def display_images(images, prompts):
rows = len(images)
fig, axs = plt.subplots(rows, 1, figsize=(15, 5*rows))
if rows == 1:
axs = [axs]
for img, ax, prompt in zip(images, axs, prompts):
ax.imshow(img)
ax.set_title(prompt, fontsize=10)
ax.axis('off')
plt.tight_layout()
plt.show()
# execute the script
if __name__ == "__main__":
try:
pipeline = load_model('stabilityai/stable-diffusion-2-1')
prompts = [
"A cute cat",
"A cute dog",
"A cute cat and a cute dog sit next to each other"
]
num_images_per_prompt = 1
generated_images = generate_images(pipeline, prompts, num_images_per_prompt, num_inference_steps=50, guidance_scale=7.5)
display_images(generated_images, prompts)
except Exception as e:
print(f"An error occurred: {str(e)}")
finally:
cleanup_memory()
```
Here are some pictures that I've gotten from my run (note that the pictures may vary every run):
![cat](images/cat.png)
<br />
![dog](images/dog.png)
<br />
![catdog](images/catdog.png)
<br />
Note that these prompts, e.g. “A cute cat”, “A cute dog”, and “A cute cat and a cute dog sit next to each other”, are just examples, and you can modify your prompts to your liking by modifying the prompts in the main function.
If you experience hung kernel or something similar, please restart/refresh RHOAI dashboard and relaunch the notebook. Alternatively, if you were using an HCP cluster, you might also want to add more nodes into the machine pool.
Please note that you may also have seen following warning messages which are informational and generally harmless:
- *The cache for model files in Transformers v4.22.0 has been updated...*: This is just an informational message that can be safely ignored once the cache migration is complete.
- *Unable to register cuDNN/cuFFT/cuBLAS factory...*: These messages indicate that these CUDA libraries are being initialized multiple times.
- *This TensorFlow binary is optimized to use available CPU instructions...*: This is also just an informational message that TensorFlow installation is working but could potentially be optimized further.
- *TF-TRT Warning: Could not find TensorRT*: This warning indicates that TensorRT is not available, which might affect performance but not functionality.
## 4. Future research
Note that this is a simple tutorial intended to guide you through the necessary environment setup once you have a ROSA cluster spun up and followed by a simple deployment of generating images using the Stable Diffusion model. If you happen to get unsatisfactory results, i.e. inaccurate images, there are many ways you can go about improving them, such as by adjusting the parameters and using more specific prompts.
In one of my runs, I noticed that the model generated an inaccurate image of a cat and a dog (for the third prompt) as follows.
![dogdog](images/dogdog.png)
<br />
So here I adjusted the `num_inference_steps` from 50 to 75, `guidance_scale` from 7.5 to 8.5, and modified the last prompt into “A cute cat and a cute dog sitting next to each other, both faces and bodies are in the same image and background”. And thus, I got the following image as a result (note that results may vary).
![catdog1](images/catdog1.png)
<br />
Increasing `num_inference_steps` will allow the model more iterations to refine the image, adjusting `guidance_scale` can lead to images that are more closely matching the prompt, and using more detailed prompts can help guide the model better.
However, please note that even with these optimizations, generating images with multiple specific elements can be tricky due to the inherent nature of generative models. You might still need to run the code multiple times to get the desired results.
Note that there are many other ways to improve the accuracy that I’m not going to delve further in this blog, such as using [negative prompts](https://medium.com/stablediffusion/100-negative-prompts-everyone-are-using-c71d0ba33980) to exclude what you don’t want to see in the image, fine-tuning the model, using another [model](https://huggingface.co/models?other=stable-diffusion), increasing the batch size, etc. These are all potential topics for future research.

0 comments on commit 1333fbb

Please sign in to comment.