This repository has been archived by the owner on Jan 31, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 211
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: Ray Cluster and Operator Deployment
- Loading branch information
1 parent
0a80db1
commit 1646b00
Showing
11 changed files
with
4,634 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
# Deploying Ray with Open Data Hub | ||
|
||
_WIP Docs: _ | ||
|
||
Integration of [Ray](https://docs.ray.io/en/latest/index.html) with Open Data Hub on OpenShift. The ray operator and other components are based on https://docs.ray.io/en/releases-1.13.0/cluster/kubernetes.html | ||
|
||
## Components of the Ray deployment | ||
|
||
1. [Ray operator](./operator/ray-operator-deployment.yaml): The operator will process RayCluster resources and schedule ray head and worker pods based on requirements. | ||
2. [Ray CR](./operator/ray-custom-resources.yaml): RayCluster Custom Resource (CR) describes the desired state of ray cluster. | ||
3. [Ray Cluster](./cluster/ray-cluster.yaml): Defines an instance of an example Ray Cluster | ||
|
||
|
||
## Deploy the RayCluster Components: | ||
|
||
Prerequisite to install RayCluster with ODH: | ||
|
||
* Cluster admin access | ||
* An ODH deployment | ||
* [Kustomize](https://kustomize.io/) | ||
|
||
### Install Ray | ||
|
||
We will use [Kustomize](https://kustomize.io/) to deploy everything we need to use Ray with Open Data Hub. | ||
|
||
#### Install the operator and custom resource | ||
|
||
First use the `oc kustomize` command to generate a yaml containing all the requirements for the operator and the "raycluster" custom resource, then `oc apply` that yaml to deploy the operator to your cluster. | ||
|
||
```bash | ||
$ oc kustomize deploy/odh-ray-nbc/operator > operator_deployment.yaml | ||
``` | ||
```bash | ||
$ oc apply -f operator_deployment.yaml | ||
``` | ||
|
||
#### Confirm the operator is running | ||
|
||
``` | ||
$ oc get pods | ||
NAME READY STATUS RESTARTS AGE | ||
ray-operator-867bc855b7-2tzxs 1/1 Running 0 4d19h | ||
``` | ||
|
||
#### Create a ray cluster | ||
|
||
|
||
```bash | ||
$ oc kustomize deploy/odh-ray-nbc/cluster > cluster_deployment.yaml | ||
``` | ||
```bash | ||
$ oc apply -f cluster_deployment.yaml | ||
``` | ||
|
||
#### Confirm the cluster is running | ||
``` | ||
$ oc get pods | ||
NAME READY STATUS RESTARTS AGE | ||
ray-cluster-head-2f866 1/1 Running 0 36m | ||
``` | ||
|
||
Once the cluster is running you should be able to connect to it to use ray in a python script or jupyter notebook by using `ray.init(ray://<Ray_Cluster_Service_Name>:10001)`. | ||
```python | ||
import ray | ||
ray.init(ray://<Ray_Cluster_Service_Name>:10001) | ||
``` | ||
|
||
That's it! |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
--- | ||
kind: Kustomization | ||
apiVersion: kustomize.config.k8s.io/v1beta1 | ||
resources: | ||
- ray-cluster.yaml |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,115 @@ | ||
kind: RayCluster | ||
apiVersion: cluster.ray.io/v1 | ||
metadata: | ||
name: 'ray-cluster-example' | ||
labels: | ||
# allows me to return name of service that Ray operator creates | ||
odh-ray-cluster-service: 'ray-cluster-example-ray-head' | ||
spec: | ||
# we can parameterize this when we fix the JH launcher json/jinja bug | ||
maxWorkers: 3 | ||
# The autoscaler will scale up the cluster faster with higher upscaling speed. | ||
# E.g., if the task requires adding more nodes then autoscaler will gradually | ||
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes. | ||
# This number should be > 0. | ||
upscalingSpeed: 1.0 | ||
# If a node is idle for this many minutes, it will be removed. | ||
idleTimeoutMinutes: 5 | ||
# Specify the pod type for the ray head node (as configured below). | ||
headPodType: head-node | ||
# Specify the allowed pod types for this ray cluster and the resources they provide. | ||
podTypes: | ||
- name: head-node | ||
podConfig: | ||
apiVersion: v1 | ||
kind: Pod | ||
metadata: | ||
generateName: 'ray-cluster-example-head-' | ||
spec: | ||
restartPolicy: Never | ||
volumes: | ||
- name: dshm | ||
emptyDir: | ||
medium: Memory | ||
containers: | ||
- name: ray-node | ||
imagePullPolicy: Always | ||
image: quay.io/thoth-station/ray-ml-worker:v0.2.1 | ||
# Do not change this command - it keeps the pod alive until it is explicitly killed. | ||
command: ["/bin/bash", "-c", "--"] | ||
args: ['trap : TERM INT; sleep infinity & wait;'] | ||
ports: | ||
- containerPort: 6379 # Redis port for Ray <= 1.10.0. GCS server port for Ray >= 1.11.0. | ||
- containerPort: 10001 # Used by Ray Client | ||
- containerPort: 8265 # Used by Ray Dashboard | ||
- containerPort: 8000 # Used by Ray Serve | ||
# This volume allocates shared memory for Ray to use for plasma | ||
env: | ||
# defining HOME is part of a workaround for: | ||
# https://github.com/ray-project/ray/issues/14155 | ||
- name: HOME | ||
value: '/home' | ||
volumeMounts: | ||
- mountPath: /dev/shm | ||
name: dshm | ||
resources: | ||
requests: | ||
cpu: 1000m | ||
memory: 1G | ||
ephemeral-storage: 1Gi | ||
limits: | ||
cpu: 1000m | ||
# The maximum memory that this pod is allowed to use. The | ||
# limit will be detected by ray and split to use 10% for | ||
# redis, 30% for the shared memory object store, and the | ||
# rest for application memory. If this limit is not set and | ||
# the object store size is not set manually, ray will | ||
# allocate a very large object store in each pod that may | ||
# cause problems for other pods. | ||
memory: 1G | ||
nvidia.com/gpu: 1 | ||
- name: worker-nodes | ||
# we can parameterize this when we fix the JH launcher json/jinja bug | ||
minWorkers: 0 | ||
maxWorkers: 3 | ||
podConfig: | ||
apiVersion: v1 | ||
kind: Pod | ||
metadata: | ||
# Automatically generates a name for the pod with this prefix. | ||
generateName: 'ray-cluster-example-worker-' | ||
spec: | ||
restartPolicy: Never | ||
volumes: | ||
- name: dshm | ||
emptyDir: | ||
medium: Memory | ||
containers: | ||
- name: ray-node | ||
imagePullPolicy: Always | ||
image: quay.io/thoth-station/ray-ml-worker:v0.2.1 | ||
command: ["/bin/bash", "-c", "--"] | ||
args: ["trap : TERM INT; sleep infinity & wait;"] | ||
env: | ||
- name: HOME | ||
value: '/home' | ||
volumeMounts: | ||
- mountPath: /dev/shm | ||
name: dshm | ||
resources: | ||
requests: | ||
cpu: 1000m | ||
memory: 1G | ||
limits: | ||
cpu: 1000m | ||
memory: 1G | ||
nvidia.com/gpu: 1 | ||
# Commands to start Ray on the head node. You don't need to change this. | ||
# Note dashboard-host is set to 0.0.0.0 so that Kubernetes can port forward. | ||
headStartRayCommands: | ||
- cd /home/ray; pipenv run ray stop | ||
- ulimit -n 65536; cd /home/ray; pipenv run ray start --head --no-monitor --port=6379 --object-manager-port=8076 --dashboard-host=0.0.0.0 | ||
# Commands to start Ray on worker nodes. You don't need to change this. | ||
workerStartRayCommands: | ||
- cd /home/ray; pipenv run ray stop | ||
- ulimit -n 65536; cd /home/ray; pipenv run ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
--- | ||
kind: Kustomization | ||
apiVersion: kustomize.config.k8s.io/v1beta1 | ||
resources: | ||
- ray-operator-serviceaccount.yaml | ||
- ray-operator-role.yaml | ||
- ray-operator-rolebinding.yaml | ||
- ray-operator-deployment.yaml | ||
- ray-custom-resources.yaml |
Oops, something went wrong.