Skip to content
This repository has been archived by the owner on Jan 31, 2024. It is now read-only.

Commit

Permalink
feat: Ray Cluster and Operator Deployment
Browse files Browse the repository at this point in the history
  • Loading branch information
MichaelClifford committed Aug 26, 2022
1 parent 0a80db1 commit 1646b00
Show file tree
Hide file tree
Showing 11 changed files with 4,634 additions and 0 deletions.
70 changes: 70 additions & 0 deletions ray/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# Deploying Ray with Open Data Hub

_WIP Docs: _

Integration of [Ray](https://docs.ray.io/en/latest/index.html) with Open Data Hub on OpenShift. The ray operator and other components are based on https://docs.ray.io/en/releases-1.13.0/cluster/kubernetes.html

## Components of the Ray deployment

1. [Ray operator](./operator/ray-operator-deployment.yaml): The operator will process RayCluster resources and schedule ray head and worker pods based on requirements.
2. [Ray CR](./operator/ray-custom-resources.yaml): RayCluster Custom Resource (CR) describes the desired state of ray cluster.
3. [Ray Cluster](./cluster/ray-cluster.yaml): Defines an instance of an example Ray Cluster


## Deploy the RayCluster Components:

Prerequisite to install RayCluster with ODH:

* Cluster admin access
* An ODH deployment
* [Kustomize](https://kustomize.io/)

### Install Ray

We will use [Kustomize](https://kustomize.io/) to deploy everything we need to use Ray with Open Data Hub.

#### Install the operator and custom resource

First use the `oc kustomize` command to generate a yaml containing all the requirements for the operator and the "raycluster" custom resource, then `oc apply` that yaml to deploy the operator to your cluster.

```bash
$ oc kustomize deploy/odh-ray-nbc/operator > operator_deployment.yaml
```
```bash
$ oc apply -f operator_deployment.yaml
```

#### Confirm the operator is running

```
$ oc get pods
NAME READY STATUS RESTARTS AGE
ray-operator-867bc855b7-2tzxs 1/1 Running 0 4d19h
```

#### Create a ray cluster


```bash
$ oc kustomize deploy/odh-ray-nbc/cluster > cluster_deployment.yaml
```
```bash
$ oc apply -f cluster_deployment.yaml
```

#### Confirm the cluster is running
```
$ oc get pods
NAME READY STATUS RESTARTS AGE
ray-cluster-head-2f866 1/1 Running 0 36m
```

Once the cluster is running you should be able to connect to it to use ray in a python script or jupyter notebook by using `ray.init(ray://<Ray_Cluster_Service_Name>:10001)`.
```python
import ray
ray.init(ray://<Ray_Cluster_Service_Name>:10001)
```

That's it!
5 changes: 5 additions & 0 deletions ray/cluster/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
kind: Kustomization
apiVersion: kustomize.config.k8s.io/v1beta1
resources:
- ray-cluster.yaml
115 changes: 115 additions & 0 deletions ray/cluster/ray-cluster.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
kind: RayCluster
apiVersion: cluster.ray.io/v1
metadata:
name: 'ray-cluster-example'
labels:
# allows me to return name of service that Ray operator creates
odh-ray-cluster-service: 'ray-cluster-example-ray-head'
spec:
# we can parameterize this when we fix the JH launcher json/jinja bug
maxWorkers: 3
# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscalingSpeed: 1.0
# If a node is idle for this many minutes, it will be removed.
idleTimeoutMinutes: 5
# Specify the pod type for the ray head node (as configured below).
headPodType: head-node
# Specify the allowed pod types for this ray cluster and the resources they provide.
podTypes:
- name: head-node
podConfig:
apiVersion: v1
kind: Pod
metadata:
generateName: 'ray-cluster-example-head-'
spec:
restartPolicy: Never
volumes:
- name: dshm
emptyDir:
medium: Memory
containers:
- name: ray-node
imagePullPolicy: Always
image: quay.io/thoth-station/ray-ml-worker:v0.2.1
# Do not change this command - it keeps the pod alive until it is explicitly killed.
command: ["/bin/bash", "-c", "--"]
args: ['trap : TERM INT; sleep infinity & wait;']
ports:
- containerPort: 6379 # Redis port for Ray <= 1.10.0. GCS server port for Ray >= 1.11.0.
- containerPort: 10001 # Used by Ray Client
- containerPort: 8265 # Used by Ray Dashboard
- containerPort: 8000 # Used by Ray Serve
# This volume allocates shared memory for Ray to use for plasma
env:
# defining HOME is part of a workaround for:
# https://github.com/ray-project/ray/issues/14155
- name: HOME
value: '/home'
volumeMounts:
- mountPath: /dev/shm
name: dshm
resources:
requests:
cpu: 1000m
memory: 1G
ephemeral-storage: 1Gi
limits:
cpu: 1000m
# The maximum memory that this pod is allowed to use. The
# limit will be detected by ray and split to use 10% for
# redis, 30% for the shared memory object store, and the
# rest for application memory. If this limit is not set and
# the object store size is not set manually, ray will
# allocate a very large object store in each pod that may
# cause problems for other pods.
memory: 1G
nvidia.com/gpu: 1
- name: worker-nodes
# we can parameterize this when we fix the JH launcher json/jinja bug
minWorkers: 0
maxWorkers: 3
podConfig:
apiVersion: v1
kind: Pod
metadata:
# Automatically generates a name for the pod with this prefix.
generateName: 'ray-cluster-example-worker-'
spec:
restartPolicy: Never
volumes:
- name: dshm
emptyDir:
medium: Memory
containers:
- name: ray-node
imagePullPolicy: Always
image: quay.io/thoth-station/ray-ml-worker:v0.2.1
command: ["/bin/bash", "-c", "--"]
args: ["trap : TERM INT; sleep infinity & wait;"]
env:
- name: HOME
value: '/home'
volumeMounts:
- mountPath: /dev/shm
name: dshm
resources:
requests:
cpu: 1000m
memory: 1G
limits:
cpu: 1000m
memory: 1G
nvidia.com/gpu: 1
# Commands to start Ray on the head node. You don't need to change this.
# Note dashboard-host is set to 0.0.0.0 so that Kubernetes can port forward.
headStartRayCommands:
- cd /home/ray; pipenv run ray stop
- ulimit -n 65536; cd /home/ray; pipenv run ray start --head --no-monitor --port=6379 --object-manager-port=8076 --dashboard-host=0.0.0.0
# Commands to start Ray on worker nodes. You don't need to change this.
workerStartRayCommands:
- cd /home/ray; pipenv run ray stop
- ulimit -n 65536; cd /home/ray; pipenv run ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
9 changes: 9 additions & 0 deletions ray/operator/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
---
kind: Kustomization
apiVersion: kustomize.config.k8s.io/v1beta1
resources:
- ray-operator-serviceaccount.yaml
- ray-operator-role.yaml
- ray-operator-rolebinding.yaml
- ray-operator-deployment.yaml
- ray-custom-resources.yaml
Loading

0 comments on commit 1646b00

Please sign in to comment.