Skip to content
This repository has been archived by the owner on Jan 31, 2024. It is now read-only.

feat: Ray Cluster and Operator Deployment #638

Merged
merged 2 commits into from
Oct 14, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions ray/OWNERS
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
approvers:
- michaelclifford
- rimolive
- HumairAK

reviewers:
- HumairAK
- michaelclifford
- rimolive
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's an additional space before rimolive

68 changes: 68 additions & 0 deletions ray/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Deploying Ray with Open Data Hub

Integration of [Ray](https://docs.ray.io/en/latest/index.html) with Open Data Hub on OpenShift. The ray operator and other components are based on https://docs.ray.io/en/releases-1.13.0/cluster/kubernetes.html

## Components of the Ray deployment

1. [Ray operator](./operator/base/ray-operator-deployment.yaml): The operator will process RayCluster resources and schedule ray head and worker pods based on requirements. Image Containerfile can be found [here](https://github.com/thoth-station/ray-operator/blob/master/Containerfile).
2. [Ray CR](./operator/base/ray-custom-resources.yaml): RayCluster Custom Resource (CR) describes the desired state of ray cluster.
3. [Ray Cluster](./cluster/base/ray-cluster.yaml): Defines an instance of an example Ray Cluster. Image Containerfile can be found [here](https://github.com/thoth-station/ray-ml-worker/blob/master/Containerfile)


## Deploy the RayCluster Components:

Prerequisite to install RayCluster with ODH:

* Cluster admin access
* An ODH deployment
* [Kustomize](https://kustomize.io/)

### Install Ray

We will use [Kustomize](https://kustomize.io/) to deploy everything we need to use Ray with Open Data Hub.

#### Install the operator and custom resource

First use the `oc kustomize` command to generate a yaml containing all the requirements for the operator and the "raycluster" custom resource, then `oc apply` that yaml to deploy the operator to your cluster.

```bash
$ oc kustomize ray/operator/base > operator_deployment.yaml
```
```bash
$ oc apply -f operator_deployment.yaml -n <your-namespace>
```

#### Confirm the operator is running

```
$ oc get pods
NAME READY STATUS RESTARTS AGE
ray-operator-867bc855b7-2tzxs 1/1 Running 0 4d19h

```

#### Create a ray cluster


```bash
$ oc kustomize ray/cluster/base > cluster_deployment.yaml
```
```bash
$ oc apply -f cluster_deployment.yaml -n <your-namespace>
```

#### Confirm the cluster is running
```
$ oc get pods
NAME READY STATUS RESTARTS AGE
ray-cluster-head-2f866 1/1 Running 0 36m

```

Once the cluster is running you should be able to connect to it to use ray in a python script or jupyter notebook by using `ray.init(ray://<Ray_Cluster_Service_Name>:10001)`.
```python
import ray
ray.init(ray://<Ray_Cluster_Service_Name>:10001)
```

That's it!
7 changes: 7 additions & 0 deletions ray/cluster/base/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
---
kind: Kustomization
apiVersion: kustomize.config.k8s.io/v1beta1
resources:
- ray-cluster.yaml
- ../prometheus

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove newline

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry -- it's the opposite, newline missing

119 changes: 119 additions & 0 deletions ray/cluster/base/ray-cluster.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
kind: RayCluster
apiVersion: cluster.ray.io/v1
metadata:
name: 'ray-cluster-example'
labels:
# allows me to return name of service that Ray operator creates
odh-ray-cluster-service: 'ray-cluster-example-ray-head'
spec:
maxWorkers: 3
# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscalingSpeed: 1.0
# If a node is idle for this many minutes, it will be removed.
idleTimeoutMinutes: 5
# Specify the pod type for the ray head node (as configured below).
headPodType: head-node
# Specify the allowed pod types for this ray cluster and the resources they provide.
podTypes:
- name: head-node
podConfig:
apiVersion: v1
kind: Pod
metadata:
generateName: 'ray-cluster-example-head-'
spec:
restartPolicy: Never
volumes:
- name: dshm
emptyDir:
medium: Memory
containers:
- name: ray-node
imagePullPolicy: Always
image: quay.io/opendatahub-contrib/ray-ml-worker:v0.2.1
# Do not change this command - it keeps the pod alive until it is explicitly killed.
command: ["/bin/bash", "-c", "--"]
args: ['trap : TERM INT; sleep infinity & wait;']
ports:
- containerPort: 6379 # Redis port for Ray <= 1.10.0. GCS server port for Ray >= 1.11.0.
- containerPort: 10001 # Used by Ray Client
- containerPort: 8265 # Used by Ray Dashboard
- containerPort: 8000 # Used by Ray Serve
- containerPort: 8080 # Used for metrics
# This volume allocates shared memory for Ray to use for plasma
env:
# defining HOME is part of a workaround for:
# https://github.com/ray-project/ray/issues/14155
- name: HOME
value: '/home'
volumeMounts:
- mountPath: /dev/shm
name: dshm
resources:
requests:
cpu: 2
memory: 2G
ephemeral-storage: 1Gi
limits:
cpu: 2
# The maximum memory that this pod is allowed to use. The
# limit will be detected by ray and split to use 10% for
# redis, 30% for the shared memory object store, and the
# rest for application memory. If this limit is not set and
# the object store size is not set manually, ray will
# allocate a very large object store in each pod that may
# cause problems for other pods.
memory: 2G
- name: worker-nodes
# we can parameterize this when we fix the JH launcher json/jinja bug
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given ODH focus on NBC is this still valid?

minWorkers: 1
maxWorkers: 3
podConfig:
apiVersion: v1
kind: Pod
metadata:
# Automatically generates a name for the pod with this prefix.
generateName: 'ray-cluster-example-worker-'
spec:
restartPolicy: Never
volumes:
- name: dshm
emptyDir:
medium: Memory
containers:
- name: ray-node
imagePullPolicy: Always
image: quay.io/opendatahub-contrib/ray-ml-worker:v0.2.1
command: ["/bin/bash", "-c", "--"]
args: ["trap : TERM INT; sleep infinity & wait;"]
env:
- name: HOME
value: '/home'
volumeMounts:
- mountPath: /dev/shm
name: dshm
resources:
requests:
cpu: 2
memory: 2G
limits:
cpu: 2
memory: 2G

headServicePorts:
- name: metrics
port: 8080
targetPort: 8080
# Commands to start Ray on the head node. You don't need to change this.
# Note dashboard-host is set to 0.0.0.0 so that Kubernetes can port forward.
headStartRayCommands:
- cd /home/ray; pipenv run ray stop
- ulimit -n 65536; cd /home/ray; pipenv run ray start --head --metrics-export-port=8080 --port=6379 --object-manager-port=8076 --dashboard-host=0.0.0.0
# Commands to start Ray on worker nodes. You don't need to change this.
workerStartRayCommands:
- cd /home/ray; pipenv run ray stop
- ulimit -n 65536; cd /home/ray; pipenv run ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

15 changes: 15 additions & 0 deletions ray/cluster/prometheus/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Copyright 2021 IBM Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
resources:
- monitor.yaml
29 changes: 29 additions & 0 deletions ray/cluster/prometheus/monitor.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Copyright 2021 IBM Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Prometheus Monitor Service (Metrics)
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
k8s-app: ray-monitor
name: ray-monitor
spec:
endpoints:
- interval: 30s
port: metrics
scheme: http
selector:
matchLabels:
app: ray-monitor

9 changes: 9 additions & 0 deletions ray/operator/base/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
---
kind: Kustomization
apiVersion: kustomize.config.k8s.io/v1beta1
resources:
- ray-operator-serviceaccount.yaml
- ray-operator-role.yaml
- ray-operator-rolebinding.yaml
- ray-operator-deployment.yaml
- ray-custom-resources.yaml
Loading