This repository has been archived by the owner on Jan 31, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 211
feat: Ray Cluster and Operator Deployment #638
Merged
openshift-merge-robot
merged 2 commits into
opendatahub-io:master
from
MichaelClifford:ray
Oct 14, 2022
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
approvers: | ||
- michaelclifford | ||
- rimolive | ||
- HumairAK | ||
|
||
reviewers: | ||
- HumairAK | ||
- michaelclifford | ||
- rimolive | ||
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,68 @@ | ||
# Deploying Ray with Open Data Hub | ||
|
||
Integration of [Ray](https://docs.ray.io/en/latest/index.html) with Open Data Hub on OpenShift. The ray operator and other components are based on https://docs.ray.io/en/releases-1.13.0/cluster/kubernetes.html | ||
|
||
## Components of the Ray deployment | ||
|
||
1. [Ray operator](./operator/base/ray-operator-deployment.yaml): The operator will process RayCluster resources and schedule ray head and worker pods based on requirements. Image Containerfile can be found [here](https://github.com/thoth-station/ray-operator/blob/master/Containerfile). | ||
2. [Ray CR](./operator/base/ray-custom-resources.yaml): RayCluster Custom Resource (CR) describes the desired state of ray cluster. | ||
3. [Ray Cluster](./cluster/base/ray-cluster.yaml): Defines an instance of an example Ray Cluster. Image Containerfile can be found [here](https://github.com/thoth-station/ray-ml-worker/blob/master/Containerfile) | ||
|
||
|
||
## Deploy the RayCluster Components: | ||
|
||
Prerequisite to install RayCluster with ODH: | ||
|
||
* Cluster admin access | ||
* An ODH deployment | ||
* [Kustomize](https://kustomize.io/) | ||
|
||
### Install Ray | ||
|
||
We will use [Kustomize](https://kustomize.io/) to deploy everything we need to use Ray with Open Data Hub. | ||
|
||
#### Install the operator and custom resource | ||
|
||
First use the `oc kustomize` command to generate a yaml containing all the requirements for the operator and the "raycluster" custom resource, then `oc apply` that yaml to deploy the operator to your cluster. | ||
|
||
```bash | ||
$ oc kustomize ray/operator/base > operator_deployment.yaml | ||
``` | ||
```bash | ||
$ oc apply -f operator_deployment.yaml -n <your-namespace> | ||
``` | ||
|
||
#### Confirm the operator is running | ||
|
||
``` | ||
$ oc get pods | ||
NAME READY STATUS RESTARTS AGE | ||
ray-operator-867bc855b7-2tzxs 1/1 Running 0 4d19h | ||
|
||
``` | ||
|
||
#### Create a ray cluster | ||
|
||
|
||
```bash | ||
$ oc kustomize ray/cluster/base > cluster_deployment.yaml | ||
``` | ||
```bash | ||
$ oc apply -f cluster_deployment.yaml -n <your-namespace> | ||
``` | ||
|
||
#### Confirm the cluster is running | ||
``` | ||
$ oc get pods | ||
NAME READY STATUS RESTARTS AGE | ||
ray-cluster-head-2f866 1/1 Running 0 36m | ||
|
||
``` | ||
|
||
Once the cluster is running you should be able to connect to it to use ray in a python script or jupyter notebook by using `ray.init(ray://<Ray_Cluster_Service_Name>:10001)`. | ||
```python | ||
import ray | ||
ray.init(ray://<Ray_Cluster_Service_Name>:10001) | ||
``` | ||
|
||
That's it! |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
--- | ||
kind: Kustomization | ||
apiVersion: kustomize.config.k8s.io/v1beta1 | ||
resources: | ||
- ray-cluster.yaml | ||
- ../prometheus | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. remove newline There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sorry -- it's the opposite, newline missing |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,119 @@ | ||
kind: RayCluster | ||
apiVersion: cluster.ray.io/v1 | ||
metadata: | ||
name: 'ray-cluster-example' | ||
labels: | ||
# allows me to return name of service that Ray operator creates | ||
odh-ray-cluster-service: 'ray-cluster-example-ray-head' | ||
spec: | ||
maxWorkers: 3 | ||
# The autoscaler will scale up the cluster faster with higher upscaling speed. | ||
# E.g., if the task requires adding more nodes then autoscaler will gradually | ||
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes. | ||
# This number should be > 0. | ||
upscalingSpeed: 1.0 | ||
# If a node is idle for this many minutes, it will be removed. | ||
idleTimeoutMinutes: 5 | ||
# Specify the pod type for the ray head node (as configured below). | ||
headPodType: head-node | ||
# Specify the allowed pod types for this ray cluster and the resources they provide. | ||
podTypes: | ||
- name: head-node | ||
podConfig: | ||
apiVersion: v1 | ||
kind: Pod | ||
metadata: | ||
generateName: 'ray-cluster-example-head-' | ||
spec: | ||
restartPolicy: Never | ||
volumes: | ||
- name: dshm | ||
emptyDir: | ||
medium: Memory | ||
containers: | ||
- name: ray-node | ||
imagePullPolicy: Always | ||
image: quay.io/opendatahub-contrib/ray-ml-worker:v0.2.1 | ||
# Do not change this command - it keeps the pod alive until it is explicitly killed. | ||
command: ["/bin/bash", "-c", "--"] | ||
args: ['trap : TERM INT; sleep infinity & wait;'] | ||
ports: | ||
- containerPort: 6379 # Redis port for Ray <= 1.10.0. GCS server port for Ray >= 1.11.0. | ||
- containerPort: 10001 # Used by Ray Client | ||
- containerPort: 8265 # Used by Ray Dashboard | ||
- containerPort: 8000 # Used by Ray Serve | ||
- containerPort: 8080 # Used for metrics | ||
# This volume allocates shared memory for Ray to use for plasma | ||
env: | ||
# defining HOME is part of a workaround for: | ||
# https://github.com/ray-project/ray/issues/14155 | ||
- name: HOME | ||
value: '/home' | ||
volumeMounts: | ||
- mountPath: /dev/shm | ||
name: dshm | ||
resources: | ||
requests: | ||
cpu: 2 | ||
memory: 2G | ||
ephemeral-storage: 1Gi | ||
limits: | ||
cpu: 2 | ||
# The maximum memory that this pod is allowed to use. The | ||
# limit will be detected by ray and split to use 10% for | ||
# redis, 30% for the shared memory object store, and the | ||
# rest for application memory. If this limit is not set and | ||
# the object store size is not set manually, ray will | ||
# allocate a very large object store in each pod that may | ||
# cause problems for other pods. | ||
memory: 2G | ||
- name: worker-nodes | ||
# we can parameterize this when we fix the JH launcher json/jinja bug | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given ODH focus on NBC is this still valid? |
||
minWorkers: 1 | ||
maxWorkers: 3 | ||
podConfig: | ||
apiVersion: v1 | ||
kind: Pod | ||
metadata: | ||
# Automatically generates a name for the pod with this prefix. | ||
generateName: 'ray-cluster-example-worker-' | ||
spec: | ||
restartPolicy: Never | ||
volumes: | ||
- name: dshm | ||
emptyDir: | ||
medium: Memory | ||
containers: | ||
- name: ray-node | ||
imagePullPolicy: Always | ||
image: quay.io/opendatahub-contrib/ray-ml-worker:v0.2.1 | ||
command: ["/bin/bash", "-c", "--"] | ||
args: ["trap : TERM INT; sleep infinity & wait;"] | ||
env: | ||
- name: HOME | ||
value: '/home' | ||
volumeMounts: | ||
- mountPath: /dev/shm | ||
name: dshm | ||
resources: | ||
requests: | ||
cpu: 2 | ||
memory: 2G | ||
limits: | ||
cpu: 2 | ||
memory: 2G | ||
|
||
headServicePorts: | ||
- name: metrics | ||
port: 8080 | ||
targetPort: 8080 | ||
# Commands to start Ray on the head node. You don't need to change this. | ||
# Note dashboard-host is set to 0.0.0.0 so that Kubernetes can port forward. | ||
headStartRayCommands: | ||
- cd /home/ray; pipenv run ray stop | ||
- ulimit -n 65536; cd /home/ray; pipenv run ray start --head --metrics-export-port=8080 --port=6379 --object-manager-port=8076 --dashboard-host=0.0.0.0 | ||
# Commands to start Ray on worker nodes. You don't need to change this. | ||
workerStartRayCommands: | ||
- cd /home/ray; pipenv run ray stop | ||
- ulimit -n 65536; cd /home/ray; pipenv run ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076 | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
# Copyright 2021 IBM Corporation | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
resources: | ||
- monitor.yaml |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
# Copyright 2021 IBM Corporation | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
# Prometheus Monitor Service (Metrics) | ||
apiVersion: monitoring.coreos.com/v1 | ||
kind: ServiceMonitor | ||
metadata: | ||
labels: | ||
k8s-app: ray-monitor | ||
name: ray-monitor | ||
spec: | ||
endpoints: | ||
- interval: 30s | ||
port: metrics | ||
scheme: http | ||
selector: | ||
matchLabels: | ||
app: ray-monitor | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
--- | ||
kind: Kustomization | ||
apiVersion: kustomize.config.k8s.io/v1beta1 | ||
resources: | ||
- ray-operator-serviceaccount.yaml | ||
- ray-operator-role.yaml | ||
- ray-operator-rolebinding.yaml | ||
- ray-operator-deployment.yaml | ||
- ray-custom-resources.yaml |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's an additional space before
rimolive