feat: Ray Cluster and Operator Deployment

opendatahub-io · Aug 26, 2022 · 1646b00 · 1646b00
1 parent 0a80db1
commit 1646b00
Show file tree

Hide file tree

Showing 11 changed files with 4,634 additions and 0 deletions.
diff --git a/ray/README.md b/ray/README.md
@@ -0,0 +1,70 @@
+# Deploying Ray with Open Data Hub
+
+_WIP Docs: _
+
+Integration of [Ray](https://docs.ray.io/en/latest/index.html) with Open Data Hub on OpenShift. The ray operator and other components are based on https://docs.ray.io/en/releases-1.13.0/cluster/kubernetes.html
+
+## Components  of the Ray deployment
+
+1. [Ray operator](./operator/ray-operator-deployment.yaml): The operator will process RayCluster resources and schedule ray head and worker pods based on requirements.
+2. [Ray CR](./operator/ray-custom-resources.yaml):  RayCluster Custom Resource (CR) describes the desired state of ray cluster.
+3. [Ray Cluster](./cluster/ray-cluster.yaml): Defines an instance of an example Ray Cluster 
+
+
+## Deploy the RayCluster Components:
+
+Prerequisite to install RayCluster with ODH:
+
+* Cluster admin access
+* An ODH deployment
+* [Kustomize](https://kustomize.io/) 
+
+### Install Ray
+
+We will use [Kustomize](https://kustomize.io/) to deploy everything we need to use Ray with Open Data Hub. 
+
+#### Install the operator and custom resource 
+
+First use the `oc kustomize` command to generate a yaml containing all the requirements for the operator and the "raycluster" custom resource, then `oc apply` that yaml to deploy the operator to your cluster. 
+
+```bash
+$ oc kustomize deploy/odh-ray-nbc/operator > operator_deployment.yaml
+```
+```bash
+$ oc apply -f operator_deployment.yaml
+```
+
+#### Confirm the operator is running 
+
+```
+$ oc get pods 
+NAME                               READY   STATUS    RESTARTS        AGE
+ray-operator-867bc855b7-2tzxs      1/1     Running   0               4d19h
+
+```
+
+#### Create a ray cluster 
+
+
+```bash
+$ oc kustomize deploy/odh-ray-nbc/cluster > cluster_deployment.yaml
+```
+```bash
+$ oc apply -f cluster_deployment.yaml
+```
+
+#### Confirm the cluster is running 
+```
+$ oc get pods 
+NAME                               READY   STATUS    RESTARTS        AGE
+ray-cluster-head-2f866             1/1     Running   0               36m
+
+```
+
+Once the cluster is running you should be able to connect to it to use ray in a python script or jupyter notebook by using `ray.init(ray://<Ray_Cluster_Service_Name>:10001)`. 
+```python 
+import ray
+ray.init(ray://<Ray_Cluster_Service_Name>:10001)
+```
+
+That's it! 
diff --git a/ray/cluster/kustomization.yaml b/ray/cluster/kustomization.yaml
@@ -0,0 +1,5 @@
+---
+kind: Kustomization
+apiVersion: kustomize.config.k8s.io/v1beta1
+resources:
+  - ray-cluster.yaml
diff --git a/ray/cluster/ray-cluster.yaml b/ray/cluster/ray-cluster.yaml
@@ -0,0 +1,115 @@
+kind: RayCluster
+apiVersion: cluster.ray.io/v1
+metadata:
+  name: 'ray-cluster-example'
+  labels:
+      # allows me to return name of service that Ray operator creates
+      odh-ray-cluster-service: 'ray-cluster-example-ray-head'
+spec:
+  # we can parameterize this when we fix the JH launcher json/jinja bug
+  maxWorkers: 3
+  # The autoscaler will scale up the cluster faster with higher upscaling speed.
+  # E.g., if the task requires adding more nodes then autoscaler will gradually
+  # scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
+  # This number should be > 0.
+  upscalingSpeed: 1.0
+  # If a node is idle for this many minutes, it will be removed.
+  idleTimeoutMinutes: 5
+  # Specify the pod type for the ray head node (as configured below).
+  headPodType: head-node
+  # Specify the allowed pod types for this ray cluster and the resources they provide.
+  podTypes:
+  - name: head-node
+    podConfig:
+      apiVersion: v1
+      kind: Pod
+      metadata:
+        generateName: 'ray-cluster-example-head-'
+      spec:
+        restartPolicy: Never
+        volumes:
+        - name: dshm
+          emptyDir:
+            medium: Memory
+        containers:
+        - name: ray-node
+          imagePullPolicy: Always
+          image: quay.io/thoth-station/ray-ml-worker:v0.2.1
+          # Do not change this command - it keeps the pod alive until it is explicitly killed.
+          command: ["/bin/bash", "-c", "--"]
+          args: ['trap : TERM INT; sleep infinity & wait;']
+          ports:
+          - containerPort: 6379  # Redis port for Ray <= 1.10.0. GCS server port for Ray >= 1.11.0.
+          - containerPort: 10001  # Used by Ray Client
+          - containerPort: 8265  # Used by Ray Dashboard
+          - containerPort: 8000 # Used by Ray Serve
+          # This volume allocates shared memory for Ray to use for plasma
+          env:
+          # defining HOME is part of a workaround for:
+          # https://github.com/ray-project/ray/issues/14155
+          - name: HOME
+            value: '/home'
+          volumeMounts:
+          - mountPath: /dev/shm
+            name: dshm
+          resources:
+            requests:
+              cpu: 1000m
+              memory: 1G
+              ephemeral-storage: 1Gi
+            limits:
+              cpu: 1000m
+              # The maximum memory that this pod is allowed to use. The
+              # limit will be detected by ray and split to use 10% for
+              # redis, 30% for the shared memory object store, and the
+              # rest for application memory. If this limit is not set and
+              # the object store size is not set manually, ray will
+              # allocate a very large object store in each pod that may
+              # cause problems for other pods.
+              memory: 1G
+              nvidia.com/gpu: 1
+  - name: worker-nodes
+    # we can parameterize this when we fix the JH launcher json/jinja bug
+    minWorkers: 0
+    maxWorkers: 3
+    podConfig:
+      apiVersion: v1
+      kind: Pod
+      metadata:
+        # Automatically generates a name for the pod with this prefix.
+        generateName: 'ray-cluster-example-worker-'
+      spec:
+        restartPolicy: Never
+        volumes:
+        - name: dshm
+          emptyDir:
+            medium: Memory
+        containers:
+        - name: ray-node
+          imagePullPolicy: Always
+          image: quay.io/thoth-station/ray-ml-worker:v0.2.1
+          command: ["/bin/bash", "-c", "--"]
+          args: ["trap : TERM INT; sleep infinity & wait;"]
+          env:
+          - name: HOME
+            value: '/home'
+          volumeMounts:
+          - mountPath: /dev/shm
+            name: dshm
+          resources:
+            requests:
+              cpu: 1000m
+              memory: 1G
+            limits:
+              cpu: 1000m
+              memory: 1G
+              nvidia.com/gpu: 1
+  # Commands to start Ray on the head node. You don't need to change this.
+  # Note dashboard-host is set to 0.0.0.0 so that Kubernetes can port forward.
+  headStartRayCommands:
+      - cd /home/ray; pipenv run ray stop
+      - ulimit -n 65536; cd /home/ray; pipenv run ray start --head --no-monitor --port=6379 --object-manager-port=8076 --dashboard-host=0.0.0.0
+  # Commands to start Ray on worker nodes. You don't need to change this.
+  workerStartRayCommands:
+      - cd /home/ray; pipenv run ray stop
+      - ulimit -n 65536; cd /home/ray; pipenv run ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
diff --git a/ray/operator/kustomization.yaml b/ray/operator/kustomization.yaml
@@ -0,0 +1,9 @@
+---
+kind: Kustomization
+apiVersion: kustomize.config.k8s.io/v1beta1
+resources:
+  - ray-operator-serviceaccount.yaml
+  - ray-operator-role.yaml
+  - ray-operator-rolebinding.yaml
+  - ray-operator-deployment.yaml
+  - ray-custom-resources.yaml