opendatahub-io · openshift-merge-robot · Oct 14, 2022 · Aug 10, 2022 · Oct 13, 2022 · HumairAK
diff --git a/ray/OWNERS b/ray/OWNERS
@@ -0,0 +1,9 @@
+approvers:
+- michaelclifford
+- rimolive
+- HumairAK
+
+reviewers:
+- HumairAK
+- michaelclifford
+- rimolive
diff --git a/ray/README.md b/ray/README.md
@@ -0,0 +1,68 @@
+# Deploying Ray with Open Data Hub
+
+Integration of [Ray](https://docs.ray.io/en/latest/index.html) with Open Data Hub on OpenShift. The ray operator and other components are based on https://docs.ray.io/en/releases-1.13.0/cluster/kubernetes.html
+
+## Components  of the Ray deployment
+
+1. [Ray operator](./operator/base/ray-operator-deployment.yaml): The operator will process RayCluster resources and schedule ray head and worker pods based on requirements. Image Containerfile can be found [here](https://github.com/thoth-station/ray-operator/blob/master/Containerfile). 
+2. [Ray CR](./operator/base/ray-custom-resources.yaml):  RayCluster Custom Resource (CR) describes the desired state of ray cluster.
+3. [Ray Cluster](./cluster/base/ray-cluster.yaml): Defines an instance of an example Ray Cluster. Image Containerfile can be found [here](https://github.com/thoth-station/ray-ml-worker/blob/master/Containerfile) 
+
+
+## Deploy the RayCluster Components:
+
+Prerequisite to install RayCluster with ODH:
+
+* Cluster admin access
+* An ODH deployment
+* [Kustomize](https://kustomize.io/) 
+
+### Install Ray
+
+We will use [Kustomize](https://kustomize.io/) to deploy everything we need to use Ray with Open Data Hub. 
+
+#### Install the operator and custom resource 
+
+First use the `oc kustomize` command to generate a yaml containing all the requirements for the operator and the "raycluster" custom resource, then `oc apply` that yaml to deploy the operator to your cluster. 
+
+```bash
+$ oc kustomize ray/operator/base > operator_deployment.yaml
+```
+```bash
+$ oc apply -f operator_deployment.yaml -n <your-namespace>
+```
+
+#### Confirm the operator is running 
+
+```
+$ oc get pods 
+NAME                               READY   STATUS    RESTARTS        AGE
+ray-operator-867bc855b7-2tzxs      1/1     Running   0               4d19h
+
+```
+
+#### Create a ray cluster 
+
+
+```bash
+$ oc kustomize ray/cluster/base > cluster_deployment.yaml
+```
+```bash
+$ oc apply -f cluster_deployment.yaml -n <your-namespace>
+```
+
+#### Confirm the cluster is running 
+```
+$ oc get pods 
+NAME                               READY   STATUS    RESTARTS        AGE
+ray-cluster-head-2f866             1/1     Running   0               36m
+
+```
+
+Once the cluster is running you should be able to connect to it to use ray in a python script or jupyter notebook by using `ray.init(ray://<Ray_Cluster_Service_Name>:10001)`. 
+```python 
+import ray
+ray.init(ray://<Ray_Cluster_Service_Name>:10001)
+```
+
+That's it! 
diff --git a/ray/cluster/base/kustomization.yaml b/ray/cluster/base/kustomization.yaml
@@ -0,0 +1,7 @@
+---
+kind: Kustomization
+apiVersion: kustomize.config.k8s.io/v1beta1
+resources:
+  - ray-cluster.yaml
+  - ../prometheus
+
diff --git a/ray/cluster/base/ray-cluster.yaml b/ray/cluster/base/ray-cluster.yaml
@@ -0,0 +1,119 @@
+kind: RayCluster
+apiVersion: cluster.ray.io/v1
+metadata:
+  name: 'ray-cluster-example'
+  labels:
+      # allows me to return name of service that Ray operator creates
+      odh-ray-cluster-service: 'ray-cluster-example-ray-head'
+spec:
+  maxWorkers: 3
+  # The autoscaler will scale up the cluster faster with higher upscaling speed.
+  # E.g., if the task requires adding more nodes then autoscaler will gradually
+  # scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
+  # This number should be > 0.
+  upscalingSpeed: 1.0
+  # If a node is idle for this many minutes, it will be removed.
+  idleTimeoutMinutes: 5
+  # Specify the pod type for the ray head node (as configured below).
+  headPodType: head-node
+  # Specify the allowed pod types for this ray cluster and the resources they provide.
+  podTypes:
+  - name: head-node
+    podConfig:
+      apiVersion: v1
+      kind: Pod
+      metadata:
+        generateName: 'ray-cluster-example-head-'
+      spec:
+        restartPolicy: Never
+        volumes:
+        - name: dshm
+          emptyDir:
+            medium: Memory
+        containers:
+        - name: ray-node
+          imagePullPolicy: Always
+          image: quay.io/opendatahub-contrib/ray-ml-worker:v0.2.1
+          # Do not change this command - it keeps the pod alive until it is explicitly killed.
+          command: ["/bin/bash", "-c", "--"]
+          args: ['trap : TERM INT; sleep infinity & wait;']
+          ports:
+          - containerPort: 6379  # Redis port for Ray <= 1.10.0. GCS server port for Ray >= 1.11.0.
+          - containerPort: 10001  # Used by Ray Client
+          - containerPort: 8265  # Used by Ray Dashboard
+          - containerPort: 8000 # Used by Ray Serve
+          - containerPort: 8080 # Used for metrics
+          # This volume allocates shared memory for Ray to use for plasma
+          env:
+          # defining HOME is part of a workaround for:
+          # https://github.com/ray-project/ray/issues/14155
+          - name: HOME
+            value: '/home'
+          volumeMounts:
+          - mountPath: /dev/shm
+            name: dshm
+          resources:
+            requests:
+              cpu: 2
+              memory: 2G
+              ephemeral-storage: 1Gi
+            limits:
+              cpu: 2
+              # The maximum memory that this pod is allowed to use. The
+              # limit will be detected by ray and split to use 10% for
+              # redis, 30% for the shared memory object store, and the
+              # rest for application memory. If this limit is not set and
+              # the object store size is not set manually, ray will
+              # allocate a very large object store in each pod that may
+              # cause problems for other pods.
+              memory: 2G
+  - name: worker-nodes
+    # we can parameterize this when we fix the JH launcher json/jinja bug
+    minWorkers: 1
+    maxWorkers: 3
+    podConfig:
+      apiVersion: v1
+      kind: Pod
+      metadata:
+        # Automatically generates a name for the pod with this prefix.
+        generateName: 'ray-cluster-example-worker-'
+      spec:
+        restartPolicy: Never
+        volumes:
+        - name: dshm
+          emptyDir:
+            medium: Memory
+        containers:
+        - name: ray-node
+          imagePullPolicy: Always
+          image: quay.io/opendatahub-contrib/ray-ml-worker:v0.2.1
+          command: ["/bin/bash", "-c", "--"]
+          args: ["trap : TERM INT; sleep infinity & wait;"]
+          env:
+          - name: HOME
+            value: '/home'
+          volumeMounts:
+          - mountPath: /dev/shm
+            name: dshm
+          resources:
+            requests:
+              cpu: 2
+              memory: 2G
+            limits:
+              cpu: 2
+              memory: 2G
+
+  headServicePorts:
+   - name: metrics
+     port: 8080
+     targetPort: 8080
+  # Commands to start Ray on the head node. You don't need to change this.
+  # Note dashboard-host is set to 0.0.0.0 so that Kubernetes can port forward.
+  headStartRayCommands:
+      - cd /home/ray; pipenv run ray stop
+      - ulimit -n 65536; cd /home/ray; pipenv run ray start --head --metrics-export-port=8080 --port=6379 --object-manager-port=8076 --dashboard-host=0.0.0.0
+  # Commands to start Ray on worker nodes. You don't need to change this.
+  workerStartRayCommands:
+      - cd /home/ray; pipenv run ray stop
+      - ulimit -n 65536; cd /home/ray; pipenv run ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
+
diff --git a/ray/cluster/prometheus/kustomization.yaml b/ray/cluster/prometheus/kustomization.yaml
@@ -0,0 +1,15 @@
+# Copyright 2021 IBM Corporation
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+resources:
+  - monitor.yaml
diff --git a/ray/cluster/prometheus/monitor.yaml b/ray/cluster/prometheus/monitor.yaml
@@ -0,0 +1,29 @@
+# Copyright 2021 IBM Corporation
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# Prometheus Monitor Service (Metrics)
+apiVersion: monitoring.coreos.com/v1
+kind: ServiceMonitor
+metadata:
+  labels:
+    k8s-app: ray-monitor
+  name: ray-monitor
+spec:
+  endpoints:
+    - interval: 30s
+      port: metrics
+      scheme: http
+  selector:
+    matchLabels:
+      app: ray-monitor
+
diff --git a/ray/operator/base/kustomization.yaml b/ray/operator/base/kustomization.yaml
@@ -0,0 +1,9 @@
+---
+kind: Kustomization
+apiVersion: kustomize.config.k8s.io/v1beta1
+resources:
+  - ray-operator-serviceaccount.yaml
+  - ray-operator-role.yaml
+  - ray-operator-rolebinding.yaml
+  - ray-operator-deployment.yaml
+  - ray-custom-resources.yaml