-
Notifications
You must be signed in to change notification settings - Fork 211
feat: Ray Cluster and Operator Deployment #638
feat: Ray Cluster and Operator Deployment #638
Conversation
Hi @MichaelClifford. Thanks for your PR. I'm waiting for a opendatahub-io member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@MichaelClifford This is going to require support CI tests that will run within our test framework. See the test README |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under what SIG/WG are we proposing this component? Or is it going to be something like a tier 2 component for the time being?
@anishasthana based on the conversation at the ML Ops SIG meeting earlier this week, this will exist initially as a tier 2 component. |
@LaVLaS added to the PR to add a couple of tests. Let me know if it needs anything else 😄 |
/retest |
8fdb338
to
6286da0
Compare
@MichaelClifford We are in the process of re-organizing components into tiers. I know there have been discussions about incubating Ray.io as a Tier 1 component which means it will need a full testsuite to verify:
You already have some tests to verify that it was deployed successfully, you'll need to add some tests to verify functionality so that any reviewer can determine if any future changes actually work. |
58f79ef
to
f455888
Compare
5b689d2
to
c53a6fc
Compare
@LaVLaS I've added an additional test that checks for functionality. It connects to the example ray cluster, runs a small job, then confirms the job ran on both the head and worker node. Thanks for your guidance on this 😃 Let me know if we need more tests here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of nits + things to follow up on eventually. I'll try testing this PR by the end of the week
ray/cluster/base/kustomization.yaml
Outdated
kind: Kustomization | ||
apiVersion: kustomize.config.k8s.io/v1beta1 | ||
resources: | ||
- ray-cluster.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing new line
# cause problems for other pods. | ||
memory: 2G | ||
- name: worker-nodes | ||
# we can parameterize this when we fix the JH launcher json/jinja bug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given ODH focus on NBC is this still valid?
tests/basictests/ray.sh
Outdated
check_cluster | ||
check_functionality | ||
|
||
os::test::junit::declare_suite_end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing new line
spec: | ||
containers: | ||
- name: ray-odh-tests | ||
image: quay.io/michaelclifford/ray-odh-tests:latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we'd have this image in the ODH org. Where's the dockerfile for this image defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dockerfile is currently here: https://github.com/redhat-et/ray-on-the-edge-demo/tree/main/tests
Should these files be included in this repo under something like ray/tests/
?
What it the process to get images into the ODH org?
containers: | ||
- name: ray-operator | ||
imagePullPolicy: Always | ||
image: 'quay.io/thoth-station/ray-operator:v0.1.3' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we'd have this image in the ODH org. Where's the dockerfile for this image defined?
ray/cluster/base/ray-cluster.yaml
Outdated
containers: | ||
- name: ray-node | ||
imagePullPolicy: Always | ||
image: quay.io/thoth-station/ray-ml-worker:v0.2.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we'd have this image in the ODH org. Where's the dockerfile for this image defined?
ray/cluster/base/ray-cluster.yaml
Outdated
containers: | ||
- name: ray-node | ||
imagePullPolicy: Always | ||
image: quay.io/thoth-station/ray-ml-worker:v0.2.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we'd have this image in the ODH org. Where's the dockerfile for this image defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually.... we need metrics and monitoring defined (if available).
I'd take a look at the ModelMesh component for a good example.
https://github.com/opendatahub-io/odh-manifests/tree/master/model-mesh/prometheus
https://github.com/opendatahub-io/odh-manifests/blob/master/tests/basictests/modelmesh.sh#L96
c53a6fc
to
47c89e1
Compare
@MichaelClifford Adding the comment based on the offline conversation with @opendatahub-io/sig-mlops that we will merge this here but include it in the migration to the opendatahub-io-contrib (Tier 2) org since the @opendatahub-io/sig-mlops does not have ray.io on the immediate roadmap for MLOps integration |
47c89e1
to
5698b17
Compare
ray/cluster/prometheus/monitor.yaml
Outdated
labels: | ||
k8s-app: ray-monitor | ||
name: ray-monitor | ||
namespace: open-data-hub |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this namespace is valid or needed.
ray/cluster/prometheus/monitor.yaml
Outdated
scheme: http | ||
selector: | ||
matchLabels: | ||
app: ray-monitor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing newline :-)
ray/cluster/prometheus/monitor.yaml
Outdated
scheme: http | ||
selector: | ||
matchLabels: | ||
app: ray-monitor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove newline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry -- it's the opposite, newline missing
resources: | ||
- ray-cluster.yaml | ||
- ../prometheus | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove newline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry -- it's the opposite, newline missing
reviewers: | ||
- HumairAK | ||
- michaelclifford | ||
- rimolive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's an additional space before rimolive
ray/README.md
Outdated
1. [Ray operator](./operator/ray-operator-deployment.yaml): The operator will process RayCluster resources and schedule ray head and worker pods based on requirements. | ||
2. [Ray CR](./operator/ray-custom-resources.yaml): RayCluster Custom Resource (CR) describes the desired state of ray cluster. | ||
3. [Ray Cluster](./cluster/ray-cluster.yaml): Defines an instance of an example Ray Cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you can create relative links to .yamls
can you? it doesn't work when I click it in your branch, I think you'll have to link the resulting url path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can. I just need to update the links, these are based on an older file structure.
ray/README.md
Outdated
@@ -0,0 +1,70 @@ | |||
# Deploying Ray with Open Data Hub | |||
|
|||
_WIP Docs: _ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this supposed to be here? seems unnecessary
ray/README.md
Outdated
First use the `oc kustomize` command to generate a yaml containing all the requirements for the operator and the "raycluster" custom resource, then `oc apply` that yaml to deploy the operator to your cluster. | ||
|
||
```bash | ||
$ oc kustomize deploy/odh-ray-nbc/operator > operator_deployment.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deploy/odh-ray-nbc/operator
Is this an outdated path? the current path to the operator seems to be ray/operator/base
?
ray/README.md
Outdated
$ oc kustomize deploy/odh-ray-nbc/operator > operator_deployment.yaml | ||
``` | ||
```bash | ||
$ oc apply -f operator_deployment.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as this is a namespace scoped deployment I'd specify -n <your-namespace>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd reflect the -n <your-namespace>
in the following commands as well
ray/README.md
Outdated
|
||
|
||
```bash | ||
$ oc kustomize deploy/odh-ray-nbc/cluster > cluster_deployment.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as before, the path deploy/odh-ray-nbc/cluster
seems like a typo, looks like the cluster is at ray/cluster/base
The following users are mentioned in OWNERS file(s) but are untrusted for the following reasons. One way to make the user trusted is to add them as members of the opendatahub-io org. You can then trigger verification by writing
|
/retest /approve /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: LaVLaS The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Description
This PR is a follow up to #573. It Includes the manifests needed to deploy Ray on Openshift, but has removed any dependencies on JupyterHub from the previous PR.
Merge criteria:
It can be tested by following the instructions in the README included in the PR.