Skip to content
This repository has been archived by the owner on Jan 31, 2024. It is now read-only.

feat: Ray Cluster and Operator Deployment #638

Merged
merged 2 commits into from
Oct 14, 2022

Conversation

MichaelClifford
Copy link
Contributor

Description

This PR is a follow up to #573. It Includes the manifests needed to deploy Ray on Openshift, but has removed any dependencies on JupyterHub from the previous PR.

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

It can be tested by following the instructions in the README included in the PR.

@openshift-ci
Copy link

openshift-ci bot commented Aug 10, 2022

Hi @MichaelClifford. Thanks for your PR.

I'm waiting for a opendatahub-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@LaVLaS
Copy link
Contributor

LaVLaS commented Aug 11, 2022

@MichaelClifford This is going to require support CI tests that will run within our test framework. See the test README

Copy link
Member

@anishasthana anishasthana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under what SIG/WG are we proposing this component? Or is it going to be something like a tier 2 component for the time being?

@MichaelClifford
Copy link
Contributor Author

@anishasthana based on the conversation at the ML Ops SIG meeting earlier this week, this will exist initially as a tier 2 component.

@MichaelClifford
Copy link
Contributor Author

@LaVLaS added to the PR to add a couple of tests. Let me know if it needs anything else 😄

@MichaelClifford
Copy link
Contributor Author

/retest

@LaVLaS
Copy link
Contributor

LaVLaS commented Aug 30, 2022

@LaVLaS added to the PR to add a couple of tests. Let me know if it needs anything else smile

@MichaelClifford We are in the process of re-organizing components into tiers. I know there have been discussions about incubating Ray.io as a Tier 1 component which means it will need a full testsuite to verify:

  1. It deploys successfully
  2. Test of functionality to ensure that a successful deployment actually works.

You already have some tests to verify that it was deployed successfully, you'll need to add some tests to verify functionality so that any reviewer can determine if any future changes actually work.

@MichaelClifford
Copy link
Contributor Author

@LaVLaS I've added an additional test that checks for functionality. It connects to the example ray cluster, runs a small job, then confirms the job ran on both the head and worker node.

Thanks for your guidance on this 😃 Let me know if we need more tests here.

Copy link
Member

@anishasthana anishasthana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of nits + things to follow up on eventually. I'll try testing this PR by the end of the week

kind: Kustomization
apiVersion: kustomize.config.k8s.io/v1beta1
resources:
- ray-cluster.yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing new line

# cause problems for other pods.
memory: 2G
- name: worker-nodes
# we can parameterize this when we fix the JH launcher json/jinja bug
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given ODH focus on NBC is this still valid?

check_cluster
check_functionality

os::test::junit::declare_suite_end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing new line

spec:
containers:
- name: ray-odh-tests
image: quay.io/michaelclifford/ray-odh-tests:latest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we'd have this image in the ODH org. Where's the dockerfile for this image defined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dockerfile is currently here: https://github.com/redhat-et/ray-on-the-edge-demo/tree/main/tests

Should these files be included in this repo under something like ray/tests/?

What it the process to get images into the ODH org?

containers:
- name: ray-operator
imagePullPolicy: Always
image: 'quay.io/thoth-station/ray-operator:v0.1.3'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we'd have this image in the ODH org. Where's the dockerfile for this image defined?

containers:
- name: ray-node
imagePullPolicy: Always
image: quay.io/thoth-station/ray-ml-worker:v0.2.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we'd have this image in the ODH org. Where's the dockerfile for this image defined?

containers:
- name: ray-node
imagePullPolicy: Always
image: quay.io/thoth-station/ray-ml-worker:v0.2.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we'd have this image in the ODH org. Where's the dockerfile for this image defined?

Copy link
Member

@anishasthana anishasthana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually.... we need metrics and monitoring defined (if available).
I'd take a look at the ModelMesh component for a good example.

https://github.com/opendatahub-io/odh-manifests/tree/master/model-mesh/prometheus
https://github.com/opendatahub-io/odh-manifests/blob/master/tests/basictests/modelmesh.sh#L96

@LaVLaS
Copy link
Contributor

LaVLaS commented Sep 22, 2022

@MichaelClifford Adding the comment based on the offline conversation with @opendatahub-io/sig-mlops that we will merge this here but include it in the migration to the opendatahub-io-contrib (Tier 2) org since the @opendatahub-io/sig-mlops does not have ray.io on the immediate roadmap for MLOps integration

labels:
k8s-app: ray-monitor
name: ray-monitor
namespace: open-data-hub
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this namespace is valid or needed.

scheme: http
selector:
matchLabels:
app: ray-monitor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing newline :-)

scheme: http
selector:
matchLabels:
app: ray-monitor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove newline

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry -- it's the opposite, newline missing

resources:
- ray-cluster.yaml
- ../prometheus

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove newline

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry -- it's the opposite, newline missing

reviewers:
- HumairAK
- michaelclifford
- rimolive
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's an additional space before rimolive

ray/README.md Outdated
Comment on lines 9 to 11
1. [Ray operator](./operator/ray-operator-deployment.yaml): The operator will process RayCluster resources and schedule ray head and worker pods based on requirements.
2. [Ray CR](./operator/ray-custom-resources.yaml): RayCluster Custom Resource (CR) describes the desired state of ray cluster.
3. [Ray Cluster](./cluster/ray-cluster.yaml): Defines an instance of an example Ray Cluster
Copy link
Contributor

@HumairAK HumairAK Oct 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you can create relative links to .yamls can you? it doesn't work when I click it in your branch, I think you'll have to link the resulting url path

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can. I just need to update the links, these are based on an older file structure.

ray/README.md Outdated
@@ -0,0 +1,70 @@
# Deploying Ray with Open Data Hub

_WIP Docs: _
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this supposed to be here? seems unnecessary

ray/README.md Outdated
First use the `oc kustomize` command to generate a yaml containing all the requirements for the operator and the "raycluster" custom resource, then `oc apply` that yaml to deploy the operator to your cluster.

```bash
$ oc kustomize deploy/odh-ray-nbc/operator > operator_deployment.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deploy/odh-ray-nbc/operator
Is this an outdated path? the current path to the operator seems to be ray/operator/base ?

ray/README.md Outdated
$ oc kustomize deploy/odh-ray-nbc/operator > operator_deployment.yaml
```
```bash
$ oc apply -f operator_deployment.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as this is a namespace scoped deployment I'd specify -n <your-namespace>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd reflect the -n <your-namespace> in the following commands as well

ray/README.md Outdated


```bash
$ oc kustomize deploy/odh-ray-nbc/cluster > cluster_deployment.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as before, the path deploy/odh-ray-nbc/cluster seems like a typo, looks like the cluster is at ray/cluster/base

@openshift-ci
Copy link

openshift-ci bot commented Oct 13, 2022

The following users are mentioned in OWNERS file(s) but are untrusted for the following reasons. One way to make the user trusted is to add them as members of the opendatahub-io org. You can then trigger verification by writing /verify-owners in a comment.

  • HumairAK
    • User is not a member of the org. User is not a collaborator. Satisfy at least one of these conditions to make the user trusted.
    • ray/OWNERS
  • michaelclifford
    • User is not a member of the org. User is not a collaborator. Satisfy at least one of these conditions to make the user trusted.
    • ray/OWNERS

@LaVLaS
Copy link
Contributor

LaVLaS commented Oct 14, 2022

/retest

/approve

/lgtm

@openshift-ci
Copy link

openshift-ci bot commented Oct 14, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: LaVLaS

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 438b933 into opendatahub-io:master Oct 14, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants