Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor ray creation #751

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Bobbins228
Copy link
Contributor

@Bobbins228 Bobbins228 commented Oct 31, 2024

Issue link

Closes: RHOAIENG-10385 and RHOAIENG-8846

What changes have been made

Verification steps

Setup

Notebook server ODH/RHOAI/Local

  • Clone this repository with git clone https://github.com/project-codeflare/codeflare-sdk.git
  • Checkout this PR's branch
  • Run poetry build - install if needed (pip install poetry)
  • Run pip install --force-reinstall dist/codeflare_sdk-0.0.0.dev0-py3-none-any.whl
  • Restart your notebook kernel

Testing

All ClusterConfiguration parameters must be tested with the new cluster creation method.

Keep a special eye out for the following as they were the most complex to implement:

  • worker_extended_resource_requests
  • head_extended_resource_requests
  • extended_resource_mapping
  • The notebook annotations.

Recommendation

Have 2 separate virtual envs 1 with main SDK and 1 with this PR's SDK and compare created Ray Clusters.
Small things like blank image pull secrets and some example metadata labels are removed from every generated ray cluster so they wont be an exact match. The important thing to look out for is if the configurations match 👍

Automated Notebook testing should cover the functionality changed but I still suggest all parameters should be human verified.

Test the new and improved get_cluster() function.

NOTE: You can compare the original & retrieved clusters by setting write_to_file=True on ClusterConfiguration and get_cluster()

NOTE 2: get_cluster() will also retrieve the mtls/oauth containers as well. This has no impact on the ability to create the cluster after deleting it through get_cluster() -> cluster.down() -> cluster.up()

  • Create a Ray Cluster
  • Restart your notebook kernel
  • Run cluster = get_cluster(cluster_name=<name>, namespace=<namespace>, write_to_file=True)
  • Run through the basic cluster. methods
  • Run cluster.down() then cluster.up()
  • Ensure the cluster is created successfully
  • Repeat steps for AppWrapper (Remember to enable AppWrappers in the CFO configmap!)

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Testing is not required for this change

@Bobbins228 Bobbins228 added test-guided-notebooks Run PR check to verify Guided notebooks test-ui-notebooks Run PR check to verify UI notebooks labels Oct 31, 2024
@Bobbins228 Bobbins228 added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 31, 2024
Copy link

codecov bot commented Oct 31, 2024

Codecov Report

Attention: Patch coverage is 95.39474% with 14 lines in your changes missing coverage. Please review.

Project coverage is 92.91%. Comparing base (63ee4ae) to head (5f5c2ab).

Files with missing lines Patch % Lines
src/codeflare_sdk/ray/cluster/cluster.py 80.30% 13 Missing ⚠️
src/codeflare_sdk/ray/cluster/build_ray_cluster.py 99.40% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #751      +/-   ##
==========================================
- Coverage   94.12%   92.91%   -1.21%     
==========================================
  Files          36       36              
  Lines        2417     2400      -17     
==========================================
- Hits         2275     2230      -45     
- Misses        142      170      +28     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Bobbins228 Bobbins228 force-pushed the refactor-ray-creation branch 5 times, most recently from 4564e31 to 202d75e Compare November 4, 2024 13:25
@Bobbins228 Bobbins228 removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 4, 2024
Copy link
Collaborator

@ChristianZaccaria ChristianZaccaria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great PR, awesome work Mark! :)

I left some minor nitpicks. I'll now give this a run in a cluster.

src/codeflare_sdk/ray/cluster/build_ray_cluster.py Outdated Show resolved Hide resolved
src/codeflare_sdk/ray/cluster/build_ray_cluster.py Outdated Show resolved Hide resolved
src/codeflare_sdk/ray/cluster/build_ray_cluster.py Outdated Show resolved Hide resolved
src/codeflare_sdk/ray/cluster/build_ray_cluster.py Outdated Show resolved Hide resolved
src/codeflare_sdk/ray/cluster/build_ray_cluster.py Outdated Show resolved Hide resolved
src/codeflare_sdk/ray/cluster/build_ray_cluster.py Outdated Show resolved Hide resolved
src/codeflare_sdk/ray/cluster/cluster.py Show resolved Hide resolved
src/codeflare_sdk/ray/cluster/cluster.py Show resolved Hide resolved
src/codeflare_sdk/ray/cluster/test_status.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@ChristianZaccaria ChristianZaccaria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed when comparing the YAMLs between this PR and main branch, on this PR, the RayCluster yaml generated contains explicitly:
imagePullPolicy: Always

In main, this is unset. By default, if the image tag is :latest the imagePullPolicy is set to Always, otherwise, the default is ifNotPresent. ifNotPresent may be preferred here to only pull the image if it's not cached or doesn't already exist on the node.

Copy link
Collaborator

@ChristianZaccaria ChristianZaccaria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • YAMLs look as expected when comparing with main and this PR.
    I tested the following scenarios:
    • LocalQueue set, no image (defaults to py3.9 image on OpenShift).
    • LocalQueue not set, no image (defaults to py3.9 image on OpenShift).
    • Python3.11 environment: the image changes to Ray image for py3.11.
    • Testing most parameters: Set envs, AppWrapper true, no LQ, set custom image, and set gpus.
    • Testing most parameters: Set envs, AppWrapper true, set LQ, set custom image, and set gpus.
  • AppWrappers and RayClusters work as expected.
  • get_cluster() works well!

/lgtm thanks! Great work!

@ChristianZaccaria ChristianZaccaria added the lgtm Indicates that a PR is ready to be merged. label Nov 6, 2024
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Nov 6, 2024
@Bobbins228 Bobbins228 force-pushed the refactor-ray-creation branch 2 times, most recently from f1ed63c to 9595458 Compare November 6, 2024 16:10
@ChristianZaccaria ChristianZaccaria added the lgtm Indicates that a PR is ready to be merged. label Nov 6, 2024
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Nov 6, 2024
@Bobbins228
Copy link
Contributor Author

/retest

Copy link
Collaborator

@KPostOffice KPostOffice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 6, 2024
Copy link
Contributor

openshift-ci bot commented Nov 6, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: KPostOffice

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. test-guided-notebooks Run PR check to verify Guided notebooks test-ui-notebooks Run PR check to verify UI notebooks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants