Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core feature] Add support KubeRay 1.0 #4244

Open
2 tasks done
pingsutw opened this issue Oct 16, 2023 · 3 comments
Open
2 tasks done

[Core feature] Add support KubeRay 1.0 #4244

pingsutw opened this issue Oct 16, 2023 · 3 comments
Labels
enhancement New feature or request flytepropeller

Comments

@pingsutw
Copy link
Member

Motivation: Why do you think this is important?

Kuberay v0.3 ~v0.6 are not stable, and have many bugs and performance issues. People have run into tons of problems when using KubeRay operators.

kuberay issues:

  • 0.3.0: Kuberay may create two rayJob sometimes
  • 0.4.0: Kuberay failed to run the RayJob when running multiple rayJobs (2+) in the same time.
  • 0.5.0 Kuberay add an init container but this init container doesn't have default request and limit, so it can't be run in the project-doman namespace
  • 0.5.1: same issue as 0.5
  • 0.5.2: We can run the rayjob successfully but still have the problem in 0.3
  • 0.6.0: Kuberay added a k8sJob in the Rayjob, which used to submit the ray remote task, but it doesn't have default request and limit also.

We should

  • Make sure Flyte can run a ray task when using kuberay 1.0
  • Upgrade the kuberay client version in the ray plugin

Goal: What should the final outcome look like, ideally?

Flyte can a run ray tasks without any problems When using kuberay 1.0

Describe alternatives you've considered

NA

Propose: Link/Inline OR Additional context

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@pingsutw pingsutw added enhancement New feature or request untriaged This issues has not yet been looked at by the Maintainers flytepropeller and removed untriaged This issues has not yet been looked at by the Maintainers labels Oct 16, 2023
@mmphsb
Copy link

mmphsb commented Oct 26, 2023

I am running Flyte 1.9.1 on an on-premise k8s cluster (microk8s). It is deployed with Helm using flyte-binary mode. I have had problems with Ray integration when using kuberay-operator in both versions: 0.6.0 and 0.5.2 I was not able to debug the problem: RayCluster and RayJob were simply not created at all. Flyte task was executed inside a pod with container deployed from the default image. There was no logs in Flyte pods, nothing logged by kuberay-operator, no k8s events, etc.

While looking for ways how to debug it (also on Flyte Slack), I have noticed that kuberay version v1.0.0-rc.1 is available. I have upgraded and the integration is just working now. I have no idea what was wrong, but for what it's worth, I can report no issues with the current version of the operator. Please mind that my use-case is not very complicated - I am running a PoC with some basic ML model training. I can run some specific tests if it helps, please let me know.

@mmphsb
Copy link

mmphsb commented Feb 1, 2024

FYI, after update to Flyte 1.10.6, I could no longer run Ray jobs using KubeRay 1.0.0-rc.0 due to:

Workflow[redacted] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: failed at Node[dn1]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [ray]: unknown job deployment status: WaitForK8sJob

Checked with flytekit and flytekitplugins-ray versions 1.9.1 and 1.10.3.

It looks like it was caused by #4389 and that it is getting fixed with #4656.

@peterghaddad
Copy link
Contributor

I think this is now complete!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request flytepropeller
Projects
None yet
Development

No branches or pull requests

3 participants