-
Notifications
You must be signed in to change notification settings - Fork 436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] [RayJob] First job pod sometimes fails to connect to Ray cluster #1381
Comments
Hi @architkulkarni, @kevin85421 Is anyone working on this? If not, I would like to work on this. InvestigationGiven the message from the logs.zip:
It seems that the cause was the dashboard server didn't start yet. SolutionsI believe this issue can be addressed by sending an HTTP request to the dashboard server at line 209 of the below section. kuberay/ray-operator/controllers/ray/rayjob_controller.go Lines 203 to 215 in 36f32ed
If that request fails, we then requeue the reconciliation request by doing the same action when err = r.updateState(ctx, rayJobInstance, nil, rayJobInstance.Status.JobStatus, rayv1alpha1.JobDeploymentStatusInitializing, nil)
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err Or setting the job status to err = r.updateState(ctx, rayJobInstance, nil, rayJobInstance.Status.JobStatus, rayv1alpha1.JobDeploymentStatusWaitForDashboard, nil)
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err Would these be preferable solutions? By doing so, we can make sure that the dashboard server is ready before creating the k8s Job instance. |
That sounds reasonable to me, and it would be great if you could submit a PR. @kevin85421 what do you think of the approach? Originally we wanted to get rid of the dashboard HTTP client for job submission, but here the dashboard HTTP client is just being used to check that the dashboard is ready, which seems fine. |
It is OK as a workaround because we want to implement this fix before the release of KubeRay 1.0.0. I am not a fan of this solution. In Kubernetes' convention, "ready" signifies that the resource is ready to serve traffic. Unfortunately, the "ready" state in RayCluster doesn't accurately reflect this, necessitating the use of an HTTP client to check the head Pod's status. For a long-term solution, I am contemplating revising the definition of "ready" in RayCluster by updating certain functions and probes. In addition, we may consider creating the Kubernetes Job after the RayCluster is ready. If we can create the Job earlier, it can start some processes earlier, such as pulling images. Hence, we may also need to consider adding a waiting mechanism for the Job. I will sync with @rueian offline to discuss possible solutions. |
Agree. I believe it will be nice if the "ready" condition can be customizable based on different situations. For example, a RayServe may only require a ready head node while a RayJob may prefer to wait until all workers are ready. Another possible solution will be simply adding a retry mechanism to the Besides, by adding retries to the |
Great points, thanks for the discussion. By the way, any ideas why the job submitter pod gets retried here? The pod has |
We can have two-levels of retry:
|
…lar to those described in (ray-project#1381)
…project#1381) (ray-project#1429) * [Bug][RayJob] Check dashboard readiness before creating job pod (ray-project#1381) * [Bug][RayJob] Enhance the RayJob end-to-end tests to detect bugs similar to those described in (ray-project#1381)
Close with #1733 |
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
Sometimes after submitting a
RayJob
, we see that the first job submission pod has failed:In the logs for the errored pod, we see
The pod gets retried, so the RayJob itself eventually succeeds, but this is still unexpected because the RayJob controller is supposed to wait for the cluster to be ready before submitting the job.
Reproduction script
kubectl apply -f config/samples/ray_v1alpha1_rayjob.yaml
Anything else
logs.zip
It only happens sometimes. (5-20% of the time?)
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: