[Bug] [RayJob] First job pod sometimes fails to connect to Ray cluster #1381

architkulkarni · 2023-08-31T20:35:17Z

Search before asking

I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

Sometimes after submitting a RayJob, we see that the first job submission pod has failed:

NAME                                        READY   STATUS      RESTARTS   AGE
kuberay-operator-7447d85d58-89nhg           1/1     Running     0          5m44s
rayjob-sample-hwbgg                         0/1     Error       0          2m55s
rayjob-sample-raycluster-vkbxr-head-4pnd5   1/1     Running     0          5m12s
rayjob-sample-w5nvf                         0/1     Completed   0          2m48s

In the logs for the errored pod, we see

  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 248, in _check_connection_and_version
    self._check_connection_and_version_with_url(min_version, version_error_message)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 278, in _check_connection_and_version_with_url
    raise ConnectionError(
ConnectionError: Failed to connect to Ray at address: http://rayjob-sample-raycluster-vkbxr-head-svc.default.svc.cluster.local:8265.

The pod gets retried, so the RayJob itself eventually succeeds, but this is still unexpected because the RayJob controller is supposed to wait for the cluster to be ready before submitting the job.

Reproduction script

kubectl apply -f config/samples/ray_v1alpha1_rayjob.yaml

Anything else

logs.zip

It only happens sometimes. (5-20% of the time?)

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

rueian · 2023-09-13T14:55:31Z

Hi @architkulkarni, @kevin85421

Is anyone working on this? If not, I would like to work on this.

Investigation

Given the message from the logs.zip:

Max retries exceeded with url: /api/version (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f063be65ea0>: Failed to establish a new connection: [Errno 111] Connection refused'))

It seems that the cause was the dashboard server didn't start yet.

Solutions

I believe this issue can be addressed by sending an HTTP request to the dashboard server at line 209 of the below section.

kuberay/ray-operator/controllers/ray/rayjob_controller.go

Lines 203 to 215 in 36f32ed

    
           // Check the current status of ray cluster before submitting. 
        
           if rayClusterInstance.Status.State != rayv1alpha1.Ready { 
        
           	r.Log.Info("waiting for the cluster to be ready", "rayCluster", rayClusterInstance.Name) 
        
           	err = r.updateState(ctx, rayJobInstance, nil, rayJobInstance.Status.JobStatus, rayv1alpha1.JobDeploymentStatusInitializing, nil) 
        
           	return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err 
        
           } 
        
           // Ensure k8s job has been created 
        
           jobName, wasJobCreated, err := r.getOrCreateK8sJob(ctx, rayJobInstance, rayClusterInstance) 
        
           if err != nil { 
        
           	err = r.updateState(ctx, rayJobInstance, nil, rayJobInstance.Status.JobStatus, rayv1alpha1.JobDeploymentStatusFailedJobDeploy, err) 
        
           	return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err 
        
           }

If that request fails, we then requeue the reconciliation request by doing the same action when rayClusterInstance.Status.State != rayv1alpha1.Ready:

err = r.updateState(ctx, rayJobInstance, nil, rayJobInstance.Status.JobStatus, rayv1alpha1.JobDeploymentStatusInitializing, nil)
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err

Or setting the job status to rayv1alpha1.JobDeploymentStatusWaitForDashboard and requeuing:

err = r.updateState(ctx, rayJobInstance, nil, rayJobInstance.Status.JobStatus, rayv1alpha1.JobDeploymentStatusWaitForDashboard, nil)
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err

Would these be preferable solutions? By doing so, we can make sure that the dashboard server is ready before creating the k8s Job instance.

architkulkarni · 2023-09-13T16:52:15Z

That sounds reasonable to me, and it would be great if you could submit a PR. @kevin85421 what do you think of the approach? Originally we wanted to get rid of the dashboard HTTP client for job submission, but here the dashboard HTTP client is just being used to check that the dashboard is ready, which seems fine.

kevin85421 · 2023-09-15T06:00:22Z

It is OK as a workaround because we want to implement this fix before the release of KubeRay 1.0.0.

I am not a fan of this solution. In Kubernetes' convention, "ready" signifies that the resource is ready to serve traffic. Unfortunately, the "ready" state in RayCluster doesn't accurately reflect this, necessitating the use of an HTTP client to check the head Pod's status. For a long-term solution, I am contemplating revising the definition of "ready" in RayCluster by updating certain functions and probes.

In addition, we may consider creating the Kubernetes Job after the RayCluster is ready. If we can create the Job earlier, it can start some processes earlier, such as pulling images. Hence, we may also need to consider adding a waiting mechanism for the Job.

I will sync with @rueian offline to discuss possible solutions.

rueian · 2023-09-15T14:56:48Z

Agree. I believe it will be nice if the "ready" condition can be customizable based on different situations. For example, a RayServe may only require a ready head node while a RayJob may prefer to wait until all workers are ready.

Another possible solution will be simply adding a retry mechanism to the ray job submit command, since honestly speaking, even if we make sure the dashboard is alive, the job submission can still fail for other reasons. I have seen a case in which raylet is dead but the dashboard is alive and therefore job submissions still fail.

Besides, by adding retries to the ray job submit command, it is possible to pull job images in advance.

architkulkarni · 2023-09-15T16:57:25Z

Great points, thanks for the discussion. By the way, any ideas why the job submitter pod gets retried here? The pod has RestartPolicyNever so I'm a bit confused.

kevin85421 · 2023-09-16T08:14:46Z

any ideas why the job submitter pod gets retried here?

We can have two-levels of retry:

Connection: If the ray job submit command cannot establish a connection with the RayCluster, it shouldn't immediately fail. Instead, it should attempt to retry the connection. We could establish a timeout period, after which the ray job submit process will fail if a connection cannot be established. It is better to add some flags in the Ray Job Submission component.
Application: If the ray job submit fails after the connection with the RayCluster has been established, the K8s Job Pod should fail and the Kubernetes Job will create a new Job Pod until the number of failures is higher than the backoff limit.

…project#1381)

…lar to those described in (ray-project#1381)

… (#1429) * [Bug][RayJob] Check dashboard readiness before creating job pod (#1381) * [Bug][RayJob] Enhance the RayJob end-to-end tests to detect bugs similar to those described in (#1381)

…project#1381) (ray-project#1429) * [Bug][RayJob] Check dashboard readiness before creating job pod (ray-project#1381) * [Bug][RayJob] Enhance the RayJob end-to-end tests to detect bugs similar to those described in (ray-project#1381)

… (#1429) (#1512)

kevin85421 · 2023-12-24T06:12:43Z

Close with #1733

architkulkarni added bug Something isn't working P1 Issue that should be fixed within a few weeks rayjob labels Aug 31, 2023

architkulkarni mentioned this issue Aug 31, 2023

Add field to expose entrypoint num cpus in rayjob #1359

Merged

4 tasks

kevin85421 assigned rueian Sep 15, 2023

rueian added a commit to rueian/kuberay that referenced this issue Sep 16, 2023

[Bug][RayJob] Check dashboard readiness before creating job pod (ray-…

d63177d

…project#1381)

rueian mentioned this issue Sep 16, 2023

[Bug][RayJob] Check dashboard readiness before creating job pod (#1381) #1429

Merged

4 tasks

rueian added a commit to rueian/kuberay that referenced this issue Sep 17, 2023

[Bug][RayJob] Check dashboard readiness before creating job pod (ray-…

b0613b8

…project#1381)

rueian added a commit to rueian/kuberay that referenced this issue Sep 17, 2023

[Bug][RayJob] Check dashboard readiness before creating job pod (ray-…

8e68570

…project#1381)

rueian added a commit to rueian/kuberay that referenced this issue Sep 25, 2023

[Bug][RayJob] Enhance the RayJob end-to-end tests to detect bugs simi…

333f704

…lar to those described in (ray-project#1381)

kevin85421 mentioned this issue Oct 17, 2023

[Cherry-pick][Bug][RayJob] Check dashboard readiness before creating job pod (#1429) #1512

Merged

4 tasks

kevin85421 added a commit that referenced this issue Oct 17, 2023

[Bug][RayJob] Check dashboard readiness before creating job pod (#1381)…

1c4f137

… (#1429) (#1512)

astefanutti mentioned this issue Nov 21, 2023

Added Job submission support to the API server #1639

Merged

2 tasks

kevin85421 closed this as completed Dec 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] [RayJob] First job pod sometimes fails to connect to Ray cluster #1381

[Bug] [RayJob] First job pod sometimes fails to connect to Ray cluster #1381

architkulkarni commented Aug 31, 2023 •

edited

Loading

rueian commented Sep 13, 2023

architkulkarni commented Sep 13, 2023

kevin85421 commented Sep 15, 2023

rueian commented Sep 15, 2023

architkulkarni commented Sep 15, 2023

kevin85421 commented Sep 16, 2023

kevin85421 commented Dec 24, 2023

[Bug] [RayJob] First job pod sometimes fails to connect to Ray cluster #1381

[Bug] [RayJob] First job pod sometimes fails to connect to Ray cluster #1381

Comments

architkulkarni commented Aug 31, 2023 • edited Loading

Search before asking

KubeRay Component

What happened + What you expected to happen

Reproduction script

Anything else

Are you willing to submit a PR?

rueian commented Sep 13, 2023

Investigation

Solutions

architkulkarni commented Sep 13, 2023

kevin85421 commented Sep 15, 2023

rueian commented Sep 15, 2023

architkulkarni commented Sep 15, 2023

kevin85421 commented Sep 16, 2023

kevin85421 commented Dec 24, 2023

architkulkarni commented Aug 31, 2023 •

edited

Loading