[Bug] Ray cluster terminates more worker pods than the amount of replica scale down requested #1936

vicentefb · 2024-02-21T20:54:01Z

Search before asking

I searched the issues and found no similar issues.

KubeRay Component

ray-operator, Others

What happened + What you expected to happen

I created a kind cluster, installed kuberay operator and then created a RayCluster. I'm not enabling enableInTreeAutoscaling. The cluster comes up successfully, but then if I manually scale down the number of replica workers, it terminates more workers than it should and then it creates and initializes again the workers that shouldn't have been terminated (possibly indicating some kind of race condition). This only happens when I scale down. If I reduce the number of worker replicas, for example, from 6 to 5 I was expecting to see only 1 pod being terminated.

Reproduction script

Created a kind cluster using kindest/node:v1.27.3

$ kind create cluster
$ kind --version
$ kind version 0.20.0

Installed kuberay operator v1.1.0-alpha

/kuberay/ray-operator$ IMG=kuberay/operator:v1.1.0-alpha.0 make docker-image
/kuberay/ray-operator$ kind load docker-image kuberay/operator:v1.1.0-alpha.0

/kuberay/ray-operator$ helm install kuberay-operator --set image.repository=kuberay/operator --set image.tag=v1.1.0-alpha.0 ../helm-chart/kuberay-operator

NAME: kuberay-operator
LAST DEPLOYED: Wed Feb 21 19:03:02 2024
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None

Create RayCluster resource that has 7 worker replicas running

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: test-raycluster
  namespace: default
spec:
  headGroupSpec:
    rayStartParams:
      dashboard-host: 0.0.0.0
    serviceType: ClusterIP
    template:
      metadata:
        annotations: {}
      spec:
        affinity: {}
        containers:
        - env: []
          image: rayproject/ray:2.9.2
          imagePullPolicy: IfNotPresent
          name: ray-head
          resources:
            limits:
              cpu: "1"
              memory: 2G
            requests:
              cpu: "1"
              memory: 2G
          securityContext: {}
          volumeMounts:
          - mountPath: /tmp/ray
            name: log-volume
        imagePullSecrets: []
        nodeSelector: {}
        tolerations: []
        volumes:
        - emptyDir: {}
          name: log-volume
  workerGroupSpecs:
  - groupName: workergroup
    maxReplicas: 10
    minReplicas: 1
    rayStartParams: {}
    replicas: 7
    template:
      metadata:
        annotations: {
        }
      spec:
        affinity: {}
        containers:
        - env: []
          image: rayproject/ray:2.9.2
          imagePullPolicy: IfNotPresent
          name: ray-worker
          resources:
            limits:
              cpu: "1"
              memory: 1G
            requests:
              cpu: "1"
              memory: 1G
          securityContext: {}
          volumeMounts:
          - mountPath: /tmp/ray
            name: log-volume
        imagePullSecrets: []
        nodeSelector: {}
        tolerations: []
        volumes:
        - emptyDir: {}
          name: log-volume

$ kubectl create -f rayCluster.yaml

NAMESPACE            NAME                                         READY   STATUS    RESTARTS   AGE
default              kuberay-operator-bcd86b7fb-q9sbh             1/1     Running   0          99m
default              test-raycluster-head-qnb4j                   1/1     Running   0          53s
default              test-raycluster-worker-workergroup-57z42     1/1     Running   0          40s
default              test-raycluster-worker-workergroup-hbx8w     1/1     Running   0          41s
default              test-raycluster-worker-workergroup-j7fd5     1/1     Running   0          40s
default              test-raycluster-worker-workergroup-kqc5z     1/1     Running   0          41s
default              test-raycluster-worker-workergroup-lhzhw     1/1     Running   0          41s
default              test-raycluster-worker-workergroup-pw4cq     1/1     Running   0          40s
default              test-raycluster-worker-workergroup-zg8l7     1/1     Running   0          41s

Reduce the number of worker replicas by 1 (from 7 to 6)

$ kubectl edit raycluster

NAMESPACE            NAME                                         READY   STATUS        RESTARTS   AGE
default              kuberay-operator-bcd86b7fb-q9sbh             1/1     Running       0          100m
default              test-raycluster-head-qnb4j                   1/1     Running       0          99s
default              test-raycluster-worker-workergroup-57z42     1/1     Running       0          86s
default              test-raycluster-worker-workergroup-hbx8w     1/1     Running       0          87s
default              test-raycluster-worker-workergroup-j7fd5     1/1     Terminating   0          86s
default              test-raycluster-worker-workergroup-kqc5z     1/1     Running       0          87s
default              test-raycluster-worker-workergroup-lhzhw     1/1     Running       0          87s
default              test-raycluster-worker-workergroup-pw4cq     1/1     Terminating   0          86s
default              test-raycluster-worker-workergroup-zg8l7     1/1     Terminating   0          87s

It terminates more worker pods than the ones needed and after a some time it will resume those pods that shouldn't have been terminated

NAMESPACE            NAME                                         READY   STATUS     RESTARTS   AGE
default              kuberay-operator-bcd86b7fb-q9sbh             1/1     Running    0          101m
default              test-raycluster-head-qnb4j                   1/1     Running    0          2m39s
default              test-raycluster-worker-workergroup-2xkvw     1/1     Running    0          35s
default              test-raycluster-worker-workergroup-7sms6     0/1     Init:0/1   0          4s
default              test-raycluster-worker-workergroup-8blvt     0/1     Init:0/1   0          4s
default              test-raycluster-worker-workergroup-br2p9     1/1     Running    0          34s
default              test-raycluster-worker-workergroup-kqc5z     1/1     Running    0          2m27s
default              test-raycluster-worker-workergroup-lhzhw     1/1     Running    0          2m27s

Anything else

This occurs every time you do a scale down in the number of worker replicas. There's not a clear pattern in the number of pods that are terminated when the replicas are modified.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

andrewsykim · 2024-02-22T18:47:21Z

/cc

andrewsykim · 2024-02-22T19:04:16Z

@kevin85421 have you seen this before?

kevin85421 · 2024-02-23T00:54:42Z

This may be related to #715. KubeRay sends a request to the K8s API server to delete a Pod. In the next reconciliation, the informer cache still hasn't received the notification that the Pod has already been deleted. Hence, KubeRay attempts to delete the Pod again.

harjas27 · 2024-05-10T19:03:01Z

The diff - count for deleting the pods, is calculated using the list of worker pods returned from the K8s API server. This list can contain the pod which was deleted in the previous reconciliation and hence leading to deletion of another random pod.
The example from the issue's description shows pods with Terminating status which means they were not yet deleted at time of next reconciliation
Removing the pods for which DeletionTimestamp is set from the list and then calculating the diff can prevent this:
https://github.com/kubernetes/kubernetes/blob/v1.2.0/pkg/kubectl/resource_printer.go#L588C9-L590

vicentefb added bug Something isn't working triage labels Feb 21, 2024

kevin85421 added stability Pertains to basic infrastructure stability P1 Issue that should be fixed within a few weeks and removed triage labels Mar 4, 2024

anyscalesam added the core label Mar 11, 2024

anyscalesam added the core-kuberay label Apr 8, 2024

Eikykun mentioned this issue May 16, 2024

[RayCluster][Fix] Add expectations of RayCluster #2150

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Ray cluster terminates more worker pods than the amount of replica scale down requested #1936

[Bug] Ray cluster terminates more worker pods than the amount of replica scale down requested #1936

vicentefb commented Feb 21, 2024 •

edited

Loading

andrewsykim commented Feb 22, 2024

andrewsykim commented Feb 22, 2024

kevin85421 commented Feb 23, 2024

harjas27 commented May 10, 2024

[Bug] Ray cluster terminates more worker pods than the amount of replica scale down requested #1936

[Bug] Ray cluster terminates more worker pods than the amount of replica scale down requested #1936

Comments

vicentefb commented Feb 21, 2024 • edited Loading

Search before asking

KubeRay Component

What happened + What you expected to happen

Reproduction script

Anything else

Are you willing to submit a PR?

andrewsykim commented Feb 22, 2024

andrewsykim commented Feb 22, 2024

kevin85421 commented Feb 23, 2024

harjas27 commented May 10, 2024

vicentefb commented Feb 21, 2024 •

edited

Loading