Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Ray cluster terminates more worker pods than the amount of replica scale down requested #1936

Open
1 of 2 tasks
vicentefb opened this issue Feb 21, 2024 · 4 comments
Open
1 of 2 tasks
Labels
bug Something isn't working core core-kuberay P1 Issue that should be fixed within a few weeks stability Pertains to basic infrastructure stability

Comments

@vicentefb
Copy link

vicentefb commented Feb 21, 2024

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator, Others

What happened + What you expected to happen

I created a kind cluster, installed kuberay operator and then created a RayCluster. I'm not enabling enableInTreeAutoscaling. The cluster comes up successfully, but then if I manually scale down the number of replica workers, it terminates more workers than it should and then it creates and initializes again the workers that shouldn't have been terminated (possibly indicating some kind of race condition). This only happens when I scale down. If I reduce the number of worker replicas, for example, from 6 to 5 I was expecting to see only 1 pod being terminated.

Reproduction script

Created a kind cluster using kindest/node:v1.27.3

$ kind create cluster
$ kind --version
$ kind version 0.20.0

Installed kuberay operator v1.1.0-alpha

/kuberay/ray-operator$ IMG=kuberay/operator:v1.1.0-alpha.0 make docker-image
/kuberay/ray-operator$ kind load docker-image kuberay/operator:v1.1.0-alpha.0

/kuberay/ray-operator$ helm install kuberay-operator --set image.repository=kuberay/operator --set image.tag=v1.1.0-alpha.0 ../helm-chart/kuberay-operator

NAME: kuberay-operator
LAST DEPLOYED: Wed Feb 21 19:03:02 2024
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None

Create RayCluster resource that has 7 worker replicas running

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: test-raycluster
  namespace: default
spec:
  headGroupSpec:
    rayStartParams:
      dashboard-host: 0.0.0.0
    serviceType: ClusterIP
    template:
      metadata:
        annotations: {}
      spec:
        affinity: {}
        containers:
        - env: []
          image: rayproject/ray:2.9.2
          imagePullPolicy: IfNotPresent
          name: ray-head
          resources:
            limits:
              cpu: "1"
              memory: 2G
            requests:
              cpu: "1"
              memory: 2G
          securityContext: {}
          volumeMounts:
          - mountPath: /tmp/ray
            name: log-volume
        imagePullSecrets: []
        nodeSelector: {}
        tolerations: []
        volumes:
        - emptyDir: {}
          name: log-volume
  workerGroupSpecs:
  - groupName: workergroup
    maxReplicas: 10
    minReplicas: 1
    rayStartParams: {}
    replicas: 7
    template:
      metadata:
        annotations: {
        }
      spec:
        affinity: {}
        containers:
        - env: []
          image: rayproject/ray:2.9.2
          imagePullPolicy: IfNotPresent
          name: ray-worker
          resources:
            limits:
              cpu: "1"
              memory: 1G
            requests:
              cpu: "1"
              memory: 1G
          securityContext: {}
          volumeMounts:
          - mountPath: /tmp/ray
            name: log-volume
        imagePullSecrets: []
        nodeSelector: {}
        tolerations: []
        volumes:
        - emptyDir: {}
          name: log-volume
$ kubectl create -f rayCluster.yaml

NAMESPACE            NAME                                         READY   STATUS    RESTARTS   AGE
default              kuberay-operator-bcd86b7fb-q9sbh             1/1     Running   0          99m
default              test-raycluster-head-qnb4j                   1/1     Running   0          53s
default              test-raycluster-worker-workergroup-57z42     1/1     Running   0          40s
default              test-raycluster-worker-workergroup-hbx8w     1/1     Running   0          41s
default              test-raycluster-worker-workergroup-j7fd5     1/1     Running   0          40s
default              test-raycluster-worker-workergroup-kqc5z     1/1     Running   0          41s
default              test-raycluster-worker-workergroup-lhzhw     1/1     Running   0          41s
default              test-raycluster-worker-workergroup-pw4cq     1/1     Running   0          40s
default              test-raycluster-worker-workergroup-zg8l7     1/1     Running   0          41s

Reduce the number of worker replicas by 1 (from 7 to 6)

$ kubectl edit raycluster

NAMESPACE            NAME                                         READY   STATUS        RESTARTS   AGE
default              kuberay-operator-bcd86b7fb-q9sbh             1/1     Running       0          100m
default              test-raycluster-head-qnb4j                   1/1     Running       0          99s
default              test-raycluster-worker-workergroup-57z42     1/1     Running       0          86s
default              test-raycluster-worker-workergroup-hbx8w     1/1     Running       0          87s
default              test-raycluster-worker-workergroup-j7fd5     1/1     Terminating   0          86s
default              test-raycluster-worker-workergroup-kqc5z     1/1     Running       0          87s
default              test-raycluster-worker-workergroup-lhzhw     1/1     Running       0          87s
default              test-raycluster-worker-workergroup-pw4cq     1/1     Terminating   0          86s
default              test-raycluster-worker-workergroup-zg8l7     1/1     Terminating   0          87s

It terminates more worker pods than the ones needed and after a some time it will resume those pods that shouldn't have been terminated

NAMESPACE            NAME                                         READY   STATUS     RESTARTS   AGE
default              kuberay-operator-bcd86b7fb-q9sbh             1/1     Running    0          101m
default              test-raycluster-head-qnb4j                   1/1     Running    0          2m39s
default              test-raycluster-worker-workergroup-2xkvw     1/1     Running    0          35s
default              test-raycluster-worker-workergroup-7sms6     0/1     Init:0/1   0          4s
default              test-raycluster-worker-workergroup-8blvt     0/1     Init:0/1   0          4s
default              test-raycluster-worker-workergroup-br2p9     1/1     Running    0          34s
default              test-raycluster-worker-workergroup-kqc5z     1/1     Running    0          2m27s
default              test-raycluster-worker-workergroup-lhzhw     1/1     Running    0          2m27s

Anything else

This occurs every time you do a scale down in the number of worker replicas. There's not a clear pattern in the number of pods that are terminated when the replicas are modified.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@vicentefb vicentefb added bug Something isn't working triage labels Feb 21, 2024
@andrewsykim
Copy link
Collaborator

/cc

@andrewsykim
Copy link
Collaborator

@kevin85421 have you seen this before?

@kevin85421
Copy link
Member

This may be related to #715. KubeRay sends a request to the K8s API server to delete a Pod. In the next reconciliation, the informer cache still hasn't received the notification that the Pod has already been deleted. Hence, KubeRay attempts to delete the Pod again.

@kevin85421 kevin85421 added stability Pertains to basic infrastructure stability P1 Issue that should be fixed within a few weeks and removed triage labels Mar 4, 2024
@harjas27
Copy link

The diff - count for deleting the pods, is calculated using the list of worker pods returned from the K8s API server. This list can contain the pod which was deleted in the previous reconciliation and hence leading to deletion of another random pod.
The example from the issue's description shows pods with Terminating status which means they were not yet deleted at time of next reconciliation
Removing the pods for which DeletionTimestamp is set from the list and then calculating the diff can prevent this:
https://github.com/kubernetes/kubernetes/blob/v1.2.0/pkg/kubectl/resource_printer.go#L588C9-L590

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working core core-kuberay P1 Issue that should be fixed within a few weeks stability Pertains to basic infrastructure stability
Projects
None yet
Development

No branches or pull requests

5 participants