[Bug] Ray cluster terminates more worker pods than the amount of replica scale down requested #1936
Open
1 of 2 tasks
Labels
bug
Something isn't working
core
core-kuberay
P1
Issue that should be fixed within a few weeks
stability
Pertains to basic infrastructure stability
Search before asking
KubeRay Component
ray-operator, Others
What happened + What you expected to happen
I created a kind cluster, installed kuberay operator and then created a RayCluster. I'm not enabling
enableInTreeAutoscaling
. The cluster comes up successfully, but then if I manually scale down the number of replica workers, it terminates more workers than it should and then it creates and initializes again the workers that shouldn't have been terminated (possibly indicating some kind of race condition). This only happens when I scale down. If I reduce the number of worker replicas, for example, from 6 to 5 I was expecting to see only 1 pod being terminated.Reproduction script
Created a kind cluster using kindest/node:v1.27.3
Installed kuberay operator v1.1.0-alpha
Create RayCluster resource that has 7 worker replicas running
Reduce the number of worker replicas by 1 (from 7 to 6)
It terminates more worker pods than the ones needed and after a some time it will resume those pods that shouldn't have been terminated
Anything else
This occurs every time you do a scale down in the number of worker replicas. There's not a clear pattern in the number of pods that are terminated when the replicas are modified.
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: