-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Disable zero downtime upgrade for a RayService using RayServiceSpec #2468
base: master
Are you sure you want to change the base?
Conversation
e436040
to
3870fb7
Compare
3870fb7
to
702d2ba
Compare
702d2ba
to
21fdf3f
Compare
type UpgradeStrategy string | ||
|
||
const ( | ||
BlueGreenUpgrade UpgradeStrategy = "BlueGreenUpgrade" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking of alternatives out loud:
- RollingUpdate / Recreate
- CreateThenDelete / DeleteThenCreate
- CreateFirst / DeleteFirst
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or perhaps we can follow upgrade strategy types similar to StatefulSet: RollingUpdate / OnDelete https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#update-strategies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see "RollingUpdate" being misleading because we only ever update 1 new RayCluster at a time, so it's not really a rolling update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can use the "OnDelete" as I think it fits the "delete first" and "delete then create". For the zero downtime maybe we can name it something along the lines of "PrepareUpdateThenSwap" to describe that the updated cluster is getting ready?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OnDelete
implies the users has to manually delete the cluster first. This is a new behavior which is maybe okay? If we are still automatically deleting the cluster, then it should be just Delete
, DeleteFirst
, Replace
something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh I see, so like upgrade OnDelete. If that's the case, my vote would be for "Replace" or "DeleteFirst".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this more, I think having a separate strategy for OnDelete and DeleteFirst would be valuable. So in total we could support 3 upgrade strategies
- spec.upgradeStrategy=RollingUpdate -- same behavior as ENABLE_ZERO_DOWNTIME=true
- spec.upgradeStrategy=DeleteFirst -- same behavior as ENABLE_ZERO_DOWNTIME=false
- spec.upgradeStrategy=OnDelete -- new behavior, only upgrade if user manually deletes the RayCluster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OnDelete implies the users has to manually delete the cluster first. This is a new behavior which is maybe okay?
I think the behavior of OnDelete
is fine, but it may require additional code changes, such as updating the CR status. This is beyond the scope of this PR. Maybe we can open an issue to track the progress, and we can revisit whether we need this behavior after the refactoring of RayService.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For ENABLE_ZERO_DOWNTIME=false
case, how about name it to None
or Disabled
or NoUpgrade
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current behavior of ENABLE_ZERO_DOWNTIME=false
is to ignore all spec changes and do nothing.
b1c2096
to
179207c
Compare
ray-operator/controllers/ray/rayservice_controller_unit_test.go
Outdated
Show resolved
Hide resolved
ray-operator/controllers/ray/rayservice_controller_unit_test.go
Outdated
Show resolved
Hide resolved
179207c
to
8f69b78
Compare
if zeroDowntimeEnvVar != "" { | ||
enableZeroDowntime = strings.ToLower(zeroDowntimeEnvVar) != "false" | ||
} else { | ||
enableZeroDowntime = rayServiceSpecUpgradeStrategy == rayv1.NewCluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this may cause issues when upgrading the CRD. Can you test the case that:
- Install KubeRay v1.2.2
- Create a RayService
- Upgrade KubeRay to this PR (without upgrading CRD)
- Try to trigger zero-downtime upgrade.
I expect that the zero-downtime upgrade will not be triggered. The reason is that the KubeRay v1.2.2's CRD doesn't have the field rayServiceInstance.Spec.RayServiceUpgradeStrategy
, so the zero value of string
will be used, so rayServiceSpecUpgradeStrategy == rayv1.NewCluster
will be false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If my above statement is correct (I'm not 100% sure), can you:
- Add a check in
validateRayServiceSpec
to make sure the value ofrayServiceInstance.Spec.RayServiceUpgradeStrategy
is valid (for now,NewCluster
,None
, and the zero value of string are valid).
func validateRayServiceSpec(rayService *rayv1.RayService) error { |
- Handle the case if
RayServiceUpgradeStrategy
is an empty string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, would you mind adding some comments to summarize the mechanism to control zero-downtime upgrade?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your statement is correct, with the current logic, it did not trigger zero-downtime.
I changed the logic within rayservice controller to default zero-downtime to true
let me know if this is sufficient.
c2567a1
to
818e12b
Compare
818e12b
to
ec428ab
Compare
ec428ab
to
4a531cb
Compare
@@ -57,6 +66,9 @@ type RayServiceSpec struct { | |||
DeploymentUnhealthySecondThreshold *int32 `json:"deploymentUnhealthySecondThreshold,omitempty"` | |||
// ServeService is the Kubernetes service for head node and worker nodes who have healthy http proxy to serve traffics. | |||
ServeService *corev1.Service `json:"serveService,omitempty"` | |||
// UpgradeStrategy represents the strategy used when upgrading the RayService. Currently supports `NewCluster` and `None` | |||
// +kubebuilder:default:=NewCluster | |||
UpgradeStrategy RayServiceUpgradeStrategy `json:"upgradeStrategy,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what will happen if specify a value for UpgradeStrategy
other than NewCluster
or None
? If OpenAPI doesn't complain, we should add a check in validateRayServiceSpec
to make sure the value is valid.
https://github.com/ray-project/kuberay/pull/2468/files#r1823902608
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added validation to validateRayServiceSpec
4a531cb
to
29a5a74
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM @kevin85421 can you take another look?
@@ -815,6 +823,14 @@ func TestReconcileRayCluster(t *testing.T) { | |||
updateKubeRayVersion: true, | |||
kubeRayVersion: "new-version", | |||
}, | |||
// Test 7: Zero downtime upgrade is enabled, but is enabled through the RayServiceSpec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add more tests? We have 4 combinations enableZeroDowntime
(true, false) and rayServiceUpgradeStrategy
(NewCluster, None).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the following combinations:
env var: false, spec: true
env var: true, spec: false
env var: false, spec: unset
env var: false, spec: false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, I believe there are 6 cases based on enableZeroDowntime
(true, false) and rayServiceUpgradeStrategy
(NewCluster, None, "").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe all 6 of the combinations are covered. The "" should be covered by the original tests and the ("NewCluster", "None") x (true, false) are covered by the new tests that are added.
932941b
to
a5cdab5
Compare
48d17a6
to
8a44e43
Compare
@@ -57,6 +66,9 @@ type RayServiceSpec struct { | |||
DeploymentUnhealthySecondThreshold *int32 `json:"deploymentUnhealthySecondThreshold,omitempty"` | |||
// ServeService is the Kubernetes service for head node and worker nodes who have healthy http proxy to serve traffics. | |||
ServeService *corev1.Service `json:"serveService,omitempty"` | |||
// UpgradeStrategy represents the strategy used when upgrading the RayService. Currently supports `NewCluster` and `None` | |||
// +kubebuilder:default:=NewCluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defaulting upgradeStrategy
to NewCluster
seems to be a breaking change. For example, if users set ENABLE_ZERO_DOWNTIME
to false
and don't specify upgradeStrategy
, zero-downtime upgrades were disabled in KubeRay v1.2.2 but are enabled with this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is what I'm seeing as well. Should we not have a default value for UpgradeStrategy
then?
cc @andrewsykim What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we not have a default value for
UpgradeStrategy
then?
Having no default value sounds reasonable if there isn’t a better solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the defaulting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There needs to be defaulting to define the default upgrade strategy though? Why not have ENABLE_ZERO_DOWNTIME take precedence to preserve backwards compatibility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ENABLE_ZERO_DOWNTIME
is an operator-level configuration, and UpgradeStrategy
is a CR-level configuration. My current reasoning for why the operator-level config should take precedence is as follows:
- KubeRay operator is managed by a ML infra team.
- Custom resources are managed by application teams (e.g., recommendation systems, LLM, etc.).
The ML infra team will set a default configuration, i.e., ENABLE_ZERO_DOWNTIME
in this case, for KubeRay that can work for most use cases.
However, in certain scenarios, some application teams may have rare use cases that require customizing their CRs. In these cases, they might want to set a different upgrade strategy for their RayService.
If the operator-level configuration takes precedence over the CR-level configuration, application teams would need to request changes to the KubeRay operator's configuration from the infra team. This could prevent these teams from sharing the KubeRay operator with other teams.
Does this make sense?
59a99c2
to
623360e
Compare
…le/enable zero-downtime update.
623360e
to
b8a01fd
Compare
@@ -824,6 +871,9 @@ func TestReconcileRayCluster(t *testing.T) { | |||
if !tc.enableZeroDowntime { | |||
os.Setenv(ENABLE_ZERO_DOWNTIME, "false") | |||
} | |||
if tc.rayServiceUpgradeStrategy != "" { | |||
rayService.Spec.UpgradeStrategy = tc.rayServiceUpgradeStrategy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we update the deepcopy (service := rayService.DeepCopy()
) instead?
Why are these changes needed?
To allow disabling zero downtime upgrade for a specific ray service
Related issue number
Part of #2397
Checks