[Feature] Disable zero downtime upgrade for a RayService using RayServiceSpec #2468

chiayi · 2024-10-23T17:45:53Z

Why are these changes needed?

To allow disabling zero downtime upgrade for a specific ray service

Related issue number

Part of #2397

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

ray-operator/apis/ray/v1/rayservice_types.go

ray-operator/controllers/ray/rayservice_controller.go

andrewsykim · 2024-10-24T14:01:26Z

ray-operator/apis/ray/v1/rayservice_types.go

+type UpgradeStrategy string
+
+const (
+	BlueGreenUpgrade      UpgradeStrategy = "BlueGreenUpgrade"


Thinking of alternatives out loud:

RollingUpdate / Recreate

CreateThenDelete / DeleteThenCreate

CreateFirst / DeleteFirst

Or perhaps we can follow upgrade strategy types similar to StatefulSet: RollingUpdate / OnDelete https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#update-strategies

I can see "RollingUpdate" being misleading because we only ever update 1 new RayCluster at a time, so it's not really a rolling update

I think we can use the "OnDelete" as I think it fits the "delete first" and "delete then create". For the zero downtime maybe we can name it something along the lines of "PrepareUpdateThenSwap" to describe that the updated cluster is getting ready?

OnDelete implies the users has to manually delete the cluster first. This is a new behavior which is maybe okay? If we are still automatically deleting the cluster, then it should be just Delete, DeleteFirst, Replace something like that.

Ahh I see, so like upgrade OnDelete. If that's the case, my vote would be for "Replace" or "DeleteFirst".

Thinking about this more, I think having a separate strategy for OnDelete and DeleteFirst would be valuable. So in total we could support 3 upgrade strategies

spec.upgradeStrategy=RollingUpdate -- same behavior as ENABLE_ZERO_DOWNTIME=true

spec.upgradeStrategy=DeleteFirst -- same behavior as ENABLE_ZERO_DOWNTIME=false

spec.upgradeStrategy=OnDelete -- new behavior, only upgrade if user manually deletes the RayCluster

OnDelete implies the users has to manually delete the cluster first. This is a new behavior which is maybe okay?

I think the behavior of OnDelete is fine, but it may require additional code changes, such as updating the CR status. This is beyond the scope of this PR. Maybe we can open an issue to track the progress, and we can revisit whether we need this behavior after the refactoring of RayService.

For ENABLE_ZERO_DOWNTIME=false case, how about name it to None or Disabled or NoUpgrade?

The current behavior of ENABLE_ZERO_DOWNTIME=false is to ignore all spec changes and do nothing.

ray-operator/controllers/ray/rayservice_controller_unit_test.go

ray-operator/controllers/ray/rayservice_controller.go

ray-operator/apis/ray/v1/rayservice_types.go

kevin85421 · 2024-10-31T06:59:17Z

ray-operator/controllers/ray/rayservice_controller.go

+		if zeroDowntimeEnvVar != "" {
+			enableZeroDowntime = strings.ToLower(zeroDowntimeEnvVar) != "false"
+		} else {
+			enableZeroDowntime = rayServiceSpecUpgradeStrategy == rayv1.NewCluster


I think this may cause issues when upgrading the CRD. Can you test the case that:

Install KubeRay v1.2.2

Create a RayService

Upgrade KubeRay to this PR (without upgrading CRD)

Try to trigger zero-downtime upgrade.

I expect that the zero-downtime upgrade will not be triggered. The reason is that the KubeRay v1.2.2's CRD doesn't have the field rayServiceInstance.Spec.RayServiceUpgradeStrategy, so the zero value of string will be used, so rayServiceSpecUpgradeStrategy == rayv1.NewCluster will be false.

If my above statement is correct (I'm not 100% sure), can you:

Add a check in validateRayServiceSpec to make sure the value of rayServiceInstance.Spec.RayServiceUpgradeStrategy is valid (for now, NewCluster, None, and the zero value of string are valid).

kuberay/ray-operator/controllers/ray/rayservice_controller.go

Line 247 in 33ba385

func validateRayServiceSpec(rayService *rayv1.RayService) error {

Handle the case if RayServiceUpgradeStrategy is an empty string.

In addition, would you mind adding some comments to summarize the mechanism to control zero-downtime upgrade?

Your statement is correct, with the current logic, it did not trigger zero-downtime.

I changed the logic within rayservice controller to default zero-downtime to true let me know if this is sufficient.

ray-operator/controllers/ray/rayservice_controller.go

ray-operator/apis/ray/v1/rayservice_types.go

kevin85421 · 2024-11-05T16:23:06Z

ray-operator/apis/ray/v1/rayservice_types.go

@@ -57,6 +66,9 @@ type RayServiceSpec struct {
 	DeploymentUnhealthySecondThreshold *int32 `json:"deploymentUnhealthySecondThreshold,omitempty"`
 	// ServeService is the Kubernetes service for head node and worker nodes who have healthy http proxy to serve traffics.
 	ServeService *corev1.Service `json:"serveService,omitempty"`
+	// UpgradeStrategy represents the strategy used when upgrading the RayService. Currently supports `NewCluster` and `None`
+	// +kubebuilder:default:=NewCluster
+	UpgradeStrategy RayServiceUpgradeStrategy `json:"upgradeStrategy,omitempty"`


what will happen if specify a value for UpgradeStrategy other than NewCluster or None? If OpenAPI doesn't complain, we should add a check in validateRayServiceSpec to make sure the value is valid.

https://github.com/ray-project/kuberay/pull/2468/files#r1823902608

Added validation to validateRayServiceSpec

andrewsykim

LGTM @kevin85421 can you take another look?

ray-operator/controllers/ray/rayservice_controller.go

kevin85421 · 2024-11-07T06:08:19Z

ray-operator/controllers/ray/rayservice_controller_unit_test.go

@@ -815,6 +823,14 @@ func TestReconcileRayCluster(t *testing.T) {
 			updateKubeRayVersion:    true,
 			kubeRayVersion:          "new-version",
 		},
+		// Test 7: Zero downtime upgrade is enabled, but is enabled through the RayServiceSpec


can we add more tests? We have 4 combinations enableZeroDowntime (true, false) and rayServiceUpgradeStrategy (NewCluster, None).

Added the following combinations:
env var: false, spec: true
env var: true, spec: false
env var: false, spec: unset
env var: false, spec: false

On second thought, I believe there are 6 cases based on enableZeroDowntime (true, false) and rayServiceUpgradeStrategy (NewCluster, None, "").

I believe all 6 of the combinations are covered. The "" should be covered by the original tests and the ("NewCluster", "None") x (true, false) are covered by the new tests that are added.

kevin85421 · 2024-11-07T23:12:53Z

ray-operator/apis/ray/v1/rayservice_types.go

@@ -57,6 +66,9 @@ type RayServiceSpec struct {
 	DeploymentUnhealthySecondThreshold *int32 `json:"deploymentUnhealthySecondThreshold,omitempty"`
 	// ServeService is the Kubernetes service for head node and worker nodes who have healthy http proxy to serve traffics.
 	ServeService *corev1.Service `json:"serveService,omitempty"`
+	// UpgradeStrategy represents the strategy used when upgrading the RayService. Currently supports `NewCluster` and `None`
+	// +kubebuilder:default:=NewCluster


Defaulting upgradeStrategy to NewCluster seems to be a breaking change. For example, if users set ENABLE_ZERO_DOWNTIME to false and don't specify upgradeStrategy, zero-downtime upgrades were disabled in KubeRay v1.2.2 but are enabled with this PR.

That is what I'm seeing as well. Should we not have a default value for UpgradeStrategy then?
cc @andrewsykim What do you think?

Should we not have a default value for UpgradeStrategy then?

Having no default value sounds reasonable if there isn’t a better solution.

Removed the defaulting.

There needs to be defaulting to define the default upgrade strategy though? Why not have ENABLE_ZERO_DOWNTIME take precedence to preserve backwards compatibility?

ENABLE_ZERO_DOWNTIME is an operator-level configuration, and UpgradeStrategy is a CR-level configuration. My current reasoning for why the operator-level config should take precedence is as follows:

KubeRay operator is managed by a ML infra team.

Custom resources are managed by application teams (e.g., recommendation systems, LLM, etc.).

The ML infra team will set a default configuration, i.e., ENABLE_ZERO_DOWNTIME in this case, for KubeRay that can work for most use cases.

However, in certain scenarios, some application teams may have rare use cases that require customizing their CRs. In these cases, they might want to set a different upgrade strategy for their RayService.

If the operator-level configuration takes precedence over the CR-level configuration, application teams would need to request changes to the KubeRay operator's configuration from the infra team. This could prevent these teams from sharing the KubeRay operator with other teams.

Does this make sense?

…le/enable zero-downtime update.

kevin85421 · 2024-11-08T05:18:47Z

ray-operator/controllers/ray/rayservice_controller_unit_test.go

@@ -824,6 +871,9 @@ func TestReconcileRayCluster(t *testing.T) {
 			if !tc.enableZeroDowntime {
 				os.Setenv(ENABLE_ZERO_DOWNTIME, "false")
 			}
+			if tc.rayServiceUpgradeStrategy != "" {
+				rayService.Spec.UpgradeStrategy = tc.rayServiceUpgradeStrategy


can we update the deepcopy (service := rayService.DeepCopy()) instead?

chiayi force-pushed the rayservice-upgrade branch from e436040 to 3870fb7 Compare October 23, 2024 18:04

andrewsykim reviewed Oct 23, 2024

View reviewed changes

ray-operator/apis/ray/v1/rayservice_types.go Outdated Show resolved Hide resolved

chiayi force-pushed the rayservice-upgrade branch from 3870fb7 to 702d2ba Compare October 23, 2024 18:49

andrewsykim reviewed Oct 23, 2024

View reviewed changes

ray-operator/apis/ray/v1/rayservice_types.go Outdated Show resolved Hide resolved

chiayi force-pushed the rayservice-upgrade branch from 702d2ba to 21fdf3f Compare October 23, 2024 21:33

kevin85421 reviewed Oct 24, 2024

View reviewed changes

ray-operator/controllers/ray/rayservice_controller.go Outdated Show resolved Hide resolved

andrewsykim reviewed Oct 24, 2024

View reviewed changes

chiayi force-pushed the rayservice-upgrade branch 2 times, most recently from b1c2096 to 179207c Compare October 29, 2024 21:02

andrewsykim reviewed Oct 30, 2024

View reviewed changes

ray-operator/controllers/ray/rayservice_controller_unit_test.go Outdated Show resolved Hide resolved

ray-operator/controllers/ray/rayservice_controller_unit_test.go Outdated Show resolved Hide resolved

ray-operator/controllers/ray/rayservice_controller.go Show resolved Hide resolved

andrewsykim reviewed Oct 30, 2024

View reviewed changes

ray-operator/apis/ray/v1/rayservice_types.go Outdated Show resolved Hide resolved

chiayi force-pushed the rayservice-upgrade branch from 179207c to 8f69b78 Compare October 30, 2024 20:56

kevin85421 reviewed Oct 31, 2024

View reviewed changes

kevin85421 self-assigned this Oct 31, 2024

andrewsykim reviewed Oct 31, 2024

View reviewed changes

ray-operator/controllers/ray/rayservice_controller.go Outdated Show resolved Hide resolved

ray-operator/apis/ray/v1/rayservice_types.go Outdated Show resolved Hide resolved

chiayi force-pushed the rayservice-upgrade branch 2 times, most recently from c2567a1 to 818e12b Compare November 4, 2024 19:36

chiayi marked this pull request as ready for review November 4, 2024 19:36

andrewsykim reviewed Nov 4, 2024

View reviewed changes

ray-operator/apis/ray/v1/rayservice_types.go Outdated Show resolved Hide resolved

chiayi force-pushed the rayservice-upgrade branch from 818e12b to ec428ab Compare November 4, 2024 21:16

andrewsykim reviewed Nov 4, 2024

View reviewed changes

ray-operator/apis/ray/v1/rayservice_types.go Outdated Show resolved Hide resolved

chiayi force-pushed the rayservice-upgrade branch from ec428ab to 4a531cb Compare November 4, 2024 23:19

kevin85421 reviewed Nov 5, 2024

View reviewed changes

chiayi force-pushed the rayservice-upgrade branch from 4a531cb to 29a5a74 Compare November 5, 2024 21:33

andrewsykim approved these changes Nov 7, 2024

View reviewed changes

kevin85421 requested changes Nov 7, 2024

View reviewed changes

chiayi force-pushed the rayservice-upgrade branch 4 times, most recently from 932941b to a5cdab5 Compare November 7, 2024 21:14

chiayi force-pushed the rayservice-upgrade branch 3 times, most recently from 48d17a6 to 8a44e43 Compare November 7, 2024 22:02

kevin85421 reviewed Nov 7, 2024

View reviewed changes

chiayi force-pushed the rayservice-upgrade branch 2 times, most recently from 59a99c2 to 623360e Compare November 8, 2024 00:58

Add 'updateStrategy' field to RayServiceSpec, allowing users to disab…

b8a01fd

…le/enable zero-downtime update.

chiayi force-pushed the rayservice-upgrade branch from 623360e to b8a01fd Compare November 8, 2024 01:00

kevin85421 reviewed Nov 8, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Disable zero downtime upgrade for a RayService using RayServiceSpec #2468

[Feature] Disable zero downtime upgrade for a RayService using RayServiceSpec #2468

chiayi commented Oct 23, 2024

andrewsykim Oct 24, 2024

andrewsykim Oct 24, 2024

andrewsykim Oct 24, 2024

chiayi Oct 24, 2024

andrewsykim Oct 24, 2024

chiayi Oct 24, 2024

andrewsykim Oct 24, 2024

kevin85421 Oct 24, 2024

MortalHappiness Oct 25, 2024

MortalHappiness Oct 25, 2024 •

edited

Loading

kevin85421 Oct 31, 2024

kevin85421 Oct 31, 2024

kevin85421 Oct 31, 2024

chiayi Nov 4, 2024

kevin85421 Nov 5, 2024 •

edited

Loading

chiayi Nov 6, 2024

andrewsykim left a comment

kevin85421 Nov 7, 2024

chiayi Nov 7, 2024 •

edited

Loading

kevin85421 Nov 7, 2024

chiayi Nov 8, 2024

kevin85421 Nov 7, 2024

chiayi Nov 7, 2024

kevin85421 Nov 8, 2024

chiayi Nov 8, 2024

andrewsykim Nov 8, 2024

kevin85421 Nov 8, 2024

kevin85421 Nov 8, 2024

[Feature] Disable zero downtime upgrade for a RayService using RayServiceSpec #2468

Are you sure you want to change the base?

[Feature] Disable zero downtime upgrade for a RayService using RayServiceSpec #2468

Conversation

chiayi commented Oct 23, 2024

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MortalHappiness Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevin85421 Nov 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewsykim left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chiayi Nov 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MortalHappiness Oct 25, 2024 •

edited

Loading

kevin85421 Nov 5, 2024 •

edited

Loading

chiayi Nov 7, 2024 •

edited

Loading