Investigate the effect of using RollingUpdate strategy for workspaces deployments #21974

l0rd · 2023-01-31T13:57:08Z

Is your task related to a problem? Please describe

Currently workspace deployment use Recreate deployment strategy. From the doc (in Che case there is only one Pod):

All existing Pods are killed before new ones are created

The rational beyond the decision to use Recreate is that re-attaching the PV may fail if the new Pod is created when the existing one still exist.

On the other hand, if RollingUpdate was used instead of Recreate, the client side of the IDE would never lose connection with the backend. Developers would not notice that the Pod has been restarted except if a command (build / debug / terminal) was running during the update.

Describe the solution you'd like

It's time to re-evaluate if the PV re-attach would fail using RollingUpdate. We should test and compare workspaces updates with both strategies.

We should test with workspaces using different storage strategies (ephemeral, per workspace, per user) and with VS Code as an editor (testing with JetBrains Gateway would be a plus but not required).

Based on the result we should:

decide if we want to stay with Recreate or not
document the effect of a cluster update when nodes get evicted and workspace Pod disrupted
decide if we should make improvements IDE side to make the experience smoother (i.e. warning the user that the workspace is going to be restarted on another node and reconnect automatically)

Describe alternatives you've considered

We could have a DWO / CheCluster option to set workspace Pods PodDisruptionBudget with maxUnavailable=0. That would avoid workspace Pods disruption during an update (the update is put on hold until the Pod gets stopped).

Additional information

In any case, from our experience, cluster updates are scheduled during non working hours (during weekends), making Pod disruption not a problem as workspaces get idled in the meantime.

The text was updated successfully, but these errors were encountered:

AObuchow · 2023-01-31T19:21:20Z

@l0rd this is a pretty interesting topic. I'd be open to investigate this further if the priority is high enough. However, if this is something you plan to look into yourself, that's fine with me :)

l0rd · 2023-02-01T11:11:45Z

@AObuchow this is something that we need to discuss during next sprint planning.

AObuchow · 2023-03-01T06:21:39Z

I've been testing using the RollingUpdate deployment strategy, with the maximum number of unavailable workspace pods set to 0 and the maximum surge set to 1 and with Che Code as the editor.

To test how the editor responds to the updated deployment, I edited the devworkspace and requested more memory (several times for each storage strategy). I would then try switching editor files from the file explorer.

In my experience, the behaviour seems to be the same for the ephemeral, per-user and per-workspace storage strategies. Che Code would prompt the user to refresh the page with the following pop-up:

In some instances, this prompt wouldn't appear and when changing editor files, the file would fail to load - manually refreshing the browser window fixed the issue. I haven't been able to figure out the exact steps to reproduce this issue, though it happened infrequently enough.

For the per-user and per-workspace storage strategies, the provisioned PVC remained in the bound state.

One thing I observed was that the workspace status (i.e. kubectl get dw -n $NAMESPACE) would state certain phases and status message multiple times when the deployment was updated (e.g. Starting/Networking ready -> Running -> Starting/Waiting for editor to start -> Starting/Waiting for workspace deployment -> Starting/Waiting for editor to start -> Running). Though, this is likely just a bug in my patch that would have to be addressed if this becomes a full-fledged feature.

kubectl get dw -n $NAMESPACE -w
NAME            DEVWORKSPACE ID             PHASE      INFO
code-latest-2   workspace658fd06d97874b1e   Running    https://workspace658fd06d97874b1e-che-code-3100.192.168.39.117.nip.io
code-latest-2   workspace658fd06d97874b1e   Starting   Networking ready
code-latest-2   workspace658fd06d97874b1e   Starting   Waiting for workspace deployment
code-latest-2   workspace658fd06d97874b1e   Running    https://workspace658fd06d97874b1e-che-code-3100.192.168.39.117.nip.io
code-latest-2   workspace658fd06d97874b1e   Starting   Waiting for editor to start
code-latest-2   workspace658fd06d97874b1e   Starting   Waiting for workspace deployment
code-latest-2   workspace658fd06d97874b1e   Starting   Waiting for editor to start
code-latest-2   workspace658fd06d97874b1e   Running    https://workspace658fd06d97874b1e-che-code-3100.192.168.39.117.nip.io
1e   Running    https://workspace658fd06d97874b1e-che-code-3100.192.168.39.117.nip.io

If someone wants to test using workspaces with the RollingUpdate deployment strategy, I've uploaded an image of DWO that supports this feature at quay.io/aobuchow/devworkspace-controller:rolling-update.

In order to use the feature, a new deploymentStrategy field has been added to the DWOC to toggle between the Recreate (default) and RollingUpdate deployment strategy:

kind: DevWorkspaceOperatorConfig
apiVersion: controller.devfile.io/v1alpha1
config:
  routing:
    clusterHostSuffix: 192.168.39.117.nip.io
    defaultRoutingClass: basic
  workspace:
+    deploymentStrategy: RollingUpdate
    imagePullPolicy: Always

l0rd · 2023-03-01T12:08:22Z

I think we should open 2 new issues about the message "Cannot reconnect. Please reload the window.":

it should be improved in the case of the workspace being restarted:
- with a better explanation ("The workspace has been restarted" instead of "Cannot reconnect")
- split in 2 steps ("The workspace is restarting. Please Wait." and then "The workspace has been restarted. Please Reload the window.")
Investigate why sometimes it doesn't show up.

cc @azatsarynnyy

l0rd · 2023-03-03T16:00:30Z

@amisevsk @AObuchow considered the investigation results I would move forward with devfile/devworkspace-operator#1057 but setting rollingUpdate as default stragegy. Are you ok with that?

amisevsk · 2023-03-03T16:34:24Z

Just to confirm, do we want the default to be rolling or recreate? IMO we should stick with the safer option (recreate).

AObuchow · 2023-03-03T16:40:04Z

IMO we should stick with the safer option (recreate).

I'm also of the same opinion.

Additionally, perhaps I should also add a Che Cluster CR field to configure the workspace deployment strategy so that users don't have to modify the DWOC themselves. Perhaps if feedback on the RollingUpdate strategy is positive (or neutral) we should change it to the default in a future release?

amisevsk · 2023-03-03T17:01:11Z

The risk with defaulting to rollingUpdate is that workspace rollouts will fail randomly (and silently), depending on cluster load.

To reproduce this case, you can start a workspace that requests more than half the memory available on any node, e.g. below (tested on a cluster where worker nodes have 16GB memory)

kind: DevWorkspace
apiVersion: workspace.devfile.io/v1alpha2
metadata:
  name: code-latest
spec:
  started: true
  template:
    components:
      - name: dev
        container:
          image: quay.io/devfile/universal-developer-image:latest
          memoryRequest: 8Gi
          memoryLimit: 8Gi
  contributions:
    - name: che-code
      uri: https://eclipse-che.github.io/che-plugin-registry/main/v3/plugins/che-incubator/che-code/latest/devfile.yaml
      components:
        - name: che-code-runtime-description
          container:
            env:
              - name: CODE_HOST
                value: 0.0.0.0

If you trigger a rollout (e.g. update memoryRequest and memoryLimit to 9Gi, the new pod is created but is stuck in a pending state with an unschedulable condition:

0/6 nodes are available: 3 node(s) had untolerated taint
{node-role.kubernetes.io/master: }, 4 Insufficient memory, 4 node(s) had
volume node affinity conflict. preemption: 0/6 nodes are available: 1 No
preemption victims found for incoming pod, 5 Preemption is not helpful
for **scheduling.**

If you instead switch the workspaces storage-type to ephemeral so that no PVC is used, the rollout succeeds with no downtime on the pod

In the former case, the workspace is still "running" as the original pod is not terminated, but any changes to the workspace are not reflected in the workspace pods.

l0rd · 2023-03-06T21:52:24Z

Ok let's make it configurable via the CheCluster CR but with recreate as the default:

spec:
  devEnvironments:
    deploymentStrategy: 'Recreate'

AObuchow · 2023-03-27T21:04:58Z

Update on this: I've changed the DWO-side PR from a Draft status to ready for review.

I still need to make the Che-Operator PR, however.

AObuchow · 2023-04-25T14:14:51Z

This feature is now implemented on both the DWO and Che-Operator side. You can now configure the workspace deployment strategy through the CheCluster spec.devEnvironments.deploymentStrategy, which can be set to Recreate (default value) or RollingUpdate.

Leaving this issue open for the time being since there might be more work to do on the Che Code side, and experimentation with the IDEA editor should probably be done:

I think we should open 2 new issues about the message "Cannot reconnect. Please reload the window.":

1. it should be improved in the case of the workspace being restarted:
   
   * with a better explanation ("The workspace has been restarted" instead of "Cannot reconnect")
   * split in 2 steps ("The workspace is restarting. Please Wait." and then "The workspace has been restarted. Please Reload the window.")

2. Investigate why sometimes it doesn't show up.

l0rd · 2023-04-28T12:20:05Z

I am closing this issue as using RollingUpdate is now possible. I will open a separate issue to handle the eviction / restart better on VS Code and IntelliJ side.

l0rd added the kind/task Internal things, technical debt, and to-do tasks to be performed. label Jan 31, 2023

che-bot added the status/need-triage An issue that needs to be prioritized by the curator responsible for the triage. See https://github. label Jan 31, 2023

l0rd mentioned this issue Feb 14, 2023

Sprint 232 (Team B) #22001

Closed

amisevsk mentioned this issue Feb 15, 2023

Support configuring rolling updates for workspace deployments devfile/devworkspace-operator#1045

Closed

AObuchow mentioned this issue Mar 1, 2023

support configuring workspace deploymentStrategy in DWOC devfile/devworkspace-operator#1057

Merged

3 tasks

AObuchow added the area/che-operator Issues and PRs related to Eclipse Che Kubernetes Operator label Mar 27, 2023

l0rd assigned AObuchow Apr 5, 2023

AObuchow mentioned this issue Apr 19, 2023

feat: configure workspace deployment strategy with CheCluster CR eclipse-che/che-operator#1666

Merged

10 tasks

l0rd closed this as completed Apr 28, 2023

l0rd mentioned this issue Sep 28, 2023

Warn the user in VS Code as soon as the workspace is stopped #22570

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate the effect of using RollingUpdate strategy for workspaces deployments #21974

Investigate the effect of using RollingUpdate strategy for workspaces deployments #21974

l0rd commented Jan 31, 2023 •

edited

Loading

AObuchow commented Jan 31, 2023

l0rd commented Feb 1, 2023

AObuchow commented Mar 1, 2023

l0rd commented Mar 1, 2023

l0rd commented Mar 3, 2023 •

edited

Loading

amisevsk commented Mar 3, 2023

AObuchow commented Mar 3, 2023

amisevsk commented Mar 3, 2023 •

edited

Loading

l0rd commented Mar 6, 2023

AObuchow commented Mar 27, 2023

AObuchow commented Apr 25, 2023

l0rd commented Apr 28, 2023

Investigate the effect of using RollingUpdate strategy for workspaces deployments #21974

Investigate the effect of using RollingUpdate strategy for workspaces deployments #21974

Comments

l0rd commented Jan 31, 2023 • edited Loading

Is your task related to a problem? Please describe

Describe the solution you'd like

Describe alternatives you've considered

Additional information

AObuchow commented Jan 31, 2023

l0rd commented Feb 1, 2023

AObuchow commented Mar 1, 2023

l0rd commented Mar 1, 2023

l0rd commented Mar 3, 2023 • edited Loading

amisevsk commented Mar 3, 2023

AObuchow commented Mar 3, 2023

amisevsk commented Mar 3, 2023 • edited Loading

l0rd commented Mar 6, 2023

AObuchow commented Mar 27, 2023

AObuchow commented Apr 25, 2023

l0rd commented Apr 28, 2023

l0rd commented Jan 31, 2023 •

edited

Loading

l0rd commented Mar 3, 2023 •

edited

Loading

amisevsk commented Mar 3, 2023 •

edited

Loading