Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate the effect of using RollingUpdate strategy for workspaces deployments #21974

Closed
3 tasks
l0rd opened this issue Jan 31, 2023 · 12 comments
Closed
3 tasks
Assignees
Labels
area/che-operator Issues and PRs related to Eclipse Che Kubernetes Operator area/devworkspace-operator kind/task Internal things, technical debt, and to-do tasks to be performed. status/analyzing An issue has been proposed and it is currently being analyzed for effort and implementation approach

Comments

@l0rd
Copy link
Contributor

l0rd commented Jan 31, 2023

Is your task related to a problem? Please describe

Currently workspace deployment use Recreate deployment strategy. From the doc (in Che case there is only one Pod):

All existing Pods are killed before new ones are created

The rational beyond the decision to use Recreate is that re-attaching the PV may fail if the new Pod is created when the existing one still exist.

On the other hand, if RollingUpdate was used instead of Recreate, the client side of the IDE would never lose connection with the backend. Developers would not notice that the Pod has been restarted except if a command (build / debug / terminal) was running during the update.

Describe the solution you'd like

It's time to re-evaluate if the PV re-attach would fail using RollingUpdate. We should test and compare workspaces updates with both strategies.

We should test with workspaces using different storage strategies (ephemeral, per workspace, per user) and with VS Code as an editor (testing with JetBrains Gateway would be a plus but not required).

Based on the result we should:

  • decide if we want to stay with Recreate or not
  • document the effect of a cluster update when nodes get evicted and workspace Pod disrupted
  • decide if we should make improvements IDE side to make the experience smoother (i.e. warning the user that the workspace is going to be restarted on another node and reconnect automatically)

Describe alternatives you've considered

We could have a DWO / CheCluster option to set workspace Pods PodDisruptionBudget with maxUnavailable=0. That would avoid workspace Pods disruption during an update (the update is put on hold until the Pod gets stopped).

Additional information

In any case, from our experience, cluster updates are scheduled during non working hours (during weekends), making Pod disruption not a problem as workspaces get idled in the meantime.

@l0rd l0rd added the kind/task Internal things, technical debt, and to-do tasks to be performed. label Jan 31, 2023
@che-bot che-bot added the status/need-triage An issue that needs to be prioritized by the curator responsible for the triage. See https://github. label Jan 31, 2023
@AObuchow AObuchow added status/analyzing An issue has been proposed and it is currently being analyzed for effort and implementation approach area/devworkspace-operator and removed status/need-triage An issue that needs to be prioritized by the curator responsible for the triage. See https://github. labels Jan 31, 2023
@AObuchow
Copy link

@l0rd this is a pretty interesting topic. I'd be open to investigate this further if the priority is high enough. However, if this is something you plan to look into yourself, that's fine with me :)

@l0rd
Copy link
Contributor Author

l0rd commented Feb 1, 2023

@AObuchow this is something that we need to discuss during next sprint planning.

@AObuchow
Copy link

AObuchow commented Mar 1, 2023

I've been testing using the RollingUpdate deployment strategy, with the maximum number of unavailable workspace pods set to 0 and the maximum surge set to 1 and with Che Code as the editor.

To test how the editor responds to the updated deployment, I edited the devworkspace and requested more memory (several times for each storage strategy). I would then try switching editor files from the file explorer.

In my experience, the behaviour seems to be the same for the ephemeral, per-user and per-workspace storage strategies. Che Code would prompt the user to refresh the page with the following pop-up:

image

In some instances, this prompt wouldn't appear and when changing editor files, the file would fail to load - manually refreshing the browser window fixed the issue. I haven't been able to figure out the exact steps to reproduce this issue, though it happened infrequently enough.

For the per-user and per-workspace storage strategies, the provisioned PVC remained in the bound state.

One thing I observed was that the workspace status (i.e. kubectl get dw -n $NAMESPACE) would state certain phases and status message multiple times when the deployment was updated (e.g. Starting/Networking ready -> Running -> Starting/Waiting for editor to start -> Starting/Waiting for workspace deployment -> Starting/Waiting for editor to start -> Running). Though, this is likely just a bug in my patch that would have to be addressed if this becomes a full-fledged feature.

kubectl get dw -n $NAMESPACE -w
NAME            DEVWORKSPACE ID             PHASE      INFO
code-latest-2   workspace658fd06d97874b1e   Running    https://workspace658fd06d97874b1e-che-code-3100.192.168.39.117.nip.io
code-latest-2   workspace658fd06d97874b1e   Starting   Networking ready
code-latest-2   workspace658fd06d97874b1e   Starting   Waiting for workspace deployment
code-latest-2   workspace658fd06d97874b1e   Running    https://workspace658fd06d97874b1e-che-code-3100.192.168.39.117.nip.io
code-latest-2   workspace658fd06d97874b1e   Starting   Waiting for editor to start
code-latest-2   workspace658fd06d97874b1e   Starting   Waiting for workspace deployment
code-latest-2   workspace658fd06d97874b1e   Starting   Waiting for editor to start
code-latest-2   workspace658fd06d97874b1e   Running    https://workspace658fd06d97874b1e-che-code-3100.192.168.39.117.nip.io
1e   Running    https://workspace658fd06d97874b1e-che-code-3100.192.168.39.117.nip.io

If someone wants to test using workspaces with the RollingUpdate deployment strategy, I've uploaded an image of DWO that supports this feature at quay.io/aobuchow/devworkspace-controller:rolling-update.

In order to use the feature, a new deploymentStrategy field has been added to the DWOC to toggle between the Recreate (default) and RollingUpdate deployment strategy:

kind: DevWorkspaceOperatorConfig
apiVersion: controller.devfile.io/v1alpha1
config:
  routing:
    clusterHostSuffix: 192.168.39.117.nip.io
    defaultRoutingClass: basic
  workspace:
+    deploymentStrategy: RollingUpdate
    imagePullPolicy: Always

@l0rd
Copy link
Contributor Author

l0rd commented Mar 1, 2023

I think we should open 2 new issues about the message "Cannot reconnect. Please reload the window.":

  1. it should be improved in the case of the workspace being restarted:
    • with a better explanation ("The workspace has been restarted" instead of "Cannot reconnect")
    • split in 2 steps ("The workspace is restarting. Please Wait." and then "The workspace has been restarted. Please Reload the window.")
  2. Investigate why sometimes it doesn't show up.

cc @azatsarynnyy

@l0rd
Copy link
Contributor Author

l0rd commented Mar 3, 2023

@amisevsk @AObuchow considered the investigation results I would move forward with devfile/devworkspace-operator#1057 but setting rollingUpdate as default stragegy. Are you ok with that?

@amisevsk
Copy link
Contributor

amisevsk commented Mar 3, 2023

Just to confirm, do we want the default to be rolling or recreate? IMO we should stick with the safer option (recreate).

@AObuchow
Copy link

AObuchow commented Mar 3, 2023

IMO we should stick with the safer option (recreate).

I'm also of the same opinion.

Additionally, perhaps I should also add a Che Cluster CR field to configure the workspace deployment strategy so that users don't have to modify the DWOC themselves. Perhaps if feedback on the RollingUpdate strategy is positive (or neutral) we should change it to the default in a future release?

@amisevsk
Copy link
Contributor

amisevsk commented Mar 3, 2023

The risk with defaulting to rollingUpdate is that workspace rollouts will fail randomly (and silently), depending on cluster load.

To reproduce this case, you can start a workspace that requests more than half the memory available on any node, e.g. below (tested on a cluster where worker nodes have 16GB memory)

kind: DevWorkspace
apiVersion: workspace.devfile.io/v1alpha2
metadata:
  name: code-latest
spec:
  started: true
  template:
    components:
      - name: dev
        container:
          image: quay.io/devfile/universal-developer-image:latest
          memoryRequest: 8Gi
          memoryLimit: 8Gi
  contributions:
    - name: che-code
      uri: https://eclipse-che.github.io/che-plugin-registry/main/v3/plugins/che-incubator/che-code/latest/devfile.yaml
      components:
        - name: che-code-runtime-description
          container:
            env:
              - name: CODE_HOST
                value: 0.0.0.0
  • If you trigger a rollout (e.g. update memoryRequest and memoryLimit to 9Gi, the new pod is created but is stuck in a pending state with an unschedulable condition:
    0/6 nodes are available: 3 node(s) had untolerated taint
    {node-role.kubernetes.io/master: }, 4 Insufficient memory, 4 node(s) had
    volume node affinity conflict. preemption: 0/6 nodes are available: 1 No
    preemption victims found for incoming pod, 5 Preemption is not helpful
    for **scheduling.**
    
  • If you instead switch the workspaces storage-type to ephemeral so that no PVC is used, the rollout succeeds with no downtime on the pod

In the former case, the workspace is still "running" as the original pod is not terminated, but any changes to the workspace are not reflected in the workspace pods.

@l0rd
Copy link
Contributor Author

l0rd commented Mar 6, 2023

Ok let's make it configurable via the CheCluster CR but with recreate as the default:

spec:
  devEnvironments:
    deploymentStrategy: 'Recreate'

@AObuchow
Copy link

Update on this: I've changed the DWO-side PR from a Draft status to ready for review.

I still need to make the Che-Operator PR, however.

@AObuchow AObuchow added the area/che-operator Issues and PRs related to Eclipse Che Kubernetes Operator label Mar 27, 2023
@AObuchow
Copy link

This feature is now implemented on both the DWO and Che-Operator side. You can now configure the workspace deployment strategy through the CheCluster spec.devEnvironments.deploymentStrategy, which can be set to Recreate (default value) or RollingUpdate.

Leaving this issue open for the time being since there might be more work to do on the Che Code side, and experimentation with the IDEA editor should probably be done:

I think we should open 2 new issues about the message "Cannot reconnect. Please reload the window.":

1. it should be improved in the case of the workspace being restarted:
   
   * with a better explanation ("The workspace has been restarted" instead of "Cannot reconnect")
   * split in 2 steps ("The workspace is restarting. Please Wait." and then "The workspace has been restarted. Please Reload the window.")

2. Investigate why sometimes it doesn't show up.

@l0rd
Copy link
Contributor Author

l0rd commented Apr 28, 2023

I am closing this issue as using RollingUpdate is now possible. I will open a separate issue to handle the eviction / restart better on VS Code and IntelliJ side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/che-operator Issues and PRs related to Eclipse Che Kubernetes Operator area/devworkspace-operator kind/task Internal things, technical debt, and to-do tasks to be performed. status/analyzing An issue has been proposed and it is currently being analyzed for effort and implementation approach
Projects
None yet
Development

No branches or pull requests

4 participants