-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate the effect of using RollingUpdate strategy for workspaces deployments #21974
Comments
@l0rd this is a pretty interesting topic. I'd be open to investigate this further if the priority is high enough. However, if this is something you plan to look into yourself, that's fine with me :) |
@AObuchow this is something that we need to discuss during next sprint planning. |
I've been testing using the To test how the editor responds to the updated deployment, I edited the devworkspace and requested more memory (several times for each storage strategy). I would then try switching editor files from the file explorer. In my experience, the behaviour seems to be the same for the ephemeral, per-user and per-workspace storage strategies. Che Code would prompt the user to refresh the page with the following pop-up: In some instances, this prompt wouldn't appear and when changing editor files, the file would fail to load - manually refreshing the browser window fixed the issue. I haven't been able to figure out the exact steps to reproduce this issue, though it happened infrequently enough. For the per-user and per-workspace storage strategies, the provisioned PVC remained in the bound state. One thing I observed was that the workspace status (i.e. kubectl get dw -n $NAMESPACE -w
NAME DEVWORKSPACE ID PHASE INFO
code-latest-2 workspace658fd06d97874b1e Running https://workspace658fd06d97874b1e-che-code-3100.192.168.39.117.nip.io
code-latest-2 workspace658fd06d97874b1e Starting Networking ready
code-latest-2 workspace658fd06d97874b1e Starting Waiting for workspace deployment
code-latest-2 workspace658fd06d97874b1e Running https://workspace658fd06d97874b1e-che-code-3100.192.168.39.117.nip.io
code-latest-2 workspace658fd06d97874b1e Starting Waiting for editor to start
code-latest-2 workspace658fd06d97874b1e Starting Waiting for workspace deployment
code-latest-2 workspace658fd06d97874b1e Starting Waiting for editor to start
code-latest-2 workspace658fd06d97874b1e Running https://workspace658fd06d97874b1e-che-code-3100.192.168.39.117.nip.io
1e Running https://workspace658fd06d97874b1e-che-code-3100.192.168.39.117.nip.io If someone wants to test using workspaces with the RollingUpdate deployment strategy, I've uploaded an image of DWO that supports this feature at In order to use the feature, a new kind: DevWorkspaceOperatorConfig
apiVersion: controller.devfile.io/v1alpha1
config:
routing:
clusterHostSuffix: 192.168.39.117.nip.io
defaultRoutingClass: basic
workspace:
+ deploymentStrategy: RollingUpdate
imagePullPolicy: Always |
I think we should open 2 new issues about the message "Cannot reconnect. Please reload the window.":
|
@amisevsk @AObuchow considered the investigation results I would move forward with devfile/devworkspace-operator#1057 but setting |
Just to confirm, do we want the default to be rolling or recreate? IMO we should stick with the safer option (recreate). |
I'm also of the same opinion. Additionally, perhaps I should also add a Che Cluster CR field to configure the workspace deployment strategy so that users don't have to modify the DWOC themselves. Perhaps if feedback on the RollingUpdate strategy is positive (or neutral) we should change it to the default in a future release? |
The risk with defaulting to rollingUpdate is that workspace rollouts will fail randomly (and silently), depending on cluster load. To reproduce this case, you can start a workspace that requests more than half the memory available on any node, e.g. below (tested on a cluster where worker nodes have 16GB memory) kind: DevWorkspace
apiVersion: workspace.devfile.io/v1alpha2
metadata:
name: code-latest
spec:
started: true
template:
components:
- name: dev
container:
image: quay.io/devfile/universal-developer-image:latest
memoryRequest: 8Gi
memoryLimit: 8Gi
contributions:
- name: che-code
uri: https://eclipse-che.github.io/che-plugin-registry/main/v3/plugins/che-incubator/che-code/latest/devfile.yaml
components:
- name: che-code-runtime-description
container:
env:
- name: CODE_HOST
value: 0.0.0.0
In the former case, the workspace is still "running" as the original pod is not terminated, but any changes to the workspace are not reflected in the workspace pods. |
Ok let's make it configurable via the spec:
devEnvironments:
deploymentStrategy: 'Recreate' |
Update on this: I've changed the DWO-side PR from a Draft status to ready for review. I still need to make the Che-Operator PR, however. |
This feature is now implemented on both the DWO and Che-Operator side. You can now configure the workspace deployment strategy through the CheCluster Leaving this issue open for the time being since there might be more work to do on the Che Code side, and experimentation with the IDEA editor should probably be done:
|
I am closing this issue as using RollingUpdate is now possible. I will open a separate issue to handle the eviction / restart better on VS Code and IntelliJ side. |
Is your task related to a problem? Please describe
Currently workspace deployment use
Recreate
deployment strategy. From the doc (in Che case there is only one Pod):The rational beyond the decision to use
Recreate
is that re-attaching the PV may fail if the new Pod is created when the existing one still exist.On the other hand, if
RollingUpdate
was used instead ofRecreate
, the client side of the IDE would never lose connection with the backend. Developers would not notice that the Pod has been restarted except if a command (build / debug / terminal) was running during the update.Describe the solution you'd like
It's time to re-evaluate if the PV re-attach would fail using
RollingUpdate
. We should test and compare workspaces updates with both strategies.We should test with workspaces using different storage strategies (ephemeral, per workspace, per user) and with VS Code as an editor (testing with JetBrains Gateway would be a plus but not required).
Based on the result we should:
Recreate
or notDescribe alternatives you've considered
We could have a DWO / CheCluster option to set workspace Pods
PodDisruptionBudget
withmaxUnavailable=0
. That would avoid workspace Pods disruption during an update (the update is put on hold until the Pod gets stopped).Additional information
In any case, from our experience, cluster updates are scheduled during non working hours (during weekends), making Pod disruption not a problem as workspaces get idled in the meantime.
The text was updated successfully, but these errors were encountered: