[ws-manager] gracefully shuts down workspace, leaving behind bound PVCs, avoiding backup of user data, after unknown event #14266

kylos101 · 2022-10-28T20:23:57Z

Bug description

In other words, we don't experience data loss, but, the pod stops gracefully, and when the user starts the workspace again, they would not have their data...even though we have it in a PV.

I tried deleting us72, but could not because there were two dangling PVC:

gitpod /workspace/gitpod (main) $ kubectl get pvc
NAME                                      STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS             AGE
ws-ccd64f44-d3b6-49eb-9d1e-9275406745ef   Bound    pvc-35a16057-d21c-44e2-9a76-a36f44fb1866   30Gi       RWO            csi-gce-pd-g1-standard   47h
ws-eb6cb985-86f3-435b-9def-d820d2b9060a   Bound    pvc-50708c34-a721-4f3d-855e-f74c94e2c034   30Gi       RWO            csi-gce-pd-g1-standard   45h

For the first...given workspace logs and this workspace trace:

Startworkspace is logged on workspace
It cannot be scheduled (waiting for scale-up)
Start workspace is logged again after 7 minutes (still not scheduled to a node)
1. Which lines up with us seeing startWorkspace error at 7m in the traces
2. We poll for seven minutes, to see if the pending pod should be recreated, and startWorkspace called again
3. We force delete the original pod and try starting again using original context 🤯
- Introduced in Refactor Manager StartWorkspace #11547
csi provisioner is started for this workspace
Ring 0 stops, we must've landed on workspace-ws-us72-standard-pvc-pool-2dvw
workspace cannot connect to ws-daemon
workspace fails to start, the volume snapshot is empty
workspace fails to start is logged again

Steps to reproduce

This could be because of:

node scale-up that is too slow (given the logs, perhaps).
Another possibility, is that we restarted ws-manager "during a key moment", and it the snapshot did not proceed. We restarted ws-manager a couple times this week.
The 1h timeout, which is a byproduct from when we persisted using node storage and backed up using GCS

So, either:

Create an ephemeral cluster
Start many workspaces with loadgen, causing node scale-up, if it takes >7 minutes, you'll hit the code path that was involved for these two workspaces
Stop the workspaces
Check to see if they backed up, or, left behind PVCs.

or maybe

Stop a bunch of workspaces, and while they're stopping (before, during, and after snapshot) stop ws-manager.

or maybe

timeout after an hour

Workspace affected

gitpodio-templatetypesc-qxnleu3pzu4

Expected behavior

There are a few things:

We should try to backup for longer than 1 h, this way we do not have to manually snapshot PVCs before we delete workspace clusters.
We should have a metric track duration for when a PVC is bound but has no pod, and a trigger an alert when one or more exists for too long.

Questions:

The workspace affected was gracefully Stopped (not Failed or Stopping), which indicates the user could have tried restarting their workspace and not restored their files. This would have been very confusing because they will not have their uncommitted files restored. Is this expected?

Example repository

No response

Anything else?

We currently stop trying to backup after a 1h timeout. This was a design decision for object storage based backups, and should be revisited as part of PVC.

The text was updated successfully, but these errors were encountered:

sagor999 · 2022-10-28T22:44:09Z

As you mentioned, this could be related to several ws-man restarts in that cluster due to issues with it.

kylos101 · 2022-10-28T22:48:14Z

Thanks @sagor999 ! I updated the description's steps to recreate to consider. I also updated the Anything Else to hint that the backup timeout is still 1h with PVC, and that we should reconsider that. For example, it could also explain why the PVCs were left dangling - I'm not sure if we log that the 1h timeout was hit. Can you check?

kylos101 · 2022-10-30T20:22:07Z

@jenting can you focus on this issue? It is new (created on Friday), and seems like it'll impact our ability to upload user data in a timely manner, potentially causing an odd experience for workspace restarts (when snapshot has not been done yet).

jenting · 2022-10-31T01:09:05Z

A similar issue as #13282.

We force delete the original pod and try starting again using original context
In other words, we don't experience data loss, but, the pod stops gracefully, and when the user starts the workspace again, they would not have their data...even though we have it in a PV.

So, we recreate the start workspace request, and the 2nd workspace pod fails to start, correct?

jenting · 2022-10-31T01:29:20Z

Related instances log

kylos101 · 2022-10-31T13:48:53Z

So, we recreate the start workspace request, and the 2nd workspace pod fails to start, correct?

@jenting good observation, although, I see the same failures logged here too. Does a PVC already exist when we force delete the pod? Would it make sense to also force delete the PVC, and create a new one as part of the 2nd startWorkspace request?

kylos101 · 2022-10-31T14:58:09Z

@jenting another consideration for this issue, is that the startWorkspace request could be for an existing workspace, rather than a new one. In the case of an existing workspace, what is the user experience like? For example, I assume the 2nd workspace never starts. But, if I try restarting the stopped workspace, is my data restored from the PVC snapshot?

jenting · 2022-11-01T04:50:39Z

Close this issue because it's duplicate as #13282

kylos101 added the type: bug Something isn't working label Oct 28, 2022

kylos101 added this to 🌌 Workspace Team Oct 28, 2022

kylos101 mentioned this issue Oct 28, 2022

Epic: Ensure durability for user workspace files #7901

Closed

77 tasks

kylos101 changed the title ~~[ws-manager] leaves behind bound PVCs, avoiding backup of user data~~ [ws-manager] leaves behind bound PVCs, avoiding backup of user data, after long node scale-up Oct 28, 2022

kylos101 changed the title ~~[ws-manager] leaves behind bound PVCs, avoiding backup of user data, after long node scale-up~~ [ws-manager] leaves behind bound PVCs, avoiding backup of user data, after unknown event Oct 28, 2022

kylos101 changed the title ~~[ws-manager] leaves behind bound PVCs, avoiding backup of user data, after unknown event~~ [ws-manager] gracefully shuts down workspace, leaving behind bound PVCs, avoiding backup of user data, after unknown event Oct 28, 2022

jenting closed this as not planned Won't fix, can't repro, duplicate, stale Nov 1, 2022

jenting moved this to Awaiting Deployment in 🌌 Workspace Team Nov 1, 2022

jenting removed this from 🌌 Workspace Team Nov 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ws-manager] gracefully shuts down workspace, leaving behind bound PVCs, avoiding backup of user data, after unknown event #14266

[ws-manager] gracefully shuts down workspace, leaving behind bound PVCs, avoiding backup of user data, after unknown event #14266

kylos101 commented Oct 28, 2022 •

edited

Loading

sagor999 commented Oct 28, 2022

kylos101 commented Oct 28, 2022

kylos101 commented Oct 30, 2022 •

edited

Loading

jenting commented Oct 31, 2022

jenting commented Oct 31, 2022

kylos101 commented Oct 31, 2022

kylos101 commented Oct 31, 2022

jenting commented Nov 1, 2022

[ws-manager] gracefully shuts down workspace, leaving behind bound PVCs, avoiding backup of user data, after unknown event #14266

[ws-manager] gracefully shuts down workspace, leaving behind bound PVCs, avoiding backup of user data, after unknown event #14266

Comments

kylos101 commented Oct 28, 2022 • edited Loading

Bug description

Steps to reproduce

Workspace affected

Expected behavior

Example repository

Anything else?

sagor999 commented Oct 28, 2022

kylos101 commented Oct 28, 2022

kylos101 commented Oct 30, 2022 • edited Loading

jenting commented Oct 31, 2022

jenting commented Oct 31, 2022

kylos101 commented Oct 31, 2022

kylos101 commented Oct 31, 2022

jenting commented Nov 1, 2022

kylos101 commented Oct 28, 2022 •

edited

Loading

kylos101 commented Oct 30, 2022 •

edited

Loading