[ws-manager] cannot restart stopped workspace - `no backup found` is hidden #14451

kylos101 · 2022-11-04T22:11:49Z

Bug description

I stopped a workspace by going over the ephemeral storage limit, tried to restart the workspace, but cannot.

The error I see is:

cannot pull image: rpc error: code = NotFound desc = failed to pull and unpack image "reg.ws-ephemeral-101.gitpod.io:20000/remote/542ea066-93df-46b7-aa2c-a1c38a664aed:latest": failed to resolve reference "reg.ws-ephemeral-101.gitpod.io:20000/remote/542ea066-93df-46b7-aa2c-a1c38a664aed:latest": reg.ws-ephemeral-101.gitpod.io:20000/remote/542ea066-93df-46b7-aa2c-a1c38a664aed:latest: not found

Here are logs for my workspace (there were five instances). Here is a related trace with errors.

It looks like ws-manager is treating this as an unknown phase?

Here is a screenie of what I see as a user:

Steps to reproduce

Not sure, I'm unable to recreate.

The command I used to exceed ephemeral storage is:

fallocate -l 25G /var/tmp/test1

I tried in a new workspace, but, was unsuccessful

Workspace affected

https://gitpodio-gitpod-no9dms43jkb.ws-ephemeral-101.gitpod.io/

Expected behavior

The real reason I cannot start the workspace is no backup found, I think I should see this?

Also, I should be able to start my workspace.

Example repository

n/a

Anything else?

In this case, I was using PVC, the first workspace instance was ws-3ce2b1f6-be70-47d4-a778-77b08ed8d2fe. Here are related details for my VS.

gitpod /workspace/gitpod/dev/loadgen (aledbf/regfac) $ kubectl get pvc
No resources found in default namespace.
gitpod /workspace/gitpod/dev/loadgen (aledbf/regfac) $ kubectl get pv
No resources found
gitpod /workspace/gitpod/dev/loadgen (aledbf/regfac) $ kubectl get vs
NAME                                   READYTOUSE   SOURCEPVC                                 SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS                           SNAPSHOTCONTENT                                    CREATIONTIME   AGE
04f18b9c-7ed4-404b-8367-ed8df802a937   true         ws-04f18b9c-7ed4-404b-8367-ed8df802a937                           50Gi          csi-gce-pd-snapshot-class-g1-large      snapcontent-f098346e-f3e8-41e8-96d6-5068d98e79ca   7h10m          7h10m
3ce2b1f6-be70-47d4-a778-77b08ed8d2fe   true         ws-3ce2b1f6-be70-47d4-a778-77b08ed8d2fe                           50Gi          csi-gce-pd-snapshot-class-g1-large      snapcontent-c610e0c6-fa33-4ffe-90ba-c11bd734302e   94m            94m
50ad7fd1-d4cf-4efc-a812-b5bf0f8c8082   true         ws-50ad7fd1-d4cf-4efc-a812-b5bf0f8c8082                           50Gi          csi-gce-pd-snapshot-class-g1-large      snapcontent-2594c5ce-4e1f-4d56-859f-3b00fcdb8f21   7h33m          7h33m
5125eab3-11bd-45d6-bfb3-9bd6f6e13eed   true         ws-5125eab3-11bd-45d6-bfb3-9bd6f6e13eed                           50Gi          csi-gce-pd-snapshot-class-g1-large      snapcontent-04a7eac7-23e0-4bcd-b9c2-ac7f43179d4c   3h26m          3h26m
858e439c-a24c-4140-a835-8f1a58fafbae   true         ws-858e439c-a24c-4140-a835-8f1a58fafbae                           50Gi          csi-gce-pd-snapshot-class-g1-large      snapcontent-ed97f61a-dcb5-40f7-8682-2412d3c9b0a4   119m           119m
a0db60bf-83f7-4dc0-a9c5-421147b20634   true         ws-a0db60bf-83f7-4dc0-a9c5-421147b20634                           30Gi          csi-gce-pd-snapshot-class-g1-standard   snapcontent-57eff9bd-ad55-4d64-8f42-f429bc56a874   6h23m          6h23m
c1dfc1ba-faf3-4a32-8ef8-6d081f17b995   true         ws-c1dfc1ba-faf3-4a32-8ef8-6d081f17b995                           50Gi          csi-gce-pd-snapshot-class-g1-large      snapcontent-1e4eca0d-1a3e-40f1-b73b-ab157494e4e7   6h42m          6h42m
e4cc9412-56fe-4f63-b127-009034cea379   true         ws-e4cc9412-56fe-4f63-b127-009034cea379                           50Gi          csi-gce-pd-snapshot-class-g1-large      snapcontent-8639f188-03dc-43df-bf74-12bbac6e2511   119m           119m

gitpod /workspace/gitpod/dev/loadgen (aledbf/regfac) $ kubectl describe vs 3ce2b1f6-be70-47d4-a778-77b08ed8d2fe
Name:         3ce2b1f6-be70-47d4-a778-77b08ed8d2fe
Namespace:    default
Labels:       app=gitpod
              component=workspace
              gitpod.io/networkpolicy=default
              gitpod.io/pvcFeature=true
              gitpod.io/workspaceClass=g1-large-pvc
              gpwsman=true
              headless=false
              metaID=gitpodio-gitpod-no9dms43jkb
              owner=8df3495b-685d-46e0-9820-009cc3b4afd8
              project=ef9d3f23-e5ab-4086-82f4-91813524bc3e
              team=c5895528-23ac-4ebd-9d8b-464228d5755f
              workspaceID=3ce2b1f6-be70-47d4-a778-77b08ed8d2fe
              workspaceType=regular
Annotations:  gitpod/id: 3ce2b1f6-be70-47d4-a778-77b08ed8d2fe
API Version:  snapshot.storage.k8s.io/v1
Kind:         VolumeSnapshot
Metadata:
  Creation Timestamp:  2022-11-04T20:39:15Z
  Finalizers:
    snapshot.storage.kubernetes.io/volumesnapshot-as-source-protection
    snapshot.storage.kubernetes.io/volumesnapshot-bound-protection
  Generation:  1
  Managed Fields:
    API Version:  snapshot.storage.k8s.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
          .:
          v:"snapshot.storage.kubernetes.io/volumesnapshot-as-source-protection":
          v:"snapshot.storage.kubernetes.io/volumesnapshot-bound-protection":
    Manager:      snapshot-controller
    Operation:    Update
    Time:         2022-11-04T20:39:15Z
    API Version:  snapshot.storage.k8s.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:gitpod/id:
        f:labels:
          .:
          f:app:
          f:component:
          f:gitpod.io/networkpolicy:
          f:gitpod.io/pvcFeature:
          f:gitpod.io/workspaceClass:
          f:gpwsman:
          f:headless:
          f:metaID:
          f:owner:
          f:project:
          f:team:
          f:workspaceID:
          f:workspaceType:
      f:spec:
        .:
        f:source:
          .:
          f:persistentVolumeClaimName:
        f:volumeSnapshotClassName:
    Manager:      ws-manager
    Operation:    Update
    Time:         2022-11-04T20:39:15Z
    API Version:  snapshot.storage.k8s.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:boundVolumeSnapshotContentName:
        f:creationTime:
        f:readyToUse:
        f:restoreSize:
    Manager:         snapshot-controller
    Operation:       Update
    Subresource:     status
    Time:            2022-11-04T20:39:22Z
  Resource Version:  153901
  UID:               c610e0c6-fa33-4ffe-90ba-c11bd734302e
Spec:
  Source:
    Persistent Volume Claim Name:  ws-3ce2b1f6-be70-47d4-a778-77b08ed8d2fe
  Volume Snapshot Class Name:      csi-gce-pd-snapshot-class-g1-large
Status:
  Bound Volume Snapshot Content Name:  snapcontent-c610e0c6-fa33-4ffe-90ba-c11bd734302e
  Creation Time:                       2022-11-04T20:39:17Z
  Ready To Use:                        true
  Restore Size:                        50Gi
Events:                                <none>

Is it normal to leave behind so many volumesnapshots in the cluster? Should we create a separate issue to GC them?

The text was updated successfully, but these errors were encountered:

kylos101 · 2022-11-04T22:17:02Z

@sagor999 I still have this ephemeral cluster and can recreate the issue, is it something you want to peek at with me? 🙏

sagor999 · 2022-11-04T22:54:44Z

The real reason I cannot start the workspace is no backup found, I think I should see this?

I think we have an issue for this already, where sometimes(?) errors from image build are not propogating into workspace error. But yes, that real reason should have been shown and it is a bug.

Also, I should be able to start my workspace!

Was there a previous instance of this workspace that was healthy? Or was this the first instance of the workspace that also failed?
Reason you get no backup found is due to disconnect between webapp (it thinks there should be backup) and workspace (backup was never completed).

Is it normal to leave behind so many volumesnapshots in the cluster?

depends. Webapp should GC them. If they are all related to the same workspace, then they should GC faster (I don't remember exact time it takes right now), otherwise last VS for workspace will be alive for 28(?) days.

It looks like ws-manager is treating this as an unknown phase?

this is probably the root of the problem here.

Please schedule this in, and @jenting or myself will take a look.

kylos101 · 2022-11-04T22:59:53Z

Was there a previous instance of this workspace that was healthy

Yes, the original (3ce2b1f6-be70-47d4-a778-77b08ed8d2fe) was healthy, until it hit the ephemeral storage pod limit and was stopped by kubernetes. I searched for a snapshot in GCP for the original instance ID, but, was not able to find one 😢

Webapp should GC them.

Really? Even though your average workspace cluster is ~7 days old, instead of 28? Or, do these get created in "current" workspace clusters when WebApp tries to delete PVC snapshots that were created on "older" clusters?

jenting · 2022-11-05T02:07:35Z

Is it normal to leave behind so many volumesnapshots in the cluster? Should we create a separate issue to GC them?

If it's a regular workspace, we could GC the older VolumeSnapshot for this workspace, only leaving the newest VolumeSnapshot one.
If it's a prebuild workspace, we should not GC the VolumeSnapshot because we support opening the workspace from the older prebuild.

Note that since our VolumeSnapshotClass delete policy is Delete. So directly deleting the in-cluster VolumeSnapshot object will also remove the GCP snapshots/images.
From my perspective, we should consider adding the finalizer to VolumeSnapshot to prevent a case that we delete the in-cluster VolumeSnapshot by accident or we could change the VolumeSnapshotClass deletion policy to Retain, and deleting the GCP snapshots/images should call the gcloud compute snapshots/images delete command instead.

kylos101 · 2022-11-05T03:18:46Z

From my perspective, we should consider adding the finalizer to VolumeSnapshot to prevent a case that we delete the in-cluster VolumeSnapshot by accident or we could change the VolumeSnapshotClass deletion policy to Retain, and deleting the GCP snapshots/images should call the gcloud compute snapshots/images delete command instead.

@jenting that is a separate enhancement, I think. I created this issue because after stopping my workspace, I was unable to restart it. The most recent restart suggested there was a missing backup. I'm not sure what caused it to go missing.

sagor999 · 2022-11-07T15:27:32Z

Note that since our VolumeSnapshotClass delete policy is Delete. So directly deleting the in-cluster VolumeSnapshot object will also remove the GCP snapshots/images.

That is by design. That is done so that when we GC snapshots, they will be auto removed from GCP. Otherwise as you mentioned we would need to run gcloud util directly to remove them.
Though adding finalizer might be a good idea to prevent accidental deletion by admin.

kylos101 · 2022-11-09T22:30:09Z

Removed from scheduled groundwork for now, let's treat this as a Day 2 item, pending how things go with PVC running at 10%.

stale · 2023-02-19T15:52:24Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

kylos101 added type: bug Something isn't working component: ws-manager labels Nov 4, 2022

kylos101 added this to 🌌 Workspace Team Nov 4, 2022

kylos101 moved this to Scheduled in 🌌 Workspace Team Nov 4, 2022

kylos101 mentioned this issue Nov 9, 2022

Epic: Ensure durability for user workspace files #7901

Closed

77 tasks

kylos101 removed the status in 🌌 Workspace Team Nov 9, 2022

stale bot added the meta: stale This issue/PR is stale and will be closed soon label Feb 19, 2023

stale bot closed this as completed Mar 19, 2023

github-project-automation bot moved this to Awaiting Deployment in 🌌 Workspace Team Mar 19, 2023

kylos101 closed this as not planned Won't fix, can't repro, duplicate, stale Mar 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ws-manager] cannot restart stopped workspace - `no backup found` is hidden #14451

[ws-manager] cannot restart stopped workspace - `no backup found` is hidden #14451

kylos101 commented Nov 4, 2022 •

edited

Loading

kylos101 commented Nov 4, 2022

sagor999 commented Nov 4, 2022

kylos101 commented Nov 4, 2022

jenting commented Nov 5, 2022

kylos101 commented Nov 5, 2022 •

edited

Loading

sagor999 commented Nov 7, 2022

kylos101 commented Nov 9, 2022

stale bot commented Feb 19, 2023

[ws-manager] cannot restart stopped workspace - no backup found is hidden #14451

[ws-manager] cannot restart stopped workspace - no backup found is hidden #14451

Comments

kylos101 commented Nov 4, 2022 • edited Loading

Bug description

Steps to reproduce

Workspace affected

Expected behavior

Example repository

Anything else?

kylos101 commented Nov 4, 2022

sagor999 commented Nov 4, 2022

kylos101 commented Nov 4, 2022

jenting commented Nov 5, 2022

kylos101 commented Nov 5, 2022 • edited Loading

sagor999 commented Nov 7, 2022

kylos101 commented Nov 9, 2022

stale bot commented Feb 19, 2023

[ws-manager] cannot restart stopped workspace - `no backup found` is hidden #14451

[ws-manager] cannot restart stopped workspace - `no backup found` is hidden #14451

kylos101 commented Nov 4, 2022 •

edited

Loading

kylos101 commented Nov 5, 2022 •

edited

Loading