[PVC] loadgen testing Pod can't mount Volume #13353

jenting · 2022-09-27T09:01:29Z

Bug description

Running the loadgen test with PVC feature flag, some of the pods can't mount the volume
We ran the loadgen test twice

1st round: 1 of 100 workspaces can't be up
2nd round: 16 of 100 workspaces can't be up

The pod event log

  Warning  FailedMount             21m (x25 over 56m)  kubelet                  MountVolume.MountDevice failed for volume "pvc-b5d1cab4-5261-4ad0-8fc3-fb87f773558c" : rpc error: code = Internal desc = Error when getting device path: rpc error: code = Internal desc = error verifying GCE PD ("pvc-b5d1cab4-5261-4ad0-8fc3-fb87f773558c") is attached: failed to find and re-link disk pvc-b5d1cab4-5261-4ad0-8fc3-fb87f773558c with udevadm after retrying for 3s: failed to trigger udevadm fix of non existent disk for "pvc-b5d1cab4-5261-4ad0-8fc3-fb87f773558c": udevadm --trigger requested to fix disk pvc-b5d1cab4-5261-4ad0-8fc3-fb87f773558c but no such disk was found
  Warning  FailedMount             5m (x22 over 54m)   kubelet                  Unable to attach or mount volumes: unmounted volumes=[vol-this-workspace], unattached volumes=[vol-this-workspace daemon-mount]: timed out waiting for the condition
  Warning  FailedMount             54s (x5 over 23m)   kubelet                  Unable to attach or mount volumes: unmounted volumes=[vol-this-workspace], unattached volumes=[daemon-mount vol-this-workspace]: timed out waiting for the condition

Steps to reproduce

https://gitpod.slack.com/archives/C02F19UUW6S/p1664263343721969?thread_ts=1664240249.861729&cid=C02F19UUW6S

Related upstream issue kubernetes-sigs/gcp-compute-persistent-disk-csi-driver#608

Workspace affected

No response

Expected behavior

When you start 300 workspaces with g1-standard workspace class, they should all start and stop, some landing on new nodes because of auto scaling. Then, when you start 300 while using the same nodes (not needing auto scaling because nodes exist), they should all start and stop gracefully, too.

Example repository

No response

Anything else?

#7901

kylos101 · 2022-10-02T16:57:53Z

👋 @jenting do we have a workaround for the meantime?

jenting · 2022-10-03T10:25:13Z

👋 @jenting do we have a workaround for the meantime?

Nope. I tried to reproduce it five times, but I can't reproduce it.

kylos101 · 2022-10-03T21:41:13Z

@jenting how did you try to reproduce? I assume you tried starting 100 workspaces. Did you try any other permutations? For example, 200 or 400 workspaces?

sagor999 · 2022-10-03T22:40:01Z

I've looked at upstream issue for this, and it appears to intermittent error, that google folks cannot repro, and hence are not able to pin point where the problem might be.

It appears one workaround for this is to delete the pod and re-create it. Which I believe we can do as part of startWorkspace call (we used to do something similar for another k8s bug with memory oom).
So we could I think apply same workaround here.
It also seems to be GCP related bug (??), though more research would need to be done to see if other PD drivers have similar issue or not.

sagor999 · 2022-10-03T22:50:59Z

@jenting do you by any chance remember what is the status of the pod when this happens? Was it still pending? Or was it already in the running phase?

We can do an easy workaround implementation for it here, but I need a full yaml capture of the pod definition when this happens to properly implement it. Would you be able to try to repro it again?
The code can be added here:

gitpod/components/ws-manager/pkg/manager/manager.go

Line 658 in 28fae90

switch pod.Status.Phase {

if we can add it there, then we automatically will delete the pod and re-create it, hopefully getting around this issue. 🤔

sagor999 · 2022-10-03T22:53:19Z

On the other hand, if pod is already considered running, then we would need to implement that workaround in a different place... 🤔 which will make it a bit more cumbersome, but bottom line, I think we can have a workaround for this problem very soon and unblock PVC roll out (again).

jenting · 2022-10-03T23:13:55Z

@sagor999
If I remember correctly, two cases:

Pod is still pending: it was kept in pending state for at least 30 mins. The PV status is healthy in the GCP console. The CSI mount point exists in the host node. (This happened when Toru ran the loadgen)
Pod is running: described the pod, and it showed the error message. But after a few seconds, the pod can mount the volume, and the pod is running. (This happened when I tried to reproduce the issue)

So, I plan to launch an ephemeral cluster to reproduce this issue today.

sagor999 · 2022-10-03T23:32:47Z

Aha, good. Since it was pending, it will be much easier to have a workaround.
Thank you for trying to repro it again 🙏
If you will be able to repro it, grab pod definition in yml format, that way it will be easier to make a proper fix (to know which exact field to check for that specific error)

sagor999 · 2022-10-03T23:33:33Z

If you will not be able to repro, let me know, and I will make a blind fix. 😅 🙈 👓

jenting · 2022-10-04T09:10:17Z

I tried to reproduce this issue the whole day (I ran 10 times with 100 regular workspaces with PVC), but I got no luck that I couldn't reproduce it.

kylos101 · 2022-10-04T15:23:37Z

@jenting @sagor999

What was the peek # of running workspaces that you achieved at one time when trying to recreate?
Did you try starting 100, leaving them running, starting another 100 but at the same time stopping the initial 100?

In other words, test with more than 100 workspaces that are both starting and stopping. This way you have PVC creation and snapshot happening at the same time, with a higher volume.

Why do I recommend a higher volume than 100? Consider this, running a test that has volume and activity similar to EU may help uncover other hot spots. It'll be costly for the nodes, but, it is important to know we can reach that volume in non-prod before trying in prod.

jenting · 2022-10-11T06:55:45Z

Reopen it because it does not work as expected.

https://gitpod.slack.com/archives/C02F19UUW6S/p1665469849252539?thread_ts=1665464017.670309&cid=C02F19UUW6S

utam0k · 2022-11-08T09:13:01Z

I'll have to give it a try, but what about hooking the process with a seccomp notify mount? Wouldn't that device be visible from the hostt?

gitpod/components/workspacekit/pkg/seccomp/notify.go

Line 212 in f53edf7

    
           func (h *InWorkspaceHandler) Mount(req *libseccomp.ScmpNotifReq) (val uint64, errno int32, flags uint32) {

jenting · 2022-11-08T10:25:29Z

@utam0k But we did a trial that runs the container with the command /.supervisor/workspacekit fsprep which did not run the ring0/ring1/ring2. In general, we don't rely on the ws-daemon, but we want to use workspacekit to format the mounted block device.
However, it does not work. Related code

gitpod/components/ws-manager/pkg/manager/create.go

Lines 748 to 757 in 71573f4

    
           sec.RunAsUser = pointer.Int64(0) 
        
           sec.RunAsGroup = pointer.Int64(0) 
        
           sec.RunAsNonRoot = pointer.Bool(false) 
        
           sec.AllowPrivilegeEscalation = pointer.Bool(true) 
        
           sec.Capabilities = &corev1.Capabilities{ 
        
           	Add: []corev1.Capability{"CAP_SYS_ADMIN"}, 
        
           } 
        
           sec.Privileged = pointer.Bool(true) 
        
           sec.ReadOnlyRootFilesystem = pointer.Bool(false) 
        
           command = []string{"/.supervisor/workspacekit", "fsprep"}

https://github.com/gitpod-io/gitpod/blob/jenting/13353/components/workspacekit/cmd/fsprep.go

Furisto · 2022-11-08T12:22:14Z

Why does it need to be workspacekit that runs the fsprep? You could implement the operation in IWS and then call it from workspacekit.

kylos101 · 2022-11-08T13:22:48Z

Why does it need to be workspacekit that runs the fsprep? You could implement the operation in IWS and then call it from workspacekit.

@Furisto using IWS may put us back into the original issue, where a process above/outside the workspace container is involved with attach/detach of the disk.

See this comment for background.

sagor999 · 2022-11-08T16:11:26Z

I think there is a workaround for permission issue.
We can run mounting of device as init container that runs as privileged container. Each container has its own sec context. I will implement that later today.

sagor999 · 2022-11-08T21:37:46Z

Init container idea did not work. Mount does not persist between init container and regular container. :(
Or at least I do not know how to make it persist. Mount works though in init container (since I can run it as root privileged container). Just do not know how to make it persist between containers, if that is even possible.

sagor999 · 2022-11-09T01:43:33Z

I am out of ideas, so for now removing myself as assignee.

utam0k · 2022-11-09T02:53:08Z

@utam0k But we did a trial that runs the container with the command /.supervisor/workspacekit fsprep which did not run the ring0/ring1/ring2. In general, we don't rely on the ws-daemon, but we want to use workspacekit to format the mounted block device. However, it does not work. Related code

gitpod/components/ws-manager/pkg/manager/create.go

Lines 748 to 757 in 71573f4

sec.RunAsUser = pointer.Int64(0)

sec.RunAsGroup = pointer.Int64(0)

sec.RunAsNonRoot = pointer.Bool(false)

sec.AllowPrivilegeEscalation = pointer.Bool(true)

sec.Capabilities = &corev1.Capabilities{

Add: []corev1.Capability{"CAP_SYS_ADMIN"},

}

sec.Privileged = pointer.Bool(true)

sec.ReadOnlyRootFilesystem = pointer.Bool(false)

command = []string{"/.supervisor/workspacekit", "fsprep"}

https://github.com/gitpod-io/gitpod/blob/jenting/13353/components/workspacekit/cmd/fsprep.go

@jenting
First, we don't allow running the workspace pod as a host root user from the security view. It is too dangerous.

Do you know which system call permissions are missing? Could it be a system call that is prohibited by seccomp?
Could it be a system call that is prohibited by seccomp?
https://github.com/gitpod-io/gitpod/blob/main/components/ws-daemon/seccomp-profile-installer/main.go

utam0k · 2022-11-09T02:58:09Z

@Furisto using IWS may put us back into the original issue, where a process above/outside the workspace container is involved with attach/detach of the disk.

The IWS server has a nsinsider that creates a temporary process and puts it into an arbitrary Linux mount namespace and others (e.g. workspace). Can this be used?
A sample code:

gitpod/components/ws-daemon/pkg/iws/iws.go

Lines 242 to 244 in ddf6f3e

    
           err = nsi.Nsinsider(wbs.Session.InstanceID, int(containerPID), func(c *exec.Cmd) { 
        
           	c.Args = append(c.Args, "prepare-dev", "--uid", strconv.Itoa(wsinit.GitpodUID), "--gid", strconv.Itoa(wsinit.GitpodGID)) 
        
           })

nsinsider is basically just a temporary process and terminates as a process.

gitpod/components/ws-daemon/pkg/nsinsider/nsinsider.go

Lines 53 to 124 in ddf6f3e

    
           cfg := NsinsiderOpts{ 
        
           	MountNS: true, 
        
           } 
        
           for _, o := range opts { 
        
           	o(&cfg) 
        
           } 
        
           base, err := os.Executable() 
        
           if err != nil { 
        
           	return err 
        
           } 
        
           type mnt struct { 
        
           	Env    string 
        
           	Source string 
        
           	Flags  int 
        
           } 
        
           var nss []mnt 
        
           if cfg.MountNS { 
        
           	tpid := targetPid 
        
           	if cfg.MountNSPid != 0 { 
        
           		tpid = cfg.MountNSPid 
        
           	} 
        
           	nss = append(nss, 
        
           		mnt{"_LIBNSENTER_ROOTFD", fmt.Sprintf("/proc/%d/root", tpid), unix.O_PATH}, 
        
           		mnt{"_LIBNSENTER_CWDFD", fmt.Sprintf("/proc/%d/cwd", tpid), unix.O_PATH}, 
        
           		mnt{"_LIBNSENTER_MNTNSFD", fmt.Sprintf("/proc/%d/ns/mnt", tpid), os.O_RDONLY}, 
        
           	) 
        
           } 
        
           if cfg.PidNS { 
        
           	nss = append(nss, mnt{"_LIBNSENTER_PIDNSFD", fmt.Sprintf("/proc/%d/ns/pid", targetPid), os.O_RDONLY}) 
        
           } 
        
           if cfg.NetNS { 
        
           	nss = append(nss, mnt{"_LIBNSENTER_NETNSFD", fmt.Sprintf("/proc/%d/ns/net", targetPid), os.O_RDONLY}) 
        
           } 
        
           stdioFdCount := 3 
        
           cmd := exec.Command(filepath.Join(filepath.Dir(base), "nsinsider")) 
        
           mod(cmd) 
        
           cmd.Env = append(cmd.Env, "_LIBNSENTER_INIT=1", "GITPOD_INSTANCE_ID="+instanceID) 
        
           for _, ns := range nss { 
        
           	f, err := os.OpenFile(ns.Source, ns.Flags, 0) 
        
           	if err != nil { 
        
           		return fmt.Errorf("cannot open %s: %w", ns.Source, err) 
        
           	} 
        
           	defer f.Close() 
        
           	cmd.Env = append(cmd.Env, fmt.Sprintf("%s=%d", ns.Env, stdioFdCount+len(cmd.ExtraFiles))) 
        
           	cmd.ExtraFiles = append(cmd.ExtraFiles, f) 
        
           } 
        
           var cmdOut bytes.Buffer 
        
           cmd.Stdout = &cmdOut 
        
           cmd.Stderr = os.Stderr 
        
           cmd.Stdin = os.Stdin 
        
           err = cmd.Run() 
        
           log.FromBuffer(&cmdOut, log.WithFields(log.OWI("", "", instanceID))) 
        
           if err != nil { 
        
           	out, oErr := cmd.CombinedOutput() 
        
           	if oErr != nil { 
        
           		return fmt.Errorf("run nsinsider (%v) \n%v\n output error: %v", 
        
           			cmd.Args, 
        
           			err, 
        
           			oErr, 
        
           		) 
        
           	} 
        
           	return fmt.Errorf("run nsinsider (%v) failed: %q\n%v", 
        
           		cmd.Args, 
        
           		string(out), 
        
           		err, 
        
           	) 
        
           } 
        
           return nil

jenting · 2022-11-09T15:55:51Z

The IWS server has a nsinsider that creates a temporary process and puts it into an arbitrary Linux mount namespace and others (e.g. workspace). Can this be used?

Thank you @utam0k. It works as expected, reference branch.
The left problem is some of the files/folders UID/GID is incorrect.

But we could rerun the loadgen first to see whether changing PVC with the block device fix this issue or not.
And I am worried that choosing IWS to mount/format the disk does not fix this issue because we have the concern in the #13353 (comment).
Or, probably we need to umount the disk when the ring2 or workspacekit stopped 🤔.

cc @sagor999

utam0k · 2022-11-09T23:24:40Z

@jenting I'm glad to hear that 😍 Is there any other help you need now?

jenting · 2022-11-10T10:16:33Z

I make another approach, using the init container to mount the block device and format it.
In this way, the PVC feature does not rely on the IWS anymore. Here is the branch.

kylos101 · 2022-11-10T23:12:04Z

Awesome, nice work @sagor999 and @jenting !!! 🚀

@jenting for Friday, can you:

test that you can restart workspaces?
assert integration tests pass?
see if you can recreate the issue with loadgen?

Good luck! 🍀

jenting · 2022-11-11T06:56:16Z

test that you can restart workspaces?

No. 😢

The workspace pod must execute fsfreeze --freeze /workspace on the block device mount point /workspace before requesting Kubernetes to create the PVC snapshot. Otherwise, the workspace content does not exist in the PVC snapshot (VolumeSnapshot).

Given the above information, the IWS requires nsinsider into the workspace pod to execute fsfreeze --freeze /workspace before the workspace disposal process. The PVC feature depends on the ws-daemon, which does not fit our PVC design.

assert integration tests pass?

No. 😢 I encountered the files and folders UID/GID incorrect.

see if you can recreate the issue with loadgen?

I haven't run the loadgen because in the current code, the workspace can't restart.

@sagor999 @WVerlaek Is it possible to run a hook when the metadata.deletionTimestamp is set? So we don't have to rely on IWS to execute the fsfreeze --freeze /workspace command, but instead, let the pod execute by itself. As far as I know, there is a preStop hook, but it is triggered before the workspace pod is terminated, so it does not fit this use case.

WVerlaek · 2022-11-11T16:41:27Z

@jenting a bit of a hack, but would it be possible for the preStop hook to send a SIGTERM to the workspace entrypoint and wait until it exits, before running fsfreeze? I'm not sure if k8s will let the preStop hook continue to run after the container itself exits though

kylos101 · 2022-11-11T22:46:39Z

Related: kubernetes-sigs/gcp-compute-persistent-disk-csi-driver#608 (comment)

kylos101 · 2022-11-14T18:16:38Z

@sagor999 @jenting this comment is super interesting.

Specifically:

To reiterate, you're only hitting this problem if you're detaching disks (either manually, or because of nodes getting into an unusually overloaded state) while there is active IO on the device.

For each approach we've tried, how do we know processes aren't using the disk/device at workspace stop, before we detach the PVC?

sagor999 · 2022-11-14T18:32:03Z

We only detach PVC when container is stopped.
So there is probably some zombie\ghost process that might be still doing IO after container stopped. 🤔

kylos101 · 2022-11-14T19:27:07Z

We only detach PVC when container is stopped. So there is probably some zombie\ghost process that might be still doing IO after container stopped. thinking

👍 Hmm, I wonder if no longer is an issue now that init container is mounting and formatting block device. Any ideas on how to resolve UID/GID issue @sagor999?

jenting · 2022-11-15T02:32:42Z

👍 Hmm, I wonder if no longer is an issue now that init container is mounting and formatting block device. Any ideas on how to resolve UID/GID issue @sagor999?

@kylos101
The UID/GID issue is fine. The big problem/challenge is ws-manager has to request workspace pod to execute fsfreeze before the workspace disposal process which makes ws-manager couple to ws-daemon again.

jenting · 2022-11-16T02:50:25Z

The probability of occurrence on gen75, log.

sagor999 · 2022-11-18T18:03:29Z

Removed from workspace project for now as we are rethinking our PVC approach.

stale · 2023-03-18T20:09:09Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2023-06-17T23:06:47Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

jenting added the type: bug Something isn't working label Sep 27, 2022

jenting added this to 🌌 Workspace Team Sep 27, 2022

jenting added the team: workspace Issue belongs to the Workspace team label Sep 27, 2022

jenting mentioned this issue Sep 27, 2022

Epic: Ensure durability for user workspace files #7901

Closed

77 tasks

kylos101 mentioned this issue Oct 3, 2022

Performance test PVC with a cluster saturated with workspaces #12747

Closed

jenting self-assigned this Oct 4, 2022

jenting moved this to In Progress in 🌌 Workspace Team Oct 4, 2022

sagor999 self-assigned this Oct 4, 2022

sagor999 mentioned this issue Oct 4, 2022

[pvc] potential workaround for mount volume failure #13594

Merged

3 tasks

roboquat closed this as completed in #13594 Oct 4, 2022

Repository owner moved this from In Progress to Awaiting Deployment in 🌌 Workspace Team Oct 4, 2022

kylos101 moved this from Awaiting Deployment to In Validation in 🌌 Workspace Team Oct 5, 2022

jenting moved this from In Validation to Done in 🌌 Workspace Team Oct 7, 2022

jenting reopened this Oct 11, 2022

jenting moved this from Done to Scheduled in 🌌 Workspace Team Oct 11, 2022

jenting moved this from Scheduled to In Progress in 🌌 Workspace Team Oct 11, 2022

jenting mentioned this issue Oct 11, 2022

Skip "MountVolume.MountDevice failed for volume" #13760

Merged

3 tasks

roboquat closed this as completed in #13760 Oct 12, 2022

sagor999 removed their assignment Nov 9, 2022

jenting added the blocked label Nov 16, 2022

sagor999 removed this from 🌌 Workspace Team Nov 18, 2022

jenting removed their assignment Nov 24, 2022

stale bot added the meta: stale This issue/PR is stale and will be closed soon label Jun 17, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 22, 2024

[PVC] loadgen testing Pod can't mount Volume #13353

[PVC] loadgen testing Pod can't mount Volume #13353

Comments

jenting commented Sep 27, 2022 • edited by kylos101 Loading

Bug description

Steps to reproduce

Workspace affected

Expected behavior

Example repository

Anything else?

kylos101 commented Oct 2, 2022

jenting commented Oct 3, 2022 • edited Loading

kylos101 commented Oct 3, 2022

sagor999 commented Oct 3, 2022

sagor999 commented Oct 3, 2022

sagor999 commented Oct 3, 2022

jenting commented Oct 3, 2022

sagor999 commented Oct 3, 2022

sagor999 commented Oct 3, 2022

jenting commented Oct 4, 2022

kylos101 commented Oct 4, 2022

jenting commented Oct 11, 2022

utam0k commented Nov 8, 2022

jenting commented Nov 8, 2022 • edited Loading

Furisto commented Nov 8, 2022

kylos101 commented Nov 8, 2022

sagor999 commented Nov 8, 2022

sagor999 commented Nov 8, 2022

sagor999 commented Nov 9, 2022

utam0k commented Nov 9, 2022

utam0k commented Nov 9, 2022 • edited Loading

jenting commented Nov 9, 2022 • edited Loading

utam0k commented Nov 9, 2022

jenting commented Nov 10, 2022

kylos101 commented Nov 10, 2022

jenting commented Nov 11, 2022 • edited Loading

WVerlaek commented Nov 11, 2022

kylos101 commented Nov 11, 2022 • edited Loading

kylos101 commented Nov 14, 2022

sagor999 commented Nov 14, 2022

kylos101 commented Nov 14, 2022

jenting commented Nov 15, 2022 • edited Loading

jenting commented Nov 16, 2022

sagor999 commented Nov 18, 2022

stale bot commented Mar 18, 2023

stale bot commented Jun 17, 2023

jenting commented Sep 27, 2022 •

edited by kylos101

Loading

jenting commented Oct 3, 2022 •

edited

Loading

jenting commented Nov 8, 2022 •

edited

Loading

utam0k commented Nov 9, 2022 •

edited

Loading

jenting commented Nov 9, 2022 •

edited

Loading

jenting commented Nov 11, 2022 •

edited

Loading

kylos101 commented Nov 11, 2022 •

edited

Loading

jenting commented Nov 15, 2022 •

edited

Loading