Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PVC] loadgen testing Pod can't mount Volume #13353

Closed
Tracked by #7901
jenting opened this issue Sep 27, 2022 · 56 comments · Fixed by #13594 or #13760
Closed
Tracked by #7901

[PVC] loadgen testing Pod can't mount Volume #13353

jenting opened this issue Sep 27, 2022 · 56 comments · Fixed by #13594 or #13760
Labels
blocked meta: stale This issue/PR is stale and will be closed soon team: workspace Issue belongs to the Workspace team type: bug Something isn't working

Comments

@jenting
Copy link
Contributor

jenting commented Sep 27, 2022

Bug description

Running the loadgen test with PVC feature flag, some of the pods can't mount the volume
We ran the loadgen test twice

  • 1st round: 1 of 100 workspaces can't be up
  • 2nd round: 16 of 100 workspaces can't be up

The pod event log

  Warning  FailedMount             21m (x25 over 56m)  kubelet                  MountVolume.MountDevice failed for volume "pvc-b5d1cab4-5261-4ad0-8fc3-fb87f773558c" : rpc error: code = Internal desc = Error when getting device path: rpc error: code = Internal desc = error verifying GCE PD ("pvc-b5d1cab4-5261-4ad0-8fc3-fb87f773558c") is attached: failed to find and re-link disk pvc-b5d1cab4-5261-4ad0-8fc3-fb87f773558c with udevadm after retrying for 3s: failed to trigger udevadm fix of non existent disk for "pvc-b5d1cab4-5261-4ad0-8fc3-fb87f773558c": udevadm --trigger requested to fix disk pvc-b5d1cab4-5261-4ad0-8fc3-fb87f773558c but no such disk was found
  Warning  FailedMount             5m (x22 over 54m)   kubelet                  Unable to attach or mount volumes: unmounted volumes=[vol-this-workspace], unattached volumes=[vol-this-workspace daemon-mount]: timed out waiting for the condition
  Warning  FailedMount             54s (x5 over 23m)   kubelet                  Unable to attach or mount volumes: unmounted volumes=[vol-this-workspace], unattached volumes=[daemon-mount vol-this-workspace]: timed out waiting for the condition

Steps to reproduce

https://gitpod.slack.com/archives/C02F19UUW6S/p1664263343721969?thread_ts=1664240249.861729&cid=C02F19UUW6S

Related upstream issue kubernetes-sigs/gcp-compute-persistent-disk-csi-driver#608

Workspace affected

No response

Expected behavior

When you start 300 workspaces with g1-standard workspace class, they should all start and stop, some landing on new nodes because of auto scaling. Then, when you start 300 while using the same nodes (not needing auto scaling because nodes exist), they should all start and stop gracefully, too.

Example repository

No response

Anything else?

#7901

@jenting jenting added the type: bug Something isn't working label Sep 27, 2022
@jenting jenting added the team: workspace Issue belongs to the Workspace team label Sep 27, 2022
@kylos101
Copy link
Contributor

kylos101 commented Oct 2, 2022

👋 @jenting do we have a workaround for the meantime?

@jenting
Copy link
Contributor Author

jenting commented Oct 3, 2022

👋 @jenting do we have a workaround for the meantime?

Nope. I tried to reproduce it five times, but I can't reproduce it.

@kylos101
Copy link
Contributor

kylos101 commented Oct 3, 2022

@jenting how did you try to reproduce? I assume you tried starting 100 workspaces. Did you try any other permutations? For example, 200 or 400 workspaces?

@sagor999
Copy link
Contributor

sagor999 commented Oct 3, 2022

I've looked at upstream issue for this, and it appears to intermittent error, that google folks cannot repro, and hence are not able to pin point where the problem might be.

It appears one workaround for this is to delete the pod and re-create it. Which I believe we can do as part of startWorkspace call (we used to do something similar for another k8s bug with memory oom).
So we could I think apply same workaround here.
It also seems to be GCP related bug (??), though more research would need to be done to see if other PD drivers have similar issue or not.

@sagor999
Copy link
Contributor

sagor999 commented Oct 3, 2022

@jenting do you by any chance remember what is the status of the pod when this happens? Was it still pending? Or was it already in the running phase?

We can do an easy workaround implementation for it here, but I need a full yaml capture of the pod definition when this happens to properly implement it. Would you be able to try to repro it again?
The code can be added here:

switch pod.Status.Phase {

if we can add it there, then we automatically will delete the pod and re-create it, hopefully getting around this issue. 🤔

@sagor999
Copy link
Contributor

sagor999 commented Oct 3, 2022

On the other hand, if pod is already considered running, then we would need to implement that workaround in a different place... 🤔 which will make it a bit more cumbersome, but bottom line, I think we can have a workaround for this problem very soon and unblock PVC roll out (again).

@jenting
Copy link
Contributor Author

jenting commented Oct 3, 2022

@sagor999
If I remember correctly, two cases:

  • Pod is still pending: it was kept in pending state for at least 30 mins. The PV status is healthy in the GCP console. The CSI mount point exists in the host node. (This happened when Toru ran the loadgen)
  • Pod is running: described the pod, and it showed the error message. But after a few seconds, the pod can mount the volume, and the pod is running. (This happened when I tried to reproduce the issue)

So, I plan to launch an ephemeral cluster to reproduce this issue today.

@sagor999
Copy link
Contributor

sagor999 commented Oct 3, 2022

Aha, good. Since it was pending, it will be much easier to have a workaround.
Thank you for trying to repro it again 🙏
If you will be able to repro it, grab pod definition in yml format, that way it will be easier to make a proper fix (to know which exact field to check for that specific error)

@sagor999
Copy link
Contributor

sagor999 commented Oct 3, 2022

If you will not be able to repro, let me know, and I will make a blind fix. 😅 🙈 👓

@jenting jenting self-assigned this Oct 4, 2022
@jenting jenting moved this to In Progress in 🌌 Workspace Team Oct 4, 2022
@jenting
Copy link
Contributor Author

jenting commented Oct 4, 2022

I tried to reproduce this issue the whole day (I ran 10 times with 100 regular workspaces with PVC), but I got no luck that I couldn't reproduce it.

@kylos101
Copy link
Contributor

kylos101 commented Oct 4, 2022

@jenting @sagor999

  • What was the peek # of running workspaces that you achieved at one time when trying to recreate?
  • Did you try starting 100, leaving them running, starting another 100 but at the same time stopping the initial 100?

In other words, test with more than 100 workspaces that are both starting and stopping. This way you have PVC creation and snapshot happening at the same time, with a higher volume.

Why do I recommend a higher volume than 100? Consider this, running a test that has volume and activity similar to EU may help uncover other hot spots. It'll be costly for the nodes, but, it is important to know we can reach that volume in non-prod before trying in prod.

@sagor999 sagor999 self-assigned this Oct 4, 2022
Repository owner moved this from In Progress to Awaiting Deployment in 🌌 Workspace Team Oct 4, 2022
@kylos101 kylos101 moved this from Awaiting Deployment to In Validation in 🌌 Workspace Team Oct 5, 2022
@jenting jenting moved this from In Validation to Done in 🌌 Workspace Team Oct 7, 2022
@jenting
Copy link
Contributor Author

jenting commented Oct 11, 2022

@jenting jenting reopened this Oct 11, 2022
@jenting jenting moved this from Done to Scheduled in 🌌 Workspace Team Oct 11, 2022
@jenting jenting moved this from Scheduled to In Progress in 🌌 Workspace Team Oct 11, 2022
@utam0k
Copy link
Contributor

utam0k commented Nov 8, 2022

I'll have to give it a try, but what about hooking the process with a seccomp notify mount? Wouldn't that device be visible from the hostt?

func (h *InWorkspaceHandler) Mount(req *libseccomp.ScmpNotifReq) (val uint64, errno int32, flags uint32) {

@jenting
Copy link
Contributor Author

jenting commented Nov 8, 2022

@utam0k But we did a trial that runs the container with the command /.supervisor/workspacekit fsprep which did not run the ring0/ring1/ring2. In general, we don't rely on the ws-daemon, but we want to use workspacekit to format the mounted block device.
However, it does not work. Related code

@Furisto
Copy link
Member

Furisto commented Nov 8, 2022

Why does it need to be workspacekit that runs the fsprep? You could implement the operation in IWS and then call it from workspacekit.

@kylos101
Copy link
Contributor

kylos101 commented Nov 8, 2022

Why does it need to be workspacekit that runs the fsprep? You could implement the operation in IWS and then call it from workspacekit.

@Furisto using IWS may put us back into the original issue, where a process above/outside the workspace container is involved with attach/detach of the disk.

See this comment for background.

@sagor999
Copy link
Contributor

sagor999 commented Nov 8, 2022

I think there is a workaround for permission issue.
We can run mounting of device as init container that runs as privileged container. Each container has its own sec context. I will implement that later today.

@sagor999
Copy link
Contributor

sagor999 commented Nov 8, 2022

Init container idea did not work. Mount does not persist between init container and regular container. :(
Or at least I do not know how to make it persist. Mount works though in init container (since I can run it as root privileged container). Just do not know how to make it persist between containers, if that is even possible.

@sagor999 sagor999 removed their assignment Nov 9, 2022
@sagor999
Copy link
Contributor

sagor999 commented Nov 9, 2022

I am out of ideas, so for now removing myself as assignee.

@utam0k
Copy link
Contributor

utam0k commented Nov 9, 2022

@utam0k But we did a trial that runs the container with the command /.supervisor/workspacekit fsprep which did not run the ring0/ring1/ring2. In general, we don't rely on the ws-daemon, but we want to use workspacekit to format the mounted block device. However, it does not work. Related code

@jenting
First, we don't allow running the workspace pod as a host root user from the security view. It is too dangerous.

Do you know which system call permissions are missing? Could it be a system call that is prohibited by seccomp?
Could it be a system call that is prohibited by seccomp?
https://github.com/gitpod-io/gitpod/blob/main/components/ws-daemon/seccomp-profile-installer/main.go

@utam0k
Copy link
Contributor

utam0k commented Nov 9, 2022

@Furisto using IWS may put us back into the original issue, where a process above/outside the workspace container is involved with attach/detach of the disk.

The IWS server has a nsinsider that creates a temporary process and puts it into an arbitrary Linux mount namespace and others (e.g. workspace). Can this be used?
A sample code:

err = nsi.Nsinsider(wbs.Session.InstanceID, int(containerPID), func(c *exec.Cmd) {
c.Args = append(c.Args, "prepare-dev", "--uid", strconv.Itoa(wsinit.GitpodUID), "--gid", strconv.Itoa(wsinit.GitpodGID))
})

nsinsider is basically just a temporary process and terminates as a process.

cfg := NsinsiderOpts{
MountNS: true,
}
for _, o := range opts {
o(&cfg)
}
base, err := os.Executable()
if err != nil {
return err
}
type mnt struct {
Env string
Source string
Flags int
}
var nss []mnt
if cfg.MountNS {
tpid := targetPid
if cfg.MountNSPid != 0 {
tpid = cfg.MountNSPid
}
nss = append(nss,
mnt{"_LIBNSENTER_ROOTFD", fmt.Sprintf("/proc/%d/root", tpid), unix.O_PATH},
mnt{"_LIBNSENTER_CWDFD", fmt.Sprintf("/proc/%d/cwd", tpid), unix.O_PATH},
mnt{"_LIBNSENTER_MNTNSFD", fmt.Sprintf("/proc/%d/ns/mnt", tpid), os.O_RDONLY},
)
}
if cfg.PidNS {
nss = append(nss, mnt{"_LIBNSENTER_PIDNSFD", fmt.Sprintf("/proc/%d/ns/pid", targetPid), os.O_RDONLY})
}
if cfg.NetNS {
nss = append(nss, mnt{"_LIBNSENTER_NETNSFD", fmt.Sprintf("/proc/%d/ns/net", targetPid), os.O_RDONLY})
}
stdioFdCount := 3
cmd := exec.Command(filepath.Join(filepath.Dir(base), "nsinsider"))
mod(cmd)
cmd.Env = append(cmd.Env, "_LIBNSENTER_INIT=1", "GITPOD_INSTANCE_ID="+instanceID)
for _, ns := range nss {
f, err := os.OpenFile(ns.Source, ns.Flags, 0)
if err != nil {
return fmt.Errorf("cannot open %s: %w", ns.Source, err)
}
defer f.Close()
cmd.Env = append(cmd.Env, fmt.Sprintf("%s=%d", ns.Env, stdioFdCount+len(cmd.ExtraFiles)))
cmd.ExtraFiles = append(cmd.ExtraFiles, f)
}
var cmdOut bytes.Buffer
cmd.Stdout = &cmdOut
cmd.Stderr = os.Stderr
cmd.Stdin = os.Stdin
err = cmd.Run()
log.FromBuffer(&cmdOut, log.WithFields(log.OWI("", "", instanceID)))
if err != nil {
out, oErr := cmd.CombinedOutput()
if oErr != nil {
return fmt.Errorf("run nsinsider (%v) \n%v\n output error: %v",
cmd.Args,
err,
oErr,
)
}
return fmt.Errorf("run nsinsider (%v) failed: %q\n%v",
cmd.Args,
string(out),
err,
)
}
return nil

@jenting
Copy link
Contributor Author

jenting commented Nov 9, 2022

The IWS server has a nsinsider that creates a temporary process and puts it into an arbitrary Linux mount namespace and others (e.g. workspace). Can this be used?

Thank you @utam0k. It works as expected, reference branch.
The left problem is some of the files/folders UID/GID is incorrect.

But we could rerun the loadgen first to see whether changing PVC with the block device fix this issue or not.
And I am worried that choosing IWS to mount/format the disk does not fix this issue because we have the concern in the #13353 (comment).
Or, probably we need to umount the disk when the ring2 or workspacekit stopped 🤔.

cc @sagor999

@utam0k
Copy link
Contributor

utam0k commented Nov 9, 2022

@jenting I'm glad to hear that 😍 Is there any other help you need now?

@jenting
Copy link
Contributor Author

jenting commented Nov 10, 2022

I make another approach, using the init container to mount the block device and format it.
In this way, the PVC feature does not rely on the IWS anymore. Here is the branch.

@kylos101
Copy link
Contributor

Awesome, nice work @sagor999 and @jenting !!! 🚀

@jenting for Friday, can you:

  1. test that you can restart workspaces?
  2. assert integration tests pass?
  3. see if you can recreate the issue with loadgen?

Good luck! 🍀

@jenting
Copy link
Contributor Author

jenting commented Nov 11, 2022

test that you can restart workspaces?

No. 😢

The workspace pod must execute fsfreeze --freeze /workspace on the block device mount point /workspace before requesting Kubernetes to create the PVC snapshot. Otherwise, the workspace content does not exist in the PVC snapshot (VolumeSnapshot).

Given the above information, the IWS requires nsinsider into the workspace pod to execute fsfreeze --freeze /workspace before the workspace disposal process. The PVC feature depends on the ws-daemon, which does not fit our PVC design.

assert integration tests pass?

No. 😢 I encountered the files and folders UID/GID incorrect.

see if you can recreate the issue with loadgen?

I haven't run the loadgen because in the current code, the workspace can't restart.

@sagor999 @WVerlaek Is it possible to run a hook when the metadata.deletionTimestamp is set? So we don't have to rely on IWS to execute the fsfreeze --freeze /workspace command, but instead, let the pod execute by itself. As far as I know, there is a preStop hook, but it is triggered before the workspace pod is terminated, so it does not fit this use case.

@WVerlaek
Copy link
Member

@jenting a bit of a hack, but would it be possible for the preStop hook to send a SIGTERM to the workspace entrypoint and wait until it exits, before running fsfreeze? I'm not sure if k8s will let the preStop hook continue to run after the container itself exits though

@kylos101
Copy link
Contributor

kylos101 commented Nov 11, 2022

Related: kubernetes-sigs/gcp-compute-persistent-disk-csi-driver#608 (comment)

@kylos101
Copy link
Contributor

@sagor999 @jenting this comment is super interesting.

Specifically:

To reiterate, you're only hitting this problem if you're detaching disks (either manually, or because of nodes getting into an unusually overloaded state) while there is active IO on the device.

For each approach we've tried, how do we know processes aren't using the disk/device at workspace stop, before we detach the PVC?

@sagor999
Copy link
Contributor

We only detach PVC when container is stopped.
So there is probably some zombie\ghost process that might be still doing IO after container stopped. 🤔

@kylos101
Copy link
Contributor

We only detach PVC when container is stopped. So there is probably some zombie\ghost process that might be still doing IO after container stopped. thinking

👍 Hmm, I wonder if no longer is an issue now that init container is mounting and formatting block device. Any ideas on how to resolve UID/GID issue @sagor999?

@jenting
Copy link
Contributor Author

jenting commented Nov 15, 2022

👍 Hmm, I wonder if no longer is an issue now that init container is mounting and formatting block device. Any ideas on how to resolve UID/GID issue @sagor999?

@kylos101
The UID/GID issue is fine. The big problem/challenge is ws-manager has to request workspace pod to execute fsfreeze before the workspace disposal process which makes ws-manager couple to ws-daemon again.

@jenting
Copy link
Contributor Author

jenting commented Nov 16, 2022

The probability of occurrence on gen75, log.

@sagor999
Copy link
Contributor

Removed from workspace project for now as we are rethinking our PVC approach.

@jenting jenting removed their assignment Nov 24, 2022
@stale
Copy link

stale bot commented Mar 18, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

1 similar comment
@stale
Copy link

stale bot commented Jun 17, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the meta: stale This issue/PR is stale and will be closed soon label Jun 17, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked meta: stale This issue/PR is stale and will be closed soon team: workspace Issue belongs to the Workspace team type: bug Something isn't working
Projects
None yet
7 participants