Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filesystem is not resized when restoring from snapshot/cloning to larger size than origin #972

Merged
merged 3 commits into from
Oct 25, 2022

Conversation

RomanBednar
Copy link
Contributor

@RomanBednar RomanBednar commented May 3, 2022

What type of PR is this?

/kind bug

What this PR does / why we need it:

Volume filesystem size is not expanded during clone if pvc data source is larger than original volume.
Similarly when restoring a snapshot to a pvc with larger size the filesystem is not expanded either.

This patch uses mount-utils to handle volume size check and resizing.

Which issue(s) this PR fixes:

Fixes #784

Special notes for your reviewer:
Here is how I verified the patch manually.

BEFORE PATCH:

1.Create pvc with 1Gi/pod

2.Create clone pvc with 5Gi/pod

$ oc get pvc/clone-of-pvc-1 -o yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    pv.kubernetes.io/bind-completed: "yes"
    pv.kubernetes.io/bound-by-controller: "yes"
    volume.beta.kubernetes.io/storage-provisioner: pd.csi.storage.gke.io
    volume.kubernetes.io/selected-node: ci-ln-469g9qt-72292-bp8k2-worker-a-jmrcv
    volume.kubernetes.io/storage-provisioner: pd.csi.storage.gke.io
  creationTimestamp: "2022-04-11T06:30:29Z"
  finalizers:
  - kubernetes.io/pvc-protection
  name: clone-of-pvc-1
  namespace: test
  resourceVersion: "40853"
  uid: 4d439336-35b1-4836-8ece-2c21ce36d1b2
spec:
  accessModes:
  - ReadWriteOnce
  dataSource:
    apiGroup: null
    kind: PersistentVolumeClaim
    name: pvc2
  resources:
    requests:
      storage: 5Gi
  storageClassName: standard-csi
  volumeMode: Filesystem
  volumeName: pvc-4d439336-35b1-4836-8ece-2c21ce36d1b2
status:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 5Gi
  phase: Bound
  
3. Check PVCs

$ oc get pvc
NAME             STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
clone-of-pvc-1   Bound    pvc-4d439336-35b1-4836-8ece-2c21ce36d1b2   5Gi        RWO            standard-csi   59m
pvc2             Bound    pvc-9508e898-d958-412b-b2c8-4523f87d7b25   1Gi        RWO            standard-csi   69m

4. The filesystem was not resized in the pod (5Gi expected)

oc exec mypodclone -t -i -- df -h /tmp
Filesystem                Size      Used Available Use% Mounted on
/dev/sdc                975.9M      2.5M    957.4M   0% /tmp

AFTER PATCH:

1. Create pvc with 1Gi/pod

2. Create snapshot of the pvc

3. Create clone of the pvc with 5Gi/pod

4. Restore the snapshot to a new pvc with 5Gi/pod

5. Create a pod that mounts both clone and restored snapshot pvc

6. Check that the filesystem was resized correctly in the pod

$ oc get pvc
NAME                     STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
pvc-1                    Bound    pvc-71b5bb22-2ca7-4150-af44-b5b0b48111a7   1Gi        RWO            standard-csi   15m
pvc-1-clone              Bound    pvc-e6ebb004-baba-446b-b5f5-90bac00240c9   5Gi        RWO            standard-csi   14m
pvc-1-snapshot-restore   Bound    pvc-6d325c59-c6e1-428b-b83d-958c78c71964   5Gi        RWO            standard-csi   13m

$ oc exec task-pv-pod-2 -t -i -- df -h | grep -E "clone|restore"
/dev/sdd        4.9G  4.0M  4.9G   1% /tmp/clone
/dev/sdc        4.9G  4.0M  4.9G   1% /tmp/restore

Does this PR introduce a user-facing change?:

Filesystem is now expanded correctly when restoring from a snapshot or cloning a volume with larger size than the origin volume.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 3, 2022
@k8s-ci-robot
Copy link
Contributor

Welcome @RomanBednar!

It looks like this is your first PR to kubernetes-sigs/gcp-compute-persistent-disk-csi-driver 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/gcp-compute-persistent-disk-csi-driver has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 3, 2022
@k8s-ci-robot
Copy link
Contributor

Hi @RomanBednar. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot requested review from jingxu97 and mattcary May 3, 2022 08:09
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 3, 2022
@RomanBednar
Copy link
Contributor Author

@mattcary Hi, I've noticed there are tests for NodeExpandVolume that are commented out with a note about fakeexec being too brittle:

// TODO: This test is too brittle due to the fakeexec package not being

But, by adding mount-utils resizer to NodeStageVolume I broke the existing unit tests for this function because there are actual command executions in the resize flow and so I used the fakeexec - just to have some working example. Better approach might be using a mock interface instead. What do you think?

@mattcary
Copy link
Contributor

mattcary commented May 3, 2022

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 3, 2022
@mattcary
Copy link
Contributor

mattcary commented May 3, 2022

I dunno, creating a mock that would actually catch errors or regressions is very hard. Maybe just stick with fakeexec for now and we'll see how it does.

This should be exercised in e2e tests as it's part of the csi spec (but probably isn't?)

@RomanBednar
Copy link
Contributor Author

I dunno, creating a mock that would actually catch errors or regressions is very hard. Maybe just stick with fakeexec for now and we'll see how it does.

This should be exercised in e2e tests as it's part of the csi spec (but probably isn't?)

We've investigated the tests with @jsafrane and we currently see two issues:

  1. GCP driver sanity tests use fakes (mounter+cloud provider)
    This causes the resize code to fail on command executions because it can't read real size of the device.
    However, fakeexec with predefined command sequence (like I used in unit tests) is probably not an option
    for sanity tests because it shares the fake mounter across many other tests.

    https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/blob/master/test/sanity/sanity_test.go#L62

  2. CSI e2e tests currently don't excercise resize code for NodeStageVolume
    Update of the tests is planned in scope of the snaphot restore/resize effort. It is mentioned here as the last
    item on the checklist: Filesystem resize is required after snapshot restore / volume clone kubernetes/kubernetes#94929

I'll look into it and update the CSI tests accordingly once we resolve the sanity test issue. Another catch is that adding csi e2e tests will take quite some time before we can use it (until next Kubernetes release at best). @mattcary What would you suggest to fix the sanity tests? And do we want to wait with this patch until e2e tests are available?

@mattcary
Copy link
Contributor

Thanks for laying out the state of things so clearly!

Would it be easy to add a gcp-persistent-disk-csi-driver/test/e2e test? (confusingly this is not the k/k e2e test)

It should be about as straightfoward as adding a sanity check, except it uses a real cloud provider etc so no faking is necessary.

@jsafrane
Copy link
Contributor

jsafrane commented May 11, 2022

@mattcary, while it would be possible to add a new test to test/e2e, it's not easy to fix e2e/sanity - all NodeStage calls there will fail, because the fake mounter / exec does not return a valid device size when asked. To me it looks like fixing that would take a huge amount of work with a doubtful result.

Out of curiosity, is there any reason why don't you run pull-gcp-compute-persistent-disk-csi-driver-sanity tests with a real cloud + mounter + exec?
In addition, running k/k tests would be beneficial too, you are able to install the driver and run the tests in k/k CI, why not here?

@mattcary
Copy link
Contributor

@mattcary, while it would be possible to add a new test to test/e2e, it's not easy to fix e2e/sanity - all NodeStage calls there will fail, because the fake mounter / exec does not return a valid device size when asked. To me it looks like fixing that would take a huge amount of work with a doubtful result.

I agree. Sorry for not making that more clear---I don't think we should try to fix sanity, I think we should just put it in an e2e test. That would be skipping resize tests in sanity (the other tests will continue to work, or have those been broken too?)

I'm not actually familiar with the sanity tests, I've never had a reason to look at them. If we do run them in a cloud context, would they substantially duplicate the e2e tests? I'll have to look into them. I suppose our e2e tests hit GCE specific features that aren't covered by sanity, and sanity does more thorough generic CSI testing. If my supposition is true I think it would probably be a good idea to run them with real resources. That's been reliable in the e2e tests, and I've never found fakes a very convincing test strategy.

Out of curiosity, is there any reason why don't you run pull-gcp-compute-persistent-disk-csi-driver-sanity tests with a real cloud + mounter + exec? In addition, running k/k tests would be beneficial too, you are able to install the driver and run the tests in k/k CI, why not here?

test/k8s-integration runs the k/k tests. There's real value in splitting those out from e2e tests as they take a lot longer to run and are way way flakier. Just bringing up a cluster consistently across k8s minor versions, different versions of kubetest, etc, is a lot of toil.

@jsafrane
Copy link
Contributor

jsafrane commented May 12, 2022

That would be skipping resize tests in sanity (the other tests will continue to work, or have those been broken too?)

Unfortunately, all sanity tests that call NodeStage will fail, because NodeStage is not able to read device size with the fake exec.

If we do run them in a cloud context, would they substantially duplicate the e2e tests?

There definitely is some overlap in your custom e2e tests, kubernetes-integration and sanity test. The sanity tests add some checks that cannot be easily tested in Kubernetes, like that the driver returns the right error code when a volume does not exist, staging path does not exist and so on. If you want to use sanity tests, then you should run them against a real CSI driver in a real cloud. On the other hand, you don't need Kubernetes to run the tests (unless the driver itself needs to talk to Kubernetes). You need a VM in GCE, run the CSI driver in the VM (either in a container or directly on the host, it does not really matter) and poke the driver's CSI socket with the sanity tests. It will provision real volumes, attach them to the VM and mount them.

@mattcary
Copy link
Contributor

See #991 and kubernetes/test-infra#26354, we're making the sanity tests optional while we sort this out.

I think that unblocks this PR? (as soon as the test-infra change goes in)

@jsafrane
Copy link
Contributor

/refresh
/retest

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 13, 2022
@mattcary
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 13, 2022
@RomanBednar RomanBednar force-pushed the snapshot-expansion branch 2 times, most recently from b925ef8 to 0cc341d Compare September 30, 2022 15:37
NodeStageVolume() which is called few lines above in
NodeStageExt4Volume() can now resize filesystem if it's needed so this
check is no longer valid.
@mattcary
Copy link
Contributor

Thanks for the review @mauriciopoppe !

/lgtm
/approve

(and of course thanks for the PR @RomanBednar :-)

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 25, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mattcary, RomanBednar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 25, 2022
@k8s-ci-robot k8s-ci-robot merged commit b4f6d95 into kubernetes-sigs:master Oct 25, 2022
RomanBednar added a commit to RomanBednar/release that referenced this pull request Feb 27, 2023
This test was also disabled upstream:
https://github.com/kubernetes/test-infra/pull/26354/files

The problem is that sanity tests currently use command faker and only
one driver instance. That makes any command faking impossible as
the same instance of the driver is re-used in many tests. Relevant
discussion upstream is here:
kubernetes-sigs/gcp-compute-persistent-disk-csi-driver#972 (comment)
openshift-merge-robot pushed a commit to openshift/release that referenced this pull request Mar 17, 2023
This test was also disabled upstream:
https://github.com/kubernetes/test-infra/pull/26354/files

The problem is that sanity tests currently use command faker and only
one driver instance. That makes any command faking impossible as
the same instance of the driver is re-used in many tests. Relevant
discussion upstream is here:
kubernetes-sigs/gcp-compute-persistent-disk-csi-driver#972 (comment)
bmanzari pushed a commit to bmanzari/release that referenced this pull request Mar 30, 2023
This test was also disabled upstream:
https://github.com/kubernetes/test-infra/pull/26354/files

The problem is that sanity tests currently use command faker and only
one driver instance. That makes any command faking impossible as
the same instance of the driver is re-used in many tests. Relevant
discussion upstream is here:
kubernetes-sigs/gcp-compute-persistent-disk-csi-driver#972 (comment)
amacaskill added a commit to amacaskill/gcp-compute-persistent-disk-csi-driver that referenced this pull request Apr 14, 2023
… is not backported to 1.8-. The fix is only in 1.9+.
amacaskill added a commit to amacaskill/gcp-compute-persistent-disk-csi-driver that referenced this pull request Apr 14, 2023
… is not backported to 1.8-. The fix is only in 1.9+.
amacaskill added a commit to amacaskill/gcp-compute-persistent-disk-csi-driver that referenced this pull request Apr 14, 2023
… is not backported to 1.8-. The fix is only in 1.9+.
amacaskill added a commit to amacaskill/gcp-compute-persistent-disk-csi-driver that referenced this pull request Apr 14, 2023
… is not backported to 1.8-. The fix is only in 1.9+.
amacaskill added a commit to amacaskill/gcp-compute-persistent-disk-csi-driver that referenced this pull request Apr 14, 2023
… is not backported to 1.8-. The fix is only in 1.9+.
amacaskill added a commit to amacaskill/gcp-compute-persistent-disk-csi-driver that referenced this pull request Apr 14, 2023
… is not backported to 1.8-. The fix is only in 1.9+.
amacaskill added a commit to amacaskill/gcp-compute-persistent-disk-csi-driver that referenced this pull request Apr 14, 2023
… is not backported to 1.8-. The fix is only in 1.9+.
amacaskill added a commit to amacaskill/gcp-compute-persistent-disk-csi-driver that referenced this pull request Apr 14, 2023
… is not backported to 1.8-. The fix is only in 1.9+.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Restoring from a snapshot to a larger size
6 participants