Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rbd: do not execute rbd sparsify when volume is in use #3985

Merged
merged 1 commit into from
Jul 11, 2023

Conversation

Rakshith-R
Copy link
Contributor

Describe what this PR does

This commit makes sure sparsify() is not run when rbd image is in use.
Running rbd sparsify with workload doing io and too frequently is not desirable.
When a image is in use fstrim is run and sparsify will be run only when image is not mapped.

Too many frequent rbd sparsify calls has been seen to cause mounted pod to be stuck.

Logs when PVC is not mounted to a Pod:

I0711 07:05:45.339456       1 utils.go:195] ID: 21 GRPC call: /reclaimspace.ReclaimSpaceController/ControllerReclaimSpace
I0711 07:05:45.339635       1 utils.go:206] ID: 21 GRPC request: {"parameters":{"clusterID":"rook-ceph","imageFeatures":"layering","imageFormat":"2","imageName":"csi-vol-9890e517-0f58-4b63-b7b6-5d2b6379ab1b","journalPool":"replicapool","pool":"replicapool","storage.kubernetes.io/csiProvisionerIdentity":"1689058250740-1371-rook-ceph.rbd.csi.ceph.com"},"secrets":"***stripped***","volume_id":"0001-0009-rook-ceph-0000000000000001-9890e517-0f58-4b63-b7b6-5d2b6379ab1b"}
I0711 07:05:45.348794       1 omap.go:88] ID: 21 got omap values: (pool="replicapool", namespace="", name="csi.volume.9890e517-0f58-4b63-b7b6-5d2b6379ab1b"): map[csi.imageid:138a8a79d1db csi.imagename:csi-vol-9890e517-0f58-4b63-b7b6-5d2b6379ab1b csi.volname:pvc-9949853a-3899-4cde-b2b1-8ee8fe908708 csi.volume.owner:rook-ceph]
I0711 07:05:46.034081       1 utils.go:212] ID: 21 GRPC response: {}

Logs when PVC is mounted to a Pod:

I0711 07:05:09.953239       1 utils.go:195] ID: 20 GRPC call: /reclaimspace.ReclaimSpaceController/ControllerReclaimSpace
I0711 07:05:09.956137       1 utils.go:206] ID: 20 GRPC request: {"parameters":{"clusterID":"rook-ceph","imageFeatures":"layering","imageFormat":"2","imageName":"csi-vol-9890e517-0f58-4b63-b7b6-5d2b6379ab1b","journalPool":"replicapool","pool":"replicapool","storage.kubernetes.io/csiProvisionerIdentity":"1689058250740-1371-rook-ceph.rbd.csi.ceph.com"},"secrets":"***stripped***","volume_id":"0001-0009-rook-ceph-0000000000000001-9890e517-0f58-4b63-b7b6-5d2b6379ab1b"}
I0711 07:05:09.992341       1 omap.go:88] ID: 20 got omap values: (pool="replicapool", namespace="", name="csi.volume.9890e517-0f58-4b63-b7b6-5d2b6379ab1b"): map[csi.imageid:138a8a79d1db csi.imagename:csi-vol-9890e517-0f58-4b63-b7b6-5d2b6379ab1b csi.volname:pvc-9949853a-3899-4cde-b2b1-8ee8fe908708 csi.volume.owner:rook-ceph]
I0711 07:05:10.051898       1 reclaimspace.go:76] ID: 20 volume with ID "0001-0009-rook-ceph-0000000000000001-9890e517-0f58-4b63-b7b6-5d2b6379ab1b" is in use, skipping sparsify operation
I0711 07:05:10.052064       1 utils.go:212] ID: 20 GRPC response: {}

cc @idryomov

Show available bot commands

These commands are normally not required, but in case of issues, leave any of
the following bot commands in an otherwise empty comment in this PR:

  • /retest ci/centos/<job-name>: retest the <job-name> after unrelated
    failure (please report the failure too!)

@Rakshith-R Rakshith-R added component/rbd Issues related to RBD backport-to-release-v3.9 Label to backport from devel to release-v3.9 branch labels Jul 11, 2023
@Rakshith-R Rakshith-R requested review from nixpanic, Madhu-1 and a team July 11, 2023 07:20
@nixpanic
Copy link
Member

Should this also be backported to 3.8?

@Rakshith-R Rakshith-R added the backport-to-release-v3.8 backport to release 3.8 branch label Jul 11, 2023
@Rakshith-R
Copy link
Contributor Author

Should this also be backported to 3.8?

👍 added label

@Rakshith-R Rakshith-R requested a review from a team July 11, 2023 09:45
Copy link
Contributor

@riya-singhal31 riya-singhal31 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nixpanic
Copy link
Member

@Mergifyio queue

@mergify
Copy link
Contributor

mergify bot commented Jul 11, 2023

queue

✅ The pull request has been merged automatically

The pull request has been merged automatically at 98fdadf

This commit makes sure sparsify() is not run when rbd
image is in use.
Running rbd sparsify with workload doing io and too
frequently is not desirable.
When a image is in use fstrim is run and sparsify will
be run only when image is not mapped.

Signed-off-by: Rakshith R <[email protected]>
@mergify mergify bot added the ok-to-test Label to trigger E2E tests label Jul 11, 2023
@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/k8s-e2e-external-storage/1.25

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/k8s-e2e-external-storage/1.26

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/k8s-e2e-external-storage/1.27

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e-helm/k8s-1.25

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e-helm/k8s-1.26

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e-helm/k8s-1.27

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e/k8s-1.25

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e/k8s-1.26

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e/k8s-1.27

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/upgrade-tests-cephfs

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/upgrade-tests-rbd

@ceph-csi-bot ceph-csi-bot removed the ok-to-test Label to trigger E2E tests label Jul 11, 2023
@@ -23,7 +23,18 @@ import (
// Sparsify checks the size of the objects in the RBD image and calls
// rbd_sparify() to free zero-filled blocks and reduce the storage consumption
// of the image.
// This function will return ErrImageInUse if the image is in use, since
// sparsifying an image on which i/o is in progress is not optimal.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to sound like a broken record, but I have to ask again: do you have any data that would support automating rbd sparsify runs at all? In particular:

  • Have you done any research into how likely they are to be encountered?
  • Do you have ceph df outputs captured before and after rbd sparsify is run for typical customer scenarios?

If the answer is no, I don't understand why rbd sparsify is being conditioned on isInUse() instead of just being dropped entirely (to be resurrected as an optional step when the relevant CSI addon spec allows taking options from the user). Sparsifying the image the way Ceph CSI does it is likely not optimal in general, not just when there is competing I/O.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on one heavily used cluster, I suspect I've seen 33% to 100% extra usage without sparsify. Hadn't had the chance to run it offline and haven't had the guts to run it online. This makes it sound like it was good for me to be hesitent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah -- I would suggest starting with fstrim (assuming you aren't using raw block PVs).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-to-release-v3.8 backport to release 3.8 branch backport-to-release-v3.9 Label to backport from devel to release-v3.9 branch component/rbd Issues related to RBD
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants