rbd: do not execute rbd sparsify when volume is in use #3985

Rakshith-R · 2023-07-11T07:20:13Z

Describe what this PR does

This commit makes sure sparsify() is not run when rbd image is in use.
Running rbd sparsify with workload doing io and too frequently is not desirable.
When a image is in use fstrim is run and sparsify will be run only when image is not mapped.

Too many frequent rbd sparsify calls has been seen to cause mounted pod to be stuck.

Logs when PVC is not mounted to a Pod:

I0711 07:05:45.339456       1 utils.go:195] ID: 21 GRPC call: /reclaimspace.ReclaimSpaceController/ControllerReclaimSpace
I0711 07:05:45.339635       1 utils.go:206] ID: 21 GRPC request: {"parameters":{"clusterID":"rook-ceph","imageFeatures":"layering","imageFormat":"2","imageName":"csi-vol-9890e517-0f58-4b63-b7b6-5d2b6379ab1b","journalPool":"replicapool","pool":"replicapool","storage.kubernetes.io/csiProvisionerIdentity":"1689058250740-1371-rook-ceph.rbd.csi.ceph.com"},"secrets":"***stripped***","volume_id":"0001-0009-rook-ceph-0000000000000001-9890e517-0f58-4b63-b7b6-5d2b6379ab1b"}
I0711 07:05:45.348794       1 omap.go:88] ID: 21 got omap values: (pool="replicapool", namespace="", name="csi.volume.9890e517-0f58-4b63-b7b6-5d2b6379ab1b"): map[csi.imageid:138a8a79d1db csi.imagename:csi-vol-9890e517-0f58-4b63-b7b6-5d2b6379ab1b csi.volname:pvc-9949853a-3899-4cde-b2b1-8ee8fe908708 csi.volume.owner:rook-ceph]
I0711 07:05:46.034081       1 utils.go:212] ID: 21 GRPC response: {}

Logs when PVC is mounted to a Pod:

I0711 07:05:09.953239       1 utils.go:195] ID: 20 GRPC call: /reclaimspace.ReclaimSpaceController/ControllerReclaimSpace
I0711 07:05:09.956137       1 utils.go:206] ID: 20 GRPC request: {"parameters":{"clusterID":"rook-ceph","imageFeatures":"layering","imageFormat":"2","imageName":"csi-vol-9890e517-0f58-4b63-b7b6-5d2b6379ab1b","journalPool":"replicapool","pool":"replicapool","storage.kubernetes.io/csiProvisionerIdentity":"1689058250740-1371-rook-ceph.rbd.csi.ceph.com"},"secrets":"***stripped***","volume_id":"0001-0009-rook-ceph-0000000000000001-9890e517-0f58-4b63-b7b6-5d2b6379ab1b"}
I0711 07:05:09.992341       1 omap.go:88] ID: 20 got omap values: (pool="replicapool", namespace="", name="csi.volume.9890e517-0f58-4b63-b7b6-5d2b6379ab1b"): map[csi.imageid:138a8a79d1db csi.imagename:csi-vol-9890e517-0f58-4b63-b7b6-5d2b6379ab1b csi.volname:pvc-9949853a-3899-4cde-b2b1-8ee8fe908708 csi.volume.owner:rook-ceph]
I0711 07:05:10.051898       1 reclaimspace.go:76] ID: 20 volume with ID "0001-0009-rook-ceph-0000000000000001-9890e517-0f58-4b63-b7b6-5d2b6379ab1b" is in use, skipping sparsify operation
I0711 07:05:10.052064       1 utils.go:212] ID: 20 GRPC response: {}

cc @idryomov

Show available bot commands

These commands are normally not required, but in case of issues, leave any of
the following bot commands in an otherwise empty comment in this PR:

/retest ci/centos/<job-name>: retest the <job-name> after unrelated
failure (please report the failure too!)

internal/csi-addons/rbd/reclaimspace.go

internal/rbd/diskusage.go

internal/csi-addons/rbd/reclaimspace.go

nixpanic · 2023-07-11T09:38:09Z

Should this also be backported to 3.8?

Rakshith-R · 2023-07-11T09:45:37Z

Should this also be backported to 3.8?

👍 added label

riya-singhal31

LGTM

nixpanic · 2023-07-11T12:00:19Z

@Mergifyio queue

mergify · 2023-07-11T12:00:23Z

queue

✅ The pull request has been merged automatically

The pull request has been merged automatically at 98fdadf

This commit makes sure sparsify() is not run when rbd image is in use. Running rbd sparsify with workload doing io and too frequently is not desirable. When a image is in use fstrim is run and sparsify will be run only when image is not mapped. Signed-off-by: Rakshith R <[email protected]>

ceph-csi-bot · 2023-07-11T12:00:51Z

/test ci/centos/k8s-e2e-external-storage/1.25

ceph-csi-bot · 2023-07-11T12:00:52Z

/test ci/centos/k8s-e2e-external-storage/1.26

ceph-csi-bot · 2023-07-11T12:00:53Z

/test ci/centos/k8s-e2e-external-storage/1.27

ceph-csi-bot · 2023-07-11T12:00:54Z

/test ci/centos/mini-e2e-helm/k8s-1.25

ceph-csi-bot · 2023-07-11T12:00:56Z

/test ci/centos/mini-e2e-helm/k8s-1.26

ceph-csi-bot · 2023-07-11T12:00:57Z

/test ci/centos/mini-e2e-helm/k8s-1.27

ceph-csi-bot · 2023-07-11T12:00:58Z

/test ci/centos/mini-e2e/k8s-1.25

ceph-csi-bot · 2023-07-11T12:00:59Z

/test ci/centos/mini-e2e/k8s-1.26

ceph-csi-bot · 2023-07-11T12:01:00Z

/test ci/centos/mini-e2e/k8s-1.27

ceph-csi-bot · 2023-07-11T12:01:01Z

/test ci/centos/upgrade-tests-cephfs

ceph-csi-bot · 2023-07-11T12:01:02Z

/test ci/centos/upgrade-tests-rbd

idryomov · 2023-07-11T13:10:57Z

internal/rbd/diskusage.go

@@ -23,7 +23,18 @@ import (
 // Sparsify checks the size of the objects in the RBD image and calls
 // rbd_sparify() to free zero-filled blocks and reduce the storage consumption
 // of the image.
+// This function will return ErrImageInUse if the image is in use, since
+// sparsifying an image on which i/o is in progress is not optimal.


I don't want to sound like a broken record, but I have to ask again: do you have any data that would support automating rbd sparsify runs at all? In particular:

Have you done any research into how likely they are to be encountered?

Do you have ceph df outputs captured before and after rbd sparsify is run for typical customer scenarios?

If the answer is no, I don't understand why rbd sparsify is being conditioned on isInUse() instead of just being dropped entirely (to be resurrected as an optional step when the relevant CSI addon spec allows taking options from the user). Sparsifying the image the way Ceph CSI does it is likely not optimal in general, not just when there is competing I/O.

on one heavily used cluster, I suspect I've seen 33% to 100% extra usage without sparsify. Hadn't had the chance to run it offline and haven't had the guts to run it online. This makes it sound like it was good for me to be hesitent.

Yeah -- I would suggest starting with fstrim (assuming you aren't using raw block PVs).

Rakshith-R added component/rbd Issues related to RBD backport-to-release-v3.9 Label to backport from devel to release-v3.9 branch labels Jul 11, 2023

Rakshith-R requested review from nixpanic, Madhu-1 and a team July 11, 2023 07:20

nixpanic reviewed Jul 11, 2023

View reviewed changes

internal/csi-addons/rbd/reclaimspace.go Show resolved Hide resolved

nixpanic reviewed Jul 11, 2023

View reviewed changes

internal/rbd/diskusage.go Show resolved Hide resolved

nixpanic reviewed Jul 11, 2023

View reviewed changes

internal/rbd/diskusage.go Show resolved Hide resolved

Rakshith-R force-pushed the watcher-rbd-sparsify branch from 2705a01 to b3b402f Compare July 11, 2023 07:34

Rakshith-R requested a review from nixpanic July 11, 2023 07:37

nixpanic reviewed Jul 11, 2023

View reviewed changes

internal/csi-addons/rbd/reclaimspace.go Show resolved Hide resolved

Rakshith-R force-pushed the watcher-rbd-sparsify branch from b3b402f to 6dee995 Compare July 11, 2023 09:35

Rakshith-R requested a review from nixpanic July 11, 2023 09:35

nixpanic approved these changes Jul 11, 2023

View reviewed changes

Rakshith-R added the backport-to-release-v3.8 backport to release 3.8 branch label Jul 11, 2023

Rakshith-R requested a review from a team July 11, 2023 09:45

riya-singhal31 approved these changes Jul 11, 2023

View reviewed changes

Rakshith-R force-pushed the watcher-rbd-sparsify branch from 6dee995 to c479ca7 Compare July 11, 2023 12:00

mergify bot added the ok-to-test Label to trigger E2E tests label Jul 11, 2023

ceph-csi-bot removed the ok-to-test Label to trigger E2E tests label Jul 11, 2023

idryomov reviewed Jul 11, 2023

View reviewed changes

mergify bot merged commit 98fdadf into ceph:devel Jul 11, 2023

This was referenced Jul 11, 2023

rbd: do not execute rbd sparsify when volume is in use (backport #3985) #3989

Merged

rbd: do not execute rbd sparsify when volume is in use (backport #3985) #3990

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rbd: do not execute rbd sparsify when volume is in use #3985

rbd: do not execute rbd sparsify when volume is in use #3985

Rakshith-R commented Jul 11, 2023

nixpanic commented Jul 11, 2023

Rakshith-R commented Jul 11, 2023

riya-singhal31 left a comment

nixpanic commented Jul 11, 2023

mergify bot commented Jul 11, 2023 •

edited

Loading

ceph-csi-bot commented Jul 11, 2023

ceph-csi-bot commented Jul 11, 2023

ceph-csi-bot commented Jul 11, 2023

ceph-csi-bot commented Jul 11, 2023

ceph-csi-bot commented Jul 11, 2023

ceph-csi-bot commented Jul 11, 2023

ceph-csi-bot commented Jul 11, 2023

ceph-csi-bot commented Jul 11, 2023

ceph-csi-bot commented Jul 11, 2023

ceph-csi-bot commented Jul 11, 2023

ceph-csi-bot commented Jul 11, 2023

idryomov Jul 11, 2023

kfox1111 Jul 24, 2023

idryomov Jul 25, 2023

rbd: do not execute rbd sparsify when volume is in use #3985

rbd: do not execute rbd sparsify when volume is in use #3985

Conversation

Rakshith-R commented Jul 11, 2023

Describe what this PR does

cc @idryomov

nixpanic commented Jul 11, 2023

Rakshith-R commented Jul 11, 2023

riya-singhal31 left a comment

Choose a reason for hiding this comment

nixpanic commented Jul 11, 2023

mergify bot commented Jul 11, 2023 • edited Loading

✅ The pull request has been merged automatically

ceph-csi-bot commented Jul 11, 2023

ceph-csi-bot commented Jul 11, 2023

ceph-csi-bot commented Jul 11, 2023

ceph-csi-bot commented Jul 11, 2023

ceph-csi-bot commented Jul 11, 2023

ceph-csi-bot commented Jul 11, 2023

ceph-csi-bot commented Jul 11, 2023

ceph-csi-bot commented Jul 11, 2023

ceph-csi-bot commented Jul 11, 2023

ceph-csi-bot commented Jul 11, 2023

ceph-csi-bot commented Jul 11, 2023

idryomov Jul 11, 2023

Choose a reason for hiding this comment

kfox1111 Jul 24, 2023

Choose a reason for hiding this comment

idryomov Jul 25, 2023

Choose a reason for hiding this comment

mergify bot commented Jul 11, 2023 •

edited

Loading