Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rbd: add support for adjusting the number of RBD mirror image watchers #4312

Closed
togoschi opened this issue Dec 11, 2023 · 13 comments · Fixed by #4566
Closed

rbd: add support for adjusting the number of RBD mirror image watchers #4312

togoschi opened this issue Dec 11, 2023 · 13 comments · Fixed by #4566
Assignees
Labels
component/rbd Issues related to RBD wontfix This will not be worked on

Comments

@togoschi
Copy link

Since ceph-csi release v3.3.1 the workflow of RBD volume atttachment considers 2 watchers to check on the image when mirroring primary is true (#1993).

If you operate your ceph cluster environment with more than one rbd-mirror daemon (e.g. for high availibility and/or the cluster is streched over more than one datacenter) you maybe want to adjust the number of image watchers to check on the image.

It's required to run at least 2 rbd-mirror daemons if you want to ensure non-disruptive RBD mirroring. This is not a problem for the "in use" logic of mirrored images because only 1 rbd-mirror client watches the image usually.
But we observed volume attachment issues with 2 rbd-mirror clients when a mirror daemon leader becomes temporarily unresponsive (for example due to heavy load in the peering cluster).
That's not an serious issue for the rbd mirroring because of redundancy but meanwhile for the volume attachment of re-scheduled stateful pods.

In addition in some cases you even want to run more than 2 rbd-mirror daemons to balance the leaders over multiple RBD pools. This setup is not possible with the current implementation because you have 2 watchers on every image with 3 daemons for example (under normal operation).

Current implementation tolerates only one watcher if mirroring primary is true (https://github.com/ceph/ceph-csi/blob/v3.10.0/internal/rbd/rbd_util.go#L553-L559). I think it would be very helpful to make it possible adjusting the number of rbd-mirror watchers in more complex ceph environments.

@nixpanic nixpanic added the component/rbd Issues related to RBD label Dec 12, 2023
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the wontfix This will not be worked on label Jan 11, 2024
@simon-wessel
Copy link

Support for this would be great. The hardcoded value does not seem to be a clean solution.

@Madhu-1 Madhu-1 removed the wontfix This will not be worked on label Jan 15, 2024
@Madhu-1
Copy link
Collaborator

Madhu-1 commented Jan 15, 2024

@simon-wessel @togoschi sure will make it configurable

@Madhu-1 Madhu-1 added this to the release-v3.11.0 milestone Jan 15, 2024
@togoschi
Copy link
Author

@Madhu-1 thx for take care of it. A viable approach could be include the ability to configure an ipaddress array of all RBD mirror dameons

@Madhu-1
Copy link
Collaborator

Madhu-1 commented Jan 15, 2024

@Madhu-1 thx for take care of it. A viable approach could be include the ability to configure an ipaddress array of all RBD mirror dameons

@togoschi i was thinking of adding a new key like mirrorDaemonCount (something like that) where the user can specify the count of daemons running

"radosNamespace": "<rados-namespace>",
. am not yet sure including ipaddresses will have any benefit as in case of container world the ip always changes dynamically, feel free to send a patch with suggestion As the community is always a priority :)

@togoschi
Copy link
Author

@Madhu-1 of course you're completely right, setting the number of mirror daemons is the better approach

@Madhu-1 Madhu-1 self-assigned this Jan 16, 2024
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the wontfix This will not be worked on label Feb 15, 2024
@Rakshith-R Rakshith-R removed the wontfix This will not be worked on label Feb 16, 2024
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the wontfix This will not be worked on label Mar 17, 2024
@simon-wessel
Copy link

@Madhu-1 Is this still on the roadmap?

@Madhu-1
Copy link
Collaborator

Madhu-1 commented Mar 18, 2024

@Madhu-1 Is this still on the roadmap?

@simon-wessel will try to fix it in coming week

@github-actions github-actions bot removed the wontfix This will not be worked on label Mar 18, 2024
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the wontfix This will not be worked on label Apr 17, 2024
@simon-wessel
Copy link

Looking forward to the fix 🙂

Madhu-1 added a commit to Madhu-1/ceph-csi that referenced this issue Apr 18, 2024
Currently we are assuming that only one
rbd mirror daemon running on the ceph cluster
but that is not true for many cases and it
can be more that one, this PR make this as a
configurable parameter.

fixes: ceph#4312

Signed-off-by: Madhu Rajanna <[email protected]>
@Madhu-1
Copy link
Collaborator

Madhu-1 commented Apr 18, 2024

@simon-wessel Sorry, I wasn't able to work on it earlier. I have created a PR to make it configurable.

Madhu-1 added a commit to Madhu-1/ceph-csi that referenced this issue Apr 19, 2024
Currently we are assuming that only one
rbd mirror daemon running on the ceph cluster
but that is not true for many cases and it
can be more that one, this PR make this as a
configurable parameter.

fixes: ceph#4312

Signed-off-by: Madhu Rajanna <[email protected]>
Madhu-1 added a commit to Madhu-1/ceph-csi that referenced this issue Apr 19, 2024
Currently we are assuming that only one
rbd mirror daemon running on the ceph cluster
but that is not true for many cases and it
can be more that one, this PR make this as a
configurable parameter.

fixes: ceph#4312

Signed-off-by: Madhu Rajanna <[email protected]>
Madhu-1 added a commit to Madhu-1/ceph-csi that referenced this issue Apr 19, 2024
Currently we are assuming that only one
rbd mirror daemon running on the ceph cluster
but that is not true for many cases and it
can be more that one, this PR make this as a
configurable parameter.

fixes: ceph#4312

Signed-off-by: Madhu Rajanna <[email protected]>
@mergify mergify bot closed this as completed in #4566 Apr 22, 2024
mergify bot pushed a commit that referenced this issue Apr 22, 2024
Currently we are assuming that only one
rbd mirror daemon running on the ceph cluster
but that is not true for many cases and it
can be more that one, this PR make this as a
configurable parameter.

fixes: #4312

Signed-off-by: Madhu Rajanna <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/rbd Issues related to RBD wontfix This will not be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants