Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resource usage with forks #921

Closed
kfox1111 opened this issue Apr 7, 2020 · 8 comments
Closed

resource usage with forks #921

kfox1111 opened this issue Apr 7, 2020 · 8 comments
Assignees
Labels
enhancement New feature or request wontfix This will not be worked on

Comments

@kfox1111
Copy link
Contributor

kfox1111 commented Apr 7, 2020

In the csi-rbdplugin container in the provisioner pod, we see large numbers of processes forking off (rados & ceph) during provisioning. This makes it difficult to set resource limits on the pods to grantee availability. We need some mechanism to guarantee only a certain number of children will be running at a time so we can control the resource usage.

@nixpanic
Copy link
Member

nixpanic commented Apr 8, 2020

When we move to go-ceph for provisioning, the number of execs to external commands (like rados and rbd) will be minimized if not completely removed. This reduces not only the exec to the specific command, but also the spawning of threads that a Ceph client does when connecting to a Ceph cluster. With go-ceph we will re-use existing connection (for volume management, not I/O), so resource consumption will change drastically when running many operations on PV(C)s.

(see also #449 and related issues and PRs)

@nixpanic nixpanic added the enhancement New feature or request label Apr 8, 2020
@nixpanic nixpanic self-assigned this Apr 8, 2020
@kfox1111
Copy link
Contributor Author

kfox1111 commented May 1, 2020

We're still running hard into this issue. https://github.com/ceph/ceph-csi/projects/3 makes it seem like the go based fix is still a long time out. How do we fix this in the mean time?

Can we limit the number of grpc calls it takes a time? That may help.

@kfox1111
Copy link
Contributor Author

kfox1111 commented May 1, 2020

Looks like the provisioner supports a flag:
--worker-threads : Number of simultaneously running ControllerCreateVolume and ControllerDeleteVolume operations. Default value is 100.

This value is WAY too high with forking. But setting it lower should fix the problem.

Currently the helm chart does not allow this to be set. Please can we add support for that?

Thanks!

@kfox1111
Copy link
Contributor Author

kfox1111 commented May 1, 2020

I can confirm on my cluster once I patched in --worker-threads=4 the cluster became reliable.

@kfox1111
Copy link
Contributor Author

kfox1111 commented May 4, 2020

There may be other --worker-threads options in other sidecars. We should set them all for better reliability.

@nixpanic
Copy link
Member

There are also PRs #1033 and #1034 that add --worker-threads. In one of the PRs there is a discussion/question about the ideal value for this, and how testing was done.

@stale
Copy link

stale bot commented Oct 4, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Oct 4, 2020
@stale
Copy link

stale bot commented Oct 12, 2020

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

@stale stale bot closed this as completed Oct 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants