resource usage with forks #921

kfox1111 · 2020-04-07T16:05:31Z

In the csi-rbdplugin container in the provisioner pod, we see large numbers of processes forking off (rados & ceph) during provisioning. This makes it difficult to set resource limits on the pods to grantee availability. We need some mechanism to guarantee only a certain number of children will be running at a time so we can control the resource usage.

nixpanic · 2020-04-08T07:13:16Z

When we move to go-ceph for provisioning, the number of execs to external commands (like rados and rbd) will be minimized if not completely removed. This reduces not only the exec to the specific command, but also the spawning of threads that a Ceph client does when connecting to a Ceph cluster. With go-ceph we will re-use existing connection (for volume management, not I/O), so resource consumption will change drastically when running many operations on PV(C)s.

(see also #449 and related issues and PRs)

kfox1111 · 2020-05-01T19:31:35Z

We're still running hard into this issue. https://github.com/ceph/ceph-csi/projects/3 makes it seem like the go based fix is still a long time out. How do we fix this in the mean time?

Can we limit the number of grpc calls it takes a time? That may help.

kfox1111 · 2020-05-01T19:38:31Z

Looks like the provisioner supports a flag:
--worker-threads : Number of simultaneously running ControllerCreateVolume and ControllerDeleteVolume operations. Default value is 100.

This value is WAY too high with forking. But setting it lower should fix the problem.

Currently the helm chart does not allow this to be set. Please can we add support for that?

Thanks!

kfox1111 · 2020-05-01T22:12:17Z

I can confirm on my cluster once I patched in --worker-threads=4 the cluster became reliable.

kfox1111 · 2020-05-04T15:50:13Z

There may be other --worker-threads options in other sidecars. We should set them all for better reliability.

nixpanic · 2020-05-14T09:12:00Z

There are also PRs #1033 and #1034 that add --worker-threads. In one of the PRs there is a discussion/question about the ideal value for this, and how testing was done.

stale · 2020-10-04T04:17:15Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

stale · 2020-10-12T14:33:48Z

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

nixpanic added the enhancement New feature or request label Apr 8, 2020

nixpanic self-assigned this Apr 8, 2020

kfox1111 mentioned this issue May 4, 2020

Release v2.1.1 of Ceph CSI driver #1010

Closed

6 tasks

nixpanic mentioned this issue May 14, 2020

reduce default worker-threads for csi-provisioner #1034

Closed

stale bot added the wontfix This will not be worked on label Oct 4, 2020

stale bot closed this as completed Oct 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resource usage with forks #921

resource usage with forks #921

kfox1111 commented Apr 7, 2020

nixpanic commented Apr 8, 2020 •

edited

Loading

kfox1111 commented May 1, 2020

kfox1111 commented May 1, 2020

kfox1111 commented May 1, 2020

kfox1111 commented May 4, 2020

nixpanic commented May 14, 2020

stale bot commented Oct 4, 2020

stale bot commented Oct 12, 2020

resource usage with forks #921

resource usage with forks #921

Comments

kfox1111 commented Apr 7, 2020

nixpanic commented Apr 8, 2020 • edited Loading

kfox1111 commented May 1, 2020

kfox1111 commented May 1, 2020

kfox1111 commented May 1, 2020

kfox1111 commented May 4, 2020

nixpanic commented May 14, 2020

stale bot commented Oct 4, 2020

stale bot commented Oct 12, 2020

nixpanic commented Apr 8, 2020 •

edited

Loading