Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can the CSI RBD plugin catch up in terms of performance to native CLI calls? #449

Closed
ShyamsundarR opened this issue Jun 27, 2019 · 42 comments
Assignees
Labels
component/rbd Issues related to RBD enhancement New feature or request Priority-0 highest priority issue wontfix This will not be worked on

Comments

@ShyamsundarR
Copy link
Contributor

As a part of the discussion in PR #443 it was noted by @dillaman that native CLI calls outperform the plugin by a large factor.

As part of the discussion I noted that the plugin does more work than just an image create, and that there would be kubernetes factors to consider.

As a result this issue is opened to help with the analysis of native calls versus calls made by the plugin to understand and maybe improve the performance of the plugin.

Various experiments may need to be conducted to arrive at an answer, and this issue can hopefully help track the progress.

@ShyamsundarR
Copy link
Contributor Author

First test, to compare numbers generated using PR #443 and the script below, that generate exact nature and order of calls as the plugin, is provided in this comment.

Script to mimic what the plugin does as of commit hash 59d3365d3b13f37a9b914d59d931c71269a21eec on master: raw-rbd-perf.sh.txt

Note: Script maybe a bit rough but works, execute a create like so # time ./raw-rbd-perf.sh create 20 and follow with a delete like so # time ./raw-rbd-perf.sh delete 20, repeat for multiple runs.

Note: Following numbers were generated on the same setup as numbers generated in this comment. Further the script was executed from within the container csi-rbdplugin in the pod csi-rbdplugin-provisioner-0, to keep things as close to the plugin.

Script times:
Executed 20 creates using the script, Run1 real 0m32.030s Run2 real 0m38.861s
Executed 20 deletes using the script, Run1 real 0m29.651s Run2 real 0m28.819s

Further, modified the script to not run the create/delete in parallel and instead do a serial create/delete (just for a baseline), and following are the results from the same,
Executed 20 serial creates using the script, Run1 real 2m0.101s Run2 real 2m48.814s
Executed 20 serial deletes using the script, Run1 real 4m25.355s Run2 real 3m56.247s

Observations:

  • Delete using the plugin was taking an average of 32 seconds, and using the native CLI/script takes about 29 seconds, so we are very close to what we can achieve in delete
  • Create shows a gap, the plugin takes about 55 seconds for 20 PVC creates and the native CLI/script takes 35 seconds. This needs to be possibly further analyzed. First order of analysis here would possibly be average time per create by the plugin and per create by the script to remove any kubernetes overheads

@dillaman
Copy link

RBD deletes are highly dependent on the image size since it requires issuing RADOS deletes for every possible backing object (regardless of whether or not they actually exists). Therefore, 1 GiB image deletion could be an order of magnitude faster than a 10 GiB image.

Longer term, that is why I am proposing moving long-running operations to a new MGR call so that the CSI can "fire and forget".

In terms of creation, I think if you re-implemented your script using the rados/rbd Python bindings, you will see a huge drop in time. This just brings up back to the previous talks about eventually replacing all CLI calls w/ golang API binding calls. For larger clusters, just the bootstrap time required for each CLI call (connect to MON, exchange keys, pull maps, etc) can be the vast majority of runtime.

@ShyamsundarR
Copy link
Contributor Author

With a go lang program ceph-golib.go.txt that uses go-ceph on a vagrant based kube+Rook(ceph) setup on a laptop the following measures were taken,

NOTE: timing is coarse as it is based on date output before and after executing the command/script
NOTE: serial operations were only performed (i.e create/delete one at a time) (as I am now going to add the go routine parallelism to this)

Test details: 3 Runs per test, each run is an iteration of 25 creates or deletes (IOW 25 RBD images created and deleted in the end, including its associated RADOS OMap updates)

go-ceph based times for 3 runs:
Example command: ceph-golib -key=<key> -monitors=<mons> -operation delete -count 25
Create: 5/5/5 (seconds)
Average: 5 seconds
Delete: 26/29/29 (seconds)
Average: 28 seconds

using the script as provided earlier invoking the Ceph CLIs
Example command: raw-rbd-perf.sh create 25 2>&1 > /dev/null
Create: 30/29/33 (seconds)
Average: 31 seconds
Delete: 57/59/56
Average: 57 seconds

@ShyamsundarR
Copy link
Contributor Author

On the same setup as the above comment, here are some more times based on parallel invocation of creates and deletes from the golang ceph-golib.go version of the program.

NOTE: All tests are run for 2 iterations for 25 objects (PVCs/creates/deletes) and average per run is mentioned below,

PVC creates in parallel when using CSI drivers (that invoke the various CLIs):
34 seconds for 25 creates in parallel
44 seconds for 25 deletes in parallel

golib program in parallel:
Create: 25 images : 2 seconds
Delete: 25 images : 10 seconds

Script in parallel:
create: 25 images : 24 seconds
delete: 25 images : 37 seconds


NOTE: Further tests would be based on testing in a real cluster to understand contribution factor of PV create/delete/attach times, to really understand if improving this aspect would yield the best value for effort.

@Madhu-1
Copy link
Collaborator

Madhu-1 commented Aug 1, 2019

I have modified ceph-csi code to use ceph -go library

tested in kube, I was able to create and delete 100 rbd PVC in less 2 minutes
100 PVC creation took 58 seconds
100 PVC deletion took 60 seconds

Note: This is parallel PVC operation

@humblec @ShyamsundarR @dillaman FYI

@ShyamsundarR
Copy link
Contributor Author

I have modified ceph-csi code to use ceph -go library

tested in kube, I was able to create and delete 100 rbd PVC in less 2 minutes
100 PVC creation took 58 seconds
100 PVC deletion took 60 seconds

Note: This is parallel PVC operation

@humblec @ShyamsundarR @dillaman FYI

@Madhu-1 is it an improvement over running it without the ceph-go library on the same setup? Currently, with the data above, I am not able to figure out if it is an improvement or the numbers are the same (or worse), as our setups would be different.

@Madhu-1
Copy link
Collaborator

Madhu-1 commented Aug 2, 2019

using cephcsi:canary image.
tested in Kube,
100 PVC creation took 60 seconds
100 PVC deletion took 62 seconds

I don't see any performance improvement, we need to do the same testing in other setups to check are we gaining any performance improvement for using the ceph-go library.

cirros-create-with-pvc-without-io (copy).txt
attaching the script used for testing. @nehaberry Thanks for providing the script.

@ShyamsundarR
Copy link
Contributor Author

Update:
Went through the code changes used for ceph-go bindings in CSI and between @Madhu-1 and me revisited it to use a single connection and IOContext (sample code here). Briefly tested this on the setup used to get numbers in this comment, and volume create time reported is on average 18 seconds (down from 34 seconds prior to this) for 25 parallel creates.

Need to do some further bench marking on a non-laptop setup and also using gRPC to provide data on how this can help CSI operations when operating on parallel requests.

@humblec
Copy link
Collaborator

humblec commented Aug 7, 2019

Update:
Went through the code changes used for ceph-go bindings in CSI and between @Madhu-1 and me revisited it to use a single connection and IOContext (sample code here). Briefly tested this on the setup used to get numbers in this comment, and volume create time reported is on average 18 seconds (down from 34 seconds prior to this) for 25 parallel creates.

Need to do some further bench marking on a non-laptop setup and also using gRPC to provide data on how this can help CSI operations when operating on parallel requests.

@Madhu-1 @ShyamsundarR if we are seeing performance improvement on parallel requests and also if we think the ceph go client is stable enough for atleast happy path or create and delete Volume , lets bring it to the repo. I am fine with the same.

@ShyamsundarR
Copy link
Contributor Author

ShyamsundarR commented Aug 8, 2019

Need to do some further bench marking on a non-laptop setup and also using gRPC to provide data on how this can help CSI operations when operating on parallel requests.

Completed the test and here are the results:

Legend:

<key>: <value>
Operation: time in seconds per operation
NOTE: For node plugins operation count and time is collected from 3 nodes, and hence represented as total-time/#calls-on-node + ... + ... and average is the prior total divided by 3

1) Test results with existing code and added Prometheus gRPC metrics:

Actions in parallel: n = 25

NodePublishVolume: 1.27/10 + 0.76/7 + 0.73/8 = 0.11
NodeStageVolume: 18.59/10 + 25.20/7 + 26.43/8 = 2.9
NodeUnpublishVolume: 0.26/10 + 0.18/7 + 0.24/8 = 0.03
NodeUnstageVolume: 1.16/10 + 0.88/7 + 1.05/8 = 0.12
CreateVolume: 262.25/25 = 10.5
DeleteVolume: 424.11/25 = 17

Actions in parallel: n = 1
CreateVolume: 1.225128874

Actions in parallel: n = 10
CreateVolume: 53.89/10 = 5.4

NOTE: to test improvements CreateVolume was picked up as that shows an increasing latency as the number of parallel requests go up (same with DeleteVolume in terms of trend)

2) Test results using this code, that uses ceph-go bindings for just the happy path in CreateVolume

Actions in parallel: n = 25
CreateVolume: 31.60/25 = 1.26

So, with ceph-go bindings and when dealing with parallel requests, there is a big performance improvement on a per-RPC response time. Based on this we should pursue getting this change done to the code base as such.

@Madhu-1 you wanted to further understand how this impacts PVC creation time overall, and how it will benefit from the same and if throttling and other mechanisms will trip this up. Let me know how you want to proceed testing the same and any help required towards it.

@Madhu-1
Copy link
Collaborator

Madhu-1 commented Aug 12, 2019

Need to do some further bench marking on a non-laptop setup and also using gRPC to provide data on how this can help CSI operations when operating on parallel requests.

Completed the test and here are the results:

Legend:

<key>: <value>
Operation: time in seconds per operation
NOTE: For node plugins operation count and time is collected from 3 nodes, and hence represented as total-time/#calls-on-node + ... + ... and average is the prior total divided by 3

1) Test results with existing code and added Prometheus gRPC metrics:

Actions in parallel: n = 25

NodePublishVolume: 1.27/10 + 0.76/7 + 0.73/8 = 0.11
NodeStageVolume: 18.59/10 + 25.20/7 + 26.43/8 = 2.9
NodeUnpublishVolume: 0.26/10 + 0.18/7 + 0.24/8 = 0.03
NodeUnstageVolume: 1.16/10 + 0.88/7 + 1.05/8 = 0.12
CreateVolume: 262.25/25 = 10.5
DeleteVolume: 424.11/25 = 17

Actions in parallel: n = 1
CreateVolume: 1.225128874

Actions in parallel: n = 10
CreateVolume: 53.89/10 = 5.4

NOTE: to test improvements CreateVolume was picked up as that shows an increasing latency as the number of parallel requests go up (same with DeleteVolume in terms of trend)

2) Test results using this code, that uses ceph-go bindings for just the happy path in CreateVolume

could not test with the above-mentioed code, facing issue in PVC create

I0812 04:03:10.152532       1 utils.go:110] GRPC request: {"capacity_range":{"required_bytes":2147483648},"name":"pvc-a0ae0ee6-4ba0-4ecf-a04d-0922543271e7","parameters":{"clusterID":"abcd","imageFeatures":"layering","imageFormat":"2","pool":"replicapool"},"secrets":"***stripped***","volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4","mount_flags":["discard"]}},"access_mode":{"mode":1}}]}
I0812 04:03:10.156928       1 rbd_util.go:482] setting disableInUseChecks on rbd volume to: false
I0812 04:03:10.240855       1 cephcmds.go:158] GetOMapValueGo: error (rados: No such file or directory)
E0812 04:03:10.247148       1 utils.go:113] GRPC error: rpc error: code = Internal desc = rados: No such file or directory
I0812 04:03:10.973198       1 utils.go:109] GRPC call: /csi.v1.Controller/CreateVolume
I0812 04:03:10.973344       1 utils.go:110] GRPC request: {"capacity_range":{"required_bytes":2147483648},"name":"pvc-bd5d0796-741e-417d-9a0a-13b13182a30f","parameters":{"clusterID":"abcd","imageFeatures":"layering","imageFormat":"2","pool":"replicapool"},"secrets":"***stripped***","volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4","mount_flags":["discard"]}},"access_mode":{"mode":1}}]}
I0812 04:03:10.987897       1 rbd_util.go:482] setting disableInUseChecks on rbd volume to: false
I0812 04:03:11.173766       1 cephcmds.go:158] GetOMapValueGo: error (rados: No such file or directory)
E0812 04:03:11.178812       1 utils.go:113] GRPC error: rpc error: code = Internal desc = rados: No such file or directory

Actions in parallel: n = 25
CreateVolume: 31.60/25 = 1.26

So, with ceph-go bindings and when dealing with parallel requests, there is a big performance improvement on a per-RPC response time. Based on this we should pursue getting this change done to the code base as such.

@Madhu-1 you wanted to further understand how this impacts PVC creation time overall, and how it will benefit from the same and if throttling and other mechanisms will trip this up. Let me know how you want to proceed testing the same and any help required towards it.

@Madhu-1
Copy link
Collaborator

Madhu-1 commented Aug 12, 2019

@ShyamsundarR I have created a new image as per our discussion on reusing the connection, I don't see any major overall performance improvement

on 100 PVC creation, it took around 57 seconds to bound all PVC's

@ShyamsundarR
Copy link
Contributor Author

could not test with the above-mentioed code, facing issue in PVC create

The initial csi.volumes.default OMap needs to be created by hand, as error detection there is not added, once done the existing code works as expected.

@ShyamsundarR I have created a new image as per our discussion on reusing the connection, I don't see any major overall performance improvement
on 100 PVC creation, it took around 57 seconds to bound all PVC's

Can you please share the code and the test? Also did you measure gRPC metrics for the same, pre and post change, tests?

@mykaul
Copy link
Contributor

mykaul commented Sep 3, 2019

@Madhu-1 , @ShyamsundarR - any updates?

@ShyamsundarR
Copy link
Contributor Author

@mykaul It was decided to pick up this work post 1.2.0 release for the following reasons,

  • go-ceph does not yet have a stable release branch for use
  • There is at least one feature still needed in go-ceph
  • Finally, we need to settle on the gains when operations are performed in parallel as per commentary in this PR

@nixpanic was interested in taking this forward, he may have further comments to provide.

@humblec humblec added release-2.0.0 v2.0.0 release Priority-0 highest priority issue enhancement New feature or request labels Oct 4, 2019
@humblec
Copy link
Collaborator

humblec commented Oct 4, 2019

@nixpanic assigning this task to you as per the discussion. This is a very high priority item for v2.0.0 , so we need immediate attention and priority here. Please let me know if you need any help on this.

@humblec
Copy link
Collaborator

humblec commented Oct 4, 2019

#557

@nixpanic nixpanic self-assigned this Oct 9, 2019
@humblec
Copy link
Collaborator

humblec commented Nov 25, 2019

@nixpanic can you please update this issue with the PR or WIP patch ?

@nixpanic
Copy link
Member

nixpanic commented Dec 3, 2019

@humblec
Copy link
Collaborator

humblec commented Dec 12, 2019

@nixpanic as the PR (ceph/go-ceph#111) is merged now, may be we could make some more progress here :).. Thanks !

@nixpanic
Copy link
Member

Older version of Ceph's librbd do not support fetching the watchers on an image. That has been introduced with Mimic (not Nautilus as mentioned in earlier comments). The rbd command in Luminous can get the watchers, but it uses some internal Ceph functions that are not exposed in public libraries.

So, the question is: does Ceph-CSI want to keep support for older Ceph versions (Luminous), or can we move on to a newer version as minimal dependency?

There is a way to implement a fallback in ceph-csi, for the missing functionality in libraries, it can still call the rbd executable instead.

nixpanic added a commit to nixpanic/ceph-csi that referenced this issue Feb 20, 2020
This is the initial step for improving performance during provisioning
of CSI volumes backed by RBD.

While creating a volume, an existing connection to the Ceph cluster is
used from the ConnPool. This should speed up the creation of a batch of
volumes significantly.

Updates: ceph#449
Signed-off-by: Niels de Vos <[email protected]>
nixpanic added a commit to nixpanic/ceph-csi that referenced this issue Feb 20, 2020
This is the initial step for improving performance during provisioning
of CSI volumes backed by RBD.

While creating a volume, an existing connection to the Ceph cluster is
used from the ConnPool. This should speed up the creation of a batch of
volumes significantly.

Updates: ceph#449
Signed-off-by: Niels de Vos <[email protected]>
nixpanic added a commit to nixpanic/ceph-csi that referenced this issue Feb 25, 2020
This is the initial step for improving performance during provisioning
of CSI volumes backed by RBD.

While creating a volume, an existing connection to the Ceph cluster is
used from the ConnPool. This should speed up the creation of a batch of
volumes significantly.

Updates: ceph#449
Signed-off-by: Niels de Vos <[email protected]>
nixpanic added a commit to nixpanic/ceph-csi that referenced this issue Feb 25, 2020
This is the initial step for improving performance during provisioning
of CSI volumes backed by RBD.

While creating a volume, an existing connection to the Ceph cluster is
used from the ConnPool. This should speed up the creation of a batch of
volumes significantly.

Updates: ceph#449
Signed-off-by: Niels de Vos <[email protected]>
nixpanic added a commit to nixpanic/ceph-csi that referenced this issue Feb 25, 2020
This is the initial step for improving performance during provisioning
of CSI volumes backed by RBD.

While creating a volume, an existing connection to the Ceph cluster is
used from the ConnPool. This should speed up the creation of a batch of
volumes significantly.

Updates: ceph#449
Signed-off-by: Niels de Vos <[email protected]>
@mykaul
Copy link
Contributor

mykaul commented Feb 25, 2020

Older version of Ceph's librbd do not support fetching the watchers on an image. That has been introduced with Mimic (not Nautilus as mentioned in earlier comments). The rbd command in Luminous can get the watchers, but it uses some internal Ceph functions that are not exposed in public libraries.

So, the question is: does Ceph-CSI want to keep support for older Ceph versions (Luminous), or can we move on to a newer version as minimal dependency?

There is a way to implement a fallback in ceph-csi, for the missing functionality in libraries, it can still call the rbd executable instead.

I think we should move on.

@Madhu-1
Copy link
Collaborator

Madhu-1 commented Feb 25, 2020

@nixpanic as discussed we want to support Mimic+ version in ceph-csi, ceph-csi support matrix is here https://github.com/ceph/ceph-csi#ceph-csi-features-and-available-versions

nixpanic added a commit to nixpanic/ceph-csi that referenced this issue Feb 26, 2020
This is the initial step for improving performance during provisioning
of CSI volumes backed by RBD.

While creating a volume, an existing connection to the Ceph cluster is
used from the ConnPool. This should speed up the creation of a batch of
volumes significantly.

Updates: ceph#449
Signed-off-by: Niels de Vos <[email protected]>
nixpanic added a commit to nixpanic/ceph-csi that referenced this issue Feb 27, 2020
This is the initial step for improving performance during provisioning
of CSI volumes backed by RBD.

While creating a volume, an existing connection to the Ceph cluster is
used from the ConnPool. This should speed up the creation of a batch of
volumes significantly.

Updates: ceph#449
Signed-off-by: Niels de Vos <[email protected]>
nixpanic added a commit to nixpanic/ceph-csi that referenced this issue Feb 27, 2020
This is the initial step for improving performance during provisioning
of CSI volumes backed by RBD.

While creating a volume, an existing connection to the Ceph cluster is
used from the ConnPool. This should speed up the creation of a batch of
volumes significantly.

Updates: ceph#449
Signed-off-by: Niels de Vos <[email protected]>
nixpanic added a commit to nixpanic/ceph-csi that referenced this issue Feb 28, 2020
This is the initial step for improving performance during provisioning
of CSI volumes backed by RBD.

While creating a volume, an existing connection to the Ceph cluster is
used from the ConnPool. This should speed up the creation of a batch of
volumes significantly.

Updates: ceph#449
Signed-off-by: Niels de Vos <[email protected]>
nixpanic added a commit to nixpanic/ceph-csi that referenced this issue Feb 28, 2020
This is the initial step for improving performance during provisioning
of CSI volumes backed by RBD.

While creating a volume, an existing connection to the Ceph cluster is
used from the ConnPool. This should speed up the creation of a batch of
volumes significantly.

Updates: ceph#449
Signed-off-by: Niels de Vos <[email protected]>
nixpanic added a commit to nixpanic/ceph-csi that referenced this issue Mar 9, 2020
This is the initial step for improving performance during provisioning
of CSI volumes backed by RBD.

While creating a volume, an existing connection to the Ceph cluster is
used from the ConnPool. This should speed up the creation of a batch of
volumes significantly.

Updates: ceph#449
Signed-off-by: Niels de Vos <[email protected]>
nixpanic added a commit to nixpanic/ceph-csi that referenced this issue Mar 10, 2020
This is the initial step for improving performance during provisioning
of CSI volumes backed by RBD.

While creating a volume, an existing connection to the Ceph cluster is
used from the ConnPool. This should speed up the creation of a batch of
volumes significantly.

Updates: ceph#449
Signed-off-by: Niels de Vos <[email protected]>
nixpanic added a commit to nixpanic/ceph-csi that referenced this issue Mar 11, 2020
This is the initial step for improving performance during provisioning
of CSI volumes backed by RBD.

While creating a volume, an existing connection to the Ceph cluster is
used from the ConnPool. This should speed up the creation of a batch of
volumes significantly.

Updates: ceph#449
Signed-off-by: Niels de Vos <[email protected]>
nixpanic added a commit to nixpanic/ceph-csi that referenced this issue Mar 11, 2020
This is the initial step for improving performance during provisioning
of CSI volumes backed by RBD.

While creating a volume, an existing connection to the Ceph cluster is
used from the ConnPool. This should speed up the creation of a batch of
volumes significantly.

Updates: ceph#449
Signed-off-by: Niels de Vos <[email protected]>
nixpanic added a commit to nixpanic/ceph-csi that referenced this issue Mar 11, 2020
This is the initial step for improving performance during provisioning
of CSI volumes backed by RBD.

While creating a volume, an existing connection to the Ceph cluster is
used from the ConnPool. This should speed up the creation of a batch of
volumes significantly.

Updates: ceph#449
Signed-off-by: Niels de Vos <[email protected]>
mergify bot pushed a commit that referenced this issue Mar 11, 2020
This is the initial step for improving performance during provisioning
of CSI volumes backed by RBD.

While creating a volume, an existing connection to the Ceph cluster is
used from the ConnPool. This should speed up the creation of a batch of
volumes significantly.

Updates: #449
Signed-off-by: Niels de Vos <[email protected]>
@nixpanic
Copy link
Member

nixpanic commented Apr 8, 2020

There are many things that needed to be done for this Issue. I have created https://github.com/ceph/ceph-csi/projects/3 so that tracking the dependencies is a little easier. Don't hesitate to add more issues/PRs/cards to the project.

@nixpanic nixpanic added the component/rbd Issues related to RBD label Apr 17, 2020
@stale
Copy link

stale bot commented Oct 4, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Oct 4, 2020
@stale
Copy link

stale bot commented Oct 12, 2020

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

@stale stale bot closed this as completed Oct 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/rbd Issues related to RBD enhancement New feature or request Priority-0 highest priority issue wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

7 participants