-
Notifications
You must be signed in to change notification settings - Fork 547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concurrency issue when provisioning multiple encrypted CephFS instances #4592
Comments
Will keep digging into the issue, can also upload the full logs here if anyone is interested! |
It appears to me that this would be ideally addressed within the fscrypt module. I'll open an issue there and see what their response is. Explanation below, other option I can think of currently would be to limit the concurrency. The problem seems to arise from the fact that when a new mount is created, the UpdateMountInfo function in the fscrypt module is called. However, this function recreates the mountsByDevice map each time it's called. Consequently, the memory references of the mount objects get updated. This results in a mismatch in policy.apply cause we store the old reference in the fscryptContext (%!w(*actions.ErrDifferentFilesystem=&{0xc00157b130 0xc00208a140}). So, when the lookup from the map is performed and compared to the context in the policy.apply, the memory addresses don't match, even though the device number and path are the same in the new mount object. I tested if we, keep the old references in the map and only update new this issue is resolved so let's hope that is acceptable for fscrypt |
Great find, thanks for identifying the issue! Let us know if you need assistance with a fix for fscrypt. |
Thanks for the offer, but the change was small so I got it! @nixpanic should we keep this issue open to upgrade the fscrypt version once it is available? |
If you want to keep this open until Ceph-CSI rebased the package, that is fine. I leave it up to you what you want to do. |
The fscrypt PR is merged, but when discussing with maintainers they mentioned it might be months until the next release.. |
@NymanRobin we can update the go.mod to point to the exact commit that we require until we get the next release. |
@Madhu-1 Thanks for the information! I opened a PR for this, let me know if it looks okay to you |
Yes we can do that if someone is blocked because of this fix. |
Awesome, yes that would help us get further with this feature! I see the PR cannot merge because of this error:
|
Thanks for the help getting the changes in all! |
@NymanRobin not yet, we will plan it next 1 or 2 weeks. |
When creating multiple pods with separate PVC's the ceph-csi might fail to provision some of the encrypted cephfs instances.
The error from the pods logs is the following:
From this it seems setting up the encryption fails in the ceph-csi from the cephfs-csi plugin logs this can be seen, which makes me suspect a concurrency issue:
Environment
This problem is reproducible in the rook development environment from master and kernel support can be setup for example by following the instructions that was provided in this PR #3460
The following bash script can be executed and the problem should be seen at least for me it happens every time without exception only variance is how many pods fail to deploy
I will try to debug this further but if anyone has any ideas or pointer regarding it I am all ears or if anyone has seen this by chance and has some ideas, thanks!
The text was updated successfully, but these errors were encountered: