-
Notifications
You must be signed in to change notification settings - Fork 547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert journal package omap to use go ceph #1071
Convert journal package omap to use go ceph #1071
Conversation
180f674
to
e92d6df
Compare
60e1941
to
78cedb4
Compare
7c4b402
to
0ba9363
Compare
Can someone who is more familiar with these e2e tests help me out? I see that the test fails and times out waiting for a condition, but the output is not clear what condition. If it refers to the lines right above, I don't quite see how this pr would interact with that (node labeling). I've (lightly) tested this by hand now for rbd and its working locally for me. I'd love a hand getting this PR to pass the tests. |
@ShyamsundarR are missing any RBAC for external-provisioner?
@phlogistonjohn this is the issue in cephcsi side, in DeleteVolume request, as CSI needs to be idempotent we need to consider the case as a success if some resource is not present in the backend https://travis-ci.org/github/ceph/ceph-csi/jobs/695718476#L13231 |
Looks like the rbd-external-provisioner also need #1142 |
hmm.. Its already present in the template... |
In the job logs for non-helm based test, the above errors are not present. Checked the helm scripts to ensure that topology was set to true, as that would add the required roles from the cluster role template. Currently unable to figure out what else is missing, to cause the above errors to appear. |
Run: With helm Run: without helm NOTE: Debugging the non-helm setup, to avoid the "node get" noise.
First error is here: @phlogistonjohn Have to recheck the code but, newrbdpool holds the omap for the image, and the CSI omap should be in the default pool (replicapool) created. The code seems to be attempting a delete of the omap in the wrong pool? |
Thanks for the feedback everyone, it appears to be that the rados command line did not return an error in the case of the omap not existings when removing keys. I've added a change to mimic that behavior. This should be ready for review now. |
4281a94
to
d71166f
Compare
@nixpanic @humblec @ShyamsundarR @Madhu-1 PTAL. I have rebased again and now all the checks are happy now (previously, the CI had passed but mergify was stuck). |
switch err { | ||
case nil: | ||
case rados.ErrNotFound: | ||
klog.Errorf( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we log which key not found? it would be useful for debugging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can log the keys requested, but logging keys-not-found doesn't make sense for this case because it is only returned if the oid is not found. The only reason it returns ErrKeyNotFound is for backwards compat. with the previous implementation.
So I'm not sure it would really be helpful.
@@ -173,6 +177,7 @@ func NewCSISnapshotJournal(suffix string) *Config { | |||
cephSnapSourceKey: "csi.source", | |||
namespace: "", | |||
encryptKMSKey: "csi.volume.encryptKMS", | |||
commonPrefix: "csi.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this cause any breakage once user upgrades cephcsi, just want to make sure everything works in an upgrade scenario?
the upgrade is not covered in E2E :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be fine. It's just a constant that "rolls up" the current practice of all constants that start with "csi." (and I checked, they all do AFAICT).
If this were to change then we'd have to modify it with either additional constants or something, but I don't see a problem here.
As a further (future) optimization, you could eliminate the journal lock and use a RADOS class call to execute the |
d71166f
to
14ebcca
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small nit, good to go
14ebcca
to
9949a2e
Compare
Travis failed in setup (not my code). If someone doesn't poke it I'll push a dummy change later (nothing to rebase on atm). |
4633f09
to
e41662f
Compare
/test ci/centos/containerized-tests |
@Mergifyio rebase |
These types have private fields but we need to construct them outside of the util package. Add New* methods for both. Signed-off-by: John Mulligan <[email protected]>
These new omap manipulation functions (get/set/remove) are roughly equivalent to the previous command-line based approach but rely on direct api calls to ceph. Signed-off-by: John Mulligan <[email protected]>
Convert the business-logic of the journal to use the new go-ceph based omap manipulation functions. Signed-off-by: John Mulligan <[email protected]>
Taking this appraoch means that any function that must get more than one key's value from the same oid can be more efficient by calling out to ceph only once. To be cautious and avoid missing things we always request ceph return more keys than we actually expect to be set on the oid. If there are unexpected keys there, we will not miss the keys we want if we first hit an unexpected key if we were to limit ourselves to iterating only over the number of keys we're expecting to be on the object. Signed-off-by: John Mulligan <[email protected]>
For any function that removes more than one key on a single oid removing them as a batch will be more efficient. Signed-off-by: John Mulligan <[email protected]>
For any function that sets more than one key on a single oid setting them as a batch will be more efficient. Signed-off-by: John Mulligan <[email protected]>
Signed-off-by: John Mulligan <[email protected]>
The previous function used to remove omap keys apparently did not return errors when removing omap keys from a missing omap (oid). Mimic that behavior when using the api. Signed-off-by: John Mulligan <[email protected]>
A number of exported functions in errors.go were missing doc comments. Add them. Signed-off-by: John Mulligan <[email protected]>
Command
|
14a7e78
to
054f1bd
Compare
Travis status appears borked (again) - all tasks report success but travis ci failed to report back to the PR and it still shows pending to me. |
Yeah, restarted it again |
Describe what this PR does
Change the way the journal package interacts with omaps by switching from cli based functions to the api bindings of go-ceph. The first few patches basically do one-to-one switch overs of the calls used in voljournal.go. Subsequent patches change the internal interfaces to use the natural batching that the api calls give us, allowing the setting, fetching, or removal of multiple keys in one api call.
Is there anything that requires special attention
Do you have any questions? n/a
Is the change backward compatible?
In theory it should behave exactly as before.
Are there concerns around backward compatibility?
Now that CI is passing, there shouldn't be, but any missed changes can be handled as fixes.
Related issues
Fixes: #434
Future concerns
This series does not handle changing everything in journal to go-ceph. There's more to be done here.
I did not clean up any functions that are now unused in util. I plan to do that in a follow up pr.