-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v0.0.5] [nvmeof restart] Failed to operate write op for oid nvmeof.state during service restart #290
Comments
The error comes from Ceph. From the file
The error we get is |
Looking at the log we can see that indeed there was an error adding the subsystem to OMAP:
So, it was added to SPDK whence it is seen in get_subsystems but not to the OMAP file. |
@rahullepakshi the GRPC lock is taken when calling the GRPC function. So, when create_subsystem() is called, we take the lock, perform the subsystem creation logic and release the lock. This logic is to first call the SPDK to create the subsystem and in case it was successful add the subsystem to the OMAP. So, on an update we won't hold the lock for the entire duration of the update. Each individual GRPC function called by the update gets the lock and then release it. |
Thanks @gbregman . So, it would be best we do not allow any such operations until GW is "UP" and OMAP is intact and "healthy" to accept writes/ or any operations for that matter or acquire lock for both GRPC call + omap update so everything is intact? WDYT? |
@rahullepakshi we have a separate issue for that. We should deal with it soon. See issue #56 |
@rahullepakshi the code has changed completely since the creation of this issue. Could you see if it's still relevant? Either reproduce it with the current code or close it? |
Not seeing this issue anymore, closing it @gbregman |
Tracks https://tracker.ceph.com/issues/63317
Main issue is below
#258 prevents corruption on any grpc calls and I am not seeing any corruption but during this phase of restart if any addition of GW entity, in my case it is create_subsystem was issued multiple times upon command failure and finally it got acknowledged and a subsystem_cnode2 was created but the same was not updated in omap.
User tries to create subsystem during nvmeof service restart, subsequently it succeeds
Somewhere in between 1st and 2nd request to create subsystem, subsystem nqn.2016-06.io.spdk:cnode2 is created as journalctl says -it already exists - check for "already exists" at http://magna002.ceph.redhat.com/cephci-jenkins/nvmeof_5_upstream.log
get_subsystems output -
But OMAP does not has its entry - http://magna002.ceph.redhat.com/cephci-jenkins/nvmeof_5_restart_omap.log
Further on restarting the GW, subsystem nqn.2016-06.io.spdk:cnode2 is no more displayed which was earlier displayed
get_subsystems output,
The text was updated successfully, but these errors were encountered: