-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nexus SEC operations panic when CockroachDB is unavailable #2416
Comments
Saw this on a March 28th iteration of dogfood |
Adding to what @smklein said -- we have two cores at With
remaining frames
And
remaining frames
|
The first one
The second one
|
The second pid 5125 one is unrelated to this issue -- have filed #5361 for that. |
We observed during a mupdate on Friday that Nexus panicked because of network flakiness. This is an attempt to address that by adding a retry loop. Fixes #2416.
I believe there still exists an unwrap here, which can cause nexus to panic if CRDB is unavailable: omicron/nexus/db-queries/src/db/sec_store.rs Lines 127 to 142 in 031b5ec
This was the root cause of #6090 |
Ah yeah you're right, shouldn't have closed this. Sorry! |
No worries, it was easy to miss. Do you wanna take fixing this one, or should I? |
I'll pick it up, thanks! |
@sunshowers see also #6090 (comment) |
@davepacheco thanks -- the first one can be done easily enough I hope, but does the second one need optimistic concurrency/a generation number? If so, then we should just implement that. |
I think it already has that, just using a generation number that's made up? The "adopt_generation" was intended to be bumped whenever a takeover happens, but we haven't implemented that. I'm not sure it's worth implementing a case we can't have in production and so can't test. What would we do if the OCC update fails for some reason? |
…potent (#6113) See discussion in #2416 and #6090 (comment). A summary of the changes here: 1. Made `saga_create_event` idempotent. Previously, creating another event that duplicated the one which already existed would fail -- now it succeeds. These events are meant to be an append-only idempotent log, so this is okay. Also added a test for this. 2. `saga_update_state` was already idempotent -- added a test which made sure of this. Also added a comment about how idempotence may not be enough in the future. 3. Added a retry loop around saga state updates, similar to the one around recording events.
Grepped through omicron for "2416" and didn't see any other results, so I think this is done for real now. |
Related: oxidecomputer/steno#302. I hit this because in my PR had I broken things so that this retry loop continued hitting a permanent error:
During that time, I couldn't list sagas. My guess is that other sagas couldn't complete, either. I think Nexus is doing nothing wrong and that Steno ought to behave better here but I figured I'd mention it here so folks were aware. |
Nexus SEC operations currently panic if they fail:
omicron/nexus/src/db/sec_store.rs
Lines 63 to 65 in cb3d713
They should use a retry loop instead. I think this should be pretty straightforward as long as we're willing to block saga progress on CockroachDB coming back (which it seems like we should).
The text was updated successfully, but these errors were encountered: