Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nexus SEC operations panic when CockroachDB is unavailable #2416

Closed
davepacheco opened this issue Feb 23, 2023 · 13 comments · Fixed by #5390
Closed

Nexus SEC operations panic when CockroachDB is unavailable #2416

davepacheco opened this issue Feb 23, 2023 · 13 comments · Fixed by #5390
Milestone

Comments

@davepacheco
Copy link
Collaborator

Nexus SEC operations currently panic if they fail:

// TODO-robustness This should be wrapped with a retry loop rather than
// unwrapping the result.
self.datastore.saga_create_event(&our_event).await.unwrap();

They should use a retry loop instead. I think this should be pretty straightforward as long as we're willing to block saga progress on CockroachDB coming back (which it seems like we should).

@smklein
Copy link
Collaborator

smklein commented Mar 30, 2024

Saw this on a March 28th iteration of dogfood

@sunshowers
Copy link
Contributor

sunshowers commented Mar 30, 2024

Adding to what @smklein said -- we have two cores at /staff/dock/rack2/mupdate-20240329/cores/sled-08.

With mdb <core>, then $c:

core.oxz_nexus_65a11c18-7f59-41ac-b9e7-680627f996e7.nexus.10458.1711663318:

Loading modules: [ libumem.so.1 libnvpair.so.1 libc.so.1 ld.so.1 ]
> $c
libc.so.1`_lwp_kill+0xa()
libc.so.1`raise+0x22(6)
libc.so.1`abort+0x58()
0x5a207b9()
0x5a207a9()
rust_panic+0xd()
_ZN3std9panicking20rust_panic_with_hook17h7d19ef586749da2fE+0x2ab()
_ZN3std9panicking19begin_panic_handler28_$u7b$$u7b$closure$u7d$$u7d$17h95e3e1ca24551a68E+0xa4()
0x5a05ef9()
0x5a08802()
0x5a4f6c5()
0x5a4fd55()
_ZN95_$LT$nexus_db_queries..db..sec_store..CockroachDbSecStore$u20$as$u20$steno..store..SecStore$GT$12record_event28_$u7b$$u7b$closure$u7d$$u7d$17hc33ebb1cf636f348E+0xcec()
remaining frames
_ZN5steno3sec3Sec12executor_log28_$u7b$$u7b$closure$u7d$$u7d$17hc49d437d71bb12e7E+0x53()
_ZN117_$LT$futures_util..stream..futures_unordered..FuturesUnordered$LT$Fut$GT$$u20$as$u20$futures_core..stream..Stream$GT$9poll_next17hfd98f22f6167186cE+0x1ac()
_ZN88_$LT$tokio..future..poll_fn..PollFn$LT$F$GT$$u20$as$u20$core..future..future..Future$GT$4poll17h87f05f880b8d9347E+0x28c()
_ZN5steno3sec3sec28_$u7b$$u7b$closure$u7d$$u7d$17he80dac5e7ee8228cE.llvm.5074909589756572123+0x3b9()
_ZN5tokio7runtime4task4core17Core$LT$T$C$S$GT$4poll17hb16ad800f6b83c3eE+0x49()
_ZN5tokio7runtime4task7harness20Harness$LT$T$C$S$GT$4poll17h4c987821e435a118E+0x48()
_ZN5tokio7runtime9scheduler12multi_thread6worker7Context8run_task17h5720ac884dcfa807E+0x146()
_ZN5tokio7runtime7context6scoped15Scoped$LT$T$GT$3set17h146604d8d66fce1bE+0xabe()
_ZN5tokio7runtime7context7runtime13enter_runtime17had12f1b61675d596E+0x193()
_ZN5tokio7runtime9scheduler12multi_thread6worker3run17hfc17fe6c20967720E+0x4b()
_ZN5tokio7runtime4task4core17Core$LT$T$C$S$GT$4poll17h9cfb48bd55732e89E+0x73()
_ZN5tokio7runtime4task7harness20Harness$LT$T$C$S$GT$4poll17h5f94343b6f772f15E+0x97()
_ZN5tokio7runtime8blocking4pool5Inner3run17h35df801e4bd32ce5E+0xe4()
_ZN3std10sys_common9backtrace28__rust_begin_short_backtrace17h53b7bfcbc6b4686bE+0x3f()
_ZN4core3ops8function6FnOnce40call_once$u7b$$u7b$vtable.shim$u7d$$u7d$17h9026fcd303042064E+0x75()
_ZN3std3sys4unix6thread6Thread3new12thread_start17h1783cbcbbf061711E+0x29()
libc.so.1`_thrp_setup+0x77(fffff5ffeddcda40)
libc.so.1`_lwp_start()

And core.oxz_nexus_65a11c18-7f59-41ac-b9e7-680627f996e7.nexus.5125.1711667683:

libc.so.1`_lwp_kill+0xa()
libc.so.1`raise+0x22(6)
libc.so.1`abort+0x58()
0x5a207b9()
0x5a207a9()
rust_panic+0xd()
_ZN3std9panicking20rust_panic_with_hook17h7d19ef586749da2fE+0x2ab()
_ZN3std9panicking19begin_panic_handler28_$u7b$$u7b$closure$u7d$$u7d$17h95e3e1ca24551a68E+0xa4()
0x5a05ef9()
0x5a08802()
0x5a4f6c5()
0x5a4f48a()
0x399e8a2()
_ZN98_$LT$alloc..vec..Vec$LT$T$GT$$u20$as$u20$alloc..vec..spec_from_iter..SpecFromIter$LT$T$C$I$GT$$GT$9from_iter17h1c0a0b91677a258cE+0xc4()
_ZN159_$LT$omicron_nexus..app..background..sync_switch_configuration..SwitchPortSettingsManager$u20$as$u20$omicron_nexus..app..background..common..BackgroundTask$GT$8activate28_$u7b$$u7b$closure$u7d$$u7d$17h003d11b741f553f2E+0x26f91()
remaining frames
_ZN13omicron_nexus3app10background6common8TaskExec8activate28_$u7b$$u7b$closure$u7d$$u7d$17h60138c0e224896edE+0x233()
_ZN13omicron_nexus3app10background6common8TaskExec3run28_$u7b$$u7b$closure$u7d$$u7d$17hb86814c79d6f1a8fE.llvm.10434374178345103703+0x467()
_ZN5tokio7runtime4task7harness20Harness$LT$T$C$S$GT$4poll17hdf5fcd0b9767bf4aE+0x83()
_ZN5tokio7runtime9scheduler12multi_thread6worker7Context8run_task17h5720ac884dcfa807E+0x190()
_ZN5tokio7runtime7context6scoped15Scoped$LT$T$GT$3set17h146604d8d66fce1bE+0xabe()
_ZN5tokio7runtime7context7runtime13enter_runtime17had12f1b61675d596E+0x193()
_ZN5tokio7runtime9scheduler12multi_thread6worker3run17hfc17fe6c20967720E+0x4b()
_ZN5tokio7runtime4task4core17Core$LT$T$C$S$GT$4poll17h9cfb48bd55732e89E+0x73()
_ZN5tokio7runtime4task7harness20Harness$LT$T$C$S$GT$4poll17h5f94343b6f772f15E+0x97()
_ZN5tokio7runtime8blocking4pool5Inner3run17h35df801e4bd32ce5E+0xe4()
_ZN3std10sys_common9backtrace28__rust_begin_short_backtrace17h53b7bfcbc6b4686bE+0x3f()
_ZN4core3ops8function6FnOnce40call_once$u7b$$u7b$vtable.shim$u7d$$u7d$17h9026fcd303042064E+0x75()
_ZN3std3sys4unix6thread6Thread3new12thread_start17h1783cbcbbf061711E+0x29()
libc.so.1`_thrp_setup+0x77(fffff5ffedddda40)
libc.so.1`_lwp_start()

@sunshowers
Copy link
Contributor

The first one 10458 seems to be

self.datastore.saga_create_event(&our_event).await.unwrap();
.

The second one 5125 is from somewhere in

.

@sunshowers
Copy link
Contributor

sunshowers commented Mar 30, 2024

The second pid 5125 one is unrelated to this issue -- have filed #5361 for that.

sunshowers added a commit that referenced this issue Apr 4, 2024
We observed during a mupdate on Friday that Nexus panicked because of
network flakiness. This is an attempt to address that by adding a retry
loop.

Fixes #2416.
@smklein smklein reopened this Jul 15, 2024
@smklein
Copy link
Collaborator

smklein commented Jul 15, 2024

I believe there still exists an unwrap here, which can cause nexus to panic if CRDB is unavailable:

async fn saga_update(&self, id: SagaId, update: steno::SagaCachedState) {
// TODO-robustness We should track the current generation of the saga
// and use it. We'll know this either from when it was created or when
// it was recovered.
info!(&self.log, "updating state";
"saga_id" => id.to_string(),
"new_state" => update.to_string()
);
// TODO-robustness This should be wrapped with a retry loop rather than
// unwrapping the result. See omicron#2416.
self.datastore
.saga_update_state(id, update, self.sec_id, Generation::new())
.await
.unwrap();
}

This was the root cause of #6090

@sunshowers
Copy link
Contributor

Ah yeah you're right, shouldn't have closed this. Sorry!

@smklein
Copy link
Collaborator

smklein commented Jul 15, 2024

Ah yeah you're right, shouldn't have closed this. Sorry!

No worries, it was easy to miss. Do you wanna take fixing this one, or should I?

@sunshowers
Copy link
Contributor

I'll pick it up, thanks!

@davepacheco
Copy link
Collaborator Author

@sunshowers see also #6090 (comment)

@sunshowers
Copy link
Contributor

@davepacheco thanks -- the first one can be done easily enough I hope, but does the second one need optimistic concurrency/a generation number? If so, then we should just implement that.

@davepacheco
Copy link
Collaborator Author

I think it already has that, just using a generation number that's made up? The "adopt_generation" was intended to be bumped whenever a takeover happens, but we haven't implemented that.

I'm not sure it's worth implementing a case we can't have in production and so can't test. What would we do if the OCC update fails for some reason?

sunshowers added a commit that referenced this issue Jul 18, 2024
…potent (#6113)

See discussion in #2416
and #6090 (comment).

A summary of the changes here:

1. Made `saga_create_event` idempotent. Previously, creating another
event that duplicated the one which already existed would fail -- now it
succeeds. These events are meant to be an append-only idempotent log, so
this is okay. Also added a test for this.
2. `saga_update_state` was already idempotent -- added a test which made
sure of this. Also added a comment about how idempotence may not be
enough in the future.
3. Added a retry loop around saga state updates, similar to the one
around recording events.
@sunshowers
Copy link
Contributor

Grepped through omicron for "2416" and didn't see any other results, so I think this is done for real now.

@davepacheco
Copy link
Collaborator Author

Related: oxidecomputer/steno#302. I hit this because in my PR had I broken things so that this retry loop continued hitting a permanent error:

00:19:42.794Z ERRO c39c0c31-e34e-4b77-9616-2a0bf956f9b6 (ServerContext): client error while updating saga state (likely requires operator intervention), retrying anyway
    call_count = 12
    error = Invalid Request: failed to update saga SagaId(9879e3fd-ae9b-4d3d-b960-2b8c3227f3b2) with state Done: preconditions not met: expected current_sec = c39c0c31-e34e-4b77-9616-2a0bf956f9b6, adopt_generation = Generation(Generation(1)), but found current_sec = Some(c39c0c31-e34e-4b77-9616-2a0bf956f9b6), adopt_generation = Generation(Generation(2)), state = SagaCachedState(Running)
    file = nexus/db-queries/src/db/sec_store.rs:147
    new_state = done
    saga_id = 9879e3fd-ae9b-4d3d-b960-2b8c3227f3b2
    total_duration = 612.261975772s

During that time, I couldn't list sagas. My guess is that other sagas couldn't complete, either. I think Nexus is doing nothing wrong and that Steno ought to behave better here but I figured I'd mention it here so folks were aware.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants