Retry progenitor client calls in sagas #3225

jmpesp · 2023-05-25T21:05:18Z

Generalize the retry_until_known_result macro, and wrap progenitor client calls from saga nodes. This will retry in the face of transient errors, and reduce

the times that sagas fail due to network weather, and 2) the times that saga unwinds fail for the same reason.

Generalize the retry_until_known_result macro, and wrap progenitor client calls from saga nodes. This will retry in the face of transient errors, and reduce 1) the times that sagas fail due to network weather, and 2) the times that saga unwinds fail for the same reason.

davepacheco

Thanks for doing this. It seems like an improvement, and I could also see us iterating on how to make it easier to do the right thing / harder to do the wrong thing. (Having to use a macro here is kind of annoying. But baking it into the clients seems also potentially fraught. So I don't have a better idea.)

I'm not familiar with a lot of the specific calls here related to storage and Dendrite. Hopefully they're all idempotent -- if not I guess they were already wrong?

davepacheco · 2023-05-26T18:58:45Z

nexus/src/app/sagas/mod.rs

+/// they are idempotent, reissue the external call until a known result comes
+/// back. Retry if a communication error is seen.
+#[macro_export]
+macro_rules! retry_until_known_result {


Could this be a function that accepts a logger and a closure?

It could be, but I couldn't get that to work! I imagine someone more experienced with Rust could get that to work.

I think this works:

/// Retry a progenitor client operation until a known result is returned. /// /// Saga execution relies on the outcome of an external call being known: since /// they are idempotent, reissue the external call until a known result comes /// back. Retry if a communication error is seen. pub(crate) async fn retry_until_known_result<F, T, E, Fut>( log: &slog::Logger, mut f: F, ) -> Result<T, progenitor_client::Error<E>> where F: FnMut() -> Fut, Fut: Future<Output = Result<T, progenitor_client::Error<E>>>, E: std::fmt::Debug, { use omicron_common::backoff; backoff::retry_notify( backoff::retry_policy_internal_service(), move || { let fut = f(); async move { match fut.await { Err(progenitor_client::Error::CommunicationError(e)) => { warn!( log, "saw transient communication error {}, retrying...", e, ); Err(backoff::BackoffError::transient( progenitor_client::Error::CommunicationError(e), )) } Err(e) => { warn!(log, "saw permanent error {}, aborting", e); Err(backoff::BackoffError::Permanent(e)) } Ok(v) => Ok(v), } } }, |error: progenitor_client::Error<_>, delay| { warn!( log, "failed external call ({:?}), will retry in {:?}", error, delay, ); }, ) .await }

but it requires some tedious changes to all the call sites; e.g., what was

retry_until_known_result!(log, { client.snapshot( &params.disk_id.to_string(), &crucible_pantry_client::types::SnapshotRequest { snapshot_id: snapshot_id.to_string(), }, ) }) .map_err(|e| { ActionError::action_failed(Error::internal_error(&e.to_string())) })?;

is now

retry_until_known_result(log, || async { client .snapshot( &params.disk_id.to_string(), &crucible_pantry_client::types::SnapshotRequest { snapshot_id: snapshot_id.to_string(), }, ) .await }) .await .map_err(|e| { ActionError::action_failed(Error::internal_error(&e.to_string())) })?;

a) Start the block with || async
b) Add .await to the call inside the block
c) Add .await to the call of retry_until_known_result

Cool! Done in b80679e

davepacheco · 2023-05-26T19:01:34Z

nexus/src/app/sagas/mod.rs

+            backoff::retry_policy_internal_service(),
+            || async {
+                match ($func).await {
+                    Err(progenitor_client::Error::CommunicationError(e)) => {


I think there are other cases where we'd want to retry, like 503. Take a look at this function:
https://github.com/oxidecomputer/omicron/blob/main/dns-service-client/src/lib.rs#L28

It feels like maybe want a generic progenitor-client is_retryable() function and we want to use that here.

I think this is still an improvement as-is. We could defer this but I think we might want a TODO comment here or an issue suggesting that we unify this function and that other is_retryable() since they're separately implementing a very critical, non-trivial policy.

Good point about the other cases, I've added a try on 503 and 429 here.

I need to think a bit more about the other parts.

add comment about requiring that calls are idempotent

jmpesp · 2023-05-26T20:24:43Z

Thanks for doing this. It seems like an improvement, and I could also see us iterating on how to make it easier to do the right thing / harder to do the wrong thing. (Having to use a macro here is kind of annoying. But baking it into the clients seems also potentially fraught. So I don't have a better idea.)

Yeah, @ahl and I talked about this as well. It quickly became complex, partially because it may or may not have required pulling in the backoff crate too, though adam can probably speak more about this.

I share the sentiment that the macro is annoying, and the fact that you have to remember to do it is problematic.

I'm not familiar with a lot of the specific calls here related to storage and Dendrite. Hopefully they're all idempotent -- if not I guess they were already wrong?

It's good to point out that assumption here, yeah: retrying until the call returns a result like this only works if the call is idempotent. I've added to the macro's comment here.

The storage calls are idempotent, but, actually, the dendrite ones are not! So this would have been a bug...

davepacheco · 2023-05-26T20:27:41Z

the fact that you have to remember to do it is problematic.

One thing I hope will help here is that I'd like to change Steno's undo action signature to return UndoActionError, which will be an enum with only one variant called something like UndoActionError::NeedsIntervention. That means if someone were to propagate an error without thinking too hard about it, they'll have to convert it first, and hopefully that conversion will raise an alarm bell.

ahl · 2023-05-26T20:32:55Z

Thanks for doing this. It seems like an improvement, and I could also see us iterating on how to make it easier to do the right thing / harder to do the wrong thing. (Having to use a macro here is kind of annoying. But baking it into the clients seems also potentially fraught. So I don't have a better idea.)

Yeah, @ahl and I talked about this as well. It quickly became complex, partially because it may or may not have required pulling in the backoff crate too, though adam can probably speak more about this.

I agree that having a retry policy would be a very good thing for the generic client to supply. I tried a few different approaches in progenitor that all quickly became more complicated than anticipated.

For example, one of the types of parameters a generated client might accept is a streaming body. We can't retry a streaming body without buffering it--which we currently don't do. In addition I ran into issues with paginated endpoints that produce streams... streams that are already (apparently) on the edge of what rustc feels it can prove regarding lifetimes. Several different attempts landed me in situations where extensions to the current structure no longer compiled.

These are navigable--and I'd like to! But it felt like there was more urgency around this issue and I wasn't certain of the level of effort required in progenitor.

dpd api server endpoints are not idempotent, so ensure functions are required when calling them for sagas.

jmpesp · 2023-05-26T21:50:42Z

The storage calls are idempotent, but, actually, the dendrite ones are not! So this would have been a bug...

I opened https://github.com/oxidecomputer/dendrite/issues/343 and added more ensure functions to the dpd-client crate in 8608dfa.

Credit goes to jgallagher, I just copied what he wrote

jmpesp · 2023-06-02T20:56:40Z

This is good for a re-review now.

davepacheco

Similar caveat as last time: I looked at the mechanical changes here, but I'm not that familiar with a bunch of these saga actions or the services they're talking to so I didn't evaluate, e.g., whether we're retrying the right blocks (vs. grouping things together into one block that should be retried, for example)

davepacheco · 2023-06-08T17:37:32Z

nexus/src/app/instance.rs

+            })
+            .await
+            .map_err(|e| {
+                Error::internal_error(&format!(


It looks like this is no worse than it was before, but it seems like a problem that we don't distinguish between the many kinds of errors that can happen here. So, not a blocker, but should we file an issue?

good point, opened #3329

jmpesp requested a review from davepacheco May 25, 2023 21:05

fmt

cbd6853

jmpesp mentioned this pull request May 25, 2023

another nexus saga unwrap failure #3133

Closed

v2p mapping functions are called from sagas

47102f8

davepacheco approved these changes May 26, 2023

View reviewed changes

retry on 503 or 429

472f77a

add comment about requiring that calls are idempotent

Create more dpd ensure functions

8608dfa

dpd api server endpoints are not idempotent, so ensure functions are required when calling them for sagas.

jmpesp added 2 commits May 26, 2023 20:12

Merge remote-tracking branch 'upstream/main' into audit_saga_unwinds

0239b13

Turn retry_until_known_result into a function

b80679e

Credit goes to jgallagher, I just copied what he wrote

davepacheco approved these changes Jun 8, 2023

View reviewed changes

jmpesp mentioned this pull request Jun 9, 2023

Distinguish between different progenitor errors #3329

Closed

jmpesp merged commit d75bda7 into oxidecomputer:main Jun 9, 2023

jmpesp deleted the audit_saga_unwinds branch June 9, 2023 17:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry progenitor client calls in sagas #3225

Retry progenitor client calls in sagas #3225

jmpesp commented May 25, 2023

davepacheco left a comment

davepacheco May 26, 2023

jmpesp May 26, 2023

jgallagher May 26, 2023

jmpesp Jun 2, 2023

davepacheco May 26, 2023

jmpesp May 26, 2023 •

edited

Loading

jmpesp commented May 26, 2023

davepacheco commented May 26, 2023

ahl commented May 26, 2023

jmpesp commented May 26, 2023

jmpesp commented Jun 2, 2023

davepacheco left a comment

davepacheco Jun 8, 2023

jmpesp Jun 9, 2023

Retry progenitor client calls in sagas #3225

Retry progenitor client calls in sagas #3225

Conversation

jmpesp commented May 25, 2023

davepacheco left a comment

Choose a reason for hiding this comment

davepacheco May 26, 2023

Choose a reason for hiding this comment

jmpesp May 26, 2023

Choose a reason for hiding this comment

jgallagher May 26, 2023

Choose a reason for hiding this comment

jmpesp Jun 2, 2023

Choose a reason for hiding this comment

davepacheco May 26, 2023

Choose a reason for hiding this comment

jmpesp May 26, 2023 • edited Loading

Choose a reason for hiding this comment

jmpesp commented May 26, 2023

davepacheco commented May 26, 2023

ahl commented May 26, 2023

jmpesp commented May 26, 2023

jmpesp commented Jun 2, 2023

davepacheco left a comment

Choose a reason for hiding this comment

davepacheco Jun 8, 2023

Choose a reason for hiding this comment

jmpesp Jun 9, 2023

Choose a reason for hiding this comment

jmpesp May 26, 2023 •

edited

Loading