Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry progenitor client calls in sagas #3225

Merged
merged 7 commits into from
Jun 9, 2023

Conversation

jmpesp
Copy link
Contributor

@jmpesp jmpesp commented May 25, 2023

Generalize the retry_until_known_result macro, and wrap progenitor client calls from saga nodes. This will retry in the face of transient errors, and reduce

  1. the times that sagas fail due to network weather, and 2) the times that saga unwinds fail for the same reason.

Generalize the retry_until_known_result macro, and wrap progenitor
client calls from saga nodes. This will retry in the face of transient
errors, and reduce

1) the times that sagas fail due to network weather, and
2) the times that saga unwinds fail for the same reason.
@jmpesp jmpesp requested a review from davepacheco May 25, 2023 21:05
Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this. It seems like an improvement, and I could also see us iterating on how to make it easier to do the right thing / harder to do the wrong thing. (Having to use a macro here is kind of annoying. But baking it into the clients seems also potentially fraught. So I don't have a better idea.)

I'm not familiar with a lot of the specific calls here related to storage and Dendrite. Hopefully they're all idempotent -- if not I guess they were already wrong?

/// they are idempotent, reissue the external call until a known result comes
/// back. Retry if a communication error is seen.
#[macro_export]
macro_rules! retry_until_known_result {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be a function that accepts a logger and a closure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be, but I couldn't get that to work! I imagine someone more experienced with Rust could get that to work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this works:

/// Retry a progenitor client operation until a known result is returned.
///
/// Saga execution relies on the outcome of an external call being known: since
/// they are idempotent, reissue the external call until a known result comes
/// back. Retry if a communication error is seen.
pub(crate) async fn retry_until_known_result<F, T, E, Fut>(
    log: &slog::Logger,
    mut f: F,
) -> Result<T, progenitor_client::Error<E>>
where
    F: FnMut() -> Fut,
    Fut: Future<Output = Result<T, progenitor_client::Error<E>>>,
    E: std::fmt::Debug,
{
    use omicron_common::backoff;

    backoff::retry_notify(
        backoff::retry_policy_internal_service(),
        move || {
            let fut = f();
            async move {
                match fut.await {
                    Err(progenitor_client::Error::CommunicationError(e)) => {
                        warn!(
                            log,
                            "saw transient communication error {}, retrying...",
                            e,
                        );

                        Err(backoff::BackoffError::transient(
                            progenitor_client::Error::CommunicationError(e),
                        ))
                    }

                    Err(e) => {
                        warn!(log, "saw permanent error {}, aborting", e);

                        Err(backoff::BackoffError::Permanent(e))
                    }

                    Ok(v) => Ok(v),
                }
            }
        },
        |error: progenitor_client::Error<_>, delay| {
            warn!(
                log,
                "failed external call ({:?}), will retry in {:?}", error, delay,
            );
        },
    )
    .await
}

but it requires some tedious changes to all the call sites; e.g., what was

    retry_until_known_result!(log, {
        client.snapshot(
            &params.disk_id.to_string(),
            &crucible_pantry_client::types::SnapshotRequest {
                snapshot_id: snapshot_id.to_string(),
            },
        )
     })
     .map_err(|e| {
         ActionError::action_failed(Error::internal_error(&e.to_string()))
     })?;

is now

    retry_until_known_result(log, || async {
        client
            .snapshot(
                &params.disk_id.to_string(),
                &crucible_pantry_client::types::SnapshotRequest {
                    snapshot_id: snapshot_id.to_string(),
                },
            )
            .await
    })
    .await
    .map_err(|e| {
        ActionError::action_failed(Error::internal_error(&e.to_string()))
    })?;

a) Start the block with || async
b) Add .await to the call inside the block
c) Add .await to the call of retry_until_known_result

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Done in b80679e

backoff::retry_policy_internal_service(),
|| async {
match ($func).await {
Err(progenitor_client::Error::CommunicationError(e)) => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are other cases where we'd want to retry, like 503. Take a look at this function:
https://github.com/oxidecomputer/omicron/blob/main/dns-service-client/src/lib.rs#L28

It feels like maybe want a generic progenitor-client is_retryable() function and we want to use that here.

I think this is still an improvement as-is. We could defer this but I think we might want a TODO comment here or an issue suggesting that we unify this function and that other is_retryable() since they're separately implementing a very critical, non-trivial policy.

Copy link
Contributor Author

@jmpesp jmpesp May 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about the other cases, I've added a try on 503 and 429 here.

I need to think a bit more about the other parts.

add comment about requiring that calls are idempotent
@jmpesp
Copy link
Contributor Author

jmpesp commented May 26, 2023

Thanks for doing this. It seems like an improvement, and I could also see us iterating on how to make it easier to do the right thing / harder to do the wrong thing. (Having to use a macro here is kind of annoying. But baking it into the clients seems also potentially fraught. So I don't have a better idea.)

Yeah, @ahl and I talked about this as well. It quickly became complex, partially because it may or may not have required pulling in the backoff crate too, though adam can probably speak more about this.

I share the sentiment that the macro is annoying, and the fact that you have to remember to do it is problematic.

I'm not familiar with a lot of the specific calls here related to storage and Dendrite. Hopefully they're all idempotent -- if not I guess they were already wrong?

It's good to point out that assumption here, yeah: retrying until the call returns a result like this only works if the call is idempotent. I've added to the macro's comment here.

The storage calls are idempotent, but, actually, the dendrite ones are not! So this would have been a bug...

@davepacheco
Copy link
Collaborator

the fact that you have to remember to do it is problematic.

One thing I hope will help here is that I'd like to change Steno's undo action signature to return UndoActionError, which will be an enum with only one variant called something like UndoActionError::NeedsIntervention. That means if someone were to propagate an error without thinking too hard about it, they'll have to convert it first, and hopefully that conversion will raise an alarm bell.

@ahl
Copy link
Contributor

ahl commented May 26, 2023

Thanks for doing this. It seems like an improvement, and I could also see us iterating on how to make it easier to do the right thing / harder to do the wrong thing. (Having to use a macro here is kind of annoying. But baking it into the clients seems also potentially fraught. So I don't have a better idea.)

Yeah, @ahl and I talked about this as well. It quickly became complex, partially because it may or may not have required pulling in the backoff crate too, though adam can probably speak more about this.

I agree that having a retry policy would be a very good thing for the generic client to supply. I tried a few different approaches in progenitor that all quickly became more complicated than anticipated.

For example, one of the types of parameters a generated client might accept is a streaming body. We can't retry a streaming body without buffering it--which we currently don't do. In addition I ran into issues with paginated endpoints that produce streams... streams that are already (apparently) on the edge of what rustc feels it can prove regarding lifetimes. Several different attempts landed me in situations where extensions to the current structure no longer compiled.

These are navigable--and I'd like to! But it felt like there was more urgency around this issue and I wasn't certain of the level of effort required in progenitor.

dpd api server endpoints are not idempotent, so ensure functions are
required when calling them for sagas.
@jmpesp
Copy link
Contributor Author

jmpesp commented May 26, 2023

The storage calls are idempotent, but, actually, the dendrite ones are not! So this would have been a bug...

I opened https://github.com/oxidecomputer/dendrite/issues/343 and added more ensure functions to the dpd-client crate in 8608dfa.

@jmpesp
Copy link
Contributor Author

jmpesp commented Jun 2, 2023

This is good for a re-review now.

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar caveat as last time: I looked at the mechanical changes here, but I'm not that familiar with a bunch of these saga actions or the services they're talking to so I didn't evaluate, e.g., whether we're retrying the right blocks (vs. grouping things together into one block that should be retried, for example)

})
.await
.map_err(|e| {
Error::internal_error(&format!(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is no worse than it was before, but it seems like a problem that we don't distinguish between the many kinds of errors that can happen here. So, not a blocker, but should we file an issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, opened #3329

@jmpesp jmpesp merged commit d75bda7 into oxidecomputer:main Jun 9, 2023
@jmpesp jmpesp deleted the audit_saga_unwinds branch June 9, 2023 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants