Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(transport): retry layer #849

Merged
merged 19 commits into from
Jul 5, 2024
Merged

feat(transport): retry layer #849

merged 19 commits into from
Jul 5, 2024

Conversation

yash-atreya
Copy link
Member

@yash-atreya yash-atreya commented Jun 7, 2024

Motivation

Port foundry's layer into the transport crate.
Had quite a few requests for a retry layer from users, and also, this would remove a bunch of code from the Foundry.

Closes #717
Closes #140

Solution

Port from Foundry, still WIP.

Closes #288

PR Checklist

  • Added Tests
  • Added Documentation
  • Breaking changes

return should_retry_json_rpc_error(&resp);
}

// some providers send invalid JSON RPC in the error case (no `id:u64`), but the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how very kind of them

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would be true ONLY of http, right? as in WS we would REQUIRE the ID in order to determine which response got rate limited?

does that imply that we should resolve this in the HTTP transports by writing the ID into the resp if it's missing?

Copy link
Member

@prestwich prestwich Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually i like this idea more now. basically try this as a fallback deser in the http transports, and insert the ID if it works

Copy link
Member

@mattsse mattsse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is what we've been in using in foundry

we could rethink some of the decisions here, but overall this has been working well, for example there's an argument for removing the cups budgeting entirely and instead entirely rely on ratelimit error responses

there are a few native retry types in tower like https://tower-rs.github.io/tower/tower/retry/budget/index.html#reexport.TpsBudget

but they don't really fit our use case

fn should_retry_json_rpc_error(error: &ErrorPayload) -> bool {
let ErrorPayload { code, message, .. } = error;
// alchemy throws it this way
if *code == 429 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excuse me this is insane

@prestwich
Copy link
Member

this is a good inclusion. the amount of stupid wrangling makes me sad

crates/transport/src/error.rs Outdated Show resolved Hide resolved
crates/transport/src/error.rs Outdated Show resolved Hide resolved
crates/transport/src/layers/retry.rs Outdated Show resolved Hide resolved
crates/transport/src/layers/retry.rs Outdated Show resolved Hide resolved
return should_retry_json_rpc_error(&resp);
}

// some providers send invalid JSON RPC in the error case (no `id:u64`), but the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would be true ONLY of http, right? as in WS we would REQUIRE the ID in order to determine which response got rate limited?

does that imply that we should resolve this in the HTTP transports by writing the ID into the resp if it's missing?

@mattsse mattsse linked an issue Jun 10, 2024 that may be closed by this pull request
crates/transport/src/error.rs Outdated Show resolved Hide resolved
crates/transport/src/layers/retry.rs Outdated Show resolved Hide resolved
crates/transport/src/error.rs Outdated Show resolved Hide resolved
crates/transport/src/error.rs Outdated Show resolved Hide resolved
crates/transport/src/error.rs Outdated Show resolved Hide resolved
crates/transport/src/layers/retry.rs Outdated Show resolved Hide resolved
crates/transport/src/layers/retry.rs Show resolved Hide resolved
crates/transport/src/error.rs Outdated Show resolved Hide resolved
yash-atreya and others added 2 commits June 11, 2024 07:13
Co-authored-by: Matthias Seitz <[email protected]>
@yash-atreya yash-atreya requested a review from mattsse June 11, 2024 11:15
crates/transport/src/error.rs Outdated Show resolved Hide resolved
crates/transport/src/layers/retry.rs Show resolved Hide resolved
crates/transport/src/error.rs Outdated Show resolved Hide resolved
crates/transport/src/layers/retry.rs Show resolved Hide resolved
@yash-atreya yash-atreya mentioned this pull request Jun 12, 2024
3 tasks
Copy link
Member

@mattsse mattsse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the batch retry logic is very hard to follow, not immediately obvious what this does.

I suggest we don't do any of this at first, just to make some progress on the pr and retry the entire call

@@ -110,3 +111,65 @@ impl HttpError {
false
}
}

/// Extension trait to implement methods for [`RpcError<TransportErrorKind, E>`].
pub trait RpcErrorExt {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created this to implement is_retryable_err and backoff_hint methods specifically on RpcError<TransportErrorKind> and not the RpcError<E, ErrResp> generic type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but this is only implemented for RpcError?

could we at least make this pub(crate)?

crates/transport/src/layers/retry.rs Outdated Show resolved Hide resolved
Comment on lines 161 to 180
for r in batch_res {
if r.is_error() {
batch_errs.push(r);
} else {
batch_success.push(r.clone());
// Remove corresponding request from the batch
req.remove_by_id(&r.id);

if req.is_empty() {
let response_packet =
ResponsePacket::from(batch_success.clone());
batch_success.clear();
batch_errs.clear();

return Ok(response_packet);
}
}
}

let e = batch_errs.first().unwrap().payload.as_error().unwrap().clone();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this do?

batch handling looks very complex,

this also only uses the first batch_err, but there could be many

Comment on lines 192 to 196
if !batch_success.is_empty() {
// Join batch_success and batch_errs
let mut batch_calls = batch_success;
batch_calls.append(&mut batch_errs);
return Ok(ResponsePacket::from(batch_calls));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't this mess with the request ordering?

@yash-atreya
Copy link
Member Author

@mattsse ptal.

I've made some changes to maintain the ordering of the responses corresponding to the incoming requests.

Few highlights:

  1. We store initial incoming requests into requets_order. This is used to reference the original order and arrange the responses accordingly.

  2. We collect all responses (regardless of whether it is a success or err) into batch_responses: HashMap<Id, Response> where key is the request Id.

  3. As the requests are retried the new response replaces the old one in the batch_responses HashMap.

  4. Requests are churned/removed from the next retry batch using req.remove_by_id(&id), if a success response is returned or when non-retryable is encountered.

  5. The batch_responses is returned as a ResponsePacket when:

    • No more requests are left to be retried i.e req.is_empty().
    • Max retries available retries have been exhausted.
    • Non retryable errors are encountered.

Copy link
Member

@mattsse mattsse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the batch retry logic is a bit complex and not easy to follow, imo this feature isn't super important. I'd rather make some progress on this pr and exclude this entirely, because this has been open for a while now.

it turns out batch requests don't necessarily have ordering guarantees:

https://docs.alchemy.com/reference/batch-requests#what-is-a-batch-request

so we can actually drastically simplify this, by doing a partition over success/errors and only retry.

but I'd like to do that separately and get this included without batch retry support

@@ -110,3 +111,65 @@ impl HttpError {
false
}
}

/// Extension trait to implement methods for [`RpcError<TransportErrorKind, E>`].
pub trait RpcErrorExt {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but this is only implemented for RpcError?

could we at least make this pub(crate)?

Comment on lines 158 to 160
if r.is_error() {
batch_responses.insert(r.id.clone(), r.clone());

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks redundant?

Copy link
Member

@mattsse mattsse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@yash-atreya yash-atreya merged commit 96e3a84 into main Jul 5, 2024
22 checks passed
@yash-atreya yash-atreya deleted the yash/retry-layer branch July 5, 2024 11:19
ben186 pushed a commit to ben186/alloy that referenced this pull request Jul 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants