op-batcher: Randomize ordering of rollupUrl failover #10695

BrianBland · 2024-05-30T23:08:05Z

Description

Updates the ActiveL2RollupProvider to fallback to alternative candidate rollupUrls in a randomized order, instead of always iterating through the list in the same order.
This prevents a subset of bad/misconfigured endpoints from permanently blocking failover behavior, as these endpoints could otherwise consume the entire context timeout on each iteration of ActiveL2RollupProvider.RollupClient()

Using a randomized ordering also allows the rollupUrls set to grow independent of the contextual deadline, as each new iteration attempt has an equal chance of trying new URLs, instead of always checking the next N.

Tests

No new tests added, but the existing ActiveL2Provider tests have been updated to continue to use a deterministic URL ordering, ensuring that these tests can remain easy to maintain.

Additional context

An alternative fix for the above issue is to tune the context timeouts and retry logic in order to better handle unresponsive endpoints. With current defaults, ActiveL2RollupProvider uses a 1 minute networkTimeout, which is guaranteed to be consumed by a single call to DialRollupClientWithTimeout. This means that no other candidate sequencers will be checked, preventing failover if only 2 sequential instances are unhealthy.

Metadata

Fixes #[Link to Issue]

ajsutton · 2024-05-31T01:12:48Z

Personally, I'd probably avoid randomised order - randomness just makes things difficult to test and reason about well generally. Plus we could keep randomly picking bad servers so only have probabilistic liveness - likely good enough to be honest, but not ideal.

Could we just keep track of which server we last tried and begin from there on the next request? That way each attempt will make some progress through the list and eventually find a working server. I think we're trying to do that with the way iteration starts from p.rollupIndex but that only gets updated if we find a working server - we'd need to update it once we decide it is currently pointing to an unavailable server as well.

BrianBland · 2024-05-31T01:53:21Z

Personally, I'd probably avoid randomised order - randomness just makes things difficult to test and reason about well generally. Plus we could keep randomly picking bad servers so only have probabilistic liveness - likely good enough to be honest, but not ideal.

Agreed that there's definitely some negative tradeoffs around randomness.

Could we just keep track of which server we last tried and begin from there on the next request? That way each attempt will make some progress through the list and eventually find a working server. I think we're trying to do that with the way iteration starts from p.rollupIndex but that only gets updated if we find a working server - we'd need to update it once we decide it is currently pointing to an unavailable server as well.

One issue is that while we generally encounter a timeout on the first misbehaving endpoint, we technically iterate through each of the remaining endpoints with an expired context. We may need to update this code to terminate on first detection of a context.DeadlineExceeded

ajsutton · 2024-05-31T02:02:21Z

Yeah good point - we should stop trying once the parent context we're given has expired (but not if the network timeout this code applies is reached which may also come back as a DeadlineExceeded).

BrianBland · 2024-05-31T02:47:42Z

@ajsutton opened the following PR as a replacement based on this discussion: #10697

BrianBland force-pushed the randomize-rollup-failover-ordering branch from ca92b66 to cc37bde Compare May 30, 2024 23:08

Randomize ordering of rollupUrl failover

824c7f9

BrianBland force-pushed the randomize-rollup-failover-ordering branch from cc37bde to 824c7f9 Compare May 30, 2024 23:18

BrianBland changed the title ~~Randomize ordering of rollupUrl failover~~ op-proposer: Randomize ordering of rollupUrl failover May 30, 2024

BrianBland marked this pull request as ready for review May 30, 2024 23:55

BrianBland requested a review from a team as a code owner May 30, 2024 23:55

BrianBland requested a review from ajsutton May 30, 2024 23:55

BrianBland changed the title ~~op-proposer: Randomize ordering of rollupUrl failover~~ op-batcher: Randomize ordering of rollupUrl failover May 31, 2024

BrianBland mentioned this pull request May 31, 2024

op-batcher: Do not retry dial attempts in active l2 providers #10696

Merged

BrianBland mentioned this pull request May 31, 2024

op-batcher: Handle DeadlineExceeded in rollup provider failover #10697

Closed

BrianBland closed this May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

op-batcher: Randomize ordering of rollupUrl failover #10695

op-batcher: Randomize ordering of rollupUrl failover #10695

BrianBland commented May 30, 2024 •

edited

Loading

ajsutton commented May 31, 2024

BrianBland commented May 31, 2024

ajsutton commented May 31, 2024

BrianBland commented May 31, 2024

op-batcher: Randomize ordering of rollupUrl failover #10695

op-batcher: Randomize ordering of rollupUrl failover #10695

Conversation

BrianBland commented May 30, 2024 • edited Loading

ajsutton commented May 31, 2024

BrianBland commented May 31, 2024

ajsutton commented May 31, 2024

BrianBland commented May 31, 2024

BrianBland commented May 30, 2024 •

edited

Loading