Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

op-batcher: Randomize ordering of rollupUrl failover #10695

Conversation

BrianBland
Copy link
Contributor

@BrianBland BrianBland commented May 30, 2024

Description

Updates the ActiveL2RollupProvider to fallback to alternative candidate rollupUrls in a randomized order, instead of always iterating through the list in the same order.
This prevents a subset of bad/misconfigured endpoints from permanently blocking failover behavior, as these endpoints could otherwise consume the entire context timeout on each iteration of ActiveL2RollupProvider.RollupClient()

Using a randomized ordering also allows the rollupUrls set to grow independent of the contextual deadline, as each new iteration attempt has an equal chance of trying new URLs, instead of always checking the next N.

Tests

No new tests added, but the existing ActiveL2Provider tests have been updated to continue to use a deterministic URL ordering, ensuring that these tests can remain easy to maintain.

Additional context

An alternative fix for the above issue is to tune the context timeouts and retry logic in order to better handle unresponsive endpoints. With current defaults, ActiveL2RollupProvider uses a 1 minute networkTimeout, which is guaranteed to be consumed by a single call to DialRollupClientWithTimeout. This means that no other candidate sequencers will be checked, preventing failover if only 2 sequential instances are unhealthy.

Metadata

  • Fixes #[Link to Issue]

@BrianBland BrianBland force-pushed the randomize-rollup-failover-ordering branch from ca92b66 to cc37bde Compare May 30, 2024 23:08
@BrianBland BrianBland force-pushed the randomize-rollup-failover-ordering branch from cc37bde to 824c7f9 Compare May 30, 2024 23:18
@BrianBland BrianBland changed the title Randomize ordering of rollupUrl failover op-proposer: Randomize ordering of rollupUrl failover May 30, 2024
@BrianBland BrianBland marked this pull request as ready for review May 30, 2024 23:55
@BrianBland BrianBland requested a review from a team as a code owner May 30, 2024 23:55
@BrianBland BrianBland requested a review from ajsutton May 30, 2024 23:55
@BrianBland BrianBland changed the title op-proposer: Randomize ordering of rollupUrl failover op-batcher: Randomize ordering of rollupUrl failover May 31, 2024
@ajsutton
Copy link
Contributor

Personally, I'd probably avoid randomised order - randomness just makes things difficult to test and reason about well generally. Plus we could keep randomly picking bad servers so only have probabilistic liveness - likely good enough to be honest, but not ideal.

Could we just keep track of which server we last tried and begin from there on the next request? That way each attempt will make some progress through the list and eventually find a working server. I think we're trying to do that with the way iteration starts from p.rollupIndex but that only gets updated if we find a working server - we'd need to update it once we decide it is currently pointing to an unavailable server as well.

@BrianBland
Copy link
Contributor Author

Personally, I'd probably avoid randomised order - randomness just makes things difficult to test and reason about well generally. Plus we could keep randomly picking bad servers so only have probabilistic liveness - likely good enough to be honest, but not ideal.

Agreed that there's definitely some negative tradeoffs around randomness.

Could we just keep track of which server we last tried and begin from there on the next request? That way each attempt will make some progress through the list and eventually find a working server. I think we're trying to do that with the way iteration starts from p.rollupIndex but that only gets updated if we find a working server - we'd need to update it once we decide it is currently pointing to an unavailable server as well.

One issue is that while we generally encounter a timeout on the first misbehaving endpoint, we technically iterate through each of the remaining endpoints with an expired context. We may need to update this code to terminate on first detection of a context.DeadlineExceeded

@ajsutton
Copy link
Contributor

Yeah good point - we should stop trying once the parent context we're given has expired (but not if the network timeout this code applies is reached which may also come back as a DeadlineExceeded).

@BrianBland
Copy link
Contributor Author

@ajsutton opened the following PR as a replacement based on this discussion: #10697

@BrianBland BrianBland closed this May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants