op-batcher: Randomize ordering of rollupUrl failover #10695
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Updates the
ActiveL2RollupProvider
to fallback to alternative candidaterollupUrls
in a randomized order, instead of always iterating through the list in the same order.This prevents a subset of bad/misconfigured endpoints from permanently blocking failover behavior, as these endpoints could otherwise consume the entire context timeout on each iteration of
ActiveL2RollupProvider.RollupClient()
Using a randomized ordering also allows the
rollupUrls
set to grow independent of the contextual deadline, as each new iteration attempt has an equal chance of trying new URLs, instead of always checking the next N.Tests
No new tests added, but the existing
ActiveL2Provider
tests have been updated to continue to use a deterministic URL ordering, ensuring that these tests can remain easy to maintain.Additional context
An alternative fix for the above issue is to tune the context timeouts and retry logic in order to better handle unresponsive endpoints. With current defaults,
ActiveL2RollupProvider
uses a 1 minutenetworkTimeout
, which is guaranteed to be consumed by a single call toDialRollupClientWithTimeout
. This means that no other candidate sequencers will be checked, preventing failover if only 2 sequential instances are unhealthy.Metadata