Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

historyarchive: Add round-robin, error-resilience, and back-off to the ArchivePool #5224

Merged
merged 21 commits into from
Mar 7, 2024

Conversation

Shaptic
Copy link
Contributor

@Shaptic Shaptic commented Feb 27, 2024

What

This adds two main components to the ArchivePool implementation that will significantly add resilience to the interface:

  • Previously, each method call would just forward the method to a randomly-chosen archive in the pool. Now, it will (persistently) round-robin through each archive in the pool with a constant back-off mechanism (250ms/archive) of up to 3 retries (so 4 total queries per attempt in the worst case).

Why

Closes #5167.

Known limitations

For feedback purposes, I haven't actually implemented this for every ArchiveInterface method as I'm seeking feedback prior to writing a bunch of repetitive code I may have to wipe out later.

@Shaptic Shaptic requested review from tamirms, urvisavla, a team and sreuland February 27, 2024 00:13
@tamirms
Copy link
Contributor

tamirms commented Feb 27, 2024

The checkpoint change reader retries history archive operations here:

https://github.com/stellar/go/blob/master/ingest/checkpoint_change_reader.go#L132-L142

We should get rid of the retry logic in the above code because there's no use in having retries implemented in multiple layers.

@tamirms
Copy link
Contributor

tamirms commented Feb 27, 2024

I think we should start with a simpler implementation (e.g. retry at most n times in a round-robin fashion). Before adding extra complexity in the retry logic, I think we should understand what metrics we're specifically trying to improve. It's hard to judge whether we should implement a more sophisticated retry algorithm unless we know what metrics we're trying to impact.

@mollykarcher mollykarcher added this to the Sprint 44 milestone Feb 27, 2024
@Shaptic
Copy link
Contributor Author

Shaptic commented Feb 27, 2024

We should get rid of the retry logic in the above code because there's no use in having retries implemented in multiple layers.

While I agree in practice, I don't think we should project behavior from one layer to another. For example, a user of the ingest package may choose to only use a single archive and yet still want retries.

@Shaptic
Copy link
Contributor Author

Shaptic commented Feb 27, 2024

I think we should start with a simpler implementation (e.g. retry at most n times in a round-robin fashion).

Yeah, I generally agree, @tamirms. And I don't think we have any metrics that we're trying to fight against, per se: it's more that we wanted the ArchivePool to retry automatically so that we would get resilience "for free" in every place that uses the pool. And I thought that if you're going to add retries, you might as well add back-offs, since we've seen issues with a lack of back-off before in other places.

However, I think keeping it simple is a totally fine call.

@tamirms
Copy link
Contributor

tamirms commented Feb 27, 2024

a user of the ingest package may choose to only use a single archive and yet still want retries.

this is a good point. would it make sense to have another archive wrapper which solely encapsulates the retry / backoff logic? then we could enable both the pooling and retry properties by composing the ArchivePool with the RetryArchive wrapper.

And I thought that if you're going to add retries, you might as well add back-offs, since we've seen issues with a lack of back-off before in other places.

I think it's ok to include back-offs but I'm not sure that we should have something sophisticated which tries to be clever about adjusting the backoff based on the estimated reliability of each archive in the pool.

@Shaptic
Copy link
Contributor Author

Shaptic commented Feb 27, 2024

would it make sense to have another archive wrapper which solely encapsulates the retry / backoff logic

It's a good idea, but I think no: retrying across the pool is better than retrying on each individual archive (in the sense that it is likelier to result in success), so we need the pool-level retries first. If we did add a RetryArchive, we'd still need this PR.

adjusting the backoff based on the estimated reliability of each archive in the pool

The back-off is linear with the number of consecutive errors; there's no other adjustment. Here, we were (i.e. before the latest push) just preferring archives with a lower back-off.


@tamirms I dropped the back-off related code in 75e50a1 and just kept a simple round-robin retry approach so that we could evaluate it. Do you think we should keep it simple and just start there?

@tamirms
Copy link
Contributor

tamirms commented Feb 28, 2024

retrying across the pool is better than retrying on each individual archive (in the sense that it is likelier to result in success), so we need the pool-level retries first. If we did add a RetryArchive, we'd still need this PR.

if you compose RetryArchive with PoolArchive you would achieve the desired retry behavior.

Assume PoolArchive cycles through different urls for each invocation in a round robin fashion. You can wrap the PoolArchive in a RetryArchive and, when the RetryArchive invokes the underlying PoolArchive again during retries, the PoolArchive's round robin behavior will ensure that a different history archive url is selected during the retried request.

@tamirms
Copy link
Contributor

tamirms commented Feb 28, 2024

The back-off is linear with the number of consecutive errors; there's no other adjustment. Here, we were (i.e. before the latest push) just preferring archives with a lower back-off.

I think having a constant back-off should be simple and effective

@Shaptic
Copy link
Contributor Author

Shaptic commented Mar 5, 2024

@tamirms I've simplified the design somewhat in line with how we discussed it, but not to the same level of depth. Rather than designing a RetriableArchive (which I think would be better scoped in a separate issue relative to #5167), I just used the backoff.Retry you referenced to back-off the loop of

This isn't as... equitable? as the previous solution (which distributed backoffs better to each individual archive) since it backs off on the entire round-robin for a single call, but I think it's still a substantial improvement relative to what we have before and so it achieves the goals in #5167.

It also classifies context errors as permanent like you advised. PTA(nother)L before I do it for the remaining methods!

@Shaptic Shaptic requested a review from tamirms March 5, 2024 21:06
@tamirms
Copy link
Contributor

tamirms commented Mar 5, 2024

@Shaptic this approach looks good to me!

@Shaptic Shaptic changed the title [wip] historyarchive: Add round-robin, error-resilience, and back-off to the ArchivePool. historyarchive: Add round-robin, error-resilience, and back-off to the ArchivePool Mar 5, 2024
@Shaptic Shaptic marked this pull request as ready for review March 5, 2024 23:53
@Shaptic Shaptic requested review from tamirms and a team March 5, 2024 23:53
@MonsieurNicolas
Copy link
Contributor

Out of curiosity, why is random (assuming uniform distribution) worst than round robin? The issue linked seems to imply that either would work as long as failure information is taken into account

@Shaptic
Copy link
Contributor Author

Shaptic commented Mar 6, 2024

@MonsieurNicolas yeah you're right that there's nothing inherently better about either, but because we want retries then it's better to have certainty that we won't reuse the same archive again (which is of course doable with random selection, too, but is cleaner with round robin).

Copy link
Contributor

@tamirms tamirms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@MonsieurNicolas
Copy link
Contributor

@MonsieurNicolas yeah you're right that there's nothing inherently better about either, but because we want retries then it's better to have certainty that we won't reuse the same archive again (which is of course doable with random selection, too, but is cleaner with round robin).

makes sense. we just need to make sure to shuffle the list on startup so that you don't introduce a bias towards the first archive across deployments/restarts

@Shaptic
Copy link
Contributor Author

Shaptic commented Mar 7, 2024

@MonsieurNicolas yep, great call-out! that's sort of being done with this line:

ap.curr = rand.Intn(len(ap.pool)) // don't necessarily start at zero

which effectively achieves the same thing: a random starting point for the round robin.

@Shaptic
Copy link
Contributor Author

Shaptic commented Mar 7, 2024

Hmm, thinking about that more: a random starting point is definitely not the same level of randomness as a shuffle (you could argue that you are still biasing against groups of archives since most people will pass them in order) when it comes to traversal, but I think since the goal is to alleviate concentrating load on a particular group during restarts, the same end goal is achieved either way.

@Shaptic Shaptic merged commit ab3a926 into stellar:master Mar 7, 2024
29 checks passed
@Shaptic Shaptic deleted the pool-hardening branch March 7, 2024 23:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

historyarchive: Harden the ArchivePool to be robust against errors.
4 participants