Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Capture errors thrown within Coordinator #515

Merged
merged 10 commits into from
Jan 24, 2024

Conversation

morgsmccauley
Copy link
Collaborator

@morgsmccauley morgsmccauley commented Jan 17, 2024

Currently, errors thrown within Coordinator V2 will bubble up to main() and cause the entire application to exit. This PR captures those errors, handling them accordingly.

Errors are handled in either of the following ways:

  1. Exponentially retry - data which is 'critical' to the control, i.e. the indexer registry, executor/stream list, will be continuously retried, blocking the control loop from further progress as it would not make sense to continue without this information.
  2. Swallowed - actions such as starting/stopping executors/streams will be logged and swallowed. This is preferable over exponential retries as individual failures will not block the progress of the control loop, therefore allowing other indexers to be acted on. Skipping should be fine in this case as it will be retried in the next loop.

I expect this behaviour to evolve over time as we learn more about the system, the important thing here is that Coordinator will not crash on errors.

@morgsmccauley morgsmccauley linked an issue Jan 17, 2024 that may be closed by this pull request
@morgsmccauley morgsmccauley force-pushed the feat/coordinator-retries branch 3 times, most recently from 50f43ce to ea3a37c Compare January 21, 2024 21:06
@morgsmccauley morgsmccauley changed the title feat/coordinator retries feat: Capture errors thrown within Coordinator Jan 21, 2024
pub fn connect(block_streamer_url: String) -> anyhow::Result<Self> {
let channel = Channel::from_shared(block_streamer_url)
.context("Block Streamer URL is invalid")?
.connect_lazy();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defer connection to when calls are made. This allows us to consolidate retries (in the code below) rather than also having retry logic here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. So basically when we make the call, the channel connection AND the call success is within the same retry loop?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct

synchronise_executors(&indexer_registry, &executors_handler),
synchronise_block_streams(&indexer_registry, &redis_client, &block_streams_handler),
async {
sleep(CONTROL_LOOP_THROTTLE_SECONDS).await;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forcing a minimum loop duration

exponential_retry(|| async {
let response = self
.client
.clone()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cloning a gRPC/tonic client is cheap and allows us to avoid holding a mutable reference - https://docs.rs/tonic/latest/tonic/transport/channel/struct.Channel.html#multiplexing-requests

@morgsmccauley morgsmccauley marked this pull request as ready for review January 21, 2024 21:25
@morgsmccauley morgsmccauley requested a review from a team as a code owner January 21, 2024 21:25
@morgsmccauley morgsmccauley force-pushed the feat/coordinator-retries branch from ea3a37c to 6a7c208 Compare January 21, 2024 21:25
pub fn connect(block_streamer_url: String) -> anyhow::Result<Self> {
let channel = Channel::from_shared(block_streamer_url)
.context("Block Streamer URL is invalid")?
.connect_lazy();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. So basically when we make the call, the channel connection AND the call success is within the same retry loop?


Ok(response.into_inner().streams)
pub async fn list(&self) -> anyhow::Result<Vec<StreamInfo>> {
exponential_retry(|| async {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does exponential retry have a limit or does it increase forever? Might be good to have an upper limit so that we don't necessarily have to restart coordinator too if the error was say block streamer or runner side.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it will just increase forever, good call, we should cap the delay seconds, i'll do that in a follow up PR.

.clone()
.stop_stream(Request::new(request.clone()))
.await
.map_err(|e| {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if a stop fails, is swallowed, and the subsequent start (e.g. an indexer update) is successful? Is there a mechanism in block stream to prevent duplicates? It may be worth skipping with continue if we don't have any duplication mechanisms in place.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Block Streamer service will throw an error if we try to start the same stream for a given indexer

@morgsmccauley morgsmccauley merged commit aac2273 into main Jan 24, 2024
4 checks passed
@morgsmccauley morgsmccauley deleted the feat/coordinator-retries branch January 24, 2024 20:43
@darunrs darunrs mentioned this pull request Feb 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create Control service
2 participants