feat: Capture errors thrown within Coordinator #515

morgsmccauley · 2024-01-17T23:00:02Z

Currently, errors thrown within Coordinator V2 will bubble up to main() and cause the entire application to exit. This PR captures those errors, handling them accordingly.

Errors are handled in either of the following ways:

Exponentially retry - data which is 'critical' to the control, i.e. the indexer registry, executor/stream list, will be continuously retried, blocking the control loop from further progress as it would not make sense to continue without this information.
Swallowed - actions such as starting/stopping executors/streams will be logged and swallowed. This is preferable over exponential retries as individual failures will not block the progress of the control loop, therefore allowing other indexers to be acted on. Skipping should be fine in this case as it will be retried in the next loop.

I expect this behaviour to evolve over time as we learn more about the system, the important thing here is that Coordinator will not crash on errors.

morgsmccauley · 2024-01-21T21:19:19Z

coordinator/src/block_streams_handler.rs

+    pub fn connect(block_streamer_url: String) -> anyhow::Result<Self> {
+        let channel = Channel::from_shared(block_streamer_url)
+            .context("Block Streamer URL is invalid")?
+            .connect_lazy();


Defer connection to when calls are made. This allows us to consolidate retries (in the code below) rather than also having retry logic here.

I see. So basically when we make the call, the channel connection AND the call success is within the same retry loop?

Yes, that's correct

morgsmccauley · 2024-01-21T21:20:08Z

coordinator/src/main.rs

+            synchronise_executors(&indexer_registry, &executors_handler),
+            synchronise_block_streams(&indexer_registry, &redis_client, &block_streams_handler),
+            async {
+                sleep(CONTROL_LOOP_THROTTLE_SECONDS).await;


Forcing a minimum loop duration

morgsmccauley · 2024-01-21T21:25:03Z

coordinator/src/block_streams_handler.rs

+        exponential_retry(|| async {
+            let response = self
+                .client
+                .clone()


Cloning a gRPC/tonic client is cheap and allows us to avoid holding a mutable reference - https://docs.rs/tonic/latest/tonic/transport/channel/struct.Channel.html#multiplexing-requests

darunrs · 2024-01-24T19:27:23Z

coordinator/src/block_streams_handler.rs

+    pub fn connect(block_streamer_url: String) -> anyhow::Result<Self> {
+        let channel = Channel::from_shared(block_streamer_url)
+            .context("Block Streamer URL is invalid")?
+            .connect_lazy();


I see. So basically when we make the call, the channel connection AND the call success is within the same retry loop?

darunrs · 2024-01-24T19:28:53Z

coordinator/src/block_streams_handler.rs

-
-        Ok(response.into_inner().streams)
+    pub async fn list(&self) -> anyhow::Result<Vec<StreamInfo>> {
+        exponential_retry(|| async {


Does exponential retry have a limit or does it increase forever? Might be good to have an upper limit so that we don't necessarily have to restart coordinator too if the error was say block streamer or runner side.

No it will just increase forever, good call, we should cap the delay seconds, i'll do that in a follow up PR.

darunrs · 2024-01-24T19:32:57Z

coordinator/src/block_streams_handler.rs

+            .clone()
+            .stop_stream(Request::new(request.clone()))
+            .await
+            .map_err(|e| {


What happens if a stop fails, is swallowed, and the subsequent start (e.g. an indexer update) is successful? Is there a mechanism in block stream to prevent duplicates? It may be worth skipping with continue if we don't have any duplication mechanisms in place.

Block Streamer service will throw an error if we try to start the same stream for a given indexer

morgsmccauley linked an issue Jan 17, 2024 that may be closed by this pull request

Create Control service #421

Closed

morgsmccauley force-pushed the feat/coordinator-retries branch 3 times, most recently from 50f43ce to ea3a37c Compare January 21, 2024 21:06

morgsmccauley changed the title ~~feat/coordinator retries~~ feat: Capture errors thrown within Coordinator Jan 21, 2024

morgsmccauley commented Jan 21, 2024

View reviewed changes

morgsmccauley marked this pull request as ready for review January 21, 2024 21:25

morgsmccauley requested a review from a team as a code owner January 21, 2024 21:25

morgsmccauley added 10 commits January 22, 2024 10:25

refactor: clone instead of holding mutable reference

210b105

refactor: Defer grpc connection to request time

c4f3610

feat: Exponential retry grpc requests

9d359ef

feat: Concurrently sync streams/executors

1b45c26

refactor: Dont block other indexers on individual failures

9ee1a52

feat: Avoid killing process on unsupported filter rule

a0210f6

refactor: Exponentially retry listing registry contract

659deb4

refactor: Retry on any error

cf7b9b5

refactor: Remove unnecessary async

feb81dd

feat: Throttle control loop

6a7c208

morgsmccauley force-pushed the feat/coordinator-retries branch from ea3a37c to 6a7c208 Compare January 21, 2024 21:25

darunrs reviewed Jan 24, 2024

View reviewed changes

darunrs approved these changes Jan 24, 2024

View reviewed changes

morgsmccauley merged commit aac2273 into main Jan 24, 2024
4 checks passed

morgsmccauley deleted the feat/coordinator-retries branch January 24, 2024 20:43

darunrs mentioned this pull request Feb 1, 2024

Prod Release 06/02/24 #546

Merged

morgsmccauley mentioned this pull request Apr 22, 2024

test stable branch git fix up #687

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Capture errors thrown within Coordinator #515

feat: Capture errors thrown within Coordinator #515

morgsmccauley commented Jan 17, 2024 •

edited

Loading

morgsmccauley Jan 21, 2024

darunrs Jan 24, 2024

morgsmccauley Jan 24, 2024

morgsmccauley Jan 21, 2024

morgsmccauley Jan 21, 2024

darunrs Jan 24, 2024

darunrs Jan 24, 2024

morgsmccauley Jan 24, 2024

darunrs Jan 24, 2024

morgsmccauley Jan 24, 2024

feat: Capture errors thrown within Coordinator #515

feat: Capture errors thrown within Coordinator #515

Conversation

morgsmccauley commented Jan 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

morgsmccauley commented Jan 17, 2024 •

edited

Loading