Skip to content
This repository has been archived by the owner on Apr 18, 2024. It is now read-only.

Report to Station when the node cannot connect to the network #63

Closed
bajtos opened this issue Aug 30, 2022 · 4 comments
Closed

Report to Station when the node cannot connect to the network #63

bajtos opened this issue Aug 30, 2022 · 4 comments
Assignees
Labels
Station Work related to Filecoin Station

Comments

@bajtos
Copy link
Contributor

bajtos commented Aug 30, 2022

Detect the situation when the L2 node cannot connect to any L1 node and report the problem back to Station. Let's keep the first version simple:

  • When the app starts or whenever the number of L1 connections drops to zero, we start a timer for 3 seconds

  • If there is no L1 connection after this timeout, then we log an event about the problem. Proposed message:

    fmt.Print("ERROR: Saturn Node is not able to connect to the network\n")
  • In the current backoff-based retry implementation, we give up connecting after several unsuccessful attempts. When this happens, L2 Node should report the problem to the Station.

     fmt.Printf("ERROR: Saturn Node was not able to connect to the network after %v attempts, giving up.\n", l.maxReconnectAttempts)

In both cases, it's important to print the message only once. We don't want the message to be printed for each L1 client we have, as that would print each message three times.

Related: #62

@bajtos bajtos added the Station Work related to Filecoin Station label Aug 30, 2022
@juliangruber
Copy link
Contributor

What about instead of adding a timeout we log this for every attempt?

fmt.Print("ERROR: Saturn Node is not able to connect to the network, retrying...\n")

This gives faster but more noisy feedback, with the benefit of a simpler implementation

@bajtos
Copy link
Contributor Author

bajtos commented Aug 30, 2022

What about instead of adding a timeout we log this for every attempt?

I have already tried that and was getting three error messages every now and then. Like nothing happens for a second or two, then three messages appear at once, then there is another pause, and then another three messages, and so on.

I am fine to look for a simpler solution as long as we can report the problem at the L2-Node level, not at the level of every L1 client.

In other words, we can rework the part about 3sec timeout to use the current backoff retry mechanism, but then we need to report only the first error and not the duplicates following soon after the first one.

That would work in the case where we cannot reach the network at all.

However, it would not work if we can reach only some of the L1 nodes. In that case, we don't want to report connection errors mixed with messages like Saturn Node is online and connected to 1 peer(s).

@juliangruber
Copy link
Contributor

Ah gotcha, I thought the timeout was there to reduce log messages on the L2-node level. Whatever is easiest then 👍

@juliangruber
Copy link
Contributor

Closed by #68, which replaced #67

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Station Work related to Filecoin Station
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants