Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The F3 <-> EC fork issue #718

Open
Stebalien opened this issue Oct 21, 2024 · 14 comments
Open

The F3 <-> EC fork issue #718

Stebalien opened this issue Oct 21, 2024 · 14 comments

Comments

@Stebalien
Copy link
Member

Stebalien commented Oct 21, 2024

There's a theoretical issue where, if F3 takes a long time to finalize a tipset, it might cause long-range forks in EC. We're alleviating this by:

  1. Never trying to finalize the current head (Set the default EC lookback to 4 #716).
  2. Preventing clients (e.g., lotus) from accepting finality certificates that would revert beyond EC finality (Prevent very large reverts based on finality certificates #717).

However, the issue still persists. The core of the issue is that:

  1. F3 gets a chain from EC.
  2. F3 can spend an arbitrary amount of time trying to finalize it.
  3. In the meantime, EC fork away from the head F3 ends up deciding on.

To fix this, we likely need some way for F3 to discard the current proposal (if too old) and get a new one from the client. However, this is tricky to implement in the current GPBFT protocol without breaking the liveness guarantees.

There are really two parts to this issue:

  1. Reducing the likelihood of long-range (10+ epochs) forks (to avoid breaking client assumptions).
  2. Preventing forks beyond EC finality.

However, the catch is that nobody can emit two decide messages for the same instance without potentially breaking GPBFT. But there are also certain decisions that are simply unacceptable.

@Stebalien
Copy link
Member Author

The actual solution may have multiple phases:

  1. In phase one, we operate normally and try to finalize any valid chain.
  2. In phase two, we try to avoid phase 1 by somehow skewing towards the heavier chain?
  3. In phase three, we go back to trying to finalize any valid chain. We're accepting the fact that we're likely going to have a long-range fork

But it looks like any solution will have to involve feedback between GPBFT and EC:

  1. GPBFT needs to know when it's taking too long. In that case, we want to decide on base ASAP so we can get a new proposal.
  2. EC should maybe consider switching chains based on GPBFT. E.g., if we see a quorum of quality messages for some prefix, we may want to eagerly switch to that chain because it'll likely be finalized.

@jennijuju
Copy link
Member

EC should maybe consider switching chains based on GPBFT.

If I understand correctly, the issue is that f3 participants needs to be notified if there the longest EC chain blocks is different with whaat they are finalizing over with today - so shouldn't this be the other way around -> F3 should maybe consider switch chains base on EC?

@jennijuju
Copy link
Member

What will happen today if the chain receive a finalized set of blocks that doesn't matches EC longest chain blocks?

@Stebalien
Copy link
Member Author

If I understand correctly, the issue is that f3 participants needs to be notified if there the longest EC chain blocks is different with whaat they are finalizing over with today - so shouldn't this be the other way around -> F3 should maybe consider switch chains base on EC?

Both.

  1. GPBFT should switch if it's taking too long and trying to finalize something EC isn't building on.
  2. EC should try to build on what GPBFT is likely to finalize to reduce the chances of (1) being an issue (and to increase the chances of building on the right chain).

@Stebalien
Copy link
Member Author

What will happen today if the chain receive a finalized set of blocks that doesn't matches EC longest chain blocks?

We switch to the F3 finalized chain no matter what.

@vukolic
Copy link

vukolic commented Oct 22, 2024

It was by design that in this case EC should win and GPBFT should simply halt. There was a long discussion and we decided to prefer EC availability over GPBFT consistency in this case. So if GPBFT does not finalize in 900 epochs then EC takes over again.

@Stebalien
Copy link
Member Author

That's the current plan but...

  1. Forks shorter than 900 epochs are still an issue.
  2. If we have a network incident, we don't want to have to worry about breaking F3. This would especially be an issue once we get trustless bridges.

@Stebalien
Copy link
Member Author

Ideas from a discussion with @anorth:

  1. We can finalize F3 when we switch to the decide phase instead of waiting for the certificate. Implementing this is a bit complex because it relies on the GPBFT state and not just the certificate store, but it shouldn't be too difficult and doesn't affect the protocol.
  2. We can bias base in converge if the fork is too long. This is a protocol change but it should be fine (we'll need to discuss it more). It won't save us if we get stuck in prepare/converge, but it'll reduce the chances of us forking.
  3. As discussed in Set the default EC lookback to 4 #716, we can increase the EC lookback. But, probably by more than 1.

Note: I'm mostly concerned about the initial bootstrapping of the network. Once we're under way and have healthy participation, I think we'll be fine. But until then, we could get into a situation where we hover around the power cutoff to reach consensus which could cause us to get stuck in various phases, leading to these long-range forks.

@vukolic
Copy link

vukolic commented Oct 23, 2024

This is a good point. I suggest the following

If there is a fork of EC while F3 finalizes - reject F3 finalization (do not apply it to EC) and restart new F3 instance finalizing the tail of the new chain since the last applicable F3 finalization.

Let me know if you get what I am suggesting - and if not I can describe in more details.

@vukolic
Copy link

vukolic commented Oct 23, 2024

Ideas from a discussion with @anorth:

1. We can finalize F3 when we switch to the decide phase instead of waiting for the certificate. Implementing this is a bit complex because it relies on the GPBFT state and not just the certificate store, but it shouldn't be too difficult and doesn't affect the protocol.

2. We can bias base in converge if the fork is too long. This _is_ a protocol change but it _should_ be fine (we'll need to discuss it more). It won't save us if we get stuck in prepare/converge, but it'll reduce the chances of us forking.

3. As discussed in [Set the default EC lookback to 1 #716](https://github.com/filecoin-project/go-f3/issues/716), we can increase the EC lookback. But, probably by more than 1.

Note: I'm mostly concerned about the initial bootstrapping of the network. Once we're under way and have healthy participation, I think we'll be fine. But until then, we could get into a situation where we hover around the power cutoff to reach consensus which could cause us to get stuck in various phases, leading to these long-range forks.

Please avoid protocol changes. This issue has nothing to do with GPBFT as a protocol and would appear in any finalization protocol. hence the solution is not to be looked for in changing GPBFT.

@vukolic
Copy link

vukolic commented Oct 23, 2024

If there is a fork of EC while F3 finalizes - reject F3 finalization (do not apply it to EC) and restart new F3 instance finalizing the tail of the new chain since the last applicable F3 finalization.

This said EC must commit not to fork before the last F3 finalization point - this is in the F3 specification. Otherwise there is no point in calling F3 a finalization protocol...

@Kubuxu
Copy link
Contributor

Kubuxu commented Oct 23, 2024

reject F3 finalization (do not apply it to EC) and restart new F3 instance finalizing the tail of the new chain since the last applicable F3 finalization.

How would that look? At that point, F3/GPBFT has produced a finality certificate that Filecon rejects due to consensus rules, but that rejection is not observable to consumers of only the certificate chain. Thus, it would require not a new instance but a new F3 certificate chain/network.

@hanabi1224
Copy link
Contributor

hanabi1224 commented Oct 24, 2024

A side note: I ran into an issue that F3 tries to finalize tipsets that are newer than the EC head. To reproduce:

  • run Forest with F3 sidecar
  • sleep the machine
  • wake up after a few hours

This seems to happen when certexchange is getting cert that contains newer EC head while the node is still catching up

@Stebalien
Copy link
Member Author

That's expected and something that needs to be handled, unfortunately. On the bright side, it makes syncing easier (you now have a guaranteed sync target).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

5 participants