Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terminal Total Difficulty calculation may differ between nodes #2603

Closed
ajsutton opened this issue Sep 16, 2021 · 10 comments
Closed

Terminal Total Difficulty calculation may differ between nodes #2603

ajsutton opened this issue Sep 16, 2021 · 10 comments

Comments

@ajsutton
Copy link
Contributor

Currently the merge transition process (https://github.com/ethereum/consensus-specs/blob/dev/specs/merge/fork.md#initializing-transition-store) computes the terminal total difficulty dynamically as part of process_slot into the first slot of MERGE_FORK_EPOCH and only if "the transition store has not already been initialized". The transition total difficulty is then stored outside of the BeaconState so is not fork aware or part of consensus (until the terminal PoW block is actually reached).

There are a couple of ways that nodes may wind up with an incorrect terminal total difficulty using the current spec.

Problem Sources

Checkpoint Sync

If a node uses checkpoint sync with a state between MERGE_FORK_EPOCH and the terminal PoW block being included, it will never initialise the transition store because the state.slot % SLOTS_PER_EPOCH == 0, compute_epoch_at_slot(state.slot) == MERGE_FORK_EPOCH condition will never be met (the state is already past that point). Such a node will then reject any block that includes a non-empty ExecutionPayload and fail to follow the chain when the merge completes.

It has taken quite a bit of time to establish ways for users to easily access a BeaconState to use as part of checkpoint sync, so requiring the BeaconState plus the TTD value to be supplied would be a significant step backwards in usability and the progression away from always replaying every block to sync the beacon chain.

Clients supporting checkpoint sync would need to release an update once the TTD is known to hard code it in the client to resolve this.

Forks

As the transition store is initialised the first time a node processes slots into the merge epoch, it's possible for different nodes to use states from different forks when setting the TTD. This is further complicated because many beacon chain clients pre-process slot and epoch transitions as a performance optimisation.

To give a concrete example, lets say MERGE_FORK_EPOCH=100 and thus slot 3199 is the last slot before the merge fork activates and slot 3200 the first one after it. And also assume epoch 100 is the end of the eth1 voting period (probably not actually true but makes numbers nice and round).

At slot 3198, the Eth1Data voting process is one vote short of agreeing on the new Eth1Data.
At slot 3199, a block is published to a minority of nodes which provides that last required vote.
Nodes that receive block 3199 and pre-process the epoch transition, will now initialize their transition store because they're following a fork where block 3199 is included on the chain.
Then block 3200 is received which uses block 3198 as its parent and it is accepted as the canonical chain, orphaning block 3199. Thus the canonical Eth1Data is the old one because 50% was never reached.

At this point there's no visible consensus split - all nodes will reorg over to block 3200 and use the old Eth1Data as canonical for including deposits etc. However the minority of nodes which did receive block 3199 will have initialised TransitionStore based on the new Eth1Data, calculating a different terminal total difficulty to the other nodes. When the earlier of the terminal difficulties is reach the chain split will be revealed as different nodes will expect a different terminal PoW block.

Another issue that arises from not having the TTD stored outside of the BeaconState and thus not being fork aware, is that clients may separate processing empty slots from applying a block. In that case even receiving an invalid block could still cause the TTD to be set because the process_slots operation was performed independently and wouldn't be rolled back when the block processing fails. There's a large number of variations of this kind of problem including REST APIs that could cause a beacon node to process empty slots and thus trigger initialisation of the transition store.

Fundamentally, introducing side effects to process_slots is a very significant change to the design assumptions for beacon node clients.

Possible Solutions

Store TTD in BeaconState

We could remove the side effects from process_slots and restore the ability to process any block given only the pre-state but adding a field to BeaconState which stores the calculated TTD. This would be set as part of the upgrade_to_merge function. It's still slightly awkward that the total difficulty for the Eth1Data block would need to be known and isn't available from the state but that's readily available from the execution engine and is only required for this special state transition.

The downside is this would be a new field that would then be removed again in the next consensus upgrade.

It would also be effectively impossible to override the TTD via an API as it would require an irregular state transition.

Initialize TransitionStore on Finalization

The "Forks" problems could be solved by only initializing the TransitionStore when the MERGE_FORK_EPOCH is finalized. There is then only one possible fork. Checkpoint sync would remain broken and require clients to ship a release with the TTD hard coded

Hard code TTD

We could scrap calculating the TTD and just hard code it in clients. The key downside of this is that providing sufficient notice for users to upgrade would mean setting a TTD far in the future (see Rationale in #2462).

This could be mitigated by setting expectations that users will need to upgrade their consensus clients twice for the merge. The first upgrade includes sets the MERGE_FORK_EPOCH and the second would be shipped shortly after MERGE_FORK_EPOCH is reached and include the hard coded TTD. This is what users were told to expect for the beacon chain launch as well - that a client update would be released in the week between genesis being set and the chain actually starting (so that if the genesis was calculated incorrectly the update could fix it - in the end clients agreed but a release still went out to embed the genesis state in that week).

@ajsutton
Copy link
Contributor Author

Actually, rather than storing the TTD in the state, we should just add a new field that records the total difficulty of the block referenced in Eth1Data when MERGE_FORK_EPOCH is reached (ie store the input to the computation rather than the output). Then it would still make sense to override the final calculated total difficulty via an API - you'd be essentially replacing the calculation but the input recorded in the state would remain the same.

@mkalinin
Copy link
Collaborator

mkalinin commented Sep 16, 2021

Good catch! 👍

Originally, there was no "the transition store has not already been initialized" requirement in the spec and it was assumed that TransitionStore is re-initialized each time the epoch boundary is crossed. What was missed in the original mechanics is an edge case with processing a block that affects eth1 data poll results. This could happen when either finality is delayed or there are two strong Eth1Data candidates at the end of the voting period.

The moment of calling the engine_terminalTotalDifficultyUpdated to the execution client might also be important and taken into consideration. In one of the corner cases described above (when a block from a fork triggers the Merge fork upgrade function) the correct TTD value could be overwritten by an incorrect one. It could be solved by sending the message after MERGE_FORK_EPOCH is finalized or by hardcoding the TTD value, or by sending this message only when the new fork becomes canonical.

@djrtwo
Copy link
Contributor

djrtwo commented Sep 16, 2021

tbh, I'm more and more in favor of hard-coding TTD again.

This would simplify the mechanism and edgecases.

And because the TTD could ship in consensus and execution clients, this would also prevent users that run pre-merge execution-engine implementations without a beacon node from accidentally following the PoW chain for a long time after the merge, and instead result in a failure that would be noticed and resolved.

@mkalinin
Copy link
Collaborator

mkalinin commented Sep 16, 2021

tbh, I'm more and more in favor of hard-coding TTD again.

This would simplify the mechanism and edgecases.

And because the TTD could ship in consensus and execution clients, this would also prevent users that run pre-merge execution-engine implementations without a beacon node from accidentally following the PoW chain for a long time after the merge, and instead result in a failure that would be noticed and resolved.

I second that. In this case won't a single release for the execution client containing hardcoded TTD and the Merge logic be enough?

@dapplion
Copy link
Collaborator

dapplion commented Sep 16, 2021

@ajsutton I was wondering the same.

If a beacon node process more than one epoch boundary such that epoch == MERGE_FORK_EPOCH, should it compute terminal_total_difficulty again and overwrite the existing value in the store?

If TTD is hard-coded, what happens if TTD is reached before whatever condition is picked to activate the merge fork?

@ajsutton
Copy link
Contributor Author

If a beacon node process more than one epoch boundary such that epoch == MERGE_FORK_EPOCH, should it compute terminal_total_difficulty again and overwrite the existing value in the store?

I think this is fundamentally the problem - it doesn't matter if you take the first or the last value you always have the potential for a chain split. I think using the last value is more vulnerable to attack since it allows someone to just publish their block late to ensure they're what sets the TTD but neither really works. You'd have to wait until the MERGE_FORK_EPOCH is finalized to make it reliable.

If TTD is hard-coded, what happens if TTD is reached before whatever condition is picked to activate the merge fork?

That would go quite badly, likely leading to a chain halt. We'd need to pick a TTD that's far enough in the future that we're confident the merge epoch will be reached first.

@mkalinin
Copy link
Collaborator

mkalinin commented Sep 17, 2021

That would go quite badly, likely leading to a chain halt. We'd need to pick a TTD that's far enough in the future that we're confident the merge epoch will be reached first.

What if we do it as follows:

  1. Release consensus clients with the Merge HF logic and TTD = 2**256-1
  2. Wait for the HF to happen
  3. Release consensus clients with hardcoded TTD and execution clients with the Merge logic including hardcoded TTD
  4. Wait for the transition to happen

At the first glance this may work but I might have missed something. This approach would guarantee that TTD hits after the Merge HF. Even if someone would include a non-empty payload into a beacon block signifying the transition before execution clients get shipped with the Merge logic it will be rejected by the Merge logic on the beacon chain side as TTD won't be met. This check can be done with JSON-RPC API methods and doesn't require Engine API to be implemented allowing for execution clients to stay at their pre-Merge versions while consensus clients are already upgraded.

This approach should not stretch the period of the whole upgrading process as we will likely need to wait a month for the Merge HF, then yet another month (2 weeks) for the transition vs releasing everything at once with TTD hitting in 1.5-2 months with the same intention (a months for the HF and 0.5-1 months for transition).

This approach should save us one extra release of the execution client software. In my mental model updating execution clients is more resourceful than updating consensus clients. I might be wrong though and then this proposal might not make sense.

@mkalinin
Copy link
Collaborator

This check can be done with JSON-RPC API methods and doesn't require Engine API to be implemented allowing for execution clients to stay at their pre-Merge versions while consensus clients are already upgraded.

After this change #2595 this statement became unsound as process_execution_payload happens before the terminal PoW block conditions are verified. Proper handling of this edge case could be done by adding TERMINAL_TOTAL_DIFFICULTY <= 2**256-1 assertion to the process_execution_payload function which should be straightforward.

@dapplion
Copy link
Collaborator

dapplion commented Sep 17, 2021

@mkalinin If TTD is hard-coded, what happens if miners stop mining the Ethereum chain early and that total difficulty is never reached? Since their revenue will drop to 0 at day N, it's not a huge difference from stopping getting revenue at day N-1. I guess this scenario affects all approaches discussed in this PR and current master.

EDIT: Oh already brought up in #2547 👍

@djrtwo
Copy link
Contributor

djrtwo commented Sep 21, 2021

closed via #2605
thanks @ajsutton!

@djrtwo djrtwo closed this as completed Sep 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants