feat: Add GenesisHash to AttestationStatusClient and reject unexpected updates (BFT-496) #167
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What ❔
GenesisHash
toAttestationStatus
and changeAttestationStatusWatch::update
to takeGenesisHash
and return an error if the latest from the client indicates a change in genesis. This stops the runner and thus should stop the node itself when the error bubbles up. On external nodes this is redundant with the routine that monitors the main node API for changes. On the main node I shouldn't happen.next_batch_to_attest() -> Option<BatchNumber>
intoattestation_status() -> Option<AttestationStatus>
onAttestationStatusClient
, so it now includes theGenesisHash
as well.LocalAttestationStatusClient
returns theGenesisHash
it was started withWhy ❔
This came out of a discussion here: matter-labs/zksync-era#2544 (comment)
Notably the
ConsensusDal::attestation_status()
now returns theGenesisHash
, which I thought was only to signal through the API from the main node to the external nodes that a reorg has happened (which they would eventually learn anyway through polling the genesis endpoint, causing them to restart), and they should not attest for batches if they are on a different fork. The comment raised the issue that on the main node I discarded thegenesis
field from the response because I assumed it can only be the same as the one we started theExecutor
with, and second guessing it in theBatchStore
would be awkward: the original genesis wasn't available to check against (it's cached in theBlockStore
) and running a query to fetch it (even if thePersistentBatchStore
supported it) would just compare it to what it already is.The way I handled this mismatch for external nodes was to just stop updating the status by returning
None
and wait for the restart, treating it like any other transient RPC error, it was just logged. It does make sense though to elevate it higher and be able to stop the node if it's in an inconsistent state.Now it's part of the
AttestationStatusWatch
because that is what we consider to be our interface with the node, (and not theAttestationStatusRunner
which is a convenience construct).Reorgs without restart?
If an inconsistency happened on the main node it wouldn't necessarily have to be fatal: if the genesis was allowed to change, and the components such as the
BlockStore
were able to handle it at all, then ostensibly theAttestationStatus*
stuff could recover as well, for example by resetting theBatchVotes
if they see a change in thegenesis
. The problem currently is that they might have already received and discarded votes for the new fork, which would not be gossiped again until clients are reconnected.Apparently we don't plan to handle regenesis on the fly, so
AttestationStatus::genesis
is only present to prevent any attempt at changing it by causing a restart instead.Atomicity
In the first version of this PR the
BatchStore
andPersistentBatchStore
were also changed to have anattestation_status()
method that returned aGenesisHash
along with theBatchNumber
. The only way I could make sense of this was if 1) changes in genesis while running the main node were allowed and 2) they could happen between calls toBlockStore::genesis()
andBatchStore::next_batch_to_attest()
, since those are not atomic methods. We discussed that this will not be supported, theGenesis
cached inBlockStore
will not change. Since we only start theExecutor
and theAttestationStatusRunner
after regenesis inzksync-era
, we don't have to worry about this changing. In that light it would be counter intuitive to return the genesis hash fromBatchStore
.(In the future we could consider using Software Transactional Memory to have an in-memory representation of the state where we can do consistent read and writes between multiple Watch-like references. The DAL is capable of transactional reads, and it would be nice if we could combine reading for example here the genesis and the last/next batch and not have to worry about having to compare hashes, when all of these are local reads.)