Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Added a JSON RPC to simulating L1 for consensus attestation (BFT-495) #2480

Merged
merged 16 commits into from
Jul 26, 2024

Conversation

pompon0
Copy link
Contributor

@pompon0 pompon0 commented Jul 24, 2024

It just checks what is the last L1 batch certificate stored locally and returns its number (actually, the next number to avoid problems with 0).
ENs will query the main node for this information to determine what is the number of the next batch that they should vote for.

core/lib/dal/src/consensus_dal.rs Outdated Show resolved Hide resolved
core/lib/dal/src/consensus_dal.rs Outdated Show resolved Hide resolved
core/lib/types/src/api/en.rs Outdated Show resolved Hide resolved
core/lib/web3_decl/src/namespaces/en.rs Outdated Show resolved Hide resolved
core/node/api_server/src/web3/namespaces/en.rs Outdated Show resolved Hide resolved
core/node/api_server/src/web3/namespaces/en.rs Outdated Show resolved Hide resolved
core/node/api_server/src/web3/namespaces/en.rs Outdated Show resolved Hide resolved
core/node/consensus/src/storage/connection.rs Outdated Show resolved Hide resolved
core/node/api_server/src/web3/namespaces/en.rs Outdated Show resolved Hide resolved
core/node/consensus/src/tests.rs Outdated Show resolved Hide resolved
@pompon0 pompon0 enabled auto-merge July 25, 2024 13:24
aakoshh
aakoshh previously approved these changes Jul 25, 2024
Copy link
Contributor

@aakoshh aakoshh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, the new comments explaining the fallbacks are very helpful 👍

RomanBrodetski
RomanBrodetski previously approved these changes Jul 26, 2024
Copy link
Collaborator

@RomanBrodetski RomanBrodetski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one of the unit tests is failing though

@pompon0 pompon0 dismissed stale reviews from RomanBrodetski and aakoshh via b4b05ba July 26, 2024 07:54
@pompon0 pompon0 added this pull request to the merge queue Jul 26, 2024
Merged via the queue into main with commit c6b3adf Jul 26, 2024
47 checks passed
@pompon0 pompon0 deleted the gprusak-endpoint branch July 26, 2024 09:13
@pompon0 pompon0 changed the title feat: Added a JSON RPC to simulating L1 for consensus attestation feat: Added a JSON RPC to simulating L1 for consensus attestation (BFT-495) Jul 26, 2024
aakoshh added a commit to matter-labs/era-consensus that referenced this pull request Jul 31, 2024
## What ❔

Poll the main node for which batch height to vote on.

- [x] Return an `Option` rather than a `Vec` of `BatchQC` from
`BatchVotes::find_quorum`: we decided not to implement voting on
multiple heights for now, which makes it just confusing that a vector is
returned.
- [x] Create an `AttestationStatusClient` trait with a method to poll
the next `BatchNumber` to vote on, which can be injected as a `dyn`
dependency
- [x] Create an `AttestationStatusWatch` that contains the next
`BatchNumber` to vote on; this is updated by a single background task
and listened to by multiple subscribers
- [x] Create an `AttestationStatusRunner` which is supposed to be
started along with the `BatchStore` in `zksync-era` and polls the
client; the point of this is to push out the poll interval configuration
to the edge and make the `AttestationStatusWatch` the integration point
- [x] Change `AttesterRunner` to wait for the `AttestationStatusWatch`
to change, then the payload to appear, then cast the vote; it doesn't
need to worry about reverts because the node will restart if that
happens
- [x] Change `run_batch_qc_finder` to wait for the
`AttestationStatusWatch` to change, then wait for the next available QC,
and save it; this might produce gaps which is normal on the external
nodes, but undersireable on the main node - on the main node it only
produces gaps if the attesters vote on higher heights despite the main
node telling them not to, which if they are majority honest should not
happen, and if they are not then what can we do anyway
- [x] Rename `PersistentBatchStore::earliest_batch_number_to_sign` to
`next_batch_to_attest` in accordance with the new method on
`consensus_dal`.
- [ ] ~~Initialise `BatchNumber::min_batch_number` to by asking the main
node where to resume from~~ This initialisation isn't necessary because
1) we decided that we'll only allow 1 vote per attester, so there is no
attack vector of casting votes from batch 0 and 2) `run_batch_qc_finder`
first waits for the API status to appear and then prunes all preceding
data, so an older QC will not be saved. It was also undesirable to have
to initialise from an API that might return nothing (or errors) for an
unknown amount of time.

### Poll interval

The `AttesterStatusRunner` polls the API at fixed intervals without any
exponential backoff. I thought 5s would be a reasonable default value.
With 60s batches this seems fine because the *next* batch to sign will
be signalled as soon as the current batch QC is inserted into the
database, which is well ahead of time of when the next batch and its
commitment will be available. If we think we'll need to catch up with
historical batch signatures it might be a bit slow. We generally still
have to account for gossiping around the votes though, which in a large
enough network could be on the order of several seconds as well.

### Potential deadlock

There is a pathological sequence of events that could prevent the
feature from starting on a fresh database with no prior batch QCs:
1. Say we start from batch 0, and a QC is gathered, but saving it to the
database (which is an async process) takes a long time on the main node.
2. Say it takes so long that batch 1 and batch 2 are created, but none
of them have a QC yet.
3. Say we have two attesters A and B, and A polled the API when it
showed batch 2, and the main node set its minimum batch number in the
vote registry to batch 2 as well.
4. Now the QC for batch 0 is saved to the database and the API indicates
the next one to vote on is batch 1.
5. Attester B casts its vote on batch 1 but it goes ignored by the main
node vote registry because its looking for a quorum with at least batch
number 2.
6. Having missed an attestation the main node will never save a QC and
never move on until it's restarted.

Note that even if the registry minimum batch number was adjusted _down_
that might happen _after_ the vote for batch 1 has already been
discarded, and because new votes are only pushed once through gossip
there is no guarantee that it will get it again.

The solution is coming in the form of fixing the starting point of
gossip to be where the genesis starts, even if that means filling a
potentially long history of batches in the beginning. See
matter-labs/zksync-era#2539

## Why ❔

We want to collect batch QCs without gaps, and for now put the main node
in charge of what to vote on. The API to serve this information is in
matter-labs/zksync-era#2480 and this PR is the
follow up to make polling that information part of the signing process.
github-merge-queue bot pushed a commit that referenced this pull request Aug 1, 2024
🤖 I have created a release *beep* *boop*
---


##
[24.13.0](core-v24.12.0...core-v24.13.0)
(2024-07-31)


### Features

* Add recovery tests to zk_supervisor
([#2444](#2444))
([0c0d10a](0c0d10a))
* Added a JSON RPC to simulating L1 for consensus attestation
([#2480](#2480))
([c6b3adf](c6b3adf))
* added dropping all attester certificates when doing hard fork
([#2529](#2529))
([5acd686](5acd686))
* **configs:** Do not panic if config is only partially filled
([#2545](#2545))
([db13fe3](db13fe3))
* **eth-sender:** Make eth-sender tests use blob txs + refactor of
eth-sender tests
([#2316](#2316))
([c8c8334](c8c8334))
* Introduce more tracing instrumentation
([#2523](#2523))
([79d407a](79d407a))
* Remove unused VKs, add docs for BWG
([#2468](#2468))
([2fa6bf0](2fa6bf0))
* Revisit base config values
([#2532](#2532))
([3fac8ac](3fac8ac))
* Server 10k gwei limit on gas price and 1M limit on pubdata price
([#2460](#2460))
([be238cc](be238cc))
* **vlog:** Implement otlp guard with force flush on drop
([#2536](#2536))
([c9f76e5](c9f76e5))
* **vlog:** New vlog interface + opentelemtry improvements
([#2472](#2472))
([c0815cd](c0815cd))
* **zk_toolbox:** add test upgrade subcommand to zk_toolbox
([#2515](#2515))
([1a12f5f](1a12f5f))
* **zk_toolbox:** use configs from the main repo
([#2470](#2470))
([4222d13](4222d13))


### Bug Fixes

* **contract verifier:** Fix config values
([#2510](#2510))
([3729468](3729468))
* fixed panic propagation
([#2525](#2525))
([e0fc58b](e0fc58b))
* **proof_data_handler:** Unlock jobs on transient errors
([#2486](#2486))
([7c336b1](7c336b1))
* **prover:** Parallelize circuit metadata uploading for BWG
([#2520](#2520))
([f49720f](f49720f))
* VM performance diff: don't show 0 as N/A
([#2276](#2276))
([2fa2249](2fa2249))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

---------

Co-authored-by: zksync-era-bot <[email protected]>
github-merge-queue bot pushed a commit that referenced this pull request Aug 1, 2024
## What ❔

Injects an `AttestationStatusWatch` into the `Executor` which on the
main node is backed by an `AttestationStatusRunner` polling the
`BatchStore::next_batch_to_attest` and external nodes call the
`MainNodeApi::fetch_attestation_status`.

TODO: 
- [x] Rebase after #2539
is merged
- [x] Update the `era-consensus` dependency once
matter-labs/era-consensus#161 is merged and
`0.1.0-rc.5` is published

### Test failure - pruning the main node

The following test never finished:
```shell
zk test rust -p zksync_node_consensus tests::test_with_pruning::case_0 --no-capture
```
A little extra logging revealed that the `AttesterStatusRunner` never
gets initialised, because the blocks get pruned earlier than it could
read the result, ie. `ConsensusDal::batch_of_block(genesis.first_block)`
probably returns `None` because it's not found, and never will because
it's pruned.

We discussed in #2480 that
the main node is not supposed to be pruned, which is why the SQL that
looks for what to attest on doesn't look for pruning records any more.
Yet this test prunes the main node, and if I just try to prune the
external node instead it panics because it doesn't have that block yet.

Either the SQL should take pruning into account after all, or we have to
figure out a way to wait in the test until the main node executor is
running, or change the test to prune the external node; I did the
latter.
 
## Why ❔

This is coordinating attesters to cast their votes on the batch which
the main node tells them to do to produce a quorum certificate for all
L1 batches without gaps.

## Checklist

<!-- Check your PR fulfills the following items. -->
<!-- For draft PRs check the boxes as you complete them. -->

- [x] PR title corresponds to the body of PR (we generate changelog
entries from PRs).
- [x] Tests for the changes have been added / updated.
- [x] Documentation comments have been added / updated.
- [x] Code has been formatted via `zk fmt` and `zk lint`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants