Skip to content

Commit

Permalink
document state sync ABCI interface and P2P protocol (#90)
Browse files Browse the repository at this point in the history
The corresponding Tendermint PRs are tendermint/tendermint#4704 and tendermint/tendermint#4705.
  • Loading branch information
erikgrinaker authored May 5, 2020
1 parent f399abd commit 9842b4b
Show file tree
Hide file tree
Showing 4 changed files with 353 additions and 9 deletions.
131 changes: 124 additions & 7 deletions spec/abci/abci.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,17 +5,22 @@
The ABCI message types are defined in a [protobuf
file](https://github.com/tendermint/tendermint/blob/master/abci/types/types.proto).

ABCI methods are split across 3 separate ABCI _connections_:
ABCI methods are split across four separate ABCI _connections_:

- `Consensus Connection`: `InitChain, BeginBlock, DeliverTx, EndBlock, Commit`
- `Mempool Connection`: `CheckTx`
- `Info Connection`: `Info, SetOption, Query`
- Consensus connection: `InitChain`, `BeginBlock`, `DeliverTx`, `EndBlock`, `Commit`
- Mempool connection: `CheckTx`
- Info connection: `Info`, `SetOption`, `Query`
- Snapshot connection: `ListSnapshots`, `LoadSnapshotChunk`, `OfferSnapshot`, `ApplySnapshotChunk`

The `Consensus Connection` is driven by a consensus protocol and is responsible
The consensus connection is driven by a consensus protocol and is responsible
for block execution.
The `Mempool Connection` is for validating new transactions, before they're

The mempool connection is for validating new transactions, before they're
shared or included in a block.
The `Info Connection` is for initialization and for queries from the user.

The info connection is for initialization and for queries from the user.

The snapshot connection is for serving and restoring [state sync snapshots](apps.md#state-sync).

Additionally, there is a `Flush` method that is called on every connection,
and an `Echo` method that is just for debugging.
Expand Down Expand Up @@ -151,6 +156,22 @@ The result is an updated application state.
Cryptographic commitments to the results of DeliverTx, EndBlock, and
Commit are included in the header of the next block.

## State Sync

State sync allows new nodes to rapidly bootstrap by discovering, fetching, and applying
state machine snapshots instead of replaying historical blocks. For more details, see the
[state sync section](apps.md#state-sync).

When a new node is discovering snapshots in the P2P network, existing nodes will call
`ListSnapshots` on the application to retrieve any local state snapshots. The new node will
offer these snapshots to its local application via `OfferSnapshot`.

Once the application accepts a snapshot and begins restoring it, Tendermint will fetch snapshot
chunks from existing nodes via `LoadSnapshotChunk` and apply them sequentially to the local
application with `ApplySnapshotChunk`. When all chunks have been applied, the application
`AppHash` is retrieved via an `Info` query and compared to the blockchain's `AppHash` verified
via light client.

## Messages

### Echo
Expand Down Expand Up @@ -388,6 +409,83 @@ Commit are included in the header of the next block.
other purposes, e.g. auditing, replay of non-persisted heights, light client
verification, and so on.

### ListSnapshots

- **Response**:
- `Snapshots ([]Snapshot)`: List of local state snapshots.
- **Usage**:
- Used during state sync to discover available snapshots on peers.
- See `Snapshot` data type for details.

### LoadSnapshotChunk

- **Request**:
- `Height (uint64)`: The height of the snapshot the chunks belongs to.
- `Format (uint32)`: The application-specific format of the snapshot the chunk belongs to.
- `Chunk (uint32)`: The chunk index, starting from `0` for the initial chunk.
- **Response**:
- `Chunk ([]byte)`: The binary chunk contents, in an arbitray format. Chunk messages cannot be
larger than 16 MB _including metadata_, so 10 MB is a good starting point.
- **Usage**:
- Used during state sync to retrieve snapshot chunks from peers.

### OfferSnapshot

- **Request**:
- `Snapshot (Snapshot)`: The snapshot offered for restoration.
- `AppHash ([]byte)`: The light client-verified app hash for this height, from the blockchain.
- **Response**:
- `Result (Result)`: The result of the snapshot offer.
- `accept`: Snapshot is accepted, start applying chunks.
- `abort`: Abort snapshot restoration, and don't try any other snapshots.
- `reject`: Reject this specific snapshot, try others.
- `reject_format`: Reject all snapshots with this `format`, try others.
- `reject_senders`: Reject all snapshots from all senders of this snapshot, try others.
- **Usage**:
- `OfferSnapshot` is called when bootstrapping a node using state sync. The application may
accept or reject snapshots as appropriate. Upon accepting, Tendermint will retrieve and
apply snapshot chunks via `ApplySnapshotChunk`. The application may also choose to reject a
snapshot in the chunk response, in which case it should be prepared to accept further
`OfferSnapshot` calls.
- Only `AppHash` can be trusted, as it has been verified by the light client. Any other data
can be spoofed by adversaries, so applications should employ additional verification schemes
to avoid denial-of-service attacks. The verified `AppHash` is automatically checked against
the restored application at the end of snapshot restoration.
- For more information, see the `Snapshot` data type or the [state sync section](apps.md#state-sync).

### ApplySnapshotChunk

- **Request**:
- `Index (uint32)`: The chunk index, starting from `0`. Tendermint applies chunks sequentially.
- `Chunk ([]byte)`: The binary chunk contents, as returned by `LoadSnapshotChunk`.
- `Sender (string)`: The P2P ID of the node who sent this chunk.
- **Response**:
- `Result (Result)`: The result of applying this chunk.
- `accept`: The chunk was accepted.
- `abort`: Abort snapshot restoration, and don't try any other snapshots.
- `retry`: Reapply this chunk, combine with `RefetchChunks` and `RejectSenders` as appropriate.
- `retry_snapshot`: Restart this snapshot from `OfferSnapshot`, reusing chunks unless
instructed otherwise.
- `reject_snapshot`: Reject this snapshot, try a different one.
- `RefetchChunks ([]uint32)`: Refetch and reapply the given chunks, regardless of `Result`. Only
the listed chunks will be refetched, and reapplied in sequential order.
- `RejectSenders ([]string)`: Reject the given P2P senders, regardless of `Result`. Any chunks
already applied will not be refetched unless explicitly requested, but queued chunks from these senders will be discarded, and new chunks or other snapshots rejected.
- **Usage**:
- The application can choose to refetch chunks and/or ban P2P peers as appropriate. Tendermint
will not do this unless instructed by the application.
- The application may want to verify each chunk, e.g. by attaching chunk hashes in
`Snapshot.Metadata` and/or incrementally verifying contents against `AppHash`.
- When all chunks have been accepted, Tendermint will make an ABCI `Info` call to verify that
`LastBlockAppHash` and `LastBlockHeight` matches the expected values, and record the
`AppVersion` in the node state. It then switches to fast sync or consensus and joins the
network.
- If Tendermint is unable to retrieve the next chunk after some time (e.g. because no suitable
peers are available), it will reject the snapshot and try a different one via `OfferSnapshot`.
The application should be prepared to reset and accept it or abort as appropriate.

###

## Data Types

### Header
Expand Down Expand Up @@ -540,3 +638,22 @@ Commit are included in the header of the next block.
- `Type (string)`: Type of Merkle proof and how it's encoded.
- `Key ([]byte)`: Key in the Merkle tree that this proof is for.
- `Data ([]byte)`: Encoded Merkle proof for the key.

### Snapshot

- **Fields**:
- `Height (uint64)`: The height at which the snapshot was taken (after commit).
- `Format (uint32)`: An application-specific snapshot format, allowing applications to version
their snapshot data format and make backwards-incompatible changes. Tendermint does not
interpret this.
- `Chunks (uint32)`: The number of chunks in the snapshot. Must be at least 1 (even if empty).
- `Hash (bytes)`: An arbitrary snapshot hash. Must be equal only for identical snapshots across
nodes. Tendermint does not interpret the hash, it only compares them.
- `Metadata (bytes)`: Arbitrary application metadata, for example chunk hashes or other
verification data.

- **Usage**:
- Used for state sync snapshots, see [separate section](apps.md#state-sync) for details.
- A snapshot is considered identical across nodes only if _all_ fields are equal (including
`Metadata`). Chunks may be retrieved from all nodes that have the same snapshot.
- When sent across the network, a snapshot message can be at most 4 MB.
152 changes: 151 additions & 1 deletion spec/abci/apps.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,11 @@ Here we cover the following components of ABCI applications:
application state
- [Crash Recovery](#crash-recovery) - handshake protocol to synchronize
Tendermint and the application on startup.
- [State Sync](#state-sync) - rapid bootstrapping of new nodes by restoring state machine snapshots

## State

Since Tendermint maintains three concurrent ABCI connections, it is typical
Since Tendermint maintains four concurrent ABCI connections, it is typical
for an application to maintain a distinct state for each, and for the states to
be synchronized during `Commit`.

Expand Down Expand Up @@ -92,6 +93,11 @@ QueryState should be set to the latest `DeliverTxState` at the end of every `Com
ie. after the full block has been processed and the state committed to disk.
Otherwise it should never be modified.

### Snapshot Connection

The Snapshot Connection is optional, and is only used to serve state sync snapshots for other nodes
and/or restore state sync snapshots to a local node being bootstrapped.

## Transaction Results

`ResponseCheckTx` and `ResponseDeliverTx` contain the same fields.
Expand Down Expand Up @@ -464,3 +470,147 @@ If `appBlockHeight == storeBlockHeight`
update the state using the saved ABCI responses but dont run the block against the real app.
This happens if we crashed after the app finished Commit but before Tendermint saved the state.

## State Sync

A new node joining the network can simply join consensus at the genesis height and replay all
historical blocks until it is caught up. However, for large chains this can take a significant
amount of time, often on the order of weeks or months.

State sync is an alternative mechanism for bootstrapping a new node, where it fetches a snapshot
of the state machine at a given height and restores it. Depending on the application, this can
be several orders of magnitude faster than replaying blocks.

Note that state sync does not currently backfill historical blocks, so the node will have a
truncated block history - users are advised to consider the broader network implications of this in
terms of block availability and auditability. This functionality may be added in the future.

For details on the specific ABCI calls and types, see the [methods and types section](abci.md).

### Taking Snapshots

Applications that want to support state syncing must take state snapshots at regular intervals. How
this is accomplished is entirely up to the application. A snapshot consists of some metadata and
a set of binary chunks in an arbitrary format:

* `Height (uint64)`: The height at which the snapshot is taken. It must be taken after the given
height has been committed, and must not contain data from any later heights.

* `Format (uint32)`: An arbitrary snapshot format identifier. This can be used to version snapshot
formats, e.g. to switch from Protobuf to MessagePack for serialization. The application can use
this when restoring to choose whether to accept or reject a snapshot.

* `Chunks (uint32)`: The number of chunks in the snapshot. Each chunk contains arbitrary binary
data, and should be less than 16 MB; 10 MB is a good starting point.

* `Hash ([]byte)`: An arbitrary hash of the snapshot. This is used to check whether a snapshot is
the same across nodes when downloading chunks.

* `Metadata ([]byte)`: Arbitrary snapshot metadata, e.g. chunk hashes for verification or any other
necessary info.

For a snapshot to be considered the same across nodes, all of these fields must be identical. When
sent across the network, snapshot metadata messages are limited to 4 MB.

When a new node is running state sync and discovering snapshots, Tendermint will query an existing
application via the ABCI `ListSnapshots` method to discover available snapshots, and load binary
snapshot chunks via `LoadSnapshotChunk`. The application is free to choose how to implement this
and which formats to use, but should provide the following guarantees:

* **Consistent:** A snapshot should be taken at a single isolated height, unaffected by
concurrent writes. This can e.g. be accomplished by using a data store that supports ACID
transactions with snapshot isolation.

* **Asynchronous:** Taking a snapshot can be time-consuming, so it should not halt chain progress,
for example by running in a separate thread.

* **Deterministic:** A snapshot taken at the same height in the same format should be identical
(at the byte level) across nodes, including all metadata. This ensures good availability of
chunks, and that they fit together across nodes.

A very basic approach might be to use a datastore with MVCC transactions (such as RocksDB),
start a transaction immediately after block commit, and spawn a new thread which is passed the
transaction handle. This thread can then export all data items, serialize them using e.g.
Protobuf, hash the byte stream, split it into chunks, and store the chunks in the file system
along with some metadata - all while the blockchain is applying new blocks in parallel.

A more advanced approach might include incremental verification of individual chunks against the
chain app hash, parallel or batched exports, compression, and so on.

Old snapshots should be removed after some time - generally only the last two snapshots are needed
(to prevent the last one from being removed while a node is restoring it).

### Bootstrapping a Node

An empty node can be state synced by setting the configuration option `statesync.enabled =
true`. The node also needs the chain genesis file for basic chain info, and configuration for
light client verification of the restored snapshot: a set of Tendermint RPC servers, and a
trusted header hash and corresponding height from a trusted source, via the `statesync`
configuration section.

Once started, the node will connect to the P2P network and begin discovering snapshots. These
will be offered to the local application, and once a snapshot is accepted Tendermint will fetch
and apply the snapshot chunks. After all chunks have been successfully applied, Tendermint verifies
the app's `AppHash` against the chain using the light client, then switches the node to normal
consensus operation.

#### Snapshot Discovery

When the empty node join the P2P network, it asks all peers to report snapshots via the
`ListSnapshots` ABCI call (limited to 10 per node). After some time, the node picks the most
suitable snapshot (generally prioritized by height, format, and number of peers), and offers it
to the application via `OfferSnapshot`. The application can choose a number of responses,
including accepting or rejecting it, rejecting the offered format, rejecting the peer who sent
it, and so on. Tendermint will keep discovering and offering snapshots until one is accepted or
the application aborts.

#### Snapshot Restoration

Once a snapshot has been accepted via `OfferSnapshot`, Tendermint begins downloading chunks from
any peers that have the same snapshot (i.e. that have identical metadata fields). Chunks are
spooled in a temporary directory, and then given to the application in sequential order via
`ApplySnapshotChunk` until all chunks have been accepted.

As with taking snapshots, the method for restoring them is entirely up to the application, but will
generally be the inverse of how they are taken.

During restoration, the application can respond to `ApplySnapshotChunk` with instructions for how
to continue. This will typically be to accept the chunk and await the next one, but it can also
ask for chunks to be refetched (either the current one or any number of previous ones), P2P peers
to be banned, snapshots to be rejected or retried, and a number of other responses - see the ABCI
reference for details.

If Tendermint fails to fetch a chunk after some time, it will reject the snapshot and try a
different one via `OfferSnapshot` - the application can choose whether it wants to support
restarting restoration, or simply abort with an error.

#### Snapshot Verification

Once all chunks have been accepted, Tendermint issues an `Info` ABCI call to retrieve the
`LastBlockAppHash`. This is compared with the trusted app hash from the chain, retrieved and
verified using the light client. Tendermint also checks that `LastBlockHeight` corresponds to the
height of the snapshot.

This verification ensures that an application is valid before joining the network. However, the
snapshot restoration may take a long time to complete, so applications may want to employ additional
verification during the restore to detect failures early. This might e.g. include incremental
verification of each chunk against the app hash (using bundled Merkle proofs), checksums to
protect against data corruption by the disk or network, and so on. However, it is important to
note that the only trusted information available is the app hash, and all other snapshot metadata
can be spoofed by adversaries.

Apps may also want to consider state sync denial-of-service vectors, where adversaries provide
invalid or harmful snapshots to prevent nodes from joining the network. The application can
counteract this by asking Tendermint to ban peers. As a last resort, node operators can use
P2P configuration options to whitelist a set of trusted peers that can provide valid snapshots.

#### Transition to Consensus

Once the snapshot has been restored, Tendermint gathers additional information necessary for
bootstrapping the node (e.g. chain ID, consensus parameters, validator sets, and block headers)
from the genesis file and light client RPC servers. It also fetches and records the `AppVersion`
from the ABCI application.

Once the node is bootstrapped with this information and the restored state machine, it transitions
to fast sync (if enabled) to fetch any remaining blocks up the the chain head, and then
transitions to regular consensus operation. At this point the node operates like any other node,
apart from having a truncated block history at the height of the restored snapshot.
2 changes: 1 addition & 1 deletion spec/abci/client-server.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ it is the standard way to encode integers in Protobuf. It is also generally shor
As noted above, this prefixing does not apply for GRPC.

An ABCI server must also be able to support multiple connections, as
Tendermint uses three connections.
Tendermint uses four connections.

### Async vs Sync

Expand Down
Loading

0 comments on commit 9842b4b

Please sign in to comment.