-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Docs] upgrade/chain halt recovery #837
base: main
Are you sure you want to change the base?
Changes from 15 commits
bd12a8b
4ea092e
341b987
3d86aef
94d8af3
cb073bd
3fe1fea
525eb57
4e8c7dd
9d5696d
dd631e6
9d0bc9c
83a24aa
50cc08e
3bddc15
498d9d8
7ecf403
be35f1a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,109 @@ | ||
--- | ||
sidebar_position: 7 | ||
title: Chain Halt Recovery | ||
--- | ||
|
||
## Chain Halt Recovery <!-- omit in toc --> | ||
|
||
This document describes how to recover from a chain halt. It assumes the cause of | ||
the chain halt has been identified, the new release has been created, and verified | ||
function correctly. | ||
|
||
:::tip | ||
See [Chain Halt Troubleshooting](./chain_halt_troubleshooting.md) for more information on identifying the cause of a chain halt. | ||
::: | ||
|
||
- [Background](#background) | ||
- [Resolving halts during a network upgrade](#resolving-halts-during-a-network-upgrade) | ||
- [Manual binary replacement (preferred)](#manual-binary-replacement-preferred) | ||
- [Rollback, fork and upgrade](#rollback-fork-and-upgrade) | ||
- [Step 5: Data rollback - retrieving snapshot at a specific height](#step-5-data-rollback---retrieving-snapshot-at-a-specific-height) | ||
- [Step 6: Validator Isolation - risk mitigation](#step-6-validator-isolation---risk-mitigation) | ||
|
||
## Background | ||
|
||
Pocket network is built on top of `cosmos-sdk`, which utilizes the CometBFT consensus engine. | ||
Byzantine Fault Tolerant (BFT) consensus algorithm requires that **at least** 2/3 of Validators | ||
are online and voting for the same block to reach a consensus. In order to maintain liveness | ||
and avoid a chain-halt, we need the majority (> 2/3) of Validators to participate | ||
and use the same version of the software. | ||
|
||
## Resolving halts during a network upgrade | ||
|
||
If the halt is caused by the network upgrade, it is possible the solution can be as simple as | ||
skipping an upgrade (i.e. `unsafe-skip-upgrade`) and creating a new (fixed) upgrade. | ||
|
||
Read more about [upgrade contingency plans](../../protocol/upgrades/contigency_plans.md). | ||
|
||
### Manual binary replacement (preferred) | ||
|
||
:::note | ||
|
||
This is preferred way of resolving the consensus-breaking issues. | ||
|
||
::: | ||
|
||
Since the chain is not moving, **it is impossible** to issue an automatic upgrade with an upgrade plan. | ||
|
||
Instead, we need **social consensus** to manually replace the binary and get the chain moving. | ||
|
||
Currently this involves synching the network from genesis breaking a way to sync the network from genesis without human interaction, but there are some plans to make the process less painful in the future. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't fully understand what you're trying to say with this sentence. #PUC There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
|
||
<!-- TODO_IMPROVE(@okdas): add links to Cosmovisor documentation how the new UX can be used to automate syncing from genesis without human input. --> | ||
|
||
### Rollback, fork and upgrade | ||
okdas marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
:::info | ||
|
||
These instructions are only relevant to Pocket Network's Shannon release. | ||
|
||
We do not currently use `x/gov` and on-chain voting for upgrades. | ||
|
||
Instead, our DAO votes on upgrades off-chain and the Foundation executes | ||
transactions on their behalf. | ||
|
||
::: | ||
|
||
**Performing a rollback is analogous to forking the network at the older height.** | ||
|
||
This should be avoided unless absolutely necessary. | ||
|
||
However, if necessary, the instructions to follow are: | ||
|
||
1. Prepare & verify a new binary that addresses the consensus-breaking issue. | ||
2. [Create a release](../../protocol/upgrades/release_process.md). | ||
3. [Prepare an upgrade transaction](../../protocol/upgrades/upgrade_procedure.md#writing-an-upgrade-transaction) to the new version. | ||
4. Get the Validator set off the network **3 blocks** prior to the height of the chain halt. For example: | ||
- Assume an issue at height `103` | ||
- Get the validator set at height `100` | ||
- Submit an upgrade transaction at `101` | ||
- Upgrade the chain at height `102` | ||
- Avoid the issue at height `103` | ||
5. Ensure all validators rolled back to the same height and use the same snapshot | ||
- The snapshot should be imported into each Validator's data directory | ||
- This is necessary to ensure data continuity and prevent forks. | ||
6. Isolate the validator set from full nodes. | ||
- This is necessary to avoid full nodes from gossiping blocks that have been rolled back. | ||
- This may require using a firewall or a private network | ||
- Validators should only be gossip blocks amongst themselves. | ||
7. Start the network and perform the upgrade. For example, reiterating the process above: | ||
- Start all Validators at height `100` | ||
- On block `101`, submit the `MsgSoftwareUpgrade` transaction with a `Plan.height` set to `102`. | ||
- `x/upgrade` will perform the upgrade in the `EndBlocker` of block `102` | ||
- If using `cosmosvisor`, the node will wait to replace the binary | ||
8. Wait for the network to reach the height of the previous ledger (`104`+) | ||
9. Allow validators to open their network to full nodes again. | ||
- Note that full nodes will need to perform the rollback or use a snapshot as well. | ||
|
||
#### Step 5: Data rollback - retrieving snapshot at a specific height | ||
|
||
There are two ways to get a snapshot from a prior height: | ||
|
||
1. Use `poktrolld rollback --hard` repeately until the command responds with the desired block number. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Create a ```bash block so it's easier to copy paste There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added, but honestly, I prefer one ` here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can change it back if need be. You know how they say "the customer is always right"? I'm the customer of this document. |
||
2. Use a snapshot and start the node with `--halt-height=100` parameter so it only syncs up to certain height and then gracefully shuts down. | ||
okdas marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
#### Step 6: Validator Isolation - risk mitigation | ||
|
||
- Having at least one node that has knowledge of the forking ledger can jeopardize the whole process. In particular, the following errors are the sign of the nodes populating existing blocks: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I copy-pasted this here so the section above is easier to read (i.e. less cognitive overhead). Please clean up |
||
- `found conflicting vote from ourselves; did you unsafe_reset a validator?` | ||
- `conflicting votes from validator` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,66 @@ | ||
--- | ||
title: Failed upgrade contingency plan | ||
sidebar_position: 5 | ||
--- | ||
|
||
:::tip | ||
|
||
This documentation covers failed upgrade contingency for `poktroll` - a `cosmos-sdk` based chain. | ||
|
||
While this can be helpful for other blockchain networks, it is not guaranteed to work for other chains. | ||
|
||
::: | ||
|
||
## Contingency plans <!-- omit in toc --> | ||
|
||
There's always a chance the upgrade will fail. | ||
|
||
This document is intended to help you recover without significant downtime. | ||
|
||
- [Option 0: The bug is discovered before the upgrade height is reached](#option-0-the-bug-is-discovered-before-the-upgrade-height-is-reached) | ||
- [Option 1: The upgrade height is reached and the migration didn't start](#option-1-the-upgrade-height-is-reached-and-the-migration-didnt-start) | ||
- [Option 2: The migration is stuck](#option-2-the-migration-is-stuck) | ||
- [Option 3: The network is stuck at the future height after the upgrade](#option-3-the-network-is-stuck-at-the-future-height-after-the-upgrade) | ||
|
||
### Option 0: The bug is discovered before the upgrade height is reached | ||
|
||
**Cancel the upgrade plan!!** | ||
|
||
See the instructions of [how to do that here](./upgrade_procedure.md#cancelling-the-upgrade-plan). | ||
Check warning on line 29 in docusaurus/docs/protocol/upgrades/contigency_plans.md GitHub Actions / misspell[misspell] docusaurus/docs/protocol/upgrades/contigency_plans.md#L29
Raw output
|
||
|
||
### Option 1: The upgrade height is reached and the migration didn't start | ||
|
||
If the nodes on the network stopped at the upgrade height and the migration did not | ||
start yet (i.e. there are no logs indicating the upgrade handler and store migrations are being executed), | ||
we mist gather social consensus to restart validators with the `--unsafe-skip-upgrade=$upgradeHeightNumber` flag. | ||
|
||
This will skip the upgrade process, allowing the chain to continue and the protocol team to plan another release. | ||
|
||
`--unsafe-skip-upgrade` simply skips the upgrade handler and store migrations. | ||
The chain continues as if the upgrade plan was never set. | ||
The upgrade needs to be fixed, and then a new plan needs to be submitted to the network. | ||
|
||
:::caution | ||
|
||
`--unsafe-skip-upgrade` needs to be documented and added to the scripts so the next time somebody tries to sync the network from genesis - they will automatically skip the failed upgrade. | ||
|
||
<!-- TODO_IMPROVE(@okdas): new cosmovisor UX can simplify this --> | ||
|
||
::: | ||
|
||
### Option 2: The migration is stuck | ||
|
||
If the migration is stuck, there's always a chance the state has been mutated for | ||
the upgrade but the migration didn't complete. | ||
|
||
In such a case, we need to: | ||
|
||
- Roll back validators to the backup (a snapshot is taken by `cosmovisor` automatically prior to upgrade, if `UNSAFE_SKIP_BACKUP` is set to `false`). | ||
okdas marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- Skip the upgrade handler and store migrations with `--unsafe-skip-upgrade=$upgradeHeightNumber`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please show full command (or link to script). I don't know what this means There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just like with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. #PUC
|
||
- Document and add `--unsafe-skip-upgrade=$upgradeHeightNumber` to the scripts so the next time somebody tries to sync the network from genesis | ||
okdas marked this conversation as resolved.
Show resolved
Hide resolved
|
||
they will automatically skip the failed upgrade. | ||
- Resolve the issue with an upgrade and schedule another plan. | ||
|
||
### Option 3: The network is stuck at the future height after the upgrade | ||
|
||
This should be treated as a consensus or non-determinism bug that is unrelated to the upgrade. See [Recovery From Chain Halt](../../develop/developer_guide/recovery_from_chain_halt.md) for more information on how to handle such issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update this section w/ links to the binaries for easier access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no binaries to point to. Rephrased.