-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add rollback support in the event of an incorrect hash #10281
Comments
@robert-zaremba can you coordinate this in context of the upgrade wg? |
My understanding of this issue is to solve the following use case:
Do I get it right, @cmwaters ? How important it is? Is it fine to roll it with v0.45?
|
Yup, although a minor clarification, Tendermint has its own state that tracks validator sets, consensus params and the app hash. We aren't actually removing any blocks, only the Tendermint state at that height. When the node restarts it will replay the same transactions in the last block.
Hard for me to judge how important it is because I haven't built an app before, but it was requested and it makes sense as a piece of tooling that is required. I think having it in v0.45 is fine. I don't see why it also couldn't be backported in the v0.44 range (we will be backporting it into v0.34 Tendermint).
Tendermint will have a public function (called I thought about adding a |
We hit a non determinism issue somehow related to an ibcclientupdate tx. Leading to AppHash mismatch. While trying to debug and actually get enough info to make a proper bug report, we were hampered by the lack of any tool to try one block earlier. And are syncing 510.000 blocks to try to see if we can reproduce it, not so fun. I can also try to lend a hand for a fix on 0.42. As soon as the tendermint fix is in 0.34 |
NB this has been requested by many chains over the last few years. And I built some tool back in 2018/2109 that did this, but never was maintained. AppHash mismatch is a pain to debug currently |
@cmwaters , I think we need to expose RPC call as well: when a tendermint is run independently (as a separate process) we should still have a way to send that request. |
This should not be a publicly available rpc endpoint. And does not need to be run remotely. We do need to handle the multi-process use case however. Something like |
Yup completely agree that we don't need to expose this via RPC. The process shouldn't be running when It's feasible to have a single command even for multi-process instances because no processes should be running. All you need is the tendermint |
Just an update on the Tendermint side. We've merged changes to master and backported them to the respective branches. We'll most likely release it in |
Thank you. I am watching this issue. Please do let me know when this is available in 0.34.14 and I will try to make time to start a branch on the cosmos-sdk side (exploratory work first) |
So I investigated using the Tendermint rollback command in combination with modified application state to help the Thorchain team recover a halted network. One thing the Tendermint rollback command doesn't do is delete the block that was rolled back, this seems to cause the ABCI replay handshake to fail because it tries to apply the same block again. |
The idea was intentionally to not delete the block but just Tendermint's own internal state (which tracks the Is there a need to remove the block as well? I would think the application would just skip over any "corrupted" transactions. |
If there is code that uses this somewhere (even a one-off script), that would be great to link here to make a concrete use case. These concrete use cases usually appear in critical (chain halt) scenarios, so best to record them to prepare tooling best beforehand and leave minimal work in crisis |
Closes: cosmos#10281 fix tendermint rollback changelog update tendermint to recent v0.35.x branch
Closes: cosmos#10281 fix tendermint rollback changelog update tendermint to recent v0.35.x branch
Closes: #10281 fix tendermint rollback changelog update tendermint to recent v0.35.x branch
Closes: #10281 fix tendermint rollback changelog update tendermint to recent v0.35.x branch (cherry picked from commit 8296ad9) # Conflicts: # CHANGELOG.md # go.mod # go.sum Co-authored-by: yihuang <[email protected]>
Closes: cosmos#10281 fix tendermint rollback changelog update tendermint to recent v0.35.x branch (cherry picked from commit 8296ad9) Co-authored-by: yihuang <[email protected]>
Closes: cosmos#10281 fix tendermint rollback changelog update tendermint to recent v0.35.x branch (cherry picked from commit 8296ad9) # Conflicts: # CHANGELOG.md # go.mod # go.sum Co-authored-by: yihuang <[email protected]>
Reopening as it seems this isn't fixed yet |
Oh, no. This is super important for recovery of chains from some non-determinism issues. I think high priority to finally get fixed. |
There was a small problem with blocksync which persisted the block before validating it and thus could leave it in a state whereby calling As a more general comment, there are really two forms of "rolling back" in Tendermint that I think it's important to distinguish:
I have only implemented for the first case and am currently against the second case because creating a new block at the same height means all validators are effectively double signing. I'd prefer, if there are "bad txs" somewhere, that the application should be then modified so it can correctly handle them (and continue onwards) rather than Tendermint trying to wipe history. |
I think solving (1) well is the main use case too. It should be "easy" to recover from some non-deterministic app. (2) will likely require some dump state and export type thing and I think it is fine to make this a manual issue for now |
it'll be fixed by #11361 |
Closes: cosmos#10281 fix tendermint rollback changelog update tendermint to recent v0.35.x branch (cherry picked from commit 8296ad9) # Conflicts: # CHANGELOG.md # go.mod # go.sum Co-authored-by: yihuang <[email protected]>
Closes: cosmos#10281 fix tendermint rollback changelog update tendermint to recent v0.35.x branch (cherry picked from commit 8296ad9) # Conflicts: # CHANGELOG.md # go.mod # go.sum Co-authored-by: yihuang <[email protected]>
Closes: cosmos#10281 fix tendermint rollback changelog update tendermint to recent v0.35.x branch (cherry picked from commit 8296ad9) # Conflicts: # CHANGELOG.md # go.mod # go.sum Co-authored-by: yihuang <[email protected]>
Closes: cosmos#10281 fix tendermint rollback changelog update tendermint to recent v0.35.x branch (cherry picked from commit 8296ad9) Co-authored-by: yihuang <[email protected]>
done in #11361 |
Closes: cosmos#10281 fix tendermint rollback changelog update tendermint to recent v0.35.x branch (cherry picked from commit 8296ad9) # Conflicts: # CHANGELOG.md # go.mod # go.sum Co-authored-by: yihuang <[email protected]>
Problem Definition
The original issue can be found here. As a quick summary, in the event of a non-deterministic app hash or when an upgrade fails, Tendermint will have persisted the incorrect
AppHash
and nodes will be unable to make progress. What needs to happen is that the application should revert back to the previous state, Tendermint should also rollback to the previous state, then upon startup Tendermint can replay the last block and should now have the correctAppHash
to continue.Proposal
Work on the Tendermint side is underway here and will be backported to
v0.34.14
when it is merged. It exposes a public functionRollbackState
which the SDK can use to provide the rollback tooling necessary.cc @aaronc, @robert-zaremba, @ethanfrey
For Admin Use
The text was updated successfully, but these errors were encountered: