-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot make block after skipped x/upgrade #8538
Comments
Well that's alarming. To begin with I think we should just remove time based upgrades. I'm not sure what's happening there but it seems unreliable and has not been as thoroughly tested. |
We also encountered something similar on Akash network testnet. We did not use the upgrade module and went for manual export and migration to v0.41. We could not get the chain to start at all even though we had 73% consensus at one point.
We reverted back to the original chain afterwards. |
Nice to see this has been reproduced. But no idea on where it comes from. Any core devs looking at this? Maybe it is a tendermint issue? @tessr could someone from your team take a look at these logs and maybe give us a hint what is off? |
Thanks for tagging me on this - it's not obvious to me either if this is a Tendermint or SDK issue, but I'll ask someone on the team to take a look. |
@ethanfrey is there anything in the logs? have you tried running it with |
Preferably in pastebin |
@melekes I am not sure the machines are still around, but I can see. @kaustubhkapatral can you also try with the debug logs? |
Here are the debug logs from our main validator @alexanderbez https://pastebin.com/7Wgy9mh0 |
No - should we just transfer this issue over? |
Feel free to move it over to tendermint. I was unclear what was going on here. It seem something similar has happened other places, so good to figure out what causes the liveness failure. |
Turns out you can't move issues across different github orgs, so I've opened a new issue on tendermint/tendermint to point to this. |
This log corresponds to the statement below, correct?
And height 445130 is the last committed height of the chain before stalling out? |
Ok so just by quickly grokking through the logs, I see that the validator does try to progress and commit a new height -- 445131. But then there is a round regression and then it eventually disconnects from all its peers.
This might be from the WAL replay and due to the way the upgrade-time logic was used. I agree with @aaronc and that height should only be used, because basing off of time could cause weird issues in terms of which steps which nodes run at which time, i.e. it's too subjective. I'm not really sure there are any actionable items for the Tendermint team here to focus on @tessr. |
Huh? After the precommits are there (from 71% of voting power), how can there be a "round regression"? What does that even mean? |
Closing this in favor of #8801. |
Summary of Bug
We tried doing x/upgrade on musselnet-2 with
--upgrade-time
(previously only tested with--upgrade-height
) and it failed. (@orkunkl will add another issue documenting this)We then decided to just skip it, using
--unsafe-skip-upgrade
on our validators (to whom we had staked almost 75% of the voting power). The nodes then came up and starting participating in consensus, but never came to make a new block, even though we have > 67% voting power in pre-commits.Version
wasmd 0.14.0
based oncosmos-sdk v0.40.1
(we were attempted to upgrade towasmd 0.15.0
/cosmos-sdk v0.41.0
)Steps to Reproduce
Not sure. I am dumping some output of the rpc nodes in hope they clarify what is going on. Rounds 2 to 31 should have produced a block from my understanding. (Please rename from .json.txt to .json, github did not allow json uploads)
Consensus state:
consensus_state.json.txt
Validators:
validators.json.txt
Validators (previous block):
validators-previous.json.txt
Key example from consensus state (height_vote_set):
75% precommits, but it does not advance?
For Admin Use
The text was updated successfully, but these errors were encountered: