-
Notifications
You must be signed in to change notification settings - Fork 376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DOC] Document zero time upgrade for nearcore #341
Comments
@ilblackdragon How technically challenging would be to solve this problem by designing to run N validators associated with the same account? This way you'd cover your stated maintenance failover usecase but also provide for permanent redundancy and network resiliency. |
How will blocks be produced under this setting? |
@bowenwang1996 consider that I have validator_1 and validator_2, both are using the same pseudokey K1, so when both are online both are producing and validating blocks under K1. |
That is not possible without significant amount of work. In your case, at most one of the nodes will be able to produce blocks. Depending on the topology of the network, I believe it is very likely that neither will be able to produce blocks. |
@bowenwang1996 thanks for the feedback, what if look at this problem in a different way. |
I don't understand how that is different. It is not about whether the validator key is the same -- you can use the same validator key for different validators. It is about whether the account is the same. In your case, if validator_1 and validator_2 use the same account then the problem is the same. If they are different accounts then I don't know what is solved here. |
We still have to wait for one full epoch for the switch, which is more than 12 hours on TestNet and MainNet (more than 3 hours on BetaNet). |
The idea is that we don't do hard forks (unless social consensus is required for some reason). |
This has been a concern for me the past 2 weeks as well since the build time was taking almost an hour on the cloud instance the node is running on (per recommended hardware config). Further 2 weeks ago during the hard fork, the slow rebuild process failed (due to a mistake on my part I believe) and the time to recovery was extended even longer. (A question I have regarding the "solution" as given here... once the primary nodes have hard forked, does keeping your node running the old version even matter in terms of uptime, being kicked, etc?! Seems to be a moot point that the old node is running if it's running with the pre-fork version of code...?) My current solution has been to fork the project tracking upstream changes and on each upstream push GitHub's CI/CD automatically builds (which still takes a bit of time as you can see) but due to the immediate nature of action via tracking upstream pushes, the speed at which the built binary is available should be at relative parity to the upstream. I planned on testing the implementation this past Tuesday however the hard fork was canceled. However I don't see why it wouldn't work fine. One additional step to add is to automatically copy the binaries to the validator node so that no manual transfer is required and it's just there. Does this sound like a solid workflow?! Would it be helpful to document this process for other validators who might not be doing it currently? |
@june07 I assume you are talking about the current way of upgrading the network instead of what is proposed in this issue. It seems fine to me. Every validator has their own setup for running the node so I don't think there is any universal setup that we should specify (in fact, we are much less experienced in this area compared to some of our validators). After we change the way we do network upgrade as specified in this issue, upgrades will happen asynchronously and validators will have more time to test and verify the new releases.
Not sure what you mean by "primary nodes". If you are talking about the new node that runs the new binary, then after your new node is synced to the network and start validating you can shutdown the old node. |
Is it possible to have an expanded version of this guide? Especially point 1. Does it mean we have to create a new contract? |
There is an issue with account announcement that prevents zero downtime upgrade from working properly. This PR fixes the issue and also adds a test to make sure that the upgrade process described in near/docs#341 works properly. Test plan ----------- * pytest `validator_switch_key.py`.
Is your feature request related to a problem? Please describe.
Currently, everyone is stopping a node, updating a binary and then restarting it again.
This can take from few minutes to 30-40 minutes in cases when new binary needs to be built.
Also after node starting, it needs to catchup with the network. For validators this leads to liveness issues and minor denial of service.
Describe the solution you'd like
NEAR has unique ability to switch between validator nodes in atomic fashion on epoch switch via changing validation keys.
The solution is to:
The text was updated successfully, but these errors were encountered: