Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC] Document zero time upgrade for nearcore #341

Closed
ilblackdragon opened this issue May 16, 2020 · 11 comments · Fixed by #393
Closed

[DOC] Document zero time upgrade for nearcore #341

ilblackdragon opened this issue May 16, 2020 · 11 comments · Fixed by #393
Assignees
Labels
enhancement New feature or request P1 Priority 1

Comments

@ilblackdragon
Copy link
Member

Is your feature request related to a problem? Please describe.
Currently, everyone is stopping a node, updating a binary and then restarting it again.
This can take from few minutes to 30-40 minutes in cases when new binary needs to be built.
Also after node starting, it needs to catchup with the network. For validators this leads to liveness issues and minor denial of service.

Describe the solution you'd like
NEAR has unique ability to switch between validator nodes in atomic fashion on epoch switch via changing validation keys.

The solution is to:

  1. Start a new node with new binary, with new validator key but the same account id
  2. Wait until node fully syncs (or time things properly with epoch switches)
  3. Send a re-staking transaction with new validator key (if running staking contract, rotate staking key within this contract)
  4. Wait until epoch switches when new validator key becomes active
  5. Stop old node with old validator key.
@eovchar
Copy link
Contributor

eovchar commented May 20, 2020

@ilblackdragon How technically challenging would be to solve this problem by designing to run N validators associated with the same account? This way you'd cover your stated maintenance failover usecase but also provide for permanent redundancy and network resiliency.
Currently your implementation allows only for a direct association between a node and an account, but I wonder if its possible to create an abstraction level (for example via a certificate or pseudonode), where multiple different nodes can be bound to the same certificate/key/pseudonode and this key is the one that would be used for staking purposes, allowing the validator to run the validation service at scale.

@bowenwang1996
Copy link
Contributor

@abakuz

run N validators associated with the same account

How will blocks be produced under this setting?

@eovchar
Copy link
Contributor

eovchar commented May 20, 2020

@bowenwang1996 consider that I have validator_1 and validator_2, both are using the same pseudokey K1, so when both are online both are producing and validating blocks under K1.
Is that technically possible under the constraints of your protocol architecture?

@bowenwang1996
Copy link
Contributor

That is not possible without significant amount of work. In your case, at most one of the nodes will be able to produce blocks. Depending on the topology of the network, I believe it is very likely that neither will be able to produce blocks.

@eovchar
Copy link
Contributor

eovchar commented May 21, 2020

@bowenwang1996 thanks for the feedback, what if look at this problem in a different way.
Say validator_1 and validator_2 continues to have its own node_key and produce blocks as they currently do, but then instead of associating your node_key with your stake, you associate you node_key with a pseudokey, and then stake using this pseudokey. The same pseudokey is reused on validator_1 and validator_2.

@bowenwang1996
Copy link
Contributor

I don't understand how that is different. It is not about whether the validator key is the same -- you can use the same validator key for different validators. It is about whether the account is the same. In your case, if validator_1 and validator_2 use the same account then the problem is the same. If they are different accounts then I don't know what is solved here.

@stefanopepe
Copy link
Contributor

We still have to wait for one full epoch for the switch, which is more than 12 hours on TestNet and MainNet (more than 3 hours on BetaNet).
It works very well with soft forks, I can't see how we can leverage this with a hard fork

@bowenwang1996
Copy link
Contributor

I can't see how we can leverage this with a hard fork

The idea is that we don't do hard forks (unless social consensus is required for some reason).

@june07
Copy link

june07 commented May 28, 2020

This has been a concern for me the past 2 weeks as well since the build time was taking almost an hour on the cloud instance the node is running on (per recommended hardware config). Further 2 weeks ago during the hard fork, the slow rebuild process failed (due to a mistake on my part I believe) and the time to recovery was extended even longer. (A question I have regarding the "solution" as given here... once the primary nodes have hard forked, does keeping your node running the old version even matter in terms of uptime, being kicked, etc?! Seems to be a moot point that the old node is running if it's running with the pre-fork version of code...?)

My current solution has been to fork the project tracking upstream changes and on each upstream push GitHub's CI/CD automatically builds (which still takes a bit of time as you can see) but due to the immediate nature of action via tracking upstream pushes, the speed at which the built binary is available should be at relative parity to the upstream.

image

I planned on testing the implementation this past Tuesday however the hard fork was canceled. However I don't see why it wouldn't work fine. One additional step to add is to automatically copy the binaries to the validator node so that no manual transfer is required and it's just there.

Does this sound like a solid workflow?! Would it be helpful to document this process for other validators who might not be doing it currently?

@bowenwang1996
Copy link
Contributor

bowenwang1996 commented May 28, 2020

@june07 I assume you are talking about the current way of upgrading the network instead of what is proposed in this issue. It seems fine to me. Every validator has their own setup for running the node so I don't think there is any universal setup that we should specify (in fact, we are much less experienced in this area compared to some of our validators).

After we change the way we do network upgrade as specified in this issue, upgrades will happen asynchronously and validators will have more time to test and verify the new releases.

once the primary nodes have hard forked, does keeping your node running the old version even matter in terms of uptime

Not sure what you mean by "primary nodes". If you are talking about the new node that runs the new binary, then after your new node is synced to the network and start validating you can shutdown the old node.

@48cfu
Copy link

48cfu commented May 31, 2020

Is your feature request related to a problem? Please describe.
Currently, everyone is stopping a node, updating a binary and then restarting it again.
This can take from few minutes to 30-40 minutes in cases when new binary needs to be built.
Also after node starting, it needs to catchup with the network. For validators this leads to liveness issues and minor denial of service.

Describe the solution you'd like
NEAR has unique ability to switch between validator nodes in atomic fashion on epoch switch via changing validation keys.

The solution is to:

1. Start a new node with new binary, with new validator key but the same account id

2. Wait until node fully syncs (or time things properly with epoch switches)

3. Send a re-staking transaction with new validator key (if running staking contract, rotate staking key within this contract)

4. Wait until epoch switches when new validator key becomes active

5. Stop old node with old validator key.

Is it possible to have an expanded version of this guide? Especially point 1. Does it mean we have to create a new contract?

@amgando amgando added the P1 Priority 1 label Jun 1, 2020
bowenwang1996 added a commit to near/nearcore that referenced this issue Jun 15, 2020
There is an issue with account announcement that prevents zero downtime upgrade from working properly. This PR fixes the issue and also adds a test to make sure that the upgrade process described in near/docs#341 works properly.

Test plan
-----------
* pytest `validator_switch_key.py`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request P1 Priority 1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants