[DOC] Document zero time upgrade for nearcore #341

ilblackdragon · 2020-05-16T20:27:52Z

Is your feature request related to a problem? Please describe.
Currently, everyone is stopping a node, updating a binary and then restarting it again.
This can take from few minutes to 30-40 minutes in cases when new binary needs to be built.
Also after node starting, it needs to catchup with the network. For validators this leads to liveness issues and minor denial of service.

Describe the solution you'd like
NEAR has unique ability to switch between validator nodes in atomic fashion on epoch switch via changing validation keys.

The solution is to:

Start a new node with new binary, with new validator key but the same account id
Wait until node fully syncs (or time things properly with epoch switches)
Send a re-staking transaction with new validator key (if running staking contract, rotate staking key within this contract)
Wait until epoch switches when new validator key becomes active
Stop old node with old validator key.

eovchar · 2020-05-20T07:05:56Z

@ilblackdragon How technically challenging would be to solve this problem by designing to run N validators associated with the same account? This way you'd cover your stated maintenance failover usecase but also provide for permanent redundancy and network resiliency.
Currently your implementation allows only for a direct association between a node and an account, but I wonder if its possible to create an abstraction level (for example via a certificate or pseudonode), where multiple different nodes can be bound to the same certificate/key/pseudonode and this key is the one that would be used for staking purposes, allowing the validator to run the validation service at scale.

bowenwang1996 · 2020-05-20T23:15:16Z

@abakuz

run N validators associated with the same account

How will blocks be produced under this setting?

eovchar · 2020-05-20T23:31:13Z

@bowenwang1996 consider that I have validator_1 and validator_2, both are using the same pseudokey K1, so when both are online both are producing and validating blocks under K1.
Is that technically possible under the constraints of your protocol architecture?

bowenwang1996 · 2020-05-20T23:52:24Z

That is not possible without significant amount of work. In your case, at most one of the nodes will be able to produce blocks. Depending on the topology of the network, I believe it is very likely that neither will be able to produce blocks.

eovchar · 2020-05-21T00:45:03Z

@bowenwang1996 thanks for the feedback, what if look at this problem in a different way.
Say validator_1 and validator_2 continues to have its own node_key and produce blocks as they currently do, but then instead of associating your node_key with your stake, you associate you node_key with a pseudokey, and then stake using this pseudokey. The same pseudokey is reused on validator_1 and validator_2.

bowenwang1996 · 2020-05-21T14:51:25Z

I don't understand how that is different. It is not about whether the validator key is the same -- you can use the same validator key for different validators. It is about whether the account is the same. In your case, if validator_1 and validator_2 use the same account then the problem is the same. If they are different accounts then I don't know what is solved here.

stefanopepe · 2020-05-27T03:14:26Z

We still have to wait for one full epoch for the switch, which is more than 12 hours on TestNet and MainNet (more than 3 hours on BetaNet).
It works very well with soft forks, I can't see how we can leverage this with a hard fork

bowenwang1996 · 2020-05-27T21:02:55Z

I can't see how we can leverage this with a hard fork

The idea is that we don't do hard forks (unless social consensus is required for some reason).

june07 · 2020-05-28T16:30:36Z

This has been a concern for me the past 2 weeks as well since the build time was taking almost an hour on the cloud instance the node is running on (per recommended hardware config). Further 2 weeks ago during the hard fork, the slow rebuild process failed (due to a mistake on my part I believe) and the time to recovery was extended even longer. (A question I have regarding the "solution" as given here... once the primary nodes have hard forked, does keeping your node running the old version even matter in terms of uptime, being kicked, etc?! Seems to be a moot point that the old node is running if it's running with the pre-fork version of code...?)

My current solution has been to fork the project tracking upstream changes and on each upstream push GitHub's CI/CD automatically builds (which still takes a bit of time as you can see) but due to the immediate nature of action via tracking upstream pushes, the speed at which the built binary is available should be at relative parity to the upstream.

I planned on testing the implementation this past Tuesday however the hard fork was canceled. However I don't see why it wouldn't work fine. One additional step to add is to automatically copy the binaries to the validator node so that no manual transfer is required and it's just there.

Does this sound like a solid workflow?! Would it be helpful to document this process for other validators who might not be doing it currently?

bowenwang1996 · 2020-05-28T20:19:21Z

@june07 I assume you are talking about the current way of upgrading the network instead of what is proposed in this issue. It seems fine to me. Every validator has their own setup for running the node so I don't think there is any universal setup that we should specify (in fact, we are much less experienced in this area compared to some of our validators).

After we change the way we do network upgrade as specified in this issue, upgrades will happen asynchronously and validators will have more time to test and verify the new releases.

once the primary nodes have hard forked, does keeping your node running the old version even matter in terms of uptime

Not sure what you mean by "primary nodes". If you are talking about the new node that runs the new binary, then after your new node is synced to the network and start validating you can shutdown the old node.

48cfu · 2020-05-31T13:50:57Z

Is your feature request related to a problem? Please describe.
Currently, everyone is stopping a node, updating a binary and then restarting it again.
This can take from few minutes to 30-40 minutes in cases when new binary needs to be built.
Also after node starting, it needs to catchup with the network. For validators this leads to liveness issues and minor denial of service.

Describe the solution you'd like
NEAR has unique ability to switch between validator nodes in atomic fashion on epoch switch via changing validation keys.

The solution is to:
1. Start a new node with new binary, with new validator key but the same account id

2. Wait until node fully syncs (or time things properly with epoch switches)

3. Send a re-staking transaction with new validator key (if running staking contract, rotate staking key within this contract)

4. Wait until epoch switches when new validator key becomes active

5. Stop old node with old validator key.

Is it possible to have an expanded version of this guide? Especially point 1. Does it mean we have to create a new contract?

There is an issue with account announcement that prevents zero downtime upgrade from working properly. This PR fixes the issue and also adds a test to make sure that the upgrade process described in near/docs#341 works properly. Test plan ----------- * pytest `validator_switch_key.py`.

ilblackdragon added the enhancement New feature or request label May 16, 2020

ilblackdragon assigned stefanopepe May 16, 2020

ilblackdragon mentioned this issue May 16, 2020

Process for responding DevX questions near/devx#171

Closed

amgando added the P1 Priority 1 label Jun 1, 2020

This was referenced Jun 12, 2020

Test zero downtime upgrade for validators near/nearcore#2840

Closed

fix: fix some issues in zero downtime upgrade near/nearcore#2846

Merged

bowenwang1996 mentioned this issue Jun 20, 2020

Documentation on node upgrades #393

Merged

amgando closed this as completed in #393 Jun 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DOC] Document zero time upgrade for nearcore #341

[DOC] Document zero time upgrade for nearcore #341

ilblackdragon commented May 16, 2020

eovchar commented May 20, 2020

bowenwang1996 commented May 20, 2020

eovchar commented May 20, 2020

bowenwang1996 commented May 20, 2020

eovchar commented May 21, 2020 •

edited

Loading

bowenwang1996 commented May 21, 2020

stefanopepe commented May 27, 2020

bowenwang1996 commented May 27, 2020

june07 commented May 28, 2020

bowenwang1996 commented May 28, 2020 •

edited

Loading

48cfu commented May 31, 2020

[DOC] Document zero time upgrade for nearcore #341

[DOC] Document zero time upgrade for nearcore #341

Comments

ilblackdragon commented May 16, 2020

eovchar commented May 20, 2020

bowenwang1996 commented May 20, 2020

eovchar commented May 20, 2020

bowenwang1996 commented May 20, 2020

eovchar commented May 21, 2020 • edited Loading

bowenwang1996 commented May 21, 2020

stefanopepe commented May 27, 2020

bowenwang1996 commented May 27, 2020

june07 commented May 28, 2020

bowenwang1996 commented May 28, 2020 • edited Loading

48cfu commented May 31, 2020

eovchar commented May 21, 2020 •

edited

Loading

bowenwang1996 commented May 28, 2020 •

edited

Loading