[TECH-DEBT] - Allow nodes to fully stop, lose all memory state, restart, and properly rejoin the network #2135

elliedavidson · 2023-11-29T19:08:07Z

What is this task and why do we need to work on it?

Nodes currently cannot rejoin the network after they fully shutdown and restart. They can't properly get the configuration file from the orchestrator, identify themselves to the network, and get enough information to start participating in the network again. Our current "shutdown" tests do not fully shutdown nodes, but instead only pause nodes, which isn't realistic behavior. This task adds functionality for nodes to optionally read configuration items from disk when they start up. This issue is needed to support more resilient testnets.

What work will need to be done to complete this task?

Nodes should write their configuration files and other necessary information to disk and optionally read from disk when they start up.

Investigate how the sequencer currently handles this. It's possible the easiest fix is sequencer-side
Add ability for node to optionally read from a config file at a parametrizable location on disk at startup in the HotShot example code
Add ability for node to write config file to disk at startup to a parametrizable location in the HotShot example code
Test by running a small network of nodes and killing / restarting a range of those nodes. Ensure that the network continues to function.
Discuss integration of changes with sequencer team
Integrate into sequencer code

Are there any other details to include?

Ideally we should add the ability to fully shutdown nodes to our testing harness, but that is outside the scope of this issue.

What are the acceptance criteria to close this issue?

Sequencer tests pass
HotShot test network with 5 nodes successfully handles shutting down and restarting all 5 nodes at different times (such that only 1 is ever offline at a time). Note that this will be a manual test.

rob-maron · 2023-12-08T01:13:04Z

The status on this one is that I am running into an issue with the webserver locally, where it is not pulling down the most current proposal. The config changes are done, but catchup is intermittently working.

rob-maron · 2023-12-08T01:13:31Z

We may be able to merge in #2168, keep this issue open, and then address that part in a separate PR

rob-maron · 2023-12-11T14:58:35Z

Fixed an issue with catchup in #2192. There's still a problem where rounds where the catchup node is leader still time out (after it's caught up). This may be a voting thing, looking into it.

rob-maron · 2023-12-12T17:02:02Z

The last step we're waiting on is for it to pass sequencer tests

elliedavidson added tech-debt sprint6 labels Nov 29, 2023

elliedavidson added this to the Sprint 6 milestone Nov 29, 2023

elliedavidson self-assigned this Nov 29, 2023

rob-maron self-assigned this Dec 5, 2023

rob-maron mentioned this issue Dec 5, 2023

[Tech Debt] Allow nodes to rejoin by saving index and config #2168

Merged

rob-maron closed this as completed in #2168 Dec 9, 2023

rob-maron reopened this Dec 9, 2023

rob-maron mentioned this issue Dec 12, 2023

[Stability] Fix node catchup/rejoin regression #2198

Merged

rob-maron closed this as completed in #2198 Dec 12, 2023

rob-maron reopened this Dec 12, 2023

jbearer closed this as completed Dec 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TECH-DEBT] - Allow nodes to fully stop, lose all memory state, restart, and properly rejoin the network #2135

[TECH-DEBT] - Allow nodes to fully stop, lose all memory state, restart, and properly rejoin the network #2135

elliedavidson commented Nov 29, 2023 •

edited by rob-maron

Loading

rob-maron commented Dec 8, 2023

rob-maron commented Dec 8, 2023

rob-maron commented Dec 11, 2023

rob-maron commented Dec 12, 2023

[TECH-DEBT] - Allow nodes to fully stop, lose all memory state, restart, and properly rejoin the network #2135

[TECH-DEBT] - Allow nodes to fully stop, lose all memory state, restart, and properly rejoin the network #2135

Comments

elliedavidson commented Nov 29, 2023 • edited by rob-maron Loading

What is this task and why do we need to work on it?

What work will need to be done to complete this task?

Are there any other details to include?

What are the acceptance criteria to close this issue?

rob-maron commented Dec 8, 2023

rob-maron commented Dec 8, 2023

rob-maron commented Dec 11, 2023

rob-maron commented Dec 12, 2023

elliedavidson commented Nov 29, 2023 •

edited by rob-maron

Loading