Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TECH-DEBT] - Allow nodes to fully stop, lose all memory state, restart, and properly rejoin the network #2135

Closed
7 of 8 tasks
elliedavidson opened this issue Nov 29, 2023 · 4 comments · Fixed by #2168 or #2198
Closed
7 of 8 tasks
Assignees
Milestone

Comments

@elliedavidson
Copy link
Member

elliedavidson commented Nov 29, 2023

What is this task and why do we need to work on it?

Nodes currently cannot rejoin the network after they fully shutdown and restart. They can't properly get the configuration file from the orchestrator, identify themselves to the network, and get enough information to start participating in the network again. Our current "shutdown" tests do not fully shutdown nodes, but instead only pause nodes, which isn't realistic behavior. This task adds functionality for nodes to optionally read configuration items from disk when they start up. This issue is needed to support more resilient testnets.

What work will need to be done to complete this task?

Nodes should write their configuration files and other necessary information to disk and optionally read from disk when they start up.

  • Investigate how the sequencer currently handles this. It's possible the easiest fix is sequencer-side
  • Add ability for node to optionally read from a config file at a parametrizable location on disk at startup in the HotShot example code
  • Add ability for node to write config file to disk at startup to a parametrizable location in the HotShot example code
  • Test by running a small network of nodes and killing / restarting a range of those nodes. Ensure that the network continues to function.
  • Discuss integration of changes with sequencer team
  • Integrate into sequencer code

Are there any other details to include?

Ideally we should add the ability to fully shutdown nodes to our testing harness, but that is outside the scope of this issue.

What are the acceptance criteria to close this issue?

  • Sequencer tests pass
  • HotShot test network with 5 nodes successfully handles shutting down and restarting all 5 nodes at different times (such that only 1 is ever offline at a time). Note that this will be a manual test.
@rob-maron
Copy link
Collaborator

The status on this one is that I am running into an issue with the webserver locally, where it is not pulling down the most current proposal. The config changes are done, but catchup is intermittently working.

@rob-maron
Copy link
Collaborator

We may be able to merge in #2168, keep this issue open, and then address that part in a separate PR

@rob-maron
Copy link
Collaborator

Fixed an issue with catchup in #2192. There's still a problem where rounds where the catchup node is leader still time out (after it's caught up). This may be a voting thing, looking into it.

@rob-maron
Copy link
Collaborator

The last step we're waiting on is for it to pass sequencer tests

@jbearer jbearer closed this as completed Dec 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment