Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable system instance to be restarted without affecting running jobs #3801

Open
garlick opened this issue Jul 27, 2021 · 2 comments
Open

enable system instance to be restarted without affecting running jobs #3801

garlick opened this issue Jul 27, 2021 · 2 comments

Comments

@garlick
Copy link
Member

garlick commented Jul 27, 2021

As we work through failure modes in the system instance, it would be helpful if brokers or broker subtrees could be rebooted without affecting workloads.

@grondo
Copy link
Contributor

grondo commented Oct 5, 2021

Let's use this issue as a tracking issue for getting basic support for restarting brokers without affecting running jobs. If that is ok then I will open some missing individual issues, and link in those and existing issues in a checklist in the initial post.

To get things started, here's a list of items off the top of my head (I'll open issues on some of these and move them to a checklist above) Anyone should feel free to edit and add to this or a checklist above.

@grondo
Copy link
Contributor

grondo commented Oct 5, 2021

It occurs to me we could develop support for this issue in several phases:

  1. restart leaf broker in instance of size > 1 and recover jobs
  2. restart non rank 0 broker and its subtree and recover jobs
  3. restart rank 0 broker (effectively restart the entire instance)

For the job shell, we could first support restart of single shell jobs, then tackle restart of multi-shell jobs.

So for a first step, we should target a restart of a leaf broker running a single shell rank job.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

2 participants