-
Notifications
You must be signed in to change notification settings - Fork 68
Use TaskGroup
to ensure all primary / worker tasks are cancelled on error and panic
#707
Conversation
TaskGroup
to ensure all primary / worker tasks are cancelled on error and panic
Ok(handles) | ||
let (task_group, task_manager) = TaskGroup::new(); | ||
for (name, handle) in handles { | ||
let _ = task_group.spawn(name, handle); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what would be returned here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are we spawning exactly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment. Basically a future that can be awaited as the JoinHandle is returned. But the returned future is unnecessary because we will use the task_manager
to await for task terminations.
node/src/lib.rs
Outdated
@@ -341,8 +347,12 @@ impl Node { | |||
store.batch_store.clone(), | |||
metrics.clone(), | |||
); | |||
handles.extend(worker_handles); | |||
// TODO: propagate worker task names if needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened an issue and linked here. If other aspects of this PR look ok, I can make the change in this PR too.
Another potential refactor is to simplify shutdown handling: instead of each task implement logic to handle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late review! This LGTM modulo a rebase,
Another potential refactor is to simplify shutdown handling: instead of each task implement logic to handle ReconfigureNotification::Shutdown, we may be able to handle it in one place to shutdown all tasks.
I think the goal of the ReconfigureNotification::Shutdown
is to allow more ad-hoc graceful shutdown logic than killing the task.
Makes sense. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TY!
Revert "Use `TaskGroup` to ensure all primary / worker tasks are cancelled on error and panic (MystenLabs#707) This reverts commit 693e879.
Revert "Use `TaskGroup` to ensure all primary / worker tasks are cancelled on error and panic (MystenLabs/narwhal#707) This reverts commit 693e87979d6be32d29414f1639735760d55c0b21.
Put all JobHandle of primary, worker and consensus into a TaskGroup, so if one task returns an error or panics, the whole group will be cancelled.
If we want to stop the primary or worker, we can just drop the
TaskManager
associated with theTaskGroup
.For cluster tests, I'm assuming JobHandles will be finish on their own, i.e. checking where a JobHandle
is_finished()
is equivalent to checking if it has been cancelled.