Skip to content
This repository has been archived by the owner on Oct 17, 2022. It is now read-only.

Use TaskGroup to ensure all primary / worker tasks are cancelled on error and panic #707

Merged
merged 6 commits into from
Aug 13, 2022

Conversation

mwtian
Copy link
Member

@mwtian mwtian commented Aug 7, 2022

Put all JobHandle of primary, worker and consensus into a TaskGroup, so if one task returns an error or panics, the whole group will be cancelled.

If we want to stop the primary or worker, we can just drop the TaskManager associated with the TaskGroup.

For cluster tests, I'm assuming JobHandles will be finish on their own, i.e. checking where a JobHandle is_finished() is equivalent to checking if it has been cancelled.

@mwtian mwtian marked this pull request as ready for review August 7, 2022 17:28
@mwtian mwtian requested a review from asonnino as a code owner August 7, 2022 17:28
@mwtian mwtian changed the title [WIP] Use task group to ensure all primary / worker tasks are cancelled on error and panic Use TaskGroup to ensure all primary / worker tasks are cancelled on error and panic Aug 7, 2022
@mwtian mwtian requested review from akichidis and huitseeker August 7, 2022 18:08
Ok(handles)
let (task_group, task_manager) = TaskGroup::new();
for (name, handle) in handles {
let _ = task_group.spawn(name, handle);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would be returned here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are we spawning exactly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment. Basically a future that can be awaited as the JoinHandle is returned. But the returned future is unnecessary because we will use the task_manager to await for task terminations.

node/src/lib.rs Outdated
@@ -341,8 +347,12 @@ impl Node {
store.batch_store.clone(),
metrics.clone(),
);
handles.extend(worker_handles);
// TODO: propagate worker task names if needed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be an issue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened an issue and linked here. If other aspects of this PR look ok, I can make the change in this PR too.

@mwtian
Copy link
Member Author

mwtian commented Aug 8, 2022

Another potential refactor is to simplify shutdown handling: instead of each task implement logic to handle ReconfigureNotification::Shutdown, we may be able to handle it in one place to shutdown all tasks.

Copy link
Contributor

@huitseeker huitseeker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review! This LGTM modulo a rebase,

Another potential refactor is to simplify shutdown handling: instead of each task implement logic to handle ReconfigureNotification::Shutdown, we may be able to handle it in one place to shutdown all tasks.

I think the goal of the ReconfigureNotification::Shutdown is to allow more ad-hoc graceful shutdown logic than killing the task.

@mwtian
Copy link
Member Author

mwtian commented Aug 13, 2022

Sorry for the late review! This LGTM modulo a rebase,

Another potential refactor is to simplify shutdown handling: instead of each task implement logic to handle ReconfigureNotification::Shutdown, we may be able to handle it in one place to shutdown all tasks.

I think the goal of the ReconfigureNotification::Shutdown is to allow more ad-hoc graceful shutdown logic than killing the task.

Makes sense.

Copy link
Contributor

@huitseeker huitseeker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TY!

akichidis added a commit that referenced this pull request Aug 18, 2022
Revert "Use `TaskGroup` to ensure all primary / worker tasks are cancelled on error and panic (#707)

This reverts commit 693e879.
huitseeker pushed a commit to huitseeker/narwhal that referenced this pull request Aug 25, 2022
Revert "Use `TaskGroup` to ensure all primary / worker tasks are cancelled on error and panic (MystenLabs#707)

This reverts commit 693e879.
huitseeker pushed a commit that referenced this pull request Aug 25, 2022
Revert "Use `TaskGroup` to ensure all primary / worker tasks are cancelled on error and panic (#707)

This reverts commit 693e879.
mwtian added a commit to mwtian/sui that referenced this pull request Sep 30, 2022
mwtian pushed a commit to mwtian/sui that referenced this pull request Sep 30, 2022
Revert "Use `TaskGroup` to ensure all primary / worker tasks are cancelled on error and panic (MystenLabs/narwhal#707)

This reverts commit 693e87979d6be32d29414f1639735760d55c0b21.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants