Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shut down channels and tasks on PeerSet Drop #3078

Merged
merged 5 commits into from
Nov 23, 2021
Merged

Conversation

teor2345
Copy link
Contributor

Motivation

We want to shut down the peer set properly. We also want to detect peer and background task errors during shutdown.

This is a partial fix for #1678 and #1351, before we modify the peer set in #2262.

Solution

  • when the peer set shuts down, stop background tasks, close channels, and drop peers
  • shut down the peer set when it is dropped

Review

This is a fix from @jvff's review #3072 (comment)
But the changes were originally part of PR #3071.

Reviewer Checklist

  • Code implements Specs and Designs
  • Tests for Expected Behaviour
  • Tests for Errors

Follow Up Work

Keep implementing shut down for the rest of Zebra
Give the peer set an explicit shutdown handle?

@teor2345 teor2345 added C-bug Category: This is a bug A-rust Area: Updates to Rust code P-Low I-panic Zebra panics with an internal error message I-hang A Zebra component stops responding to requests labels Nov 19, 2021
@teor2345 teor2345 added this to the 2021 Sprint 23 milestone Nov 19, 2021
@teor2345 teor2345 requested a review from jvff November 19, 2021 01:03
@teor2345 teor2345 self-assigned this Nov 19, 2021
@teor2345
Copy link
Contributor Author

teor2345 commented Nov 19, 2021

@jvff asked:

Question: Why aren't these background tasks handled separately? Wouldn't it make more sense to have this service have a Request::Stop or a cancel_signal: oneshot::Receiver<()>?

#3072 (comment)

How did you want to handle the background tasks separately?
(I didn't want to make too many changes in this PR.)

The listener, crawler, and address book updater background tasks are required for the PeerSet to operate correctly. So if they shut down, the PeerSet also needs to shut down.

I'm also not sure if we need to do anything else to shutdown.

On startup, zebra_network::init returns:

  • a buffered peer set service
  • an Arc<Mutex<AddressBook>>

Once the last peer set clone is dropped, all the zebra-network tasks, channels, and peers shut down. Then, when the last address book clone is dropped, the address book gets dropped.

So I'm not sure if separate shutdown channel is needed?
(And I'd like to avoid using a Request::Stop, because channel ownership is more flexible than buffered service ownership.)

When we do full graceful shutdown in #1678, we can add channels as needed.

@jvff
Copy link
Contributor

jvff commented Nov 19, 2021

@jvff asked:

Question: Why aren't these background tasks handled separately? Wouldn't it make more sense to have this service have a Request::Stop or a cancel_signal: oneshot::Receiver<()>?

#3072 (comment)

How did you want to handle the background tasks separately? (I didn't want to make too many changes in this PR.)

The listener, crawler, and address book updater background tasks are required for the PeerSet to operate correctly. So if they shut down, the PeerSet also needs to shut down.

I'm also not sure if we need to do anything else to shutdown.

On startup, zebra_network::init returns:

* a buffered peer set service

* an `Arc<Mutex<AddressBook>>`

Once the last peer set clone is dropped, all the zebra-network tasks, channels, and peers shut down. Then, when the last address book clone is dropped, the address book gets dropped.

So I'm not sure if separate shutdown channel is needed? (And I'd like to avoid using a Request::Stop, because channel ownership is more flexible than buffered service ownership.)

When we do full graceful shutdown in #1678, we can add channels as needed.

I'm not exactly sure, but I guess the core of what I'm worried about is that PeerSet seems to be doing a lot of things. I think currently it:

  • tracks ready and unready peers
  • routes requests to ready peers
  • handles load balancing (with power of two choices)
  • handles background tasks

I think the zebra_network::init API is good (we could ideally see if we could not need to return the address book, but I think that's probably a separate thing and lower priority IMHO). But I think that the internal organization of the service could be split up into different types to handle different things.

Regarding handling background/startup tasks, I think the goals are:

  1. to stop the network specific tasks if the PeerSet service is stopped
  2. to stop the PeerService if one of the background tasks stops
  3. to stop the PeerService if one of the startup tasks errors

In order to achieve those goals, when focusing on the PeerService, I think it only needs to:

  • have some sort of signal to shutdown when 2 or 3 happens
  • report when it stops in order to achieve 1

I'm not sure how to handle this in this PR. It might make sense to not change the PR right now and think more about this in a broader "PeerSet refactor" issue. Or we could see if there are any changes that are simple and "obvious" (as in, likely not needed to change regardless of any refactor idea discussions) to do in this PR that strive toward such refactor.

One idea for the latter might be to create BackgroundTask and/or StartUpTask helper types, which contain:

  • a JoinHandle for the wrapped task
  • a &'static str with the name of the task
  • a broadcast::Sender or an Arc<watch::Sender> that sends something when the task finishes (or errors, in case of a StartUpTask)

And such types would have a Drop implementation to send the signal and call JoinHandle::abort.

Anyway, I'm not sure if I'm helping or just brain dumping a wild idea 😓

jvff
jvff previously approved these changes Nov 19, 2021
Comment on lines 242 to 244
self.shut_down_tasks_and_channels();

exit_error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional idea: I'm worried that this poll_background_errors might become hard to read, because a quick glance seems like the function polls and shuts down the tasks and channels. I'd suggest refactoring this so that the normal flow is more evident. Maybe splitting this method in more methods should help:

fn poll_background_errors(&mut self, cx: &mut Context) -> Result<(), BoxError> {
    self.receive_guards_if_needed();

    match Pin:::new(&mut self.guards).poll_next(cx) {
        Poll::Pending => Ok(()),
        Poll::Ready(maybe_task_result) => self.shutdown_with_result(
            maybe_task_result.unwrap_or_else(|| {
                Err("all peer set background tasks have exited".into())
            }),
        ),
    }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e15e1b1, along with checking the channel during shutdown.

@teor2345
Copy link
Contributor Author

teor2345 commented Nov 21, 2021

Just a reminder that the goal of this PR is:

We want to shut down the peer set properly. We also want to detect peer and background task errors during shutdown.

Anything else is out of scope for the NU5 mainnet activation sprints. Unless it's important for doing #2262.

I think the zebra_network::init API is good (we could ideally see if we could not need to return the address book, but I think that's probably a separate thing and lower priority IMHO).

The inbound service uses the address book, so we would either need to:

Feel free to open a ticket for this change if you'd like, and we can schedule it in a future sprint.

But I think that the internal organization of the service could be split up into different types to handle different things.

Regarding handling background/startup tasks, I think the goals are:
1. to stop the network specific tasks if the PeerSet service is stopped
2. to stop the PeerService if one of the background tasks stops
3. to stop the PeerService if one of the startup tasks errors

In order to achieve those goals, when focusing on the PeerService, I think it only needs to:
* have some sort of signal to shutdown when 2 or 3 happens
* report when it stops in order to achieve 1

We do this already:

  1. all tasks that use the network should stop if we return an error from PeerSet::poll_ready. This is how tower Services are meant to work.
  2. the PeerSet polls the background tasks.
  3. if a startup task errors, or any ongoing task exits, we return from StartCommand::run, which should make other task exit, and eventually drop the PeerSet.

There might be bugs or missing parts of this implementation, but they're covered by ticket #1678.

I'm not sure how to handle this in this PR. It might make sense to not change the PR right now and think more about this in a broader "PeerSet refactor" issue.

I'd like to leave this PR where it is, unless there is an obvious major bug that blocks #2262.

Or we could see if there are any changes that are simple and "obvious" (as in, likely not needed to change regardless of any refactor idea discussions) to do in this PR that strive toward such refactor.

One idea for the latter might be to create BackgroundTask and/or StartUpTask helper types ...

Can you open a ticket for this change, and we can schedule it in a future sprint?

@teor2345 teor2345 requested a review from jvff November 21, 2021 21:54
@teor2345 teor2345 enabled auto-merge (squash) November 21, 2021 21:58
@teor2345
Copy link
Contributor Author

Bumping this to medium priority, because it's blocking #2262.

Copy link
Contributor

@jvff jvff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good. I just added two small optional suggestions.

Comment on lines +131 to +132
/// Stores gossiped inventory from connected peers.
/// Used to route inventory requests to peers that are likely to have it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Not sure if this was on purpose, but maybe separate the second sentence into the details section?

Suggested change
/// Stores gossiped inventory from connected peers.
/// Used to route inventory requests to peers that are likely to have it.
/// Stores gossiped inventory from connected peers.
/// Used to route inventory requests to peers that are likely to have it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7849ca6 in PR #3090.

Comment on lines +295 to +297
for handle in handles {
self.guards.push(handle);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional suggestion to make it shorter:

Suggested change
for handle in handles {
self.guards.push(handle);
}
self.guards.extend(handles);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7849ca6 in PR #3090.

@teor2345 teor2345 merged commit b39f4ca into main Nov 23, 2021
@teor2345 teor2345 deleted the peer-set-task-drop branch November 23, 2021 01:29
@teor2345 teor2345 restored the peer-set-task-drop branch November 23, 2021 02:33
@teor2345 teor2345 deleted the peer-set-task-drop branch November 23, 2021 02:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-rust Area: Updates to Rust code C-bug Category: This is a bug I-hang A Zebra component stops responding to requests I-panic Zebra panics with an internal error message
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants