Shut down channels and tasks on PeerSet Drop #3078

teor2345 · 2021-11-19T01:03:15Z

Motivation

We want to shut down the peer set properly. We also want to detect peer and background task errors during shutdown.

This is a partial fix for #1678 and #1351, before we modify the peer set in #2262.

Solution

when the peer set shuts down, stop background tasks, close channels, and drop peers
shut down the peer set when it is dropped

Review

This is a fix from @jvff's review #3072 (comment)
But the changes were originally part of PR #3071.

Reviewer Checklist

Code implements Specs and Designs
Tests for Expected Behaviour
Tests for Errors

Follow Up Work

Keep implementing shut down for the rest of Zebra
Give the peer set an explicit shutdown handle?

teor2345 · 2021-11-19T01:14:00Z

@jvff asked:

Question: Why aren't these background tasks handled separately? Wouldn't it make more sense to have this service have a Request::Stop or a cancel_signal: oneshot::Receiver<()>?

#3072 (comment)

How did you want to handle the background tasks separately?
(I didn't want to make too many changes in this PR.)

The listener, crawler, and address book updater background tasks are required for the PeerSet to operate correctly. So if they shut down, the PeerSet also needs to shut down.

I'm also not sure if we need to do anything else to shutdown.

On startup, zebra_network::init returns:

a buffered peer set service
an Arc<Mutex<AddressBook>>

Once the last peer set clone is dropped, all the zebra-network tasks, channels, and peers shut down. Then, when the last address book clone is dropped, the address book gets dropped.

So I'm not sure if separate shutdown channel is needed?
(And I'd like to avoid using a Request::Stop, because channel ownership is more flexible than buffered service ownership.)

When we do full graceful shutdown in #1678, we can add channels as needed.

jvff · 2021-11-19T12:41:41Z

@jvff asked:

Question: Why aren't these background tasks handled separately? Wouldn't it make more sense to have this service have a Request::Stop or a cancel_signal: oneshot::Receiver<()>?

#3072 (comment)

How did you want to handle the background tasks separately? (I didn't want to make too many changes in this PR.)

The listener, crawler, and address book updater background tasks are required for the PeerSet to operate correctly. So if they shut down, the PeerSet also needs to shut down.

I'm also not sure if we need to do anything else to shutdown.

On startup, zebra_network::init returns:
* a buffered peer set service

* an `Arc<Mutex<AddressBook>>`
Once the last peer set clone is dropped, all the zebra-network tasks, channels, and peers shut down. Then, when the last address book clone is dropped, the address book gets dropped.

So I'm not sure if separate shutdown channel is needed? (And I'd like to avoid using a Request::Stop, because channel ownership is more flexible than buffered service ownership.)

When we do full graceful shutdown in #1678, we can add channels as needed.

I'm not exactly sure, but I guess the core of what I'm worried about is that PeerSet seems to be doing a lot of things. I think currently it:

tracks ready and unready peers
routes requests to ready peers
handles load balancing (with power of two choices)
handles background tasks

I think the zebra_network::init API is good (we could ideally see if we could not need to return the address book, but I think that's probably a separate thing and lower priority IMHO). But I think that the internal organization of the service could be split up into different types to handle different things.

Regarding handling background/startup tasks, I think the goals are:

to stop the network specific tasks if the PeerSet service is stopped
to stop the PeerService if one of the background tasks stops
to stop the PeerService if one of the startup tasks errors

In order to achieve those goals, when focusing on the PeerService, I think it only needs to:

have some sort of signal to shutdown when 2 or 3 happens
report when it stops in order to achieve 1

I'm not sure how to handle this in this PR. It might make sense to not change the PR right now and think more about this in a broader "PeerSet refactor" issue. Or we could see if there are any changes that are simple and "obvious" (as in, likely not needed to change regardless of any refactor idea discussions) to do in this PR that strive toward such refactor.

One idea for the latter might be to create BackgroundTask and/or StartUpTask helper types, which contain:

a JoinHandle for the wrapped task
a &'static str with the name of the task
a broadcast::Sender or an Arc<watch::Sender> that sends something when the task finishes (or errors, in case of a StartUpTask)

And such types would have a Drop implementation to send the signal and call JoinHandle::abort.

Anyway, I'm not sure if I'm helping or just brain dumping a wild idea 😓

jvff · 2021-11-19T13:49:13Z

zebra-network/src/peer_set/set.rs

+        self.shut_down_tasks_and_channels();
+
+        exit_error


Optional idea: I'm worried that this poll_background_errors might become hard to read, because a quick glance seems like the function polls and shuts down the tasks and channels. I'd suggest refactoring this so that the normal flow is more evident. Maybe splitting this method in more methods should help:

fn poll_background_errors(&mut self, cx: &mut Context) -> Result<(), BoxError> { self.receive_guards_if_needed(); match Pin:::new(&mut self.guards).poll_next(cx) { Poll::Pending => Ok(()), Poll::Ready(maybe_task_result) => self.shutdown_with_result( maybe_task_result.unwrap_or_else(|| { Err("all peer set background tasks have exited".into()) }), ), } }

Fixed in e15e1b1, along with checking the channel during shutdown.

teor2345 · 2021-11-21T21:54:22Z

Just a reminder that the goal of this PR is:

We want to shut down the peer set properly. We also want to detect peer and background task errors during shutdown.

Anything else is out of scope for the NU5 mainnet activation sprints. Unless it's important for doing #2262.

I think the zebra_network::init API is good (we could ideally see if we could not need to return the address book, but I think that's probably a separate thing and lower priority IMHO).

The inbound service uses the address book, so we would either need to:

move the inbound service into zebra-network, or
access the address book via the peer set (which might depend on Turn the CandidateSet and AddressBook into tower Services #1976).

Feel free to open a ticket for this change if you'd like, and we can schedule it in a future sprint.

But I think that the internal organization of the service could be split up into different types to handle different things.

Regarding handling background/startup tasks, I think the goals are:
1. to stop the network specific tasks if the PeerSet service is stopped
2. to stop the PeerService if one of the background tasks stops
3. to stop the PeerService if one of the startup tasks errors

In order to achieve those goals, when focusing on the PeerService, I think it only needs to:
* have some sort of signal to shutdown when 2 or 3 happens
* report when it stops in order to achieve 1

We do this already:

all tasks that use the network should stop if we return an error from PeerSet::poll_ready. This is how tower Services are meant to work.
the PeerSet polls the background tasks.
if a startup task errors, or any ongoing task exits, we return from StartCommand::run, which should make other task exit, and eventually drop the PeerSet.

There might be bugs or missing parts of this implementation, but they're covered by ticket #1678.

I'm not sure how to handle this in this PR. It might make sense to not change the PR right now and think more about this in a broader "PeerSet refactor" issue.

I'd like to leave this PR where it is, unless there is an obvious major bug that blocks #2262.

Or we could see if there are any changes that are simple and "obvious" (as in, likely not needed to change regardless of any refactor idea discussions) to do in this PR that strive toward such refactor.

One idea for the latter might be to create BackgroundTask and/or StartUpTask helper types ...

Can you open a ticket for this change, and we can schedule it in a future sprint?

Also, split receiving and polling background tasks into separate methods.

teor2345 · 2021-11-23T00:42:34Z

Bumping this to medium priority, because it's blocking #2262.

jvff

Everything looks good. I just added two small optional suggestions.

jvff · 2021-11-23T01:22:46Z

zebra-network/src/peer_set/set.rs

+    /// Stores gossiped inventory from connected peers.
+    /// Used to route inventory requests to peers that are likely to have it.


Nit: Not sure if this was on purpose, but maybe separate the second sentence into the details section?

Suggested change

/// Stores gossiped inventory from connected peers.

/// Used to route inventory requests to peers that are likely to have it.

/// Stores gossiped inventory from connected peers.

/// Used to route inventory requests to peers that are likely to have it.

Fixed in 7849ca6 in PR #3090.

jvff · 2021-11-23T01:29:15Z

zebra-network/src/peer_set/set.rs

+                    for handle in handles {
+                        self.guards.push(handle);
+                    }


Optional suggestion to make it shorter:

Suggested change

for handle in handles {

self.guards.push(handle);

}

self.guards.extend(handles);

Fixed in 7849ca6 in PR #3090.

teor2345 added C-bug Category: This is a bug A-rust Area: Updates to Rust code P-Low I-panic Zebra panics with an internal error message I-hang A Zebra component stops responding to requests labels Nov 19, 2021

teor2345 added this to the 2021 Sprint 23 milestone Nov 19, 2021

teor2345 requested a review from jvff November 19, 2021 01:03

teor2345 self-assigned this Nov 19, 2021

teor2345 mentioned this pull request Nov 19, 2021

Stop downloading unnecessary blocks in Zebra acceptance tests #3072

Merged

1 task

jvff previously approved these changes Nov 19, 2021

View reviewed changes

teor2345 dismissed jvff’s stale review via e15e1b1 November 21, 2021 21:39

teor2345 force-pushed the peer-set-task-drop branch from 64c904b to e15e1b1 Compare November 21, 2021 21:39

teor2345 requested a review from jvff November 21, 2021 21:54

teor2345 enabled auto-merge (squash) November 21, 2021 21:58

teor2345 added 5 commits November 23, 2021 10:35

Shut down channels and tasks on PeerSet Drop

ac11728

Document all the PeerSet fields

d154313

Close the peer set background task handle on shutdown

4c2bf31

Receive background tasks during shutdown

7871b11

Also, split receiving and polling background tasks into separate methods.

Tweak comments

7760bf1

teor2345 force-pushed the peer-set-task-drop branch from 55d8423 to 7760bf1 Compare November 23, 2021 00:36

teor2345 mentioned this pull request Nov 23, 2021

Security: Stop routing inventory requests by peer address #3090

Merged

3 tasks

teor2345 added P-Medium and removed P-Low labels Nov 23, 2021

jvff approved these changes Nov 23, 2021

View reviewed changes

teor2345 merged commit b39f4ca into main Nov 23, 2021

teor2345 deleted the peer-set-task-drop branch November 23, 2021 01:29

teor2345 restored the peer-set-task-drop branch November 23, 2021 02:33

teor2345 deleted the peer-set-task-drop branch November 23, 2021 02:34

teor2345 mentioned this pull request Dec 2, 2021

Security: Stop RocksDB threads calling unexpected code when zebrad exits #3133

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shut down channels and tasks on PeerSet Drop #3078

Shut down channels and tasks on PeerSet Drop #3078

teor2345 commented Nov 19, 2021

teor2345 commented Nov 19, 2021 •

edited

Loading

jvff commented Nov 19, 2021

jvff Nov 19, 2021

teor2345 Nov 21, 2021

teor2345 commented Nov 21, 2021 •

edited

Loading

teor2345 commented Nov 23, 2021

jvff left a comment

jvff Nov 23, 2021

teor2345 Nov 23, 2021

jvff Nov 23, 2021

teor2345 Nov 23, 2021

		/// Stores gossiped inventory from connected peers.
		/// Used to route inventory requests to peers that are likely to have it.

Shut down channels and tasks on PeerSet Drop #3078

Shut down channels and tasks on PeerSet Drop #3078

Conversation

teor2345 commented Nov 19, 2021

Motivation

Solution

Review

Reviewer Checklist

Follow Up Work

teor2345 commented Nov 19, 2021 • edited Loading

jvff commented Nov 19, 2021

jvff Nov 19, 2021

Choose a reason for hiding this comment

teor2345 Nov 21, 2021

Choose a reason for hiding this comment

teor2345 commented Nov 21, 2021 • edited Loading

teor2345 commented Nov 23, 2021

jvff left a comment

Choose a reason for hiding this comment

jvff Nov 23, 2021

Choose a reason for hiding this comment

teor2345 Nov 23, 2021

Choose a reason for hiding this comment

jvff Nov 23, 2021

Choose a reason for hiding this comment

teor2345 Nov 23, 2021

Choose a reason for hiding this comment

teor2345 commented Nov 19, 2021 •

edited

Loading

teor2345 commented Nov 21, 2021 •

edited

Loading