-
Notifications
You must be signed in to change notification settings - Fork 68
[network] Don't back off forever on the Semaphore #559
Conversation
… the semaphore Our BoundedExecutor used to queue messages with unbounded exponential backoff on the semaphore. Now the backoff of those retries occurs outside of the semaphore, the attempts just queue when it's their time to try again.
|
||
let message_send = move || { | ||
let mut client = client.clone(); | ||
let message = message.clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
surely this is not deep-copying the entire payload every time... right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of today the message type is essentially a Bytes struct which has a very efficient clone operation (its essentially just an atomic increment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR does not change the cloning behavior, but FWIW the client also has a cheap cloning operation.
network/src/worker.rs
Outdated
response.expect("we retry forever so this shouldn't fail"); | ||
}), | ||
) | ||
.spawn_with_retries(self.retry_config, message_send) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does create more tasks right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, yes. But it does not create more concurrent network messages.
3faaa61
to
495cdb1
Compare
d221f9f
to
5a4611a
Compare
5a4611a
to
95f6892
Compare
We operate an executor with a bound on the concurrent number of messages (see MystenLabs#463, MystenLabs#559, MystenLabs#706). We expect the executors to operate for a long time at this limit (e.g. in recovery situation). The spammy logging is not usfeful This removes the logging of the concurrency bound being hit. Fixes MystenLabs#759
We operate an executor with a bound on the concurrent number of messages (see MystenLabs#463, MystenLabs#559, MystenLabs#706). PR MystenLabs#472 added logging for the bound being hit. We expect the executors to operate for a long time at this limit (e.g. in recovery situation). The spammy logging is not usfeful This removes the logging of the concurrency bound being hit. Fixes MystenLabs#759
) We operate an executor with a bound on the concurrent number of messages (see #463, #559, #706). PR #472 added logging for the bound being hit. We expect the executors to operate for a long time at this limit (e.g. in recovery situation). The spammy logging is not usfeful This removes the logging of the concurrency bound being hit. Fixes #759
…ystenLabs#763) We operate an executor with a bound on the concurrent number of messages (see MystenLabs#463, MystenLabs#559, MystenLabs#706). PR MystenLabs#472 added logging for the bound being hit. We expect the executors to operate for a long time at this limit (e.g. in recovery situation). The spammy logging is not usfeful This removes the logging of the concurrency bound being hit. Fixes MystenLabs#759
) We operate an executor with a bound on the concurrent number of messages (see #463, #559, #706). PR #472 added logging for the bound being hit. We expect the executors to operate for a long time at this limit (e.g. in recovery situation). The spammy logging is not usfeful This removes the logging of the concurrency bound being hit. Fixes #759
…ystenLabs/narwhal#763) We operate an executor with a bound on the concurrent number of messages (see MystenLabs/narwhal#463, MystenLabs/narwhal#559, MystenLabs/narwhal#706). PR MystenLabs/narwhal#472 added logging for the bound being hit. We expect the executors to operate for a long time at this limit (e.g. in recovery situation). The spammy logging is not usfeful This removes the logging of the concurrency bound being hit. Fixes MystenLabs/narwhal#759
Context
#463 introduced a
BoundedExecutor
to limit the concurrent tasks created for sending messages.However, as a protocol requirement, some of those tasks are on an unbounded exponential backoff, by which they are retried indefinitely (until cancelled).
The issue
The {primary, worker} × {
broadcast
,send
} functions enqueue instances of{Primary, Worker}::send_message
on theBoundedExecutor
, andsend_message
is an exponential backoff with infinite retries. These tasks will hence hold a semaphore ticket potentially forever.This change
The present change makes those tasks hold a semaphore permit only when they are actively sending a message, re-queuing on the semaphore only once they are done with their backoff wait (out of the semaphore0. This frees up tickets for other network tasks to help make the network progress.
Future work
The cancellation of our "reliable" network sends is not always as prompt as it could be. Fixing this is the next PR.