Skip to content
This repository has been archived by the owner on Oct 17, 2022. It is now read-only.

[network] Don't back off forever on the Semaphore #559

Merged
merged 7 commits into from
Jul 22, 2022

Conversation

huitseeker
Copy link
Contributor

@huitseeker huitseeker commented Jul 22, 2022

Context

#463 introduced a BoundedExecutor to limit the concurrent tasks created for sending messages.

However, as a protocol requirement, some of those tasks are on an unbounded exponential backoff, by which they are retried indefinitely (until cancelled).

The issue

The {primary, worker} × { broadcast, send } functions enqueue instances of {Primary, Worker}::send_message on the BoundedExecutor, and send_message is an exponential backoff with infinite retries. These tasks will hence hold a semaphore ticket potentially forever.

This change

The present change makes those tasks hold a semaphore permit only when they are actively sending a message, re-queuing on the semaphore only once they are done with their backoff wait (out of the semaphore0. This frees up tickets for other network tasks to help make the network progress.

Future work

The cancellation of our "reliable" network sends is not always as prompt as it could be. Fixing this is the next PR.

… the semaphore

Our BoundedExecutor used to queue messages with unbounded exponential backoff on the semaphore.
Now the backoff of those retries occurs outside of the semaphore, the attempts just queue when it's their time to try again.
@huitseeker huitseeker requested a review from bmwill as a code owner July 22, 2022 02:14
@huitseeker huitseeker changed the title [network] Don't back off on the Semaphore [network] Don't back off forever on the Semaphore Jul 22, 2022

let message_send = move || {
let mut client = client.clone();
let message = message.clone();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

surely this is not deep-copying the entire payload every time... right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of today the message type is essentially a Bytes struct which has a very efficient clone operation (its essentially just an atomic increment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR does not change the cloning behavior, but FWIW the client also has a cheap cloning operation.

network/src/bounded_executor.rs Outdated Show resolved Hide resolved
response.expect("we retry forever so this shouldn't fail");
}),
)
.spawn_with_retries(self.retry_config, message_send)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does create more tasks right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, yes. But it does not create more concurrent network messages.

@huitseeker huitseeker merged commit 7e49fa4 into MystenLabs:main Jul 22, 2022
huitseeker added a commit to huitseeker/narwhal that referenced this pull request Aug 14, 2022
We operate an executor with a bound on the concurrent number of messages (see MystenLabs#463, MystenLabs#559, MystenLabs#706).

We expect the executors to operate for a long time at this limit (e.g. in recovery situation).
The spammy logging is not usfeful

This removes the logging of the concurrency bound being hit.
Fixes MystenLabs#759
huitseeker added a commit to huitseeker/narwhal that referenced this pull request Aug 14, 2022
We operate an executor with a bound on the concurrent number of messages (see MystenLabs#463, MystenLabs#559, MystenLabs#706).
PR MystenLabs#472 added logging for the bound being hit.

We expect the executors to operate for a long time at this limit (e.g. in recovery situation).
The spammy logging is not usfeful

This removes the logging of the concurrency bound being hit.
Fixes MystenLabs#759
huitseeker added a commit that referenced this pull request Aug 15, 2022
)

We operate an executor with a bound on the concurrent number of messages (see #463, #559, #706).
PR #472 added logging for the bound being hit.

We expect the executors to operate for a long time at this limit (e.g. in recovery situation).
The spammy logging is not usfeful

This removes the logging of the concurrency bound being hit.
Fixes #759
huitseeker added a commit to huitseeker/narwhal that referenced this pull request Aug 16, 2022
…ystenLabs#763)

We operate an executor with a bound on the concurrent number of messages (see MystenLabs#463, MystenLabs#559, MystenLabs#706).
PR MystenLabs#472 added logging for the bound being hit.

We expect the executors to operate for a long time at this limit (e.g. in recovery situation).
The spammy logging is not usfeful

This removes the logging of the concurrency bound being hit.
Fixes MystenLabs#759
huitseeker added a commit that referenced this pull request Aug 16, 2022
)

We operate an executor with a bound on the concurrent number of messages (see #463, #559, #706).
PR #472 added logging for the bound being hit.

We expect the executors to operate for a long time at this limit (e.g. in recovery situation).
The spammy logging is not usfeful

This removes the logging of the concurrency bound being hit.
Fixes #759
mwtian pushed a commit to mwtian/sui that referenced this pull request Sep 30, 2022
…ystenLabs/narwhal#763)

We operate an executor with a bound on the concurrent number of messages (see MystenLabs/narwhal#463, MystenLabs/narwhal#559, MystenLabs/narwhal#706).
PR MystenLabs/narwhal#472 added logging for the bound being hit.

We expect the executors to operate for a long time at this limit (e.g. in recovery situation).
The spammy logging is not usfeful

This removes the logging of the concurrency bound being hit.
Fixes MystenLabs/narwhal#759
@huitseeker huitseeker deleted the network_refactor branch January 21, 2023 13:37
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants