[executor] Route the executor requests without the `worker_to_worker` interface #706

huitseeker · 2022-08-07T00:24:16Z

the Executor's BatchLoader is using the network to retrieve batches from its own worker,
that's too bad, because we have a BlockWaiter that allows a primary to retrieve payload from any particular block, feeding itself both on its own and on others' workers.
but the very problematic thing with this is that the BatchLoader is using the WorkerToWorkerClient to issue these batch requests, which means that primary is impersonating a worker, in order to contact its own worker, and using the publicly network interface of its worker (as opposed to the loopback primary_to_worker interface).
When the primary and worker are co-located, this round-trip through the exterior could be much more expensive than accessing a loopback interface.
More importantly it would mean that the response packets coming out of the co-located worker are in a contest with the other requests coming from the outside. This would explain encountering the error messages "Failed to receive batch reply from worker ".

Conclusion: I think we should excise the WorkerToWorker network from the BatchLoader. A way to do that seems to be to mimic some of the behavior of the BlockWaiter to retrieve solely batches (rather than blocks). /cc @akichidis for better ideas.

The text was updated successfully, but these errors were encountered:

Context: The `BatchLoader` is establishing a primary->worker communication using a public address, which is adding latency and competing with outbound traffic. The proper fix us to use the `BlockWaiter`. The issue: We would like a mitigation to be deployed and effective sooner. The fix: We inspect the worker addresses used by the `BatchLoader`, and rewrite their hostname to localhost when it matches the local primary.

akichidis · 2022-08-08T12:40:50Z

@huitseeker totally onboard in swapping the batch_loader with the block_waiter. By using block_waiter we also get on top transparent synchronization capabilities for missing batches. I don't believe we need to mimic block_waiter, we can just go ahead and use it. The executor is using the batch_loader to essentially request a "block" it self - meaning, all the batches associated with a certificate - and that's what we exactly do in the block_waiter.

Context: The `BatchLoader` is establishing a primary->worker communication using a public address, which is adding latency and competing with outbound traffic. The proper fix us to use the `BlockWaiter`. The issue: We would like a mitigation to be deployed and effective sooner. The fix: We inspect the worker addresses used by the `BatchLoader`, and rewrite their hostname to localhost when it matches the local primary.

We operate an executor with a bound on the concurrent number of messages (see MystenLabs#463, MystenLabs#559, MystenLabs#706). We expect the executors to operate for a long time at this limit (e.g. in recovery situation). The spammy logging is not usfeful This removes the logging of the concurrency bound being hit. Fixes MystenLabs#759

We operate an executor with a bound on the concurrent number of messages (see MystenLabs#463, MystenLabs#559, MystenLabs#706). PR MystenLabs#472 added logging for the bound being hit. We expect the executors to operate for a long time at this limit (e.g. in recovery situation). The spammy logging is not usfeful This removes the logging of the concurrency bound being hit. Fixes MystenLabs#759

) We operate an executor with a bound on the concurrent number of messages (see #463, #559, #706). PR #472 added logging for the bound being hit. We expect the executors to operate for a long time at this limit (e.g. in recovery situation). The spammy logging is not usfeful This removes the logging of the concurrency bound being hit. Fixes #759

…ystenLabs#763) We operate an executor with a bound on the concurrent number of messages (see MystenLabs#463, MystenLabs#559, MystenLabs#706). PR MystenLabs#472 added logging for the bound being hit. We expect the executors to operate for a long time at this limit (e.g. in recovery situation). The spammy logging is not usfeful This removes the logging of the concurrency bound being hit. Fixes MystenLabs#759

) We operate an executor with a bound on the concurrent number of messages (see #463, #559, #706). PR #472 added logging for the bound being hit. We expect the executors to operate for a long time at this limit (e.g. in recovery situation). The spammy logging is not usfeful This removes the logging of the concurrency bound being hit. Fixes #759

Context: The `BatchLoader` is establishing a primary->worker communication using a public address, which is adding latency and competing with outbound traffic. The proper fix us to use the `BlockWaiter`. The issue: We would like a mitigation to be deployed and effective sooner. The fix: We inspect the worker addresses used by the `BatchLoader`, and rewrite their hostname to localhost when it matches the local primary.

…ystenLabs/narwhal#763) We operate an executor with a bound on the concurrent number of messages (see MystenLabs/narwhal#463, MystenLabs/narwhal#559, MystenLabs/narwhal#706). PR MystenLabs/narwhal#472 added logging for the bound being hit. We expect the executors to operate for a long time at this limit (e.g. in recovery situation). The spammy logging is not usfeful This removes the logging of the concurrency bound being hit. Fixes MystenLabs/narwhal#759

huitseeker added the bug Something isn't working label Aug 7, 2022

huitseeker changed the title ~~[executor] Route the executor requests differently than through the worker_to_worker interface~~ [executor] Route the executor requests without the worker_to_worker interface Aug 7, 2022

huitseeker mentioned this issue Aug 7, 2022

fix: Mitigate issue #706 #715

Merged

akichidis self-assigned this Aug 9, 2022

akichidis mentioned this issue Aug 11, 2022

[refactor] use block_waiter instead of batch_loader #738

Merged

huitseeker mentioned this issue Aug 14, 2022

fix: Avoid spammy logging on hitting the network's BoundedExecutor #763

Merged

akichidis closed this as completed in #738 Aug 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[executor] Route the executor requests without the `worker_to_worker` interface #706

[executor] Route the executor requests without the `worker_to_worker` interface #706

huitseeker commented Aug 7, 2022 •

edited

Loading

akichidis commented Aug 8, 2022

[executor] Route the executor requests without the worker_to_worker interface #706

[executor] Route the executor requests without the worker_to_worker interface #706

Comments

huitseeker commented Aug 7, 2022 • edited Loading

akichidis commented Aug 8, 2022

[executor] Route the executor requests without the `worker_to_worker` interface #706

[executor] Route the executor requests without the `worker_to_worker` interface #706

huitseeker commented Aug 7, 2022 •

edited

Loading