Skip to content
This repository has been archived by the owner on Oct 17, 2022. It is now read-only.

[executor] Route the executor requests without the worker_to_worker interface #706

Closed
huitseeker opened this issue Aug 7, 2022 · 1 comment · Fixed by #738
Closed

[executor] Route the executor requests without the worker_to_worker interface #706

huitseeker opened this issue Aug 7, 2022 · 1 comment · Fixed by #738
Assignees
Labels
bug Something isn't working

Comments

@huitseeker
Copy link
Contributor

huitseeker commented Aug 7, 2022

  • the Executor's BatchLoader is using the network to retrieve batches from its own worker,
  • that's too bad, because we have a BlockWaiter that allows a primary to retrieve payload from any particular block, feeding itself both on its own and on others' workers.
  • but the very problematic thing with this is that the BatchLoader is using the WorkerToWorkerClient to issue these batch requests, which means that primary is impersonating a worker, in order to contact its own worker, and using the publicly network interface of its worker (as opposed to the loopback primary_to_worker interface).
  • When the primary and worker are co-located, this round-trip through the exterior could be much more expensive than accessing a loopback interface.
  • More importantly it would mean that the response packets coming out of the co-located worker are in a contest with the other requests coming from the outside. This would explain encountering the error messages "Failed to receive batch reply from worker ".

Conclusion: I think we should excise the WorkerToWorker network from the BatchLoader. A way to do that seems to be to mimic some of the behavior of the BlockWaiter to retrieve solely batches (rather than blocks). /cc @akichidis for better ideas.

@huitseeker huitseeker added the bug Something isn't working label Aug 7, 2022
@huitseeker huitseeker changed the title [executor] Route the executor requests differently than through the worker_to_worker interface [executor] Route the executor requests without the worker_to_worker interface Aug 7, 2022
huitseeker added a commit to huitseeker/narwhal that referenced this issue Aug 7, 2022
Context:
The `BatchLoader` is establishing a primary->worker communication using a public address, which is adding latency and competing with outbound traffic.
The proper fix us to use the `BlockWaiter`.

The issue:
We would like a mitigation to be deployed and effective sooner.

The fix:
We inspect the worker addresses used by the `BatchLoader`, and rewrite their hostname to localhost when it matches the local primary.
huitseeker added a commit to huitseeker/narwhal that referenced this issue Aug 7, 2022
Context:
The `BatchLoader` is establishing a primary->worker communication using a public address, which is adding latency and competing with outbound traffic.
The proper fix us to use the `BlockWaiter`.

The issue:
We would like a mitigation to be deployed and effective sooner.

The fix:
We inspect the worker addresses used by the `BatchLoader`, and rewrite their hostname to localhost when it matches the local primary.
@akichidis
Copy link
Contributor

@huitseeker totally onboard in swapping the batch_loader with the block_waiter. By using block_waiter we also get on top transparent synchronization capabilities for missing batches. I don't believe we need to mimic block_waiter, we can just go ahead and use it. The executor is using the batch_loader to essentially request a "block" it self - meaning, all the batches associated with a certificate - and that's what we exactly do in the block_waiter.

huitseeker added a commit to huitseeker/narwhal that referenced this issue Aug 8, 2022
Context:
The `BatchLoader` is establishing a primary->worker communication using a public address, which is adding latency and competing with outbound traffic.
The proper fix us to use the `BlockWaiter`.

The issue:
We would like a mitigation to be deployed and effective sooner.

The fix:
We inspect the worker addresses used by the `BatchLoader`, and rewrite their hostname to localhost when it matches the local primary.
huitseeker added a commit to huitseeker/narwhal that referenced this issue Aug 8, 2022
Context:
The `BatchLoader` is establishing a primary->worker communication using a public address, which is adding latency and competing with outbound traffic.
The proper fix us to use the `BlockWaiter`.

The issue:
We would like a mitigation to be deployed and effective sooner.

The fix:
We inspect the worker addresses used by the `BatchLoader`, and rewrite their hostname to localhost when it matches the local primary.
huitseeker added a commit that referenced this issue Aug 8, 2022
Context:
The `BatchLoader` is establishing a primary->worker communication using a public address, which is adding latency and competing with outbound traffic.
The proper fix us to use the `BlockWaiter`.

The issue:
We would like a mitigation to be deployed and effective sooner.

The fix:
We inspect the worker addresses used by the `BatchLoader`, and rewrite their hostname to localhost when it matches the local primary.
huitseeker added a commit to huitseeker/narwhal that referenced this issue Aug 8, 2022
Context:
The `BatchLoader` is establishing a primary->worker communication using a public address, which is adding latency and competing with outbound traffic.
The proper fix us to use the `BlockWaiter`.

The issue:
We would like a mitigation to be deployed and effective sooner.

The fix:
We inspect the worker addresses used by the `BatchLoader`, and rewrite their hostname to localhost when it matches the local primary.
huitseeker added a commit that referenced this issue Aug 8, 2022
Context:
The `BatchLoader` is establishing a primary->worker communication using a public address, which is adding latency and competing with outbound traffic.
The proper fix us to use the `BlockWaiter`.

The issue:
We would like a mitigation to be deployed and effective sooner.

The fix:
We inspect the worker addresses used by the `BatchLoader`, and rewrite their hostname to localhost when it matches the local primary.
@akichidis akichidis self-assigned this Aug 9, 2022
huitseeker added a commit that referenced this issue Aug 12, 2022
Context:
The `BatchLoader` is establishing a primary->worker communication using a public address, which is adding latency and competing with outbound traffic.
The proper fix us to use the `BlockWaiter`.

The issue:
We would like a mitigation to be deployed and effective sooner.

The fix:
We inspect the worker addresses used by the `BatchLoader`, and rewrite their hostname to localhost when it matches the local primary.
huitseeker added a commit to huitseeker/narwhal that referenced this issue Aug 14, 2022
We operate an executor with a bound on the concurrent number of messages (see MystenLabs#463, MystenLabs#559, MystenLabs#706).

We expect the executors to operate for a long time at this limit (e.g. in recovery situation).
The spammy logging is not usfeful

This removes the logging of the concurrency bound being hit.
Fixes MystenLabs#759
huitseeker added a commit to huitseeker/narwhal that referenced this issue Aug 14, 2022
We operate an executor with a bound on the concurrent number of messages (see MystenLabs#463, MystenLabs#559, MystenLabs#706).
PR MystenLabs#472 added logging for the bound being hit.

We expect the executors to operate for a long time at this limit (e.g. in recovery situation).
The spammy logging is not usfeful

This removes the logging of the concurrency bound being hit.
Fixes MystenLabs#759
huitseeker added a commit that referenced this issue Aug 15, 2022
)

We operate an executor with a bound on the concurrent number of messages (see #463, #559, #706).
PR #472 added logging for the bound being hit.

We expect the executors to operate for a long time at this limit (e.g. in recovery situation).
The spammy logging is not usfeful

This removes the logging of the concurrency bound being hit.
Fixes #759
huitseeker added a commit to huitseeker/narwhal that referenced this issue Aug 16, 2022
…ystenLabs#763)

We operate an executor with a bound on the concurrent number of messages (see MystenLabs#463, MystenLabs#559, MystenLabs#706).
PR MystenLabs#472 added logging for the bound being hit.

We expect the executors to operate for a long time at this limit (e.g. in recovery situation).
The spammy logging is not usfeful

This removes the logging of the concurrency bound being hit.
Fixes MystenLabs#759
huitseeker added a commit that referenced this issue Aug 16, 2022
)

We operate an executor with a bound on the concurrent number of messages (see #463, #559, #706).
PR #472 added logging for the bound being hit.

We expect the executors to operate for a long time at this limit (e.g. in recovery situation).
The spammy logging is not usfeful

This removes the logging of the concurrency bound being hit.
Fixes #759
mwtian pushed a commit to mwtian/sui that referenced this issue Sep 30, 2022
Context:
The `BatchLoader` is establishing a primary->worker communication using a public address, which is adding latency and competing with outbound traffic.
The proper fix us to use the `BlockWaiter`.

The issue:
We would like a mitigation to be deployed and effective sooner.

The fix:
We inspect the worker addresses used by the `BatchLoader`, and rewrite their hostname to localhost when it matches the local primary.
mwtian pushed a commit to mwtian/sui that referenced this issue Sep 30, 2022
…ystenLabs/narwhal#763)

We operate an executor with a bound on the concurrent number of messages (see MystenLabs/narwhal#463, MystenLabs/narwhal#559, MystenLabs/narwhal#706).
PR MystenLabs/narwhal#472 added logging for the bound being hit.

We expect the executors to operate for a long time at this limit (e.g. in recovery situation).
The spammy logging is not usfeful

This removes the logging of the concurrency bound being hit.
Fixes MystenLabs/narwhal#759
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants