-
Notifications
You must be signed in to change notification settings - Fork 68
[refactor / possible bug] worker synchronizer to reply back for available batches #837
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @akichidis this looks great, and it does cover an important edge case! I would simply ask for a test on the PayloadReceiver
checking that sending twice the same (digest, worker_id)
pair results in no errors (and the pair being written to the store).
It sounds trite, but it would make it explicit that we do expect idempotent behavior from this component.
Nit:
This risks invalidating the batches_received
metric, and if we want to be really fancy we can increase the PayloadToken
in the PayloadStore
to reflect we are adding the same digest several times, but that's probably an overkill (and will be refactored away if we make authorities commit to batch rounds).
worker/src/synchronizer.rs
Outdated
debug!("Requesting sync for batch {digest}"); | ||
}, | ||
Ok(Some(_)) => { | ||
// The batch arrived in the meantime: no need to request it. | ||
available.insert(*digest); | ||
debug!("Digest {digest} already in store, nothing to sync"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may end up being quite verbose, do we need this log at any level above trace
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I know it will be verbose, but given the current behaviour in DevNet I err on the side of having extra logs to be able to trace any related issues - it was a bit hard to diagnose with the current level of logging. That being said though, I'll change this to trace
and we can enable in worker the trace level - and switch back if we don't need to any more.
worker/src/synchronizer.rs
Outdated
// worker and message has been missed by primary). | ||
for digest in available { | ||
let message = WorkerPrimaryMessage::OthersBatch(digest, self.id); | ||
_ = self.tx_primary.send(message).await.tap(|err|{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The send
returns a Result<(), SendError<T>>
so I think you want to use tap_err
here, as this closure would be applied on the whole Result
otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point!
Very good point @huitseeker ! I've amended the PR with tests on the |
…revent from sending unnecessary sync requests
cc6578d
to
4b141ff
Compare
@huitseeker Ah yeah this could create some skew, but to be honest I don't expect to affect that much the grand scheme of things here (but current logging can confirm the expectation). Also as you said the refactoring to make the authorities commit to batch rounds will introduce several changes, so we can have a look as part of that - or after that. |
…able batches (MystenLabs#837) Refactored worker synchronizer to reply back immediately when batch available to ensure we'll not end up in deadlock situations in primary. Also refactored the code to not send sync requests to peer workers with empty payload.
…able batches (#837) Refactored worker synchronizer to reply back immediately when batch available to ensure we'll not end up in deadlock situations in primary. Also refactored the code to not send sync requests to peer workers with empty payload.
…able batches (MystenLabs/narwhal#837) Refactored worker synchronizer to reply back immediately when batch available to ensure we'll not end up in deadlock situations in primary. Also refactored the code to not send sync requests to peer workers with empty payload.
Section 1
Currently when we request from the worker synchronizer to synchronize a batch (
PrimaryWorkerMessage::Synchronize
) it will first check in storage whether the batch is already there.worker --> primary
has not been successfully processed for some reason, there is no mechanism to re-trigger and there will be no way for the primary to get notified about the batch availability - although the batch is in worker.Section 2
The above case (2) can create deadlock situations. In DevNet we have recently seen an issue with the payload synchronization in the recent incident where I believe is an artifact of the above behaviour. The scenario is the following:
We assume that we have two validators, (A) & (B).
kxmsamyzXZxRPc+C
and forwards to primaryblock_waiter
which first checks if the payload is available via theblock_synchronizer
. Theblock_synchronizer
sees that the header has been created by us (Validator A) and doesn't need to check the near store cachepayload_store
, but rather assumes the payload is already in worker - so theblock_waiter
can just continue and fetch the batch - all OK.kxmsamyzXZxRPc+C
- yes it happens currently in DevNet currently (since same transactions are posted) . Worker broadcasts the batch to the other workers and just waits for a quorum of them to reply back - once the quorum replies it even cancels the network handlers. Let's assume that because of that - or even some network glitch - Validator (A) hasn't received the batch.kxmsamyzXZxRPc+C
is still missing from the Validator (A)payload_store
kxmsamyzXZxRPc+C
. Theblock_synchronizer
will check thepayload_store
for the batch availability (since it's not our header) . The batch is missing , so it will dictate its worker to synchronize it from the other worker peers.From the DevNet the bellow logs tell that story - well in a quite abstract way , but that's where I am oriented at:
Validator A
We see that Validator A it will indefinitely send via the
block_synchronizer
a request to its worker to sync the batchkxmsamyzXZxRPc+C.....
Validator B
We see that Validator B is managing to successfully progress and request the batch from its worker (see log line
primary::block_waiter: Sending batch kxmsamyzXZxRPc+C request to worker id 0
Resolution
This PR attempts to resolve this issue by making the worker synchronizer reply back to primary immediately when a batch has been found available in store and not make any assumptions about other processes that might be running to notify the primary.
Also, an addition fix/improvement is made to ensure that the synchronizer will not send synchronization messages to other workers when a vector of
missing
batch ids is empty - we have also seen this happening in DevNet.PS: Similar issue with the above can arise even when the worker restarts (so any pending requests from worker to primary are lost) or for some reason primary doesn't store the batch id in the
payload_store
TODO
payload_store
in primary irrespective of wether "we" or an "other" primary has created. Will need to confirm we'll not break there some other assumptions - which I believe it won't