Some BlockFetch decline decisions are not revisited after `completeFetchBatch` #1147

nfrisby · 2019-10-18T01:00:12Z

While developing infra-slot network latencies (PR #1131), I'm seeing a violation of our current variant of Chain Growth. This Issue is complete when that test case is no longer a failure.

The current repro, which might not work directly on future master, is as follows in Test.Dynamic.Praos. (If it unexpectedly does not fail on a branch with latencies, my only suggestion is to let the latency seed vary and hope it discovers the failure.)

            testPraos' TestConfig
              { numCoreNodes
              , numSlots
              , nodeJoinPlan = NodeJoinPlan (Map.fromList [(CoreNodeId 0,SlotNo {unSlotNo = 24}),(CoreNodeId 1,SlotNo {unSlotNo = 29}),(CoreNodeId 2,SlotNo {unSlotNo = 29})])
              , nodeTopology = NodeTopology (Map.fromList [(CoreNodeId 0,Set.fromList []),(CoreNodeId 1,Set.fromList [CoreNodeId 0]),(CoreNodeId 2,Set.fromList [CoreNodeId 1])])
              , latencySeed = InjectLatencies (seedSMGen 9511500888644978603 10563634114029886209)
              }
                Seed {getSeed = (10580735904357354200,1771287816820322801,4449307331544544775,3207572460394768516,9310143438266696584)}

The specific failure happens because network latency introduces a non-negligible duration (never before seen in the test suite) between MsgBlock and MsgBatchDone. A particular node makes a FetchDeclineConcurrencyLimit decision (due to peerFetchReqsInFlight = 1) after receiving MsgBlock but before receiving MsgBatchDone (which sets peerFetchReqsInFlight = 0). The BlockFetch decision re-evaluation trigger currently does not include peerFetchReqsInFlight: MsgBlock would trigger a re-evaluation but MsgBatchDone does not. Because of that decision, nodes c1 and c2 do not adopt the chain that c0 forged, which is the longest chain in the network. That's a violation of our currently-very-strict variant of Chain Growth.

Apparent ways to resolve this Issue:

Ensure MsgBatchDone triggers re-evaluation (i.e. expand the content of FetchStateFingerprint to include the peerFetchReqsInFlight fields; and perhaps similar for other bases for decline decisions?)
@dcoutts on Slack came up with the idea of adding an "Idle" status to the PeerStatus, which is already part of the FetchStateFingerprint.
Accept this current behavior as-is, and adjust the Chain Growth variation Property to anticipate this.

The text was updated successfully, but these errors were encountered:

nfrisby · 2019-10-18T16:48:29Z

I added the Bug label. That might be an overstatement, but at the very least Duncan agreed on Slack that the current behavior is not as prompt as we'd prefer.

mrBliss · 2019-11-07T15:14:09Z

@nfrisby What's the status on this?

nfrisby · 2019-11-07T16:10:01Z

I responded on Slack. Summary: Duncan had planned to do it, though I have a commit on my branch that is more "surgical" than what he had discussed, if we want to go with that for now.

Fixes #1147 The point is that we should only base decisions (to do or not do a thing, as opposed to things that only affect how it is done) on things that are in the fingerprint that changes. The nub of the bug was that we were deciding to not do this request based on peerFetchReqsInFlight == 0 which is not part of the fingerprint. By using PeerFetchStatusReady Set.empty instead we're making a decision on a slightly different condition, but one that's within the info covered by the fingerprint. The condition is different in that instead of saying that the requests are completely done, the number of outstanding replies is empty. The distinction is the message after the last block in a range request. But this difference really doesn't matter at all. Our decision is simply to avoid too much network concurrency, so it doesn't matter if there's one remaining small message to arrive.

nfrisby added consensus issues related to ouroboros-consensus bug Something isn't working labels Oct 18, 2019

mrBliss added networking and removed consensus issues related to ouroboros-consensus labels Nov 7, 2019

coot added the priority high label Nov 12, 2019

coot modified the milestone: release-node-1.0.0 development Nov 12, 2019

coot added priority medium and removed priority high labels Nov 14, 2019

coot assigned dcoutts Nov 14, 2019

coot added this to the release-node-1.1.0 development milestone Nov 14, 2019

vhulchenko-iohk modified the milestones: S1 2019-12-05, S2 2019-12-19 Dec 9, 2019

mrBliss mentioned this issue Dec 17, 2019

Introduce network latencies IntersectMBO/ouroboros-consensus#802

Open

dcoutts modified the milestones: S2 2019-12-19, S3 2020-01-02 Dec 17, 2019

vhulchenko-iohk modified the milestones: S3 2020-01-02, S4 2020-01-16 Jan 6, 2020

mrBliss removed this from the S4 2020-01-16 milestone Jan 6, 2020

dcoutts added byron Required for a Byron mainnet: replace the old core nodes with cardano-node priority high and removed priority medium labels Jan 7, 2020

dcoutts added this to the S4 2020-01-16 milestone Jan 7, 2020

coot modified the milestones: S4 2020-01-16, S5 2020-01-30 Jan 9, 2020

coot removed this from the S5 2020-01-30 milestone Jan 31, 2020

coot added this to the S7 2020-02-27 milestone Jan 31, 2020

nfrisby mentioned this issue Feb 7, 2020

Fix cloneBlockchainTime #1506

Merged

nfrisby mentioned this issue Feb 26, 2020

network: add IsIdle argument to the PeerFetchStatusReady fingerprint #1705

Merged

iohk-bors bot closed this as completed in 36861a9 Feb 26, 2020

nfrisby mentioned this issue Feb 26, 2020

Regression test case for 1147 IntersectMBO/ouroboros-consensus#711

Open

nfrisby mentioned this issue May 19, 2020

Test with infra-slot message latency #1131

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some BlockFetch decline decisions are not revisited after `completeFetchBatch` #1147

Some BlockFetch decline decisions are not revisited after `completeFetchBatch` #1147

nfrisby commented Oct 18, 2019 •

edited

Loading

nfrisby commented Oct 18, 2019

mrBliss commented Nov 7, 2019

nfrisby commented Nov 7, 2019 •

edited

Loading

Some BlockFetch decline decisions are not revisited after completeFetchBatch #1147

Some BlockFetch decline decisions are not revisited after completeFetchBatch #1147

Comments

nfrisby commented Oct 18, 2019 • edited Loading

nfrisby commented Oct 18, 2019

mrBliss commented Nov 7, 2019

nfrisby commented Nov 7, 2019 • edited Loading

Some BlockFetch decline decisions are not revisited after `completeFetchBatch` #1147

Some BlockFetch decline decisions are not revisited after `completeFetchBatch` #1147

nfrisby commented Oct 18, 2019 •

edited

Loading

nfrisby commented Nov 7, 2019 •

edited

Loading