Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some BlockFetch decline decisions are not revisited after completeFetchBatch #1147

Closed
nfrisby opened this issue Oct 18, 2019 · 3 comments · Fixed by #1705
Closed

Some BlockFetch decline decisions are not revisited after completeFetchBatch #1147

nfrisby opened this issue Oct 18, 2019 · 3 comments · Fixed by #1705
Assignees
Labels
bug Something isn't working byron Required for a Byron mainnet: replace the old core nodes with cardano-node
Milestone

Comments

@nfrisby
Copy link
Contributor

nfrisby commented Oct 18, 2019

While developing infra-slot network latencies (PR #1131), I'm seeing a violation of our current variant of Chain Growth. This Issue is complete when that test case is no longer a failure.

The current repro, which might not work directly on future master, is as follows in Test.Dynamic.Praos. (If it unexpectedly does not fail on a branch with latencies, my only suggestion is to let the latency seed vary and hope it discovers the failure.)

            testPraos' TestConfig
              { numCoreNodes
              , numSlots
              , nodeJoinPlan = NodeJoinPlan (Map.fromList [(CoreNodeId 0,SlotNo {unSlotNo = 24}),(CoreNodeId 1,SlotNo {unSlotNo = 29}),(CoreNodeId 2,SlotNo {unSlotNo = 29})])
              , nodeTopology = NodeTopology (Map.fromList [(CoreNodeId 0,Set.fromList []),(CoreNodeId 1,Set.fromList [CoreNodeId 0]),(CoreNodeId 2,Set.fromList [CoreNodeId 1])])
              , latencySeed = InjectLatencies (seedSMGen 9511500888644978603 10563634114029886209)
              }
                Seed {getSeed = (10580735904357354200,1771287816820322801,4449307331544544775,3207572460394768516,9310143438266696584)}

The specific failure happens because network latency introduces a non-negligible duration (never before seen in the test suite) between MsgBlock and MsgBatchDone. A particular node makes a FetchDeclineConcurrencyLimit decision (due to peerFetchReqsInFlight = 1) after receiving MsgBlock but before receiving MsgBatchDone (which sets peerFetchReqsInFlight = 0). The BlockFetch decision re-evaluation trigger currently does not include peerFetchReqsInFlight: MsgBlock would trigger a re-evaluation but MsgBatchDone does not. Because of that decision, nodes c1 and c2 do not adopt the chain that c0 forged, which is the longest chain in the network. That's a violation of our currently-very-strict variant of Chain Growth.

Apparent ways to resolve this Issue:

  • Ensure MsgBatchDone triggers re-evaluation (i.e. expand the content of FetchStateFingerprint to include the peerFetchReqsInFlight fields; and perhaps similar for other bases for decline decisions?)
  • @dcoutts on Slack came up with the idea of adding an "Idle" status to the PeerStatus, which is already part of the FetchStateFingerprint.
  • Accept this current behavior as-is, and adjust the Chain Growth variation Property to anticipate this.
@nfrisby nfrisby added consensus issues related to ouroboros-consensus bug Something isn't working labels Oct 18, 2019
@nfrisby
Copy link
Contributor Author

nfrisby commented Oct 18, 2019

I added the Bug label. That might be an overstatement, but at the very least Duncan agreed on Slack that the current behavior is not as prompt as we'd prefer.

@mrBliss
Copy link
Contributor

mrBliss commented Nov 7, 2019

@nfrisby What's the status on this?

@mrBliss mrBliss added networking and removed consensus issues related to ouroboros-consensus labels Nov 7, 2019
@nfrisby
Copy link
Contributor Author

nfrisby commented Nov 7, 2019

I responded on Slack. Summary: Duncan had planned to do it, though I have a commit on my branch that is more "surgical" than what he had discussed, if we want to go with that for now.

@coot coot added this to the release-node-1.1.0 development milestone Nov 14, 2019
@dcoutts dcoutts modified the milestones: S2 2019-12-19, S3 2020-01-02 Dec 17, 2019
@mrBliss mrBliss removed this from the S4 2020-01-16 milestone Jan 6, 2020
dcoutts added a commit that referenced this issue Jan 6, 2020
Fixes #1147

The point is that we should only base decisions (to do or not do a
thing, as opposed to things that only affect how it is done) on things
that are in the fingerprint that changes.

The nub of the bug was that we were deciding to not do this request
based on peerFetchReqsInFlight == 0 which is not part of the
fingerprint. By using PeerFetchStatusReady Set.empty instead we're
making a decision on a slightly different condition, but one that's
within the info covered by the fingerprint.

The condition is different in that instead of saying that the requests
are completely done, the number of outstanding replies is empty. The
distinction is the message after the last block in a range request.

But this difference really doesn't matter at all. Our decision is
simply to avoid too much network concurrency, so it doesn't matter if
there's one remaining small message to arrive.
@dcoutts dcoutts added byron Required for a Byron mainnet: replace the old core nodes with cardano-node priority high and removed priority medium labels Jan 7, 2020
@dcoutts dcoutts added this to the S4 2020-01-16 milestone Jan 7, 2020
@coot coot modified the milestones: S4 2020-01-16, S5 2020-01-30 Jan 9, 2020
@coot coot removed this from the S5 2020-01-30 milestone Jan 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working byron Required for a Byron mainnet: replace the old core nodes with cardano-node
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants