-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test with infra-slot message latency #1131
Conversation
Currently, this implementation only requires the CS and BF mini protocols to settle down. The transaction submission mini protocols never seem to quiet down -- if I wait for those mini protocol instances to settle down, the test execution live locks. Edit: It now waits for all three mini protocols to settle. I had a bug where I had the state forget now-dead live pipes after blocking until the next slot instead of forgetting them before blocking. That was deadlock, since "unforgotten" "dead" live pipes can prevent the next slot from starting. |
8413f0c
to
ae84f1a
Compare
It seems like a good idea to open a specific Issue for this PR, but it also seems like that should wait until PR 1128 is resolved. |
=> ResourceRegistry m | ||
-> NumSlots -- ^ Number of slots | ||
-> DiffTime -- ^ Slot duration | ||
-> (SlotNo -> m ()) -- ^ Blocks until slot is finished |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment confused me. After a slot is finished, this function is called and the next slot will not start until the call (successfully) terminates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your sentence conveys a lot of the right intuition. However, I have a pedantic objection: one slot finishes exactly when the next starts, there's no duration in between.
I hope to rework this function (see other PR comments), but I will at the very least improve the names and comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about now?
} | ||
onSlotChange btime $ \s -> do | ||
blockUntilQuiescent livePipesVar quiescenceThreshold | ||
atomically $ writeTVar latestDoneSlot s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nicer if we didn't need to go through latestDoneSlot
to tell the btime
to advance. I suppose that otherwise blockUntilQuiescent
have to move to where the btime
is created. Maybe the creation of the btime
can be done in runNodeNetwork
instead? Is there any reason left to do it in runTestNetwork
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main reasons for now:
Test.Dynamic.General
is not the only call-site fornewTestBlockchainTime
- This current approach is pretty awkward, still. I'm not even sure "the slot length" is relevant anymore?
- So I thought we'd figure it out at this call-site before I attempt to alter a function with multiple call-sites.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only reason I'm thinking of for leaving it in runTestNetwork
is because it's free to introduce further details (beyond latestDoneSlot
interactions) that runNodeNetwork
does not care about, so defining btime
outside seems right.
sig2 <- get | ||
if sig1 == sig2 then pure () else go sig2 | ||
|
||
-- | Create a pipe backed by a 'LazyTMVar', add it to the live pipes, and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/LazyTMVar/StrictTMVar
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only used LazyTMVar
because that's what was being used. Since these mvars are simulating a network, it seems like making it strict would be in a sense more accurate. Though "time" is somewhat weird here due to io-sim
, so I don't think it ultimately matters -- maybe only for tracing error
calls perhaps?
I'll change to StrictTMVar
and see if someone else objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha, no, I think you're already using a StrictTMVar
, since you're using Control.Monad.Class.MonadSTM.Strict
. A partial type signature on buffer
confirms this. So my comment only applies to the docstring 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It now uses TQueue
s. Though the function's comment is now pretty big and does not mention the exact data type.
1badc07
to
2dc06f8
Compare
OK: I fixed bugs and squashed it down to a single commit. I still haven't addressed your prior comments. FYI I intend to improve the organization, but I wanted to push up the first version that seems to be correct even if not yet good :) |
302ff08
to
4e1180c
Compare
I rebased onto master, improved the organization of the commits, simplified the code some, and commented it some. The I still plan to break some of the new code (the "live pipes" stuff) out into its own module. |
18268c2
to
4a59748
Compare
I believe these commits are ready for review.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work 👍 Is it correct that this helped us find one bug, namely #1147?
ouroboros-consensus/test-consensus/Test/Dynamic/Util/LivePipes.hs
Outdated
Show resolved
Hide resolved
ouroboros-consensus/test-consensus/Test/Dynamic/Util/LivePipes.hs
Outdated
Show resolved
Hide resolved
Yep, these tests revealed Issue 1147 👍 Related: I re-requested review just now only to clear your Approval. I haven't made any changes yet, but the commits currently include a workaround for Issue 1147 that does not belong on master as-is. I'd switch the PR back to Draft status if I could. |
OK, I think I addressed your latest comments and then some. This PR is blocked by a proper fix for Issue 1147 -- its current commits contain a workaround. Once a fix is on master, I'll rebase (and squash) and then we can try to merge. |
In particular, it no longer assumes that all slots have the same duration.
We add random network latencies but crucially also ensure that each test slot cannot end until all network channels have been empty for some duration that reasonable dwarfs any computational delays. The goal is for the node network to always reach a steady state before the next slot begins, despite random network latencies.
* stylish-haskell fixes to this PR's diff * typos * change vestigial mb prefix to li * add ioSimSecondsToDiffTime for self-documenting purposes * refactor `diffCtor` for maintainability
f47d83f
to
1ed2a89
Compare
1ed2a89
to
0d2fa18
Compare
Status: I just rebased and tweaked the commits. I think 0d2fa18 is mergeable, but the last commit is a "surgical" minimal-diff simplification of what Duncan is planning to do, if I understand. |
@dcoutts On my local copy of this PR, I reverted 0d2fa18, which is the old, demonstrative commit for the Issue 1147 bug, and experimentally added the following commit which you suggested this morning on Slack.
The 1147 repro failed :( I haven't yet investigated why -- just FYI. Edit: Here's a trace of a failure. It's the original repro, but required a different RNG seed for the message latencies. So: topology is c0 <-> c1 <-> c2, c0 joins and leads in s24, c1 and c2 join in s29, and c0 and c2 lead in s29 (but c2 hasn't sync'd anything before leading, so its new block is too short and so will be discarded eventually). The failure is that c1 (and therefore c2) do not adopt the block that c0 forges in s29 (well, at least they don't during s29). Here is a trace of s29 of Forges and Switches of all three nodes as well as the BF decisions and traffic for just c1.
I suspect node 2 is irrelevant, so here is the same trace with the node 2 mentions removed.
I haven't revisited the BF logic in detail, but note the likely suspects |
Thanks! I still think my basic approach should work, but I noticed a probable bug in my solution. I've updated branch |
It fails with the same trace as in my previous comment :( Here's what I tested with:
I just pushed 3d45937 to |
This Draft PR is my first attempt at injecting networking latency into the
test-consensus
tests while simultaneously ensuring that every mini protocol message arrives in the same slot it was sent. The purpose is to permute the interleaving of various events without adding new difficult-to-predict mechanisms for Common Prefix violations.This PR is partial progress towards Issue IntersectMBO/ouroboros-consensus#802. This intermediate milestone is motivated by the test plan in (pending) PR #1128.