-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix model/SUT discrepancy in Chain DB tests #3389
Comments
Here are some initial thoughts. For ease of reference, the ultimate error message is this:
Here is where the SUT actually concludes Note that So -- as a bit of a surprise to me -- it seems that this part of the LedgerDB tracking which points are in the VolDB. (Edit: After writing the rest of this, I just think the above comment is a bit confusing.) Here is where the It's the And that function is called only after chain selection: https://github.com/input-output-hk/ouroboros-network/blob/c82309f403e99d916a76bb4d96d6812fb0a9db81/ouroboros-consensus/src/Ouroboros/Consensus/Storage/ChainDB/Impl/ChainSel.hs#L808 So: the key point is not just that s 11, b 5 is still in the VolDB, but rather that it's still in the VolDB and it had been on our current chain at some previous time. So maybe we can avoid this error by enriching the model to remember which blocks that were earlier on the selected chain (and have not been GCed) and refine its
Is it because it requires generating |
I thought
So as I understand it, a point is in |
I would not expect this. "All" we need to generate is a chain that forks after the immutable tip, and ask for the validity of a point at a fork. It might be worthwhile checking what is the percentage of chains with forks are generated. |
Proposed next steps. Understand why this error was not caught beforeA necessary condition for this error is that we have a command sequence that:
The first step would be to add labelling to understand if the probability of If we find out that the probability of generating If the probabilities above do not explain why this bug was not caught earlier, we can analyze was is the probability of asking for the validity of a point that is in the volatile DB, and in a fork of the current chain. Fix the model(Edit by @nfrisby: See my earlier comment for a tour of the analogous logic that is happening in the SUT: #3389 (comment)) We should adjust the model so that it keeps track of previously applied blocks that are still in the volatile DB. Below is a diff that sketches how this could be done. -blocks m = volatileDbBlocks m <> immutableDbBlocks m
+blocks m = volatileDbBlocks m <> immutableDbBlocks m <> garbageCollectedBlocks m
getIsValid cfg m = \pt@(RealPoint _ hash) -> if
+ | pt `in` garbageCollectedBlocks ... -> Nothing
| Set.member pt validPoints -> Just True
garbageCollect :: forall blk. HasHeader blk
=> SecurityParam -> Model blk -> Model blk
garbageCollect secParam m@Model{..} = m {
- volatileDbBlocks = Map.filter (not . collectable) volatileDbBlocks
+ volatileDbBlocks = notGCed
+ garbageCollectedBlocks = gcEd
}
-- TODO what about iterators that will stream garbage collected blocks?
where
+ (notGCed, gcEd) = Map.partion (not . collectable) volatileDbBlocks
Removing "unreachable" blocks from the SUT's volatile DB is not an option, since this will make garbage collection slower (we'd need to compute the set of unreachable blocks instead of simply having to filter them based on slot numbers). |
3651: Fix: ChainDB QSM test flakyness r=Jasagredo a=Jasagredo ChainDB QSM didn't have a notion of validity for blocks that were in chain suffixes no longer connected to the Genesis block. It now carries a Set of known-valid blocks to answer the `GetIsValid` query. Closes CAD-4002 Closes CAD-3590 Closes CAD-3601 Closes #3389 Closes #3440 Co-authored-by: Javier Sagredo <[email protected]>
3651: Fix: ChainDB QSM test flakyness r=Jasagredo a=Jasagredo ChainDB QSM didn't have a notion of validity for blocks that were in chain suffixes no longer connected to the Genesis block. It now carries a Set of known-valid blocks to answer the `GetIsValid` query. Closes CAD-4002 Closes CAD-3590 Closes CAD-3601 Closes #3389 Closes #3440 Co-authored-by: Javier Sagredo <[email protected]>
3651: Fix: ChainDB QSM test flakyness r=Jasagredo a=Jasagredo ChainDB QSM didn't have a notion of validity for blocks that were in chain suffixes no longer connected to the Genesis block. It now carries a Set of known-valid blocks to answer the `GetIsValid` query. Closes CAD-4002 Closes CAD-3590 Closes CAD-3601 Closes #3389 Closes #3440 Co-authored-by: Javier Sagredo <[email protected]>
3651: Fix: ChainDB QSM test flakyness r=Jasagredo a=Jasagredo ChainDB QSM didn't have a notion of validity for blocks that were in chain suffixes no longer connected to the Genesis block. It now carries a Set of known-valid blocks to answer the `GetIsValid` query. Closes CAD-4002 Closes CAD-3590 Closes CAD-3601 Closes #3389 Closes #3440 Co-authored-by: Javier Sagredo <[email protected]>
3651: Fix: ChainDB QSM test flakyness r=jbgi a=Jasagredo ChainDB QSM didn't have a notion of validity for blocks that were in chain suffixes no longer connected to the Genesis block. It now carries a Set of known-valid blocks to answer the `GetIsValid` query. Closes CAD-4002 Closes CAD-3590 Closes CAD-3601 Closes #3389 Closes #3440 Co-authored-by: Javier Sagredo <[email protected]>
Previously, the model might forget about disconnected blocks still in the VolatileDB that it had validated previously, which is not something that can happen with the SUT, resulting in a test failure. This can arise in specific cases after garbage collection, see IntersectMBO/ouroboros-network#3389 for a concrete example.
Previously, the model might forget about disconnected blocks still in the VolatileDB that it had validated previously, which is not something that can happen with the SUT, resulting in a test failure. This can arise in specific cases after garbage collection, see IntersectMBO/ouroboros-network#3389 for a concrete example. Co-authored-by: Joris Dral <[email protected]>
Previously, the model might forget about disconnected blocks still in the VolatileDB that it had validated previously, which is not something that can happen with the SUT, resulting in a test failure. This can arise in specific cases after garbage collection, see IntersectMBO/ouroboros-network#3389 for a concrete example. Co-authored-by: Joris Dral <[email protected]>
The error
The model and SUT disagree on the validity of a point. More precisely, the point that corresponds to block with slot 11, block number 5.
For ease of reference, the ultimate error message is this:
Click to see the full error message
The error can be reproduced running:
cabal new-run test-storage -- -p "ChainDB q-s-m" --quickcheck-replay 999301
Counterexample actions
Adding these blocks would result in this chain:
What might be going wrong
The SUT considers the point that corresponds to block
s 11, b 5
as valid,whereas the model does not have information about its validity.
The model reports that it does not know about the validity of point that
corresponds to block:
i.e., it returns
Nothing
. Why?that start from genesis with the blocks from the immutable DB and the
volatile DB.
the logs, and infer this from the fact that the last seen slot when we
call GC is
14
andk=2
, so blocks 11, b 4
is at the tip of theimmutable DB).
s 10, b 4
gets removed from the volatile DB because its slot number is smaller than
slot number of the immutable tip.
s 11, b 5
using blocks in the volatile DB and immutable DB because its predecessor
was removed from the volatile DB.
block at
s 11, b 5
is unknown (it returnsNothing
).The SUT considers the point as valid because block
s 11, b 5
is still inthe volatile DB after garbage collection.
Based on this analysis, I'd conclude that the comment about
IsValidResult
is wrong: the model does not know about all valid blocks.I do not know how we want to fix this. Edsko seems to prefer weakening the
equality instance of
IsValidResult
so that it returnsTrue
when eitherthe SUT or the model return
Nothing
.I am worried that the SUT and model definitions of
getIsValid
do notmatch. I'd argue that in this case both the model and the SUT know that the
point that corresponds to
s 11, b 5
is valid.Additional questions
The text was updated successfully, but these errors were encountered: