Test network partitioning #801

edsko · 2019-02-07T09:30:07Z

We should test what happens when the network gets temporarily partitioned. This is a particularly important test because it is the one that will cause rollbacks.

edsko · 2019-02-07T09:40:41Z

In prop_simple_bft_convergence we have

takeChainPrefix :: Chain (Block DemoBFT) -> Chain (Block DemoBFT)
takeChainPrefix = id -- in BFT, chains should indeed all be equal.

This however I think is not true once we start having network failures (@brunjlar , do you agree?)

nfrisby · 2019-07-19T15:11:50Z

@mrBliss Here's my initial approach. Seem reasonable? Might need more complexity when eventually combined with IntersectMBO/ouroboros-network#231.

broadcastNetwork defines a mesh network for each mini protocol: every node is connected to every other node via two unidirectional Channels per mini protocol. We have data Channel m a = Channel {send :: a -> m (),recv :: m (Maybe a)}. So we could simulate a temporary partitioning into two node sets A and B by having the send function of every Channel connecting A and B simply discard its input when btime is inside some interval(s). A new QuickCheck generator would randomize the node sets (without IntersectMBO/ouroboros-network#231, a simple threshold on node ID should be general enough) and the disconnection interval(s).

Edit: Seems like maybe the node needs to be notified (e.g. by exception) that its send failed. With the approach describe above, it looks nodes are just locked up; I'm guessing blocked on the response. Learning about Channel and its use in the nodes/protocols now.

Edit 2: Actually, it seems much more reasonable to block the send until the end of the network interruption instead of silently discard the send.

nfrisby · 2019-07-22T17:21:59Z

I have a local patch that splits the network into two non-empty partitions during one interval of the run. Once the interruption ceases, the two partitions exchange rollbacks (in the common case where they both extend their chains incompatibly during the interruption).

For BFT and PBFT, if I include logic in the interruption plan generator to bound the duration of the network interruption, then the subsequent rollbacks respect k and the tests pass. If the interruption duration is not bounded in that way, then the resulting rollbacks may exceed k, causing the tests to fail. So it's a well-behaved and useful property test.

For the Praos test, however, the logic I used for BFT and PBFT is insufficient. For example, it schedules an interruption that BFT and PBFT would handle without failure, but that same interruption schedule happens to bridge the gap between two "crowded runs" in the Praos test. The crowded+partitioned+crowded run's duration is long enough too create unacceptably large rollbacks once it ends.

In other words, the network partitioning introduces an "interrupt duration too long" concept similar to the "LeaderSchedule too crowded`" concept. Moreover, the two concepts can interact to create failures when neither would individually cause a failure.

I'm unsure how to proceed and would appreciate advice. Here are some related observations.

The LeaderSchedule test uses suchThat to avoid generating "too crowded" leader schedules.
The Praos test uses the nodes' uninterpretable PRNG seed instead of a LeaderSchedule generator, so it can't avoid "too crowded" LeaderSchedules a priori. It instead dismisses them after the fact, in a guard in the Property.
My generator for BFT and PBFT is comparable to the LeaderSchedule test's approach, but instead uses carefully chosen choose boundaries instead of suchThat.

And here are some options.

Extend the Praos property to be aware of the interruption and dismiss those cases too.
Extend the interruption plan generator to take a LeaderSchedule argument and avoid interruptions that problematically interact with crowded runs.
Don't write tests that mix the two concepts, at at first.

Edit: remove the sentence about LeaderSchedule test not failing: it is failing as expected.

mrBliss · 2019-07-23T06:37:37Z

What about the following approach:

We stop carefully tuning our generators to avoid rollbacks of more than k blocks and to avoid hard to predict conflicts between the network interruption and "crowded runs", and accept the fact that in some test cases we won't be able to reach consensus. We can still write our generators such that in most cases we are able to reach consensus (verify this with labelling!), otherwise we're not testing the interesting scenarios.
Define a function that for a whole test setup (including the LeaderSchedule and any information that may only be available after running the test, etc.) determines whether we should have been able to reach consensus or not.
After running each test, we check whether we have reached consensus, if not, then our function must tell us that it was indeed acceptable to not reach consensus for the given test setup.

I have not tried it out, but I hope this makes some things easier. What do you think?

nfrisby · 2019-07-23T12:58:10Z

Yes, I've reached the same conclusion: the next thing to try is writing a dismissable predicate that determines (conservatively) when a combination of the LeaderSchedule, InterruptionPlan, etc (eventually including #802, IntersectMBO/ouroboros-network#231, #800 ...) test configurations risk rollbacks > k. We can use it in generators with suchThat (like LeaderSchedule test) or in properties when the a priori is not an option (like Praos test).

For that reason, I'm now working to understand the details of the LeaderSchedule (and by proxy the Praos) test property.

1074: Add basic test for Chain Growth r=nfrisby a=nfrisby This PR is "part one" for Issue #235. It adds a `Property` that tests for Chain Growth modulo EBBs and late joins. In particular, it assumes there are no message delays, which is effectively true now and will be until Issue #229 and/or #230. From https://eprint.iacr.org/2017/573/20171115:001835 > Chain Growth (CG); with parameters τ ∈ (0, 1], s ∈ N. Consider the chains C1, C2 possessed by two honest parties at the onset of two slots sl1, sl2 with sl2 at least s slots ahead of sl1. Then it holds that len(C2) − len(C1) ≥ τ · s. We call τ the speed coefficient. We test with a deterministic variety of values for the `τ` and `s` parameters. Because our current test framework has a limited set of possible faults, we can test Chain Growth for all values of `s` that are greater than 0. Similarly, we're able to compute a `τ` value for any given interval of slots based on the leader schedule (roughly-- see relevant comments in diff). Once we add message delays (latencies, partitions, etc) these tests as-is will fail -- I'm still thinking about how to handle that, but I've decided to open this PR for now. Co-authored-by: Nicolas Frisby <[email protected]>

1107: Test for unexpected message delays r=nfrisby a=nfrisby This test confirms that our current test suite is not introducing any unexpected message delays, where _message delays_ is analogous to that from the paper https://eprint.iacr.org/2017/573/20171115:001835. Issues #229 and #230 will cause this property to fail; they'll have to refine it somehow. Co-authored-by: Nicolas Frisby <[email protected]>

nfrisby · 2020-03-31T16:02:45Z

While recently re-reading the Ouroboros papers (see eg Issue #695), I noticed that the BFT paper does not limit how many blocks a node can rollback. Our BFT implementation, though, does limit that to the security parameter k.

This observation relates to this Issue because it invalidates the paper's claim that BFT is resilient to network partitions.

Network splits. In the case of a network split, the network is temporarily partitioned into s connected components for some s≥2, each one containing n1,...,ns servers for a sequence of slots D. Assuming no other failures, it is easy to see that transaction processing will continue normally in each connected component [edit: the paper does not mention that chain density will decrease). Furthermore, by slot max D+n, all servers will be activated and the system will converge to a unique blockchain. Indeed, let i be the maximal connected component that includes the server that controls the earliest time slot in the n slots that follow D, say slj. It easy to see that after slj all servers will converge to the blockchain emitted by this server. It follows that Ouroboros-BFT is resilient to network splits. Note that transactions processed within any other connected component other than the maximal component with the winning server maybe lost and hence they have to be resubmitted.

For D "large enough" (possibly also including the effects of other confounding factors like network latency etc), the connected components' chains will have unique suffixes that include more than k blocks, and so nodes running our implementation of BFT will not be able to all switch to the same chain after the network split ends.

dcoutts · 2020-03-31T16:12:51Z

Yes, we inherit this aspect of Ouroboros classic, and when we're in Praos then we'll also have a k rollback limit. So this aspect of BFT is something we do take advantage of, and of course it would only be temporary anyway. So yes, we can still only tolerate partitions up to k blocks long.

edsko assigned mrBliss Feb 7, 2019

edsko mentioned this issue Feb 7, 2019

Protocol testing: test what happens when nodes join and leave the network repeatedly #800

Open

edsko unassigned mrBliss Feb 15, 2019

nfrisby self-assigned this Jul 19, 2019

This was referenced Jul 23, 2019

Comment explaining longestCrowdedRun IntersectMBO/ouroboros-network#814

Merged

Have some nodes join later in test-consensus IntersectMBO/ouroboros-network#773

Merged

nfrisby mentioned this issue Sep 27, 2019

Add basic test for Chain Growth IntersectMBO/ouroboros-network#1074

Merged

nfrisby mentioned this issue Oct 2, 2019

Test for unexpected message delays IntersectMBO/ouroboros-network#1107

Merged

edsko changed the title ~~Protocol testing: test network partitioning~~ Test network partitioning Jan 6, 2020

dnadales unassigned nfrisby Dec 1, 2023

dnadales added the better-tests Ideas to improve the tests label Dec 1, 2023

dnadales transferred this issue from IntersectMBO/ouroboros-network Dec 1, 2023

dnadales added this to Consensus Team Backlog Dec 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test network partitioning #801

Test network partitioning #801

edsko commented Feb 7, 2019

edsko commented Feb 7, 2019

nfrisby commented Jul 19, 2019 •

edited

Loading

nfrisby commented Jul 22, 2019 •

edited

Loading

mrBliss commented Jul 23, 2019

nfrisby commented Jul 23, 2019

nfrisby commented Mar 31, 2020

dcoutts commented Mar 31, 2020 •

edited

Loading

Test network partitioning #801

Test network partitioning #801

Comments

edsko commented Feb 7, 2019

edsko commented Feb 7, 2019

nfrisby commented Jul 19, 2019 • edited Loading

nfrisby commented Jul 22, 2019 • edited Loading

mrBliss commented Jul 23, 2019

nfrisby commented Jul 23, 2019

nfrisby commented Mar 31, 2020

dcoutts commented Mar 31, 2020 • edited Loading

nfrisby commented Jul 19, 2019 •

edited

Loading

nfrisby commented Jul 22, 2019 •

edited

Loading

dcoutts commented Mar 31, 2020 •

edited

Loading