Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test network partitioning #801

Open
edsko opened this issue Feb 7, 2019 · 7 comments
Open

Test network partitioning #801

edsko opened this issue Feb 7, 2019 · 7 comments
Labels
better-tests Ideas to improve the tests

Comments

@edsko
Copy link
Contributor

edsko commented Feb 7, 2019

We should test what happens when the network gets temporarily partitioned. This is a particularly important test because it is the one that will cause rollbacks.

@edsko
Copy link
Contributor Author

edsko commented Feb 7, 2019

In prop_simple_bft_convergence we have

takeChainPrefix :: Chain (Block DemoBFT) -> Chain (Block DemoBFT)
takeChainPrefix = id -- in BFT, chains should indeed all be equal.

This however I think is not true once we start having network failures (@brunjlar , do you agree?)

@nfrisby nfrisby self-assigned this Jul 19, 2019
@nfrisby
Copy link
Contributor

nfrisby commented Jul 19, 2019

@mrBliss Here's my initial approach. Seem reasonable? Might need more complexity when eventually combined with IntersectMBO/ouroboros-network#231.

broadcastNetwork defines a mesh network for each mini protocol: every node is connected to every other node via two unidirectional Channels per mini protocol. We have data Channel m a = Channel {send :: a -> m (),recv :: m (Maybe a)}. So we could simulate a temporary partitioning into two node sets A and B by having the send function of every Channel connecting A and B simply discard its input when btime is inside some interval(s). A new QuickCheck generator would randomize the node sets (without IntersectMBO/ouroboros-network#231, a simple threshold on node ID should be general enough) and the disconnection interval(s).

Edit: Seems like maybe the node needs to be notified (e.g. by exception) that its send failed. With the approach describe above, it looks nodes are just locked up; I'm guessing blocked on the response. Learning about Channel and its use in the nodes/protocols now.

Edit 2: Actually, it seems much more reasonable to block the send until the end of the network interruption instead of silently discard the send.

@nfrisby
Copy link
Contributor

nfrisby commented Jul 22, 2019

I have a local patch that splits the network into two non-empty partitions during one interval of the run. Once the interruption ceases, the two partitions exchange rollbacks (in the common case where they both extend their chains incompatibly during the interruption).

For BFT and PBFT, if I include logic in the interruption plan generator to bound the duration of the network interruption, then the subsequent rollbacks respect k and the tests pass. If the interruption duration is not bounded in that way, then the resulting rollbacks may exceed k, causing the tests to fail. So it's a well-behaved and useful property test.

For the Praos test, however, the logic I used for BFT and PBFT is insufficient. For example, it schedules an interruption that BFT and PBFT would handle without failure, but that same interruption schedule happens to bridge the gap between two "crowded runs" in the Praos test. The crowded+partitioned+crowded run's duration is long enough too create unacceptably large rollbacks once it ends.

In other words, the network partitioning introduces an "interrupt duration too long" concept similar to the "LeaderSchedule too crowded`" concept. Moreover, the two concepts can interact to create failures when neither would individually cause a failure.

I'm unsure how to proceed and would appreciate advice. Here are some related observations.

  • The LeaderSchedule test uses suchThat to avoid generating "too crowded" leader schedules.
  • The Praos test uses the nodes' uninterpretable PRNG seed instead of a LeaderSchedule generator, so it can't avoid "too crowded" LeaderSchedules a priori. It instead dismisses them after the fact, in a guard in the Property.
  • My generator for BFT and PBFT is comparable to the LeaderSchedule test's approach, but instead uses carefully chosen choose boundaries instead of suchThat.

And here are some options.

  • Extend the Praos property to be aware of the interruption and dismiss those cases too.
  • Extend the interruption plan generator to take a LeaderSchedule argument and avoid interruptions that problematically interact with crowded runs.
  • Don't write tests that mix the two concepts, at at first.

Edit: remove the sentence about LeaderSchedule test not failing: it is failing as expected.

@mrBliss
Copy link
Contributor

mrBliss commented Jul 23, 2019

What about the following approach:

  • We stop carefully tuning our generators to avoid rollbacks of more than k blocks and to avoid hard to predict conflicts between the network interruption and "crowded runs", and accept the fact that in some test cases we won't be able to reach consensus. We can still write our generators such that in most cases we are able to reach consensus (verify this with labelling!), otherwise we're not testing the interesting scenarios.
  • Define a function that for a whole test setup (including the LeaderSchedule and any information that may only be available after running the test, etc.) determines whether we should have been able to reach consensus or not.
  • After running each test, we check whether we have reached consensus, if not, then our function must tell us that it was indeed acceptable to not reach consensus for the given test setup.

I have not tried it out, but I hope this makes some things easier. What do you think?

@nfrisby
Copy link
Contributor

nfrisby commented Jul 23, 2019

Yes, I've reached the same conclusion: the next thing to try is writing a dismissable predicate that determines (conservatively) when a combination of the LeaderSchedule, InterruptionPlan, etc (eventually including #802, IntersectMBO/ouroboros-network#231, #800 ...) test configurations risk rollbacks > k. We can use it in generators with suchThat (like LeaderSchedule test) or in properties when the a priori is not an option (like Praos test).

For that reason, I'm now working to understand the details of the LeaderSchedule (and by proxy the Praos) test property.

iohk-bors bot referenced this issue in IntersectMBO/ouroboros-network Oct 1, 2019
1074: Add basic test for Chain Growth r=nfrisby a=nfrisby

This PR is "part one" for Issue #235.

It adds a `Property` that tests for Chain Growth modulo EBBs and late joins. In particular, it assumes there are no message delays, which is effectively true now and will be until Issue #229 and/or #230.

From https://eprint.iacr.org/2017/573/20171115:001835

> Chain Growth (CG); with parameters τ ∈ (0, 1], s ∈ N. Consider the chains C1, C2 possessed by two honest parties at the onset of two slots sl1, sl2 with sl2 at least s slots ahead of sl1. Then it holds that len(C2) − len(C1) ≥ τ · s. We call τ the speed coefficient.

We test with a deterministic variety of values for the `τ` and `s` parameters. Because our current test framework has a limited set of possible faults, we can test Chain Growth for all values of `s` that are greater than 0. Similarly, we're able to compute a `τ` value for any given interval of slots based on the leader schedule (roughly-- see relevant comments in diff). Once we add message delays (latencies, partitions, etc) these tests as-is will fail -- I'm still thinking about how to handle that, but I've decided to open this PR for now.

Co-authored-by: Nicolas Frisby <[email protected]>
iohk-bors bot referenced this issue in IntersectMBO/ouroboros-network Oct 7, 2019
1107: Test for unexpected message delays r=nfrisby a=nfrisby

This test confirms that our current test suite is not introducing any unexpected message delays, where _message delays_ is analogous to that from the paper https://eprint.iacr.org/2017/573/20171115:001835.

Issues #229 and #230 will cause this property to fail; they'll have to refine it somehow.

Co-authored-by: Nicolas Frisby <[email protected]>
@edsko edsko changed the title Protocol testing: test network partitioning Test network partitioning Jan 6, 2020
@nfrisby
Copy link
Contributor

nfrisby commented Mar 31, 2020

While recently re-reading the Ouroboros papers (see eg Issue #695), I noticed that the BFT paper does not limit how many blocks a node can rollback. Our BFT implementation, though, does limit that to the security parameter k.

This observation relates to this Issue because it invalidates the paper's claim that BFT is resilient to network partitions.

Network splits. In the case of a network split, the network is temporarily partitioned into s connected components for some s≥2, each one containing n1,...,ns servers for a sequence of slots D. Assuming no other failures, it is easy to see that transaction processing will continue normally in each connected component [edit: the paper does not mention that chain density will decrease). Furthermore, by slot max D+n, all servers will be activated and the system will converge to a unique blockchain. Indeed, let i be the maximal connected component that includes the server that controls the earliest time slot in the n slots that follow D, say slj. It easy to see that after slj all servers will converge to the blockchain emitted by this server. It follows that Ouroboros-BFT is resilient to network splits. Note that transactions processed within any other connected component other than the maximal component with the winning server maybe lost and hence they have to be resubmitted.

For D "large enough" (possibly also including the effects of other confounding factors like network latency etc), the connected components' chains will have unique suffixes that include more than k blocks, and so nodes running our implementation of BFT will not be able to all switch to the same chain after the network split ends.

@dcoutts
Copy link
Contributor

dcoutts commented Mar 31, 2020

Yes, we inherit this aspect of Ouroboros classic, and when we're in Praos then we'll also have a k rollback limit. So this aspect of BFT is something we do take advantage of, and of course it would only be temporary anyway. So yes, we can still only tolerate partitions up to k blocks long.

@dnadales dnadales added the better-tests Ideas to improve the tests label Dec 1, 2023
@dnadales dnadales transferred this issue from IntersectMBO/ouroboros-network Dec 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
better-tests Ideas to improve the tests
Projects
None yet
Development

No branches or pull requests

5 participants