-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion of weaker failure modes #3
Comments
wrt "up to F server faults": I agree, there's abuse of this model all over the place (including in a bunch of things I wrote). There's a trivial analogue of CAP for this failure mode that says there's always a non-total failure mode that causes you to give up either consistency or availability of the non-failed machines in an asynchronous network. I think this is true, and scribbled down a proof once. It's this, rather than network partitions, that I think leads more directly to quorum models and the tradeoff between latency and availability that Daniel Abadi wrote about. I think your suggestions are good: I included "15 Is a failed machine the same as a partitioned one" deliberately to address the difference between the two failure models, and making this more explicit is a good thing. What do you mean by "in the limit, failures can render..."? A crash-stop failure means messages are never received, rather than eventually received after an arbitrary delay. |
"There's a trivial analogue of CAP for this failure mode that says there's always a non-total failure mode that causes you to give up either consistency or availability of the non-failed machines in an asynchronous network." I think I'd agree; the idea of a (series of) non-total failure(s) that still leads to unavailability reminds me of the bivalency argument from FLP. However, I've never found bivalency to be as intuitive as, say, the ability to arbitrarily place and schedule partitions. "...the tradeoff between latency and availability that Daniel Abadi wrote about." The notion of HA as a proxy for "coordination complexity" is really exciting. I didn't yet make an issue, but I think the latency-consistency trade-off he's highlighted is terribly important. "What do you mean by 'in the limit, failures can render...'? A crash-stop failure means messages are never received, rather than eventually received after an arbitrary delay." Ah, sorry. I was trying to say something much more mundane. I meant to say "even if you build fifty redundant networks, in the extreme, you'll still be unavailable when all fifty fail"; this is related to your discussion in (10): your network is probably asynchronous. Hmm. Perhaps it's best to simply beef up (12) to say something like: "A synchronous network must guarantee deliver 100% of messages within a finite amount of time; if any message is delayed indefinitely, the network is asynchronous." Writing this up, I'm realizing how difficult it is to map these models to real-world networks (more kudos on your work so far!), but I think this is where the confusion lies. Perhaps giving a few examples would be useful. (e.g., "what if I have redundant networks?") But, that said, maybe this is splitting hairs. |
In that sense there are no synchronous networks: when the Earth explodes, any messages in transit will be lost forever. |
One can only hope that the development and deployment of inter-planetary distributed databases will predate the Earth's explosions! But, more seriously, while synchronous networks are indeed unrealistic in many cases, they allow for a range of "stronger" semantics/mechanisms than asynchronous networks (e.g., http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.31.5990&rep=rep1&type=pdf). I think these synchronous networks have use beyond distributed systems theory in that they allow us to consider systems designs under a much less rigorous set of failure scenarios. If your networks don't partition, you can achieve a lot more than if they do. The real question is what happens when the failure scenarios that you assumed weren't going to occur actually do occur--the correct answer is usually unavailability but can be worse in practice if these corner cases haven't been accounted for in the protocol. |
I think there are two major causes of confusion when it comes to "beating CAP." Number one is the fault model that Gilbert and Lynch assume. (Number two is the confusion over "application-level" versus "CAP" consistency, as in #4.)
I've recently seen several discussions of CAP (even in academic publications) that discuss the availability requirement. Neither of these is actually Gilbert and Lynch HA, but, for their relaxed failure domain, guarantee a response. Here are two examples:
I'm not quite sure how to best address these in the text, but it might be useful.
Two concrete suggestions:
Under "15. Is a failed machine the same as a partitioned one?" the FAQ could mention that, in an HA system, a minority partitioned server still needs to guarantee a response.
Under "12. Is my network really asynchronous?" the FAQ could mention that, in the limit, failures can render any communication network asynchronous.
Alternatively, (at the risk of starting a "list of shame"), the FAQ might expand "17. Have I 'got around' or 'beaten' the CAP theorem?" into a "list of common fallacies" like those above.
I'm curious what you think and am happy to drop a pull request if there's interest.
The text was updated successfully, but these errors were encountered: