Network partitions can cause divergence, dirty reads, and lost updates. #20031

aphyr · 2016-08-17T17:23:08Z

Hello again! I hope everyone's having a nice week. :-)

Since #7572 (A network partition can cause in flight documents to be lost) is now closed, and the resiliency page reports that "Loss of documents during network partition" is now addressed in 5.0.0, I've updated the Jepsen Elasticsearch tests for 5.0.0-alpha5. In these tests, Elasticsearch appears to allow:

Dirty reads
Divergence between replicas
The loss of successfully inserted documents

This work was funded by Crate.io, which uses Elasticsearch internally and is affected by the same issues in their fork.

Here's the test code, and the supporting library which uses the Java client (also at 5.0.0-alpha5). I've lowered timeouts to help ES recover faster from the network shenanigans in our tests.

Several clients concurrently index documents, each with a unique ID that is never retried. Writes succeed if the index request returns RestStatus.CREATED.

Meanwhile, several clients concurrently attempt to get recently inserted documents by ID. Reads are considered successful if response.isExists() is true.

During this time, we perform a sequence of simple minority/majority network partitions: 20 seconds on, 10 seconds off.

At the end of the test, we heal the network, allow 10 seconds for quiescence (I've set ES timeouts/polling intervals much lower than normal to allow faster convergence), and have every client perform an index refresh. Refreshes are retried until every shard reports successful. Somewhat confusingly, getSuccessfulShards + getFailedShards is rarely equal to getTotalShards, and .getShardFailures appears to always be empty; perhaps there's an indeterminate, unreported shard state between success and failure? In any case, I check that getTotalShards is equal to getSuccessfulShards.

Once every client has performed a refresh, each client performs a read of all documents in the index. We're looking for four-ish cases:

Divergence: do nodes agree on the documents in the index?
Dirty reads: Are we able to read records which were not successfully inserted, as determined by their absence in a final read?
Some lost updates: Are successful writes missing on some nodes but not others?
Lost updates: Are successful writes missing on every node?

All of these cases appear to occur. Here's a case where some acknowledged writes were lost by every node (757 lost, 3374 present):

{:dirty-read
 {:dirty
  #{1111 1116 1121 1123 1124 1125 1126 1128 1129 1131 1137 1141 1143
    ...
    2311 2312 2313 2314 2315 2317 2321 2322 2324 2325 2326 2327 2329
    2333},
  :nodes-agree? true,
  :valid? false,
  :lost-count 757,
  :some-lost-count 757,
  :lost
  #{1111 1116 1117 1121 1123 1124 1125 1126 1128 1129 1131 1133 1136
    ...
    2313 2314 2315 2317 2318 2320 2321 2322 2323 2324 2325 2326 2327
    2329 2333 2335},
  :dirty-count 573,
  :read-count 3101,
  :on-some-count 3374,
  :not-on-all #{},
  :some-lost
  #{1111 1116 1117 1121 1123 1124 1125 1126 1128 1129 1131 1133 1136
    ...
    2313 2314 2315 2317 2318 2320 2321 2322 2323 2324 2325 2326 2327
    2329 2333 2335},
  :not-on-all-count 0,
  :on-all-count 3374,
  :unchecked-count 846},
 :perf
 {:latency-graph {:valid? true},
  :rate-graph {:valid? true},
  :valid? true},
 :valid? false}

And here's a case where nodes disagree on the contents of the index

{:dirty-read
 {:dirty #{},
  :nodes-agree? false,
  :valid? false,
  :lost-count 0,
  :some-lost-count 109,
  :lost #{},
  :dirty-count 0,
  :read-count 2014,
  :on-some-count 5738,
  :not-on-all
  #{2835 2937 2945 2950 2967 2992 2999 3004 3007 3011 3014 3019 3027
    ...
    5260 5263 5270 5272 5287 5295 5298 5300 5305 5327 5337 5343 5355
    5378 5396 5397 5402 5404},
  :some-lost
  #{4556 4557 4559 4560 4573 4575 4577 4583 4589 4595 4608 4613 4619
    4621 4629 4637 4639 4642 4651 4653 4655 4660 4664 4671 4680 4681
    4687 4702 4708 4721 4722 4723 4725 4728 4741 4744 4746 4759 4763
    4768 4769 4784 4795 4819 4832 4838 4849 4850 4858 4872 4878 4884
    4894 4899 4904 4916 4920 4921 4923 4928 4930 4935 4938 4942 4949
    4952 4958 4961 4971 4985 5022 5046 5056 5067 5082 5088 5104 5123
    5125 5126 5138 5147 5159 5172 5187 5188 5200 5231 5241 5252 5257
    5260 5263 5270 5272 5287 5295 5298 5300 5305 5327 5337 5343 5355
    5378 5396 5397 5402 5404},
  :not-on-all-count 317,
  :on-all-count 5421,
  :unchecked-count 3724},
 :perf
 {:latency-graph {:valid? true},
  :rate-graph {:valid? true},
  :valid? true},
 :valid? false}

not-on-all denotes documents which were present on some, but not all, nodes. some-lost is the subset of those documents which were successfully inserted and should have been present.

I've also demonstrated lost updates due to dirty read semantics plus _version divergence, but I haven't ported that test to ES 5.0.0 yet.

Cheers!

Elasticsearch version: 5.0.0-alpha-5

Plugins installed: []

JVM version: Oracle JDK 1.8.0_91

OS version: Debian Jessie

Description of the problem including expected versus actual behavior: Elasticsearch should probably not forget about inserted documents which were acknowledged

Steps to reproduce:

Clone Jepsen at 63c34b16679f1aad396c1b3e212f92646dcff1be
cd elasticsearch
lein test
Wait a few hours. I haven't had time to narrow down the conditions required to trigger this bug.

The text was updated successfully, but these errors were encountered:

clintongormley · 2016-08-17T17:50:24Z

thanks @aphyr - we'll dive into it

jeffknupp · 2016-08-30T18:09:54Z

Will this block the 5.0 release (it seems as though it should, since resiliency is one of the features being heavily touted in the new version)?

bleskes · 2016-09-19T07:16:04Z

@aphyr thanks again for the clear description. Responding to the separate issues you mention:

Dirty reads - as discussed before this is possible with Elasticsearch. We are working on adding documentation to make it clear.
Replica/Primary divergence - the nature of primary backup replication allows replicas to be out of sync with each other while operations are in flight (not committed). A Primary failure at the moment will mean that the new primary is potentially out of sync with the other replicas but unlike the old primary it has no knowledge of which non-acked documents were still in flight and are missing from some of the replicas. Currently this is only fixed with the next replica relocation or recovery - which takes too long to correct it self. Dirty reads exposes this as well. We are currently working on document level primary/replica syncing method (see Add Sequence Numbers to write operations #10708) which will be used immediately upon primary promotion.
Loss of acknowledge writes - this one was interesting :) @ywelsch and I tracked it down and this is what happens:

With 5.0 we added the notion of "in sync copies" to keep track of valid copies. The valid copies are used to track which copies are allowed to become primaries. That set of what we call allocation ids is tracked in the cluster state. In the test a change to this set that was committed by the master was lost during a partition, causing a stale shard copy to be promoted to primary when the partition healed. This caused data loss. This issue is addressed in #20384 and will be part of the imminent 5.0 beta1 release. With it in place we run Jepsen for almost two days straight with no data loss. However, as mentioned in the ticket, there is still a very small chance of this happening. We have added an entry to our resiliency page about this. We are working to fix that too, although that will take quite a bit longer and will not make it for the 5.0 release.

@aphyr I would love it if you can verify our findings. I will be happy to supply a snapshot build of beta1.

aphyr · 2016-09-22T19:50:22Z

Hi @bleskes. I'd be delighted to work on verification. I've got an ongoing contract that's keeping me pretty busy right now, but if you'd like to send me an email ([email protected]) I'll reach out when I've got availability for more testing.

elasticmachine · 2018-03-27T14:43:02Z

Pinging @elastic/es-distributed

saiergong · 2018-05-18T01:30:46Z

i wonder when primary shard send operation to its replicas, and one of the replicas acknowleged this request to primary but the other replica did not conform, and primary not acknowlege to client. then a read request send to the acknowleged replica,Can it read the latest data that has not been acknowlege to the client?@bleskes

jasontedor · 2018-05-18T01:41:15Z

@saiergong Yes.

saiergong · 2018-05-18T01:56:46Z

many thanks @jasontedor , but do we think its ok? or we have plan to resolve this?

jasontedor · 2018-05-18T02:17:55Z

@saiergong Indeed it is not ideal however we consider it to be a lower priority than the other problems that we are currently trying to solve.

saiergong · 2018-05-18T02:49:35Z

i know , thanks @jasontedor

ywelsch · 2020-02-17T07:56:03Z

All known issues raised here related to divergence and lost updates of documents have been fixed as part of the sequence numbers effort and the new cluster coordination subsystem introduced in ES 7, with tests in place that run checks similar to what the Jepsen Elasticsearch tests do.

Dirty reads are documented in the Elasticsearch reference docs, and I've opened a dedicated issue (#52400) to track any related work on this.

clintongormley added resiliency :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. Meta labels Aug 17, 2016

aphyr mentioned this issue Aug 25, 2016

[very sorry to bother]based on your test, latest version of elasticsearch won't have split brain problem, right? jepsen-io/jepsen#151

Closed

bleskes added the Pioneer Program label Sep 19, 2016

dakrone mentioned this issue Sep 27, 2016

Jepsen transient failures under network partition conditions #7549

Closed

clintongormley removed the Pioneer Program label May 9, 2017

dakrone added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. and removed :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. labels Mar 27, 2018

ywelsch mentioned this issue May 2, 2019

_version does not uniquely identify a particular version of a document #19269

Closed

ywelsch mentioned this issue Feb 17, 2020

Only expose safely stored data to reads #52400

Open

ywelsch closed this as completed Feb 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Network partitions can cause divergence, dirty reads, and lost updates. #20031

Network partitions can cause divergence, dirty reads, and lost updates. #20031

aphyr commented Aug 17, 2016 •

edited

Loading

clintongormley commented Aug 17, 2016

jeffknupp commented Aug 30, 2016

bleskes commented Sep 19, 2016 •

edited

Loading

aphyr commented Sep 22, 2016

elasticmachine commented Mar 27, 2018

saiergong commented May 18, 2018 •

edited

Loading

jasontedor commented May 18, 2018

saiergong commented May 18, 2018

jasontedor commented May 18, 2018

saiergong commented May 18, 2018

ywelsch commented Feb 17, 2020

Network partitions can cause divergence, dirty reads, and lost updates. #20031

Network partitions can cause divergence, dirty reads, and lost updates. #20031

Comments

aphyr commented Aug 17, 2016 • edited Loading

clintongormley commented Aug 17, 2016

jeffknupp commented Aug 30, 2016

bleskes commented Sep 19, 2016 • edited Loading

aphyr commented Sep 22, 2016

elasticmachine commented Mar 27, 2018

saiergong commented May 18, 2018 • edited Loading

jasontedor commented May 18, 2018

saiergong commented May 18, 2018

jasontedor commented May 18, 2018

saiergong commented May 18, 2018

ywelsch commented Feb 17, 2020

aphyr commented Aug 17, 2016 •

edited

Loading

bleskes commented Sep 19, 2016 •

edited

Loading

saiergong commented May 18, 2018 •

edited

Loading