_version does not uniquely identify a particular version of a document #19269

seut · 2016-07-05T13:11:35Z

@aphyr recently discovered this resilience issue [https://github.com/crate/crate/issues/3711] while running the jespen test suite against Crate.
After I created an integration test (based on current ES master) [https://github.com/crate/elasticsearch/commit/41ed5ebe7304710fda4de4e69479e17081042c38] out of the relevant jepsen code using your nice network partition simulation helper, I was able to reproduce this error not only using Crate but also using plain Elasticsearch.

I've reproduced this issue on ES 2.3, 5.0-alpha3 & master.
The longer the test is running the more often it will fail, with current default runtime of 180sec it fails almost always on my machine. (the relevant jepsen test is running 360sec)

Currently I've no real idea why this is happening, my guess is that some reads are reading a stale version value but I did not yet figured out how/why.
I've also run this scenario on a single node with one shard because my first guess was that this is maybe not network partition related but this test never failed..

I've read the current ES resilience issues and I couldn't see anything which could be related to this issue, but I'm also not completely sure.

ywelsch · 2016-07-05T19:52:21Z

@seut @aphyr the issue you're observing is due to dirty reads (which can happen in all current ES versions). ES does not provide any stronger read guarantees at the moment. I've quickly hacked an integration test to illustrate why dirty reads are at play here ( https://gist.github.com/ywelsch/8a5334cd59d922f5c48074fec578e71c).

Note that we have made some major improvements in ES v5.0.0 to ensure that all replicas have the same data once the cluster is healed and that we don't lose acknowledged writes. We have also ported the published Jepsen scenarios to our testing infrastructure (successfully passing). To verify that we are properly modeling the original Jepsen tests, we are spending some effort as well to update the original tests so that they compile against current ES versions. While we’re constantly improving the resiliency of the system we are also spending some effort on documenting the above read/write guarantees and illustrating them with test cases under simulated conditions (see section “Documentation of guarantees” in our resiliency docs: https://www.elastic.co/guide/en/elasticsearch/resiliency/current/index.html).

aphyr · 2016-07-06T17:02:16Z

We have also ported the published Jepsen scenarios to our testing infrastructure (successfully passing).

If your Jepsen tests are passing, you... may want to revisit them. One of the original Jepsen tests was for linearizable operations, and since ES is clearly not linearizable, your version of the tests probably shouldn't pass.

ywelsch · 2016-07-07T17:09:52Z

@aphyr right, I should have been more specific. The challenge we're solving first is not to lose acknowledged writes. We only ported the parts of the Jepsen tests that account for this aspect. I've opened a PR to update the docs to that effect (#19303).

ywelsch · 2019-05-02T09:01:18Z

We have switched optimistic concurrency control (OCC) from the _version field to the new _seq_no (sequence number) and _primary_term (primary term) fields, which do uniquely identify each operation. To be clear, the _version field continues not to uniquely identify a particular version of a document; if you need to do this then you should move to using the _seq_no and _primary_term fields. We have updated the resiliency status page accordingly. All internal consumers that are making use of OCC (e.g. reindex, update-by-query, ...) have been switched to these new fields, and extended testing has been used to validate the semantics of this subset of the API (see e.g. #38561). Note that the above work does not address dirty reads in the broader API, which remain possible (documented here), but sequence number + primary term fields now allow uniquely identifying a change even in presence of those dirty reads. Given that dirty reads are tracked in #20031, I'm closing this issue.

vptech20nn · 2019-05-30T17:51:58Z

What the elasticsearch version for switch to _seq_no , _primary_term ? We are planning to use 6.7.

ywelsch · 2019-05-31T08:52:38Z

@vptech20nn ES 6.6 already provides optimistic concurrency control based on the _seq_no and _primary_term fields. While this addresses issues with the data replication subsystem, this subsystem also depends on the safety of the cluster coordination subsystem, which has seen a major overhaul in 7.0, allowing us to finally close out an important resiliency-related item (see "Repeated network partitions can cause cluster state updates to be lost" on the resiliency status page). If you're setting up a new cluster, consider using ES 7.x instead of 6.x. Also, please consider using the forums for user-related questions.

seut mentioned this issue Jul 5, 2016

_version does not uniquely identify a particular version of a row crate/crate#3711

Closed

seut changed the title ~~_version does not uniquely identify a particular version of a row~~ _version does not uniquely identify a particular version of a document Jul 5, 2016

clintongormley added resiliency :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. labels Jul 6, 2016

ywelsch closed this as completed May 2, 2019

FrankHassanabad mentioned this issue Sep 23, 2020

[Detections][EQL] EQL rule execution in detection engine elastic/kibana#77419

Merged

6 tasks

vitaliidm mentioned this issue Aug 4, 2023

[Security Solution][Detection Engine] move lists to data stream elastic/kibana#162508

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_version does not uniquely identify a particular version of a document #19269

_version does not uniquely identify a particular version of a document #19269

seut commented Jul 5, 2016

ywelsch commented Jul 5, 2016

aphyr commented Jul 6, 2016

ywelsch commented Jul 7, 2016

ywelsch commented May 2, 2019

vptech20nn commented May 30, 2019

ywelsch commented May 31, 2019

_version does not uniquely identify a particular version of a document #19269

_version does not uniquely identify a particular version of a document #19269

Comments

seut commented Jul 5, 2016

ywelsch commented Jul 5, 2016

aphyr commented Jul 6, 2016

ywelsch commented Jul 7, 2016

ywelsch commented May 2, 2019

vptech20nn commented May 30, 2019

ywelsch commented May 31, 2019