-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not optimize append-only operation if normal operation with higher seq# was seen #28787
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some suggestions but LGTM otherwise
private void updateMaxSeqNoOfNonAppendOnlyOperations(Operation op) { | ||
assert op.origin() != Operation.Origin.PRIMARY; | ||
maxSeqNoOfNonAppendOnlyOperations.updateAndGet(curr -> Math.max(op.seqNo(), curr)); | ||
assert maxSeqNoOfNonAppendOnlyOperations.get() >= op.seqNo(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs a message
return index.seqNo() <= maxSeqNoOfNonAppendOnlyOperations.get(); | ||
} | ||
|
||
private void updateMaxSeqNoOfNonAppendOnlyOperations(Operation op) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe just inline this into the planIndexingAsNonPrimary
method? I think that would be cleaner.
*/ | ||
assert canOptimizeAddDocument(index); | ||
assert index.origin() != Operation.Origin.PRIMARY; | ||
return index.seqNo() <= maxSeqNoOfNonAppendOnlyOperations.get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should inline this into planIndexingAsNonPrimary
then we don't need all the asserts
Thanks @dnhatn - I think we should agree on elastic/elasticsearch-formal-models#28 before we proceed with this. In particular, it's not yet clear that this is the only problem in this area, and we haven't settled whether we think replicas should have a different version map that only considers seq#s (the document version numbers are now only of importance on the primary). |
@DaveCTurner I understand your concern. Boaz and I discussed and agreed to start the implementation for this and #28790. We hoped that our reasoning for these would be fine. Certainly, we need to sync between the formal model and implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great. I left some nits. As discussed before, we should wait for blessing via @DaveCTurner 's work before merging.
break; | ||
final String key = entry.getKey(); | ||
if (key.equals(MAX_UNSAFE_AUTO_ID_TIMESTAMP_COMMIT_ID)) { | ||
maxUnsafeAutoIdTimestamp.set(Math.max(maxUnsafeAutoIdTimestamp.get(), Long.parseLong(entry.getValue()))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question - why the leniency with max?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the max expr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the max is still here for time stamps? maybe assert it's -1 (for both seq# and timestamp)
assert index.version() == 1L : "can optimize on replicas but incoming version is [" + index.version() + "]"; | ||
plan = IndexingStrategy.optimizedAppendOnly(index.seqNo()); | ||
} else { | ||
if (appendOnlyRequest == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we introduce a method similar to mayHaveBeenIndexedBefore that does the check and also updates the maxSeqNoOfNonAppendOnlyOperations
? I think it's good to have both marker handling consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently @s1monw prefered the reverse. I'm fine with leaving as is.
} | ||
// doc1 is delayed and arrived after a non-append-only op. | ||
final long seqNoDoc1 = localCheckpointTracker.generateSeqNo(); | ||
Engine.IndexResult regularDoc = engine.index(replicaIndexForDoc( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also test a delete?
One more nit - can we maybe give the PR title to be more exact about what the code change? something like do not optimize if a normal operation was seen with a higher seq# .. |
@bleskes I've addressed your comments. Would you please take another look? Thank you. |
I have updated elastic/elasticsearch-formal-models#28 as we discussed, so that's awaiting further comments. AIUI the model covers both this PR and #28790. From a first look, this seems to be quite different from the modelled solution and I'd be more comfortable if we brought them closer together (from either/both ends). In particular, we don't yet seem to be looking at the sequence numbers of replication requests here and I think we should. I'm still travelling and haven't had a lot of sleep so expect more useful/correct/accurate feedback at a later date. I'm sending this now in case it's useful to those in more westerly timezones in the meantime. |
break; | ||
final String key = entry.getKey(); | ||
if (key.equals(MAX_UNSAFE_AUTO_ID_TIMESTAMP_COMMIT_ID)) { | ||
maxUnsafeAutoIdTimestamp.set(Math.max(maxUnsafeAutoIdTimestamp.get(), Long.parseLong(entry.getValue()))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the max is still here for time stamps? maybe assert it's -1 (for both seq# and timestamp)
@DaveCTurner can you clarify what you mean? we use the seq# of the incoming op to update the |
Apologies, that was written on my phone and is inaccurate. I meant that the modelled solution stores seq#s in the version map. |
Models the fix implemented in elastic/elasticsearch#28787
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
# Conflicts: # server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java # server/src/test/java/org/elasticsearch/index/engine/InternalEngineTests.java # server/src/test/java/org/elasticsearch/index/replication/ESIndexLevelReplicationTestCase.java # server/src/test/java/org/elasticsearch/index/replication/IndexLevelReplicationTests.java
Thanks @s1monw, @bleskes and @DaveCTurner for reviewing. |
* master: Do not optimize append-only if seen normal op with higher seqno (elastic#28787) [test] packaging: gradle tasks for groovy tests (elastic#29046) Prune only gc deletes below local checkpoint (elastic#28790)
* master: (40 commits) Do not optimize append-only if seen normal op with higher seqno (elastic#28787) [test] packaging: gradle tasks for groovy tests (elastic#29046) Prune only gc deletes below local checkpoint (elastic#28790) remove testUnassignedShardAndEmptyNodesInRoutingTable elastic#28745: remove extra option in the composite rest tests Fold EngineDiskUtils into Store, for better lock semantics (elastic#29156) Add file permissions checks to precommit task Remove execute mode bit from source files Optimize the composite aggregation for match_all and range queries (elastic#28745) [Docs] Add rank_eval size parameter k (elastic#29218) [DOCS] Remove ignore_z_value parameter link Docs: Update docs/index_.asciidoc (elastic#29172) Docs: Link C++ client lib elasticlient (elastic#28949) [DOCS] Unregister repository instead of deleting it (elastic#29206) Docs: HighLevelRestClient#multiSearch (elastic#29144) Add Z value support to geo_shape Remove type casts in logging in server component (elastic#28807) Change BroadcastResponse from ToXContentFragment to ToXContentObject (elastic#28878) REST : Split `RestUpgradeAction` into two actions (elastic#29124) Add error file docs to important settings ...
* es/master: (22 commits) Fix building Javadoc JARs on JDK for client JARs (#29274) Require JDK 10 to build Elasticsearch (#29174) Decouple NamedXContentRegistry from ElasticsearchException (#29253) Docs: Update generating test coverage reports (#29255) [TEST] Fix issue with HttpInfo passed invalid parameter Remove all dependencies from XContentBuilder (#29225) Fix sporadic failure in CompositeValuesCollectorQueueTests Propagate ignore_unmapped to inner_hits (#29261) TEST: Increase timeout for testPrimaryReplicaResyncFailed REST client: hosts marked dead for the first time should not be immediately retried (#29230) TEST: Use different translog dir for a new engine Make SearchStats implement Writeable (#29258) [Docs] Spelling and grammar changes to reindex.asciidoc (#29232) Do not optimize append-only if seen normal op with higher seqno (#28787) [test] packaging: gradle tasks for groovy tests (#29046) Prune only gc deletes below local checkpoint (#28790) remove testUnassignedShardAndEmptyNodesInRoutingTable #28745: remove extra option in the composite rest tests Fold EngineDiskUtils into Store, for better lock semantics (#29156) Add file permissions checks to precommit task ...
This models how indexing and deletion operations are handled on the replica, including the optimisations for append-only operations and the interaction with Lucene commits and the version map. It incorporates - elastic/elasticsearch#28787 - elastic/elasticsearch#28790 - elastic/elasticsearch#29276 - a proposal to always prune tombstones
When processing an append-only operation, primary knows that operations can only conflict with another instance of the same operation. This is true as the id was freshly generated. However this property doesn't hold for replicas. As soon as an auto-generated ID was indexed into the primary, it can be exposed to a search and users can issue a follow up operation on it. In extremely rare cases, the follow up operation can be arrived and processed on a replica before the original append-only request. In this case we can't simply proceed with the append-only request and blindly add it to the index without consulting the version map. The following scenario can cause difference between primary and replica. 1. Primary indexes an auto-gen-id doc. (id=X, v=1, s#=20) 2. A refresh cycle happens on primary 3. The new doc is picked up and modified - say by a delete by query request - Primary gets a delete doc (id=X, v=2, s#=30) 4. Delete doc is processed first on the replica (id=X, v=2, s#=30) 5. Indexing operation arrives on the replica, since it's an auto-gen-id request and the retry marker is lower, we put it into lucene without any check. Replica has a doc the primary doesn't have. To deal with a potential conflict between an append-only operation and a normal operation on replicas, we need to rely on sequence numbers. This commit maintains the max seqno of non-append-only operations on replica then only apply optimization for an append-only operation only if its seq# is higher than the seq# of all non-append-only.
When processing an append-only operation, primary knows that operations can only conflict with another instance of the same operation. This is true as the id was freshly generated. However this property doesn't hold for replicas. As soon as an auto-generated ID was indexed into the primary, it can be exposed to a search and users can issue a follow up operation on it. In extremely rare cases, the follow up operation can be arrived and processed on a replica before the original append-only request. In this case we can't simply proceed with the append-only request and blindly add it to the index without consulting the version map. The following scenario can cause difference between primary and replica.
To deal with a potential conflict between an append-only operation and a normal operation on replicas, we need to rely on sequence numbers. This commit maintains the max seqno of non-append-only operations on replica then only apply optimization for an append-only operation only if its seqno is higher than the seqno of all non-append-only.