-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrates soft-deletes into Elasticsearch #33222
Conversation
This PR integrates Lucene soft-deletes (LUCENE-8200) into Elasticsearch. Highlight works in this PR include: 1. Replace hard-deletes by soft-deletes in InternalEngine 2. Use _recovery_source if _source is disabled or modified (elastic#31106) 3. Soft-deletes retention policy based on the global checkpoint (elastic#30335) 4. Read operation history from Lucene instead of translog (elastic#30120) 5. Use Lucene history in peer-recovery (elastic#30522) These works have been done by the whole team; however, these individuals (lexical order) have significant contribution in coding and reviewing: Co-authored-by: Adrien Grand <[email protected]> Co-authored-by: Boaz Leskes <[email protected]> Co-authored-by: Jason Tedor <[email protected]> Co-authored-by: Martijn van Groningen <[email protected]> Co-authored-by: Nhat Nguyen <[email protected]> Co-authored-by: Simon Willnauer <[email protected]>
Pinging @elastic/es-distributed |
@elastic/es-distributed Can you please also have a look? Thank you! |
We should enable by default in 7.0 in a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks awesome. Great piece of work! I left some comments most of them are nits but a couple of minors.
} | ||
// TODO: Avoid recalculate numDocs everytime. | ||
int numDocs = 0; | ||
for (int i = 0; i < hardLiveDocs.length(); i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are still waiting for a lucene release to fix this right? this is https://issues.apache.org/jira/browse/LUCENE-8458 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if so please put a comment in here referencing the issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but there is something not clear on my side. I will add a comment then make the change in a follow-up so we can clarify things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is not clean on your side?
/** | ||
* Specifies if the index should use soft-delete instead of hard-delete for update/delete operations. | ||
*/ | ||
public static final Setting<Boolean> INDEX_SOFT_DELETES_SETTING = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should use a validator here like we use here https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/cluster/metadata/IndexMetaData.java#L200 to validate that the index this is set on is on version 7.0 or higher? I think this would prevent setting this on other indices and prevent confusion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried but was unable to get it done with the current Validator. I think we need to extend the Validator to pass an entire Settings instance to make this possible (previous discussion #25560 (comment)). Would it okay if I make this in a follow-up?
* documents increases the chance of operation-based recoveries and allows querying a longer history of documents. | ||
* If soft-deletes is enabled, an engine by default will retain all operations up to the global checkpoint. | ||
**/ | ||
public static final Setting<Long> INDEX_SOFT_DELETES_RETENTION_OPERATIONS_SETTING = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we only allow setting this one if soft deletes are enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
/** | ||
* Creates a new history snapshot from Lucene for reading operations whose seqno in the requesting seqno range (both inclusive) | ||
*/ | ||
public abstract Translog.Snapshot newLuceneChangesSnapshot(String source, MapperService mapperService, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just call this newChangesSnapshot
?
builder.add(new DocValuesFieldExistsQuery(recoverySourceField), BooleanClause.Occur.FILTER); | ||
builder.add(retainSourceQuerySupplier.get(), BooleanClause.Occur.FILTER); | ||
IndexSearcher s = new IndexSearcher(reader); | ||
s.setQueryCache(null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is not necessary per-se but I think we should call s.rewrite(builder.build())
and pass the result of this to s.createWeigth(...)
we missed this also in lucene. ie if you'd pass a prefix query to this it would fail. I realized this yesterday when I worked on something else. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// TODO: We haven't had timestamp for Index operations in Lucene yet, we need to loosen this check without timestamp. | ||
// We don't store versionType in Lucene index, we need to exclude it from this check | ||
final boolean sameOp; | ||
if (newOp instanceof Translog.Index && prvOp instanceof Translog.Index) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we maybe fix the equals method of Operation rather than this? WE can do it in a followup no worries. I think versionType is maybe not needed for comparison?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed versionType from the comment. I will move this comparison to the Operation's equals.
try (Translog.Snapshot snapshot = | ||
newLuceneChangesSnapshot(source, mapperService, Math.max(0, startingSeqNo), Long.MAX_VALUE, false)) { | ||
return snapshot.totalOperations(); | ||
} catch (IOException ex) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we catch Exception
here instead of IOException
and check if we need to fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this catch and added a tragic check when we create a new snapshot.
@Override | ||
public void afterRefresh(boolean didRefresh) { | ||
if (didRefresh) { | ||
refreshedCheckpoint.set(pendingCheckpoint); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add an assertion that we actually set the pendingCheckpoint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a bug with the current implementation. I added a test and fixed this in
4ed7907
*/ | ||
final long lastRefreshedCheckpoint() { | ||
return lastRefreshedCheckpointListener.refreshedCheckpoint.get(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra newline please
* documents increases the chance of operation-based recoveries and allows querying a longer history of documents. | ||
* If soft-deletes is enabled, an engine by default will retain all operations up to the global checkpoint. | ||
**/ | ||
public static final Setting<Long> INDEX_SOFT_DELETES_RETENTION_OPERATIONS_SETTING = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we only allow setting this one if soft deletes are enabled?
@@ -101,6 +104,9 @@ private void updateTranslogDeletionPolicy() throws IOException { | |||
assert minRequiredGen <= lastGen : "minRequiredGen must not be greater than lastGen"; | |||
translogDeletionPolicy.setTranslogGenerationOfLastCommit(lastGen); | |||
translogDeletionPolicy.setMinTranslogGenerationForRecovery(minRequiredGen); | |||
|
|||
softDeletesPolicy.setLocalCheckpointOfSafeCommit( | |||
Long.parseLong(safeCommit.getUserData().get(SequenceNumbers.LOCAL_CHECKPOINT_KEY))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to care about commits that don't have values for this key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We ensure that all index commits should have this key.
private static final class IndexingStrategy { | ||
private void addStaleDocs(final List<ParseContext.Document> docs, final IndexWriter indexWriter) throws IOException { | ||
assert softDeleteEnabled : "Add history documents but soft-deletes is disabled"; | ||
docs.forEach(d -> d.add(softDeletesField)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: a for loop would feel more natural here?
@Override | ||
public Closeable acquireRetentionLockForPeerRecovery() { | ||
final Closeable translogLock = translog.acquireRetentionLock(); | ||
final Releasable softDeletesLock = softDeletesPolicy.acquireRetentionLock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to take care of releasing the translog lock if acquiring the soft deletes lock fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch. We should acquire one of them but not both.
final Query rangeQuery = LongPoint.newRangeQuery(SeqNoFieldMapper.NAME, fromSeqNo, toSeqNo); | ||
final Sort sortedBySeqNoThenByTerm = new Sort( | ||
new SortedNumericSortField(SeqNoFieldMapper.NAME, SortField.Type.LONG), | ||
new SortedNumericSortField(SeqNoFieldMapper.PRIMARY_TERM_NAME, SortField.Type.LONG, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could use regular SortField
instances since these fields are single-valued
} | ||
|
||
private TopDocs searchOperations(ScoreDoc after) throws IOException { | ||
final Query rangeQuery = LongPoint.newRangeQuery(SeqNoFieldMapper.NAME, fromSeqNo, toSeqNo); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how much it would help, but when after
is not null, we could cast to a FieldDoc
to extract is seq no and use it as a lower bound of the range query. That would help skip documents that have already been visited more efficiently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I use "lastSeenSeqNo + 1" as the lower bound.
if (fromSeqNo < 0 || toSeqNo < 0 || fromSeqNo > toSeqNo) { | ||
throw new IllegalArgumentException("Invalid range; from_seqno [" + fromSeqNo + "], to_seqno [" + toSeqNo + "]"); | ||
} | ||
if (searchBatchSize < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to reject 0 too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
} | ||
// TODO: Avoid recalculate numDocs everytime. | ||
int numDocs = 0; | ||
for (int i = 0; i < hardLiveDocs.length(); i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is not clean on your side?
Today we add a NoOp to Lucene and translog if we fail to process an indexing operation. However, we are only adding NoOps to translog for delete operations. In order to have a complete history in Lucene, we should add NoOps of failed delete operations to both Lucene and translog. Relates elastic#29530
Hmm. CI failed due to #32299. @elasticmachine retest this please. |
@elasticmachine run sample packaging tests please |
@elasticmachine retest this please. |
Another watcher test failure @elasticmachine test this please. |
Revert to correct co-author tags. This reverts commit 6dd0aa5.
This PR integrates Lucene soft-deletes(LUCENE-8200) into Elasticsearch. Highlight works in this PR include: - Replace hard-deletes by soft-deletes in InternalEngine - Use _recovery_source if _source is disabled or modified (elastic#31106) - Soft-deletes retention policy based on the global checkpoint (elastic#30335) - Read operation history from Lucene instead of translog (elastic#30120) - Use Lucene history in peer-recovery (elastic#30522) Relates elastic#30086 Closes elastic#29530 --- These works have been done by the whole team; however, these individuals (lexical order) have significant contribution in coding and reviewing: Co-authored-by: Adrien Grand <[email protected]> Co-authored-by: Boaz Leskes <[email protected]> Co-authored-by: Jason Tedor <[email protected]> Co-authored-by: Martijn van Groningen <[email protected]> Co-authored-by: Nhat Nguyen <[email protected]> Co-authored-by: Simon Willnauer <[email protected]>
This PR integrates Lucene soft-deletes(LUCENE-8200) into Elasticsearch. Highlight works in this PR include: - Replace hard-deletes by soft-deletes in InternalEngine - Use _recovery_source if _source is disabled or modified (#31106) - Soft-deletes retention policy based on the global checkpoint (#30335) - Read operation history from Lucene instead of translog (#30120) - Use Lucene history in peer-recovery (#30522) Relates #30086 Closes #29530 --- These works have been done by the whole team; however, these individuals (lexical order) have significant contribution in coding and reviewing: Co-authored-by: Adrien Grand <[email protected]> Co-authored-by: Boaz Leskes <[email protected]> Co-authored-by: Jason Tedor <[email protected]> Co-authored-by: Martijn van Groningen <[email protected]> Co-authored-by: Nhat Nguyen <[email protected]> Co-authored-by: Simon Willnauer <[email protected]>
* 6.x: Mute test watcher usage stats output [Rollup] Fix FullClusterRestart test TEST: Disable soft-deletes in ParentChildTestCase TEST: Disable randomized soft-deletes settings Integrates soft-deletes into Elasticsearch (#33222) drop `index.shard.check_on_startup: fix` (#32279) Fix AwaitsFix issue number Mute SmokeTestWatcherWithSecurityIT testsi [DOCS] Moves ml folder from x-pack/docs to docs (#33248) TEST: mute more SmokeTestWatcherWithSecurityIT tests [DOCS] Move rollup APIs to docs (#31450) [DOCS] Rename X-Pack Commands section (#33005) Fixes SecurityIntegTestCase so it always adds at least one alias (#33296) TESTS: Fix Random Fail in MockTcpTransportTests (#33061) (#33307) MINOR: Remove Dead Code from PathTrie (#33280) (#33306) Fix pom for build-tools (#33300) Lazy evaluate java9home (#33301) SQL: test coverage for JdbcResultSet (#32813) Work around to be able to generate eclipse projects (#33295) Different handling for security specific errors in the CLI. Fix for #33230 (#33255) [ML] Refactor delimited file structure detection (#33233) SQL: Support multi-index format as table identifier (#33278) Enable forbiddenapis server java9 (#33245) [MUTE] SmokeTestWatcherWithSecurityIT flaky tests Add region ISO code to GeoIP Ingest plugin (#31669) (#33276) Don't be strict for 6.x Update serialization versions for custom IndexMetaData backport Replace IndexMetaData.Custom with Map-based custom metadata (#32749) Painless: Fix Bindings Bug (#33274) SQL: prevent duplicate generation for repeated aggs (#33252) TEST: Mute testMonitorClusterHealth Fix serialization of empty field capabilities response (#33263) Fix nested _source retrieval with includes/excludes (#33180) [DOCS] TLS file resources are reloadable (#33258) Watcher: Ensure TriggerEngine start replaces existing watches (#33157) Ignore module-info in jar hell checks (#33011) Fix docs build after #33241 [DOC] Repository GCS ADC not supported (#33238) Upgrade to latest Gradle 4.10 (#32801) Fix/30904 cluster formation part2 (#32877) Move file-based discovery to core (#33241) HLRC: add client side RefreshPolicy (#33209) [Kerberos] Add unsupported languages for tests (#33253) Watcher: Reload properly on remote shard change (#33167) Fix classpath security checks for external tests. (#33066) [Rollup] Only allow aggregating on multiples of configured interval (#32052) Added deprecation warning for rescore in scroll queries (#33070) Apply settings filter to get cluster settings API (#33247) [Rollup] Re-factor Rollup Indexer into a generic indexer for re-usability (#32743) HLRC: create base timed request class (#33216) HLRC: Use Optional in validation logic (#33104) Painless: Add Bindings (#33042)
We can have multiple documents in Lucene with the same seq_no for parent-child documents (or without rollback). In this case, the usage "lastSeenSeqNo + 1" is an off-by-one error as it may miss some documents. This error merely affects the `skippedOperations` contract. See: #33222 (comment) Closes #33318
We can have multiple documents in Lucene with the same seq_no for parent-child documents (or without rollback). In this case, the usage "lastSeenSeqNo + 1" is an off-by-one error as it may miss some documents. This error merely affects the `skippedOperations` contract. See: #33222 (comment) Closes #33318
Today we don't store the auto-generated timestamp of append-only operations in Lucene; and assign -1 to every index operations constructed from LuceneChangesSnapshot. This looks innocent but it generates duplicate documents on a replica if a retry append-only arrives first via peer-recovery; then an original append-only arrives via replication. Since the retry append-only (delivered via recovery) does not have timestamp, the replica will happily optimizes the original request while it should not. This change transmits the max auto-generated timestamp from the primary to replicas before translog phase in peer recovery. This timestamp will prevent replicas from optimizing append-only requests if retry counterparts have been processed. Relates #33656 Relates #33222
Today we don't store the auto-generated timestamp of append-only operations in Lucene; and assign -1 to every index operations constructed from LuceneChangesSnapshot. This looks innocent but it generates duplicate documents on a replica if a retry append-only arrives first via peer-recovery; then an original append-only arrives via replication. Since the retry append-only (delivered via recovery) does not have timestamp, the replica will happily optimizes the original request while it should not. This change transmits the max auto-generated timestamp from the primary to replicas before translog phase in peer recovery. This timestamp will prevent replicas from optimizing append-only requests if retry counterparts have been processed. Relates elastic#33656 Relates elastic#33222
Today we don't store the auto-generated timestamp of append-only operations in Lucene; and assign -1 to every index operations constructed from LuceneChangesSnapshot. This looks innocent but it generates duplicate documents on a replica if a retry append-only arrives first via peer-recovery; then an original append-only arrives via replication. Since the retry append-only (delivered via recovery) does not have timestamp, the replica will happily optimizes the original request while it should not. This change transmits the max auto-generated timestamp from the primary to replicas before translog phase in peer recovery. This timestamp will prevent replicas from optimizing append-only requests if retry counterparts have been processed. Relates #33656 Relates #33222
This change enables soft-deletes by default on ES 7.0.0 or later. Relates #33222 Co-authored-by: Jason Tedor <[email protected]>
This PR integrates Lucene soft-deletes(LUCENE-8200) into Elasticsearch.
Highlight works in this PR include:
_recovery_source
if source is omitted or modified #31106)These pieces were reviewed already in the feature branch but we
would like to give them an extra look before pulling into the upstream.
Relates #30086
Closes #29530
These works have been done by the whole team; however, these individuals
(lexical order) have significant contribution in coding and reviewing:
Co-authored-by: Adrien Grand [email protected]
Co-authored-by: Boaz Leskes [email protected]
Co-authored-by: Jason Tedor [email protected]
Co-authored-by: Martijn van Groningen [email protected]
Co-authored-by: Nhat Nguyen [email protected]
Co-authored-by: Simon Willnauer [email protected]