Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrates soft-deletes into Elasticsearch #33222

Merged
merged 11 commits into from
Aug 31, 2018
Merged

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Aug 29, 2018

This PR integrates Lucene soft-deletes(LUCENE-8200) into Elasticsearch.
Highlight works in this PR include:

  1. Replace hard-deletes by soft-deletes in InternalEngine
  2. Use _recovery_source if _source is disabled or modified (Use a _recovery_source if source is omitted or modified #31106)
  3. Soft-deletes retention policy based on the global checkpoint (Introduce soft-deletes retention policy based on global checkpoint #30335)
  4. Read operation history from Lucene instead of translog ([CCR] Read changes from Lucene instead of translog #30120)
  5. Use Lucene history in peer-recovery (Use Lucene soft-deletes in peer recovery #30522)

These pieces were reviewed already in the feature branch but we
would like to give them an extra look before pulling into the upstream.

Relates #30086
Closes #29530

These works have been done by the whole team; however, these individuals
(lexical order) have significant contribution in coding and reviewing:

Co-authored-by: Adrien Grand [email protected]
Co-authored-by: Boaz Leskes [email protected]
Co-authored-by: Jason Tedor [email protected]
Co-authored-by: Martijn van Groningen [email protected]
Co-authored-by: Nhat Nguyen [email protected]
Co-authored-by: Simon Willnauer [email protected]

This PR integrates Lucene soft-deletes (LUCENE-8200) into Elasticsearch.
Highlight works in this PR include:

1. Replace hard-deletes by soft-deletes in InternalEngine
2. Use _recovery_source if _source is disabled or modified (elastic#31106)
3. Soft-deletes retention policy based on the global checkpoint (elastic#30335)
4. Read operation history from Lucene instead of translog (elastic#30120)
5. Use Lucene history in peer-recovery (elastic#30522)

These works have been done by the whole team; however, these individuals
(lexical order) have significant contribution in coding and reviewing:

Co-authored-by: Adrien Grand <[email protected]>
Co-authored-by: Boaz Leskes <[email protected]>
Co-authored-by: Jason Tedor <[email protected]>
Co-authored-by: Martijn van Groningen <[email protected]>
Co-authored-by: Nhat Nguyen <[email protected]>
Co-authored-by: Simon Willnauer <[email protected]>
@dnhatn dnhatn added >enhancement :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. v7.0.0 :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features v6.5.0 labels Aug 29, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@dnhatn
Copy link
Member Author

dnhatn commented Aug 29, 2018

@elastic/es-distributed Can you please also have a look? Thank you!

Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks awesome. Great piece of work! I left some comments most of them are nits but a couple of minors.

}
// TODO: Avoid recalculate numDocs everytime.
int numDocs = 0;
for (int i = 0; i < hardLiveDocs.length(); i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are still waiting for a lucene release to fix this right? this is https://issues.apache.org/jira/browse/LUCENE-8458 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if so please put a comment in here referencing the issue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but there is something not clear on my side. I will add a comment then make the change in a follow-up so we can clarify things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is not clean on your side?

/**
* Specifies if the index should use soft-delete instead of hard-delete for update/delete operations.
*/
public static final Setting<Boolean> INDEX_SOFT_DELETES_SETTING =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should use a validator here like we use here https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/cluster/metadata/IndexMetaData.java#L200 to validate that the index this is set on is on version 7.0 or higher? I think this would prevent setting this on other indices and prevent confusion?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried but was unable to get it done with the current Validator. I think we need to extend the Validator to pass an entire Settings instance to make this possible (previous discussion #25560 (comment)). Would it okay if I make this in a follow-up?

* documents increases the chance of operation-based recoveries and allows querying a longer history of documents.
* If soft-deletes is enabled, an engine by default will retain all operations up to the global checkpoint.
**/
public static final Setting<Long> INDEX_SOFT_DELETES_RETENTION_OPERATIONS_SETTING =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we only allow setting this one if soft deletes are enabled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

/**
* Creates a new history snapshot from Lucene for reading operations whose seqno in the requesting seqno range (both inclusive)
*/
public abstract Translog.Snapshot newLuceneChangesSnapshot(String source, MapperService mapperService,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just call this newChangesSnapshot ?

builder.add(new DocValuesFieldExistsQuery(recoverySourceField), BooleanClause.Occur.FILTER);
builder.add(retainSourceQuerySupplier.get(), BooleanClause.Occur.FILTER);
IndexSearcher s = new IndexSearcher(reader);
s.setQueryCache(null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is not necessary per-se but I think we should call s.rewrite(builder.build()) and pass the result of this to s.createWeigth(...) we missed this also in lucene. ie if you'd pass a prefix query to this it would fail. I realized this yesterday when I worked on something else. :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// TODO: We haven't had timestamp for Index operations in Lucene yet, we need to loosen this check without timestamp.
// We don't store versionType in Lucene index, we need to exclude it from this check
final boolean sameOp;
if (newOp instanceof Translog.Index && prvOp instanceof Translog.Index) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we maybe fix the equals method of Operation rather than this? WE can do it in a followup no worries. I think versionType is maybe not needed for comparison?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed versionType from the comment. I will move this comparison to the Operation's equals.

try (Translog.Snapshot snapshot =
newLuceneChangesSnapshot(source, mapperService, Math.max(0, startingSeqNo), Long.MAX_VALUE, false)) {
return snapshot.totalOperations();
} catch (IOException ex) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we catch Exception here instead of IOException and check if we need to fail?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this catch and added a tragic check when we create a new snapshot.

@Override
public void afterRefresh(boolean didRefresh) {
if (didRefresh) {
refreshedCheckpoint.set(pendingCheckpoint);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add an assertion that we actually set the pendingCheckpoint?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a bug with the current implementation. I added a test and fixed this in
4ed7907

*/
final long lastRefreshedCheckpoint() {
return lastRefreshedCheckpointListener.refreshedCheckpoint.get();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra newline please

* documents increases the chance of operation-based recoveries and allows querying a longer history of documents.
* If soft-deletes is enabled, an engine by default will retain all operations up to the global checkpoint.
**/
public static final Setting<Long> INDEX_SOFT_DELETES_RETENTION_OPERATIONS_SETTING =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we only allow setting this one if soft deletes are enabled?

@@ -101,6 +104,9 @@ private void updateTranslogDeletionPolicy() throws IOException {
assert minRequiredGen <= lastGen : "minRequiredGen must not be greater than lastGen";
translogDeletionPolicy.setTranslogGenerationOfLastCommit(lastGen);
translogDeletionPolicy.setMinTranslogGenerationForRecovery(minRequiredGen);

softDeletesPolicy.setLocalCheckpointOfSafeCommit(
Long.parseLong(safeCommit.getUserData().get(SequenceNumbers.LOCAL_CHECKPOINT_KEY)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to care about commits that don't have values for this key?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We ensure that all index commits should have this key.

private static final class IndexingStrategy {
private void addStaleDocs(final List<ParseContext.Document> docs, final IndexWriter indexWriter) throws IOException {
assert softDeleteEnabled : "Add history documents but soft-deletes is disabled";
docs.forEach(d -> d.add(softDeletesField));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: a for loop would feel more natural here?

@Override
public Closeable acquireRetentionLockForPeerRecovery() {
final Closeable translogLock = translog.acquireRetentionLock();
final Releasable softDeletesLock = softDeletesPolicy.acquireRetentionLock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to take care of releasing the translog lock if acquiring the soft deletes lock fails?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch. We should acquire one of them but not both.

final Query rangeQuery = LongPoint.newRangeQuery(SeqNoFieldMapper.NAME, fromSeqNo, toSeqNo);
final Sort sortedBySeqNoThenByTerm = new Sort(
new SortedNumericSortField(SeqNoFieldMapper.NAME, SortField.Type.LONG),
new SortedNumericSortField(SeqNoFieldMapper.PRIMARY_TERM_NAME, SortField.Type.LONG, true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could use regular SortField instances since these fields are single-valued

}

private TopDocs searchOperations(ScoreDoc after) throws IOException {
final Query rangeQuery = LongPoint.newRangeQuery(SeqNoFieldMapper.NAME, fromSeqNo, toSeqNo);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how much it would help, but when after is not null, we could cast to a FieldDoc to extract is seq no and use it as a lower bound of the range query. That would help skip documents that have already been visited more efficiently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I use "lastSeenSeqNo + 1" as the lower bound.

if (fromSeqNo < 0 || toSeqNo < 0 || fromSeqNo > toSeqNo) {
throw new IllegalArgumentException("Invalid range; from_seqno [" + fromSeqNo + "], to_seqno [" + toSeqNo + "]");
}
if (searchBatchSize < 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to reject 0 too?

@dnhatn dnhatn requested a review from s1monw August 29, 2018 21:47
@dnhatn
Copy link
Member Author

dnhatn commented Aug 29, 2018

@s1monw and @jpountz I've addressed all your comments except the settings suggestion. Can you please have another look? Thank you!

Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

}
// TODO: Avoid recalculate numDocs everytime.
int numDocs = 0;
for (int i = 0; i < hardLiveDocs.length(); i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is not clean on your side?

dnhatn added 3 commits August 30, 2018 07:57
Today we add a NoOp to Lucene and translog if we fail to process an
indexing operation. However, we are only adding NoOps to translog for
delete operations. In order to have a complete history in Lucene, we
should add NoOps of failed delete operations to both Lucene and translog.

Relates elastic#29530
@dnhatn
Copy link
Member Author

dnhatn commented Aug 30, 2018

The last CI failed because we changed the soft-deletes setting during restoring a snapshot. I added the soft-deletes setting to the UNMODIFIABLE_SETTINGS list in 96d5f0d.

@dnhatn
Copy link
Member Author

dnhatn commented Aug 30, 2018

Hmm. CI failed due to #32299.

@elasticmachine retest this please.

@dnhatn
Copy link
Member Author

dnhatn commented Aug 30, 2018

Thanks so much @s1monw and @jpountz!

@dnhatn
Copy link
Member Author

dnhatn commented Aug 30, 2018

@elasticmachine run sample packaging tests please

@dnhatn
Copy link
Member Author

dnhatn commented Aug 30, 2018

@elasticmachine retest this please.

@dnhatn
Copy link
Member Author

dnhatn commented Aug 30, 2018

Another watcher test failure ./gradlew :x-pack:plugin:integTestRunner -Dtests.seed=3DA1EDA8F6DFAADB -Dtests.class=org.elasticsearch.xpack.test.rest.XPackRestIT -Dtests.method="test {p0=watcher/usage/10_basic/Test watcher usage stats output}" -Dtests.security.manager=true -Dtests.locale=nl -Dtests.timezone=Africa/Gaborone -Dcompiler.java=10 -Druntime.java=8 -Dtests.rest.blacklist=getting_started/10_monitor_cluster_health/*

@elasticmachine test this please.

@dnhatn dnhatn merged commit 6dd0aa5 into elastic:master Aug 31, 2018
@dnhatn dnhatn deleted the soft-deletes branch August 31, 2018 02:11
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Aug 31, 2018
Revert to correct co-author tags.
This reverts commit 6dd0aa5.
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Aug 31, 2018
This PR integrates Lucene soft-deletes(LUCENE-8200) into Elasticsearch.
Highlight works in this PR include:

- Replace hard-deletes by soft-deletes in InternalEngine
- Use _recovery_source if _source is disabled or modified (elastic#31106)
- Soft-deletes retention policy based on the global checkpoint (elastic#30335)
- Read operation history from Lucene instead of translog (elastic#30120)
- Use Lucene history in peer-recovery (elastic#30522)

Relates elastic#30086
Closes elastic#29530

---
These works have been done by the whole team; however, these individuals
(lexical order) have significant contribution in coding and reviewing:

Co-authored-by: Adrien Grand <[email protected]>
Co-authored-by: Boaz Leskes <[email protected]>
Co-authored-by: Jason Tedor <[email protected]>
Co-authored-by: Martijn van Groningen <[email protected]>
Co-authored-by: Nhat Nguyen <[email protected]>
Co-authored-by: Simon Willnauer <[email protected]>
dnhatn added a commit that referenced this pull request Aug 31, 2018
* master:
  Integrates soft-deletes into Elasticsearch (#33222)
  Revert "Integrates soft-deletes into Elasticsearch (#33222)"
  Add support for "authorization_realms" (#33262)
dnhatn added a commit that referenced this pull request Aug 31, 2018
This PR integrates Lucene soft-deletes(LUCENE-8200) into Elasticsearch.
Highlight works in this PR include:

- Replace hard-deletes by soft-deletes in InternalEngine
- Use _recovery_source if _source is disabled or modified (#31106)
- Soft-deletes retention policy based on the global checkpoint (#30335)
- Read operation history from Lucene instead of translog (#30120)
- Use Lucene history in peer-recovery (#30522)

Relates #30086
Closes #29530

---
These works have been done by the whole team; however, these individuals
(lexical order) have significant contribution in coding and reviewing:

Co-authored-by: Adrien Grand <[email protected]>
Co-authored-by: Boaz Leskes <[email protected]>
Co-authored-by: Jason Tedor <[email protected]>
Co-authored-by: Martijn van Groningen <[email protected]>
Co-authored-by: Nhat Nguyen <[email protected]>
Co-authored-by: Simon Willnauer <[email protected]>
dnhatn added a commit that referenced this pull request Aug 31, 2018
dnhatn added a commit that referenced this pull request Sep 1, 2018
* 6.x:
  Mute test watcher usage stats output
  [Rollup] Fix FullClusterRestart test
  TEST: Disable soft-deletes in ParentChildTestCase
  TEST: Disable randomized soft-deletes settings
  Integrates soft-deletes into Elasticsearch (#33222)
  drop `index.shard.check_on_startup: fix` (#32279)
  Fix AwaitsFix issue number
  Mute SmokeTestWatcherWithSecurityIT testsi
  [DOCS] Moves ml folder from x-pack/docs to docs (#33248)
  TEST: mute more SmokeTestWatcherWithSecurityIT tests
  [DOCS] Move rollup APIs to docs (#31450)
  [DOCS] Rename X-Pack Commands section (#33005)
  Fixes SecurityIntegTestCase so it always adds at least one alias (#33296)
  TESTS: Fix Random Fail in MockTcpTransportTests (#33061) (#33307)
  MINOR: Remove Dead Code from PathTrie (#33280) (#33306)
  Fix pom for build-tools (#33300)
  Lazy evaluate java9home (#33301)
  SQL: test coverage for JdbcResultSet (#32813)
  Work around to be able to generate eclipse projects (#33295)
  Different handling for security specific errors in the CLI. Fix for #33230 (#33255)
  [ML] Refactor delimited file structure detection (#33233)
  SQL: Support multi-index format as table identifier (#33278)
  Enable forbiddenapis server java9 (#33245)
  [MUTE] SmokeTestWatcherWithSecurityIT flaky tests
  Add region ISO code to GeoIP Ingest plugin (#31669) (#33276)
  Don't be strict for 6.x
  Update serialization versions for custom IndexMetaData backport
  Replace IndexMetaData.Custom with Map-based custom metadata (#32749)
  Painless: Fix Bindings Bug (#33274)
  SQL: prevent duplicate generation for repeated aggs (#33252)
  TEST: Mute testMonitorClusterHealth
  Fix serialization of empty field capabilities response (#33263)
  Fix nested _source retrieval with includes/excludes (#33180)
  [DOCS] TLS file resources are reloadable (#33258)
  Watcher: Ensure TriggerEngine start replaces existing watches (#33157)
  Ignore module-info in jar hell checks (#33011)
  Fix docs build after #33241
  [DOC] Repository GCS ADC not supported (#33238)
  Upgrade to latest Gradle 4.10  (#32801)
  Fix/30904 cluster formation part2 (#32877)
  Move file-based discovery to core (#33241)
  HLRC: add client side RefreshPolicy (#33209)
  [Kerberos] Add unsupported languages for tests (#33253)
  Watcher: Reload properly on remote shard change (#33167)
  Fix classpath security checks for external tests. (#33066)
  [Rollup] Only allow aggregating on multiples of configured interval (#32052)
  Added deprecation warning for rescore in scroll queries (#33070)
  Apply settings filter to get cluster settings API (#33247)
  [Rollup] Re-factor Rollup Indexer into a generic indexer for re-usability   (#32743)
  HLRC: create base timed request class (#33216)
  HLRC: Use Optional in validation logic (#33104)
  Painless: Add Bindings (#33042)
dnhatn added a commit that referenced this pull request Sep 3, 2018
We can have multiple documents in Lucene with the same seq_no for
parent-child documents (or without rollback). In this case, the usage
"lastSeenSeqNo + 1" is an off-by-one error as it may miss some
documents. This error merely affects the `skippedOperations` contract.

See: #33222 (comment)

Closes #33318
dnhatn added a commit that referenced this pull request Sep 5, 2018
We can have multiple documents in Lucene with the same seq_no for
parent-child documents (or without rollback). In this case, the usage
"lastSeenSeqNo + 1" is an off-by-one error as it may miss some
documents. This error merely affects the `skippedOperations` contract.

See: #33222 (comment)

Closes #33318
dnhatn added a commit that referenced this pull request Sep 20, 2018
Today we don't store the auto-generated timestamp of append-only
operations in Lucene; and assign -1 to every index operations
constructed from LuceneChangesSnapshot. This looks innocent but it
generates duplicate documents on a replica if a retry append-only
arrives first via peer-recovery; then an original append-only arrives
via replication. Since the retry append-only (delivered via recovery)
does not have timestamp, the replica will happily optimizes the original
request while it should not.

This change transmits the max auto-generated timestamp from the primary
to replicas before translog phase in peer recovery. This timestamp will
prevent replicas from optimizing append-only requests if retry
counterparts have been processed.

Relates #33656 
Relates #33222
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Sep 23, 2018
Today we don't store the auto-generated timestamp of append-only
operations in Lucene; and assign -1 to every index operations
constructed from LuceneChangesSnapshot. This looks innocent but it
generates duplicate documents on a replica if a retry append-only
arrives first via peer-recovery; then an original append-only arrives
via replication. Since the retry append-only (delivered via recovery)
does not have timestamp, the replica will happily optimizes the original
request while it should not.

This change transmits the max auto-generated timestamp from the primary
to replicas before translog phase in peer recovery. This timestamp will
prevent replicas from optimizing append-only requests if retry
counterparts have been processed.

Relates elastic#33656
Relates elastic#33222
kcm pushed a commit that referenced this pull request Oct 30, 2018
Today we don't store the auto-generated timestamp of append-only
operations in Lucene; and assign -1 to every index operations
constructed from LuceneChangesSnapshot. This looks innocent but it
generates duplicate documents on a replica if a retry append-only
arrives first via peer-recovery; then an original append-only arrives
via replication. Since the retry append-only (delivered via recovery)
does not have timestamp, the replica will happily optimizes the original
request while it should not.

This change transmits the max auto-generated timestamp from the primary
to replicas before translog phase in peer recovery. This timestamp will
prevent replicas from optimizing append-only requests if retry
counterparts have been processed.

Relates #33656 
Relates #33222
@colings86 colings86 removed :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. labels Nov 2, 2018
dnhatn added a commit that referenced this pull request Dec 11, 2018
This change enables soft-deletes by default on ES 7.0.0 or later.

Relates #33222

Co-authored-by: Jason Tedor <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. >enhancement v6.5.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants