-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not expose hard-deleted docs in Lucene history #32333
Conversation
Today when reading operation history in Lucene, we read all documents. However, if indexing a document is aborted, IndexWriter will hard-delete it; we, therefore, need to exclude that document from Lucene history. This commit makes sure that we exclude aborted documents by using the hard liveDocs of a SegmentReader if there are deletes. Note that this wrapper does not work well with IndexWriter#tryDeleteDocument. We need to revisit the wrapper after LUCENE-8425 gets in.
This reverts commit 8e66a93.
Pinging @elastic/es-distributed |
if (si.getDelCount() == 0) { | ||
return new LeafReaderWithLiveDocs(segmentReader, null, segmentReader.maxDoc()); | ||
} else { | ||
Bits hardLiveBits = si.info.getCodec().liveDocsFormat().readLiveDocs(si.info.dir, si, IOContext.READ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to think this through and we might be subject to concurrent deletes if we do it this way. Can't we instead do something like this:
DocIdSetIterator soft_deletes = DocValuesFieldExistsQuery.getDocValuesDocIdSetIterator("soft_deletes", sr);
Bits liveDocs = sr.getLiveDocs();
FixedBitSet hardLiveDocs = new FixedBitSet(numDeletes);
hardLiveDocs.set(0, numDeletes);
for (int i = 0; i < liveDocs.length(); i++) {
if (liveDocs.get(i) == false) {
if (soft_deletes.docID() < i) {
int doc = soft_deletes.docID() == DocIdSetIterator.NO_MORE_DOCS ? DocIdSetIterator.NO_MORE_DOCS : soft_deletes.advance(i);
if (doc != i) {
hardLiveDocs.clear(i);
}
}
}
}
note I didn't try this out.. just to provide an idea
@s1monw I've updated the PR using the hardLiveDocs exposed in Lucene. I wonder if we should expose "isNRT" property of a SegmentReader so that we know if we can use SegmentInfos to calculate numDocs instead of getting the cardinality of the liveDocs. Can you please have another look? Thank you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments.
return new LeafReaderWithLiveDocs(leaf, null, leaf.maxDoc()); | ||
} | ||
final int numDocs; | ||
if (hardLiveDocs instanceof FixedBitSet) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it ever kick in? I would assume you get a FixedBits
instance for live docs instead of a mutable FixedBitSet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we convert then return a read-only bits. I removed this 💯.
assertThat(actualDocs, equalTo(liveDocs)); | ||
} | ||
IOUtils.close(writer, dir); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make it two tests? one that never inserts a document that fails indexing, and another one that always inserts one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@jpountz I have addressed your feedbacks? Can you please have another look? Thank you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. We should probably leave a TODO about avoiding to count live docs. Unfortunately we can't do it easily with a FilterDirectoryReader since composite readers count documents eagerly, but maybe we could find a different way to consume hard deletes that doesn't require to count them.
@jpountz Thanks for reviewing. I added a TODO. |
* elastic/ccr: (57 commits) ShardFollowNodeTask should fetch operation once (elastic#32455) Do not expose hard-deleted docs in Lucene history (elastic#32333) Tests: Fix convert error tests to use fixed value (elastic#32415) IndicesClusterStateService should replace an init. replica with an init. primary with the same aId (elastic#32374) REST high-level client: parse back _ignored meta field (elastic#32362) [CI] Mute DocumentSubsetReaderTests testSearch Reject follow request if following setting not enabled on follower (elastic#32448) TEST: testDocStats should always use forceMerge (elastic#32450) TEST: avoid merge in testSegmentMemoryTrackedInBreaker TEST: Avoid deletion in FlushIT AwaitsFix IndexShardTests#testDocStats Painless: Add method type to method. (elastic#32441) Remove reference to non-existent store type (elastic#32418) [TEST] Mute failing FlushIT test Fix ordering of bootstrap checks in docs (elastic#32417) [TEST] Mute failing InternalEngineTests#testSeqNoAndCheckpoints Validate source of an index in LuceneChangesSnapshot (elastic#32288) [TEST] Mute failing testConvertLongHexError bump lucene version after backport Upgrade to Lucene-7.5.0-snapshot-608f0277b0 (elastic#32390) ...
Today when reading operation history in Lucene, we read all documents. However, if indexing a document is aborted, IndexWriter will hard-delete it; we, therefore, need to exclude that document from Lucene history. This commit makes sure that we exclude aborted documents by using the hard liveDocs of a SegmentReader if there are deletes.
Today when reading operation history in Lucene, we read all documents.
However, if indexing a document is aborted, IndexWriter will hard-delete
it; we, therefore, need to exclude that document from Lucene history.
This commit makes sure that we exclude aborted documents by using the
hard liveDocs of a SegmentReader if there are deletes.
Closes #32269