-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add method to Engine to fetch max seq no of given SegmentInfos #5970
Add method to Engine to fetch max seq no of given SegmentInfos #5970
Conversation
Gradle Check (Jenkins) Run Completed with:
|
Codecov Report
@@ Coverage Diff @@
## main #5970 +/- ##
============================================
- Coverage 70.95% 70.78% -0.17%
+ Complexity 58829 58717 -112
============================================
Files 4771 4771
Lines 280817 280830 +13
Branches 40568 40571 +3
============================================
- Hits 199253 198790 -463
- Misses 65238 65702 +464
- Partials 16326 16338 +12
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
Gradle Check (Jenkins) Run Completed with:
|
* This method fetches the _id of last indexed document that was part of refresh and | ||
* retrieves the _seq_no of the document. | ||
*/ | ||
public long getMaxSeqNoRefreshed(String source) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been considering a similar change for segrep as well where we fetch the max seqNo contained in a set of segments, so that our seqNo passed to replicas is accurate to whats in the infos. A few questions/thoughts.
- Should we use a similar query to restoreVersionmapAndCheckpointTracker that uses a range query with a starting seqNo?
- IE is concurrently refreshing, meaning we could have a mismatch here depending on when the infos are fetched and this method is invoked while using acquireSearcher on the engine. What if we constructed a reader/IndexSearcher directly from the latest (to be uploaded) infos? Similar to NRTReplicationReaderManager:
try (GatedCloseable<SegmentInfos> segmentInfosGatedCloseable = indexShard.getSegmentInfosSnapshot()) {
SegmentInfos segmentInfos = segmentInfosGatedCloseable.get();
DirectoryReader innerReader = StandardDirectoryReader.open(referenceToRefresh.directory(), segmentInfos, null, null);
final IndexSearcher searcher = new IndexSearcher(innerReader);
...
}
If we had a method that does this search based on the infos, we would reuse it with SR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use a similar query to restoreVersionmapAndCheckpointTracker that uses a range query with a starting seqNo?
The query used in restoreVersionmapAndCheckpointTracker
would be a bit expensive than the one we are using here. Any particular reason for using the same query?
IE is concurrently refreshing, meaning we could have a mismatch here depending on when the infos are fetched and this method is invoked while using acquireSearcher on the engine.
Right. Currently, this method does not make any assumption on SegmentInfos and how it is used in conjunction with either SegRep or Remote Store. The method always return the maxSeqNo based on searcher's view of segments. I will add more info on the javadoc clarifying this.
But I got your point and is valid for remote store as well. The solution you provided would work.
I am thinking of overloading this method which takes IndexSearcher
as input. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would have to experiment but my thinking is the range query would be less expensive than the matchall, but I am no expert here in this optimization.
Overloading and taking in an IndexSearcher or even a SegmentInfos would be great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on 2nd point which Marc brought up. I think it makes sense to build methods that takes the segment infos that we are interested in. This way it is more deterministic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requested review from @msfroh to understand the performance implications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added method to accept SegmentInfos
. We no longer require to fetch max seq number of last refresh (if required, can be added later).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the PR title accordingly.
@@ -2026,4 +2069,5 @@ public long getMaxSeenAutoIdTimestamp() { | |||
* to advance this marker to at least the given sequence number. | |||
*/ | |||
public abstract void advanceMaxSeqNoOfUpdatesOrDeletes(long maxSeqNoOfUpdatesOnPrimary); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will remove this extra line.
Gradle Check (Jenkins) Run Completed with:
|
Signed-off-by: Sachin Kale <[email protected]>
Signed-off-by: Sachin Kale <[email protected]>
Signed-off-by: Sachin Kale <[email protected]>
fae6fbe
to
e8ae39d
Compare
Gradle Check (Jenkins) Run Completed with:
|
Following is the perf comparison: Setup
Results
|
@mch2 Please review the perf test results posted above. I have specifically checked for refresh time and indexing throughput and haven't observed much degradation. |
The user data is only updated at lucene commit / flush time, so I wouldn't expect this would impact refresh times / throughput much. Are you computing user data in your test in a different spot? With that said I see no impact to cumulative flush time. I think we should do a query comparison of the matchall vs range but think its ok to start as is. We can do this after integrating with SR & remote store flows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this change @sachinpkale. Lets discuss in a separate issue refactoring our checkpoint listeners so we aren't computing infos snapshots & this seqNo twice per refresh.
* Add method to Engine to fetch max seq no of last refresh Co-authored-by: Sachin Kale <[email protected]> (cherry picked from commit 750bfc6) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…earch-project#5970) * Add method to Engine to fetch max seq no of last refresh Co-authored-by: Sachin Kale <[email protected]>
…earch-project#5970) * Add method to Engine to fetch max seq no of last refresh Co-authored-by: Sachin Kale <[email protected]>
Signed-off-by: Sachin Kale [email protected]
Description
Engine
class does not expose a method which provides max sequence number that was part of last refresh.InternalEngine.lastRefreshedCheckpoint()
does not guarantee to provide sequence number that is part of refreshed segments._seq_no
field is part of each indexed document. We can use this to query refreshed segments and fetch the last sequence number.Issues Resolved
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.