Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use IndexInput#prefetch for terms dictionary lookups. #13359

Merged
merged 17 commits into from
Jul 10, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,8 @@
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.BytesRefBuilder;
import org.apache.lucene.util.RamUsageEstimator;
import org.apache.lucene.util.fst.BytesRefFSTEnum;
import org.apache.lucene.util.fst.BytesRefFSTEnum.InputOutput;
import org.apache.lucene.util.fst.FST;
import org.apache.lucene.util.fst.Util;

Expand Down Expand Up @@ -307,6 +309,31 @@ private boolean setEOF() {
return true;
}

@Override
public void prepareSeekExact(BytesRef target) throws IOException {
if (fr.index == null) {
throw new IllegalStateException("terms index was not loaded");
}

if (fr.size() == 0 || target.compareTo(fr.getMin()) < 0 || target.compareTo(fr.getMax()) > 0) {
return;
}

// TODO: should we try to reuse the current state of this terms enum when applicable?
BytesRefFSTEnum<BytesRef> indexEnum = new BytesRefFSTEnum<>(fr.index);
InputOutput<BytesRef> output = indexEnum.seekFloor(target);
final long code =
fr.readVLongOutput(
new ByteArrayDataInput(
output.output.bytes, output.output.offset, output.output.length));
final long fpSeek = code >>> Lucene90BlockTreeTermsReader.OUTPUT_FLAGS_NUM_BITS;
initIndexInput();
final long fp = in.getFilePointer();
in.seek(fpSeek);
in.prefetch(1); // TODO: could we know the length of the block?
in.seek(fp); // TODO: do we actually need to do this?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like these calls to seek() just to prefetch data. Since it is just prefetching, I'd prefer if this "dance" was an impl detail, if needed.
It would make the code simpler to just pass parameter to prefetch rather than do this.

Then it is clear that the default implementation won't cause harm (unnecessary io) for any directory subclasses

So I think prefetch should take location as argument? It is just a hint and not real i/o by the thread. It's intentionally not sequential and sequential API for it only hurts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #13363.

}

@Override
public boolean seekExact(BytesRef target) throws IOException {

Expand Down
38 changes: 32 additions & 6 deletions lucene/core/src/java/org/apache/lucene/index/TermStates.java
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,12 @@
package org.apache.lucene.index;

import java.io.IOException;
import java.io.UncheckedIOException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.concurrent.Callable;
import java.util.function.Supplier;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TaskExecutor;

Expand Down Expand Up @@ -179,15 +181,39 @@ public void accumulateStatistics(final int docFreq, final long totalTermFreq) {
* @return the {@link TermState} for the given readers ord or <code>null</code> if no {@link
* TermState} for the reader was registered
*/
public TermState get(LeafReaderContext ctx) throws IOException {
public Supplier<TermState> get(LeafReaderContext ctx) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sort of reminds me of two-phase commit, except at read-time not write-time: we now break up these IO heavy read APIs into two phases, now, where step 1 is the intention to get X soon (allowing prefetch to happen, especially concurrently not just in the background of the calling thread, but, across the N different Xs we want to retrieve). Step 2 is to then go and block on the IO to retrieve each of the N Xs. Two phased reads!

assert ctx.ord >= 0 && ctx.ord < states.length;
if (term == null) return states[ctx.ord];
if (term == null) return () -> states[ctx.ord];
if (this.states[ctx.ord] == null) {
TermsEnum te = loadTermsEnum(ctx, term);
this.states[ctx.ord] = te == null ? EMPTY_TERMSTATE : te.termState();
final Terms terms = Terms.getTerms(ctx.reader(), term.field());
final TermsEnum termsEnum = terms.iterator();
termsEnum.prepareSeekExact(term.bytes());
return () -> {
if (this.states[ctx.ord] == null) {
try {
TermState state = null;
if (termsEnum.seekExact(term.bytes())) {
state = termsEnum.termState();
}
this.states[ctx.ord] = state == null ? EMPTY_TERMSTATE : state;
} catch (IOException e) {
throw new UncheckedIOException(e);
}
}
TermState state = this.states[ctx.ord];
if (state == EMPTY_TERMSTATE) {
return null;
}
return state;
};
}
if (this.states[ctx.ord] == EMPTY_TERMSTATE) return null;
return this.states[ctx.ord];
return () -> {
TermState state = this.states[ctx.ord];
if (state == EMPTY_TERMSTATE) {
return null;
}
return state;
};
}

/**
Expand Down
10 changes: 10 additions & 0 deletions lucene/core/src/java/org/apache/lucene/index/TermsEnum.java
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
package org.apache.lucene.index;

import java.io.IOException;
import org.apache.lucene.store.IndexInput;
import org.apache.lucene.util.AttributeSource;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.BytesRefIterator;
Expand Down Expand Up @@ -61,6 +62,15 @@ public enum SeekStatus {
*/
public abstract boolean seekExact(BytesRef text) throws IOException;

/**
* Prepare a future call to {@link #seekExact}. This typically calls {@link IndexInput#prefetch}
* on the right range of bytes under the hood so that the next call to {@link #seekExact} is
* faster. This can be used to parallelize I/O across multiple terms by calling {@link
* #prepareSeekExact} on multiple terms enums before calling {@link #seekExact(BytesRef)} on the
* same {@link TermsEnum}s.
*/
public void prepareSeekExact(BytesRef text) throws IOException {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we look into subclasses such as FilterLeafReader.FilterTermsEnum to make sure this new method behaves correctly?


/**
* Seeks to the specified term, if it exists, or to the next (ceiling) term. Returns SeekStatus to
* indicate whether exact term was found, a different term was found, or EOF was hit. The target
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -316,7 +316,7 @@ private static TermStates adjustFrequencies(
List<LeafReaderContext> leaves = readerContext.leaves();
TermStates newCtx = new TermStates(readerContext);
for (int i = 0; i < leaves.size(); ++i) {
TermState termState = ctx.get(leaves.get(i));
TermState termState = ctx.get(leaves.get(i)).get();
if (termState == null) {
continue;
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -277,7 +277,7 @@ protected PhraseMatcher getPhraseMatcher(
List<PostingsEnum> postings = new ArrayList<>();

for (Term term : terms) {
TermState termState = termStates.get(term).get(context);
TermState termState = termStates.get(term).get(context).get();
if (termState != null) {
termsEnum.seekExact(term.bytes(), termState);
postings.add(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -501,7 +501,7 @@ protected PhraseMatcher getPhraseMatcher(

for (int i = 0; i < terms.length; i++) {
final Term t = terms[i];
final TermState state = states[i].get(context);
final TermState state = states[i].get(context).get();
if (state == null) {
/* term doesnt exist in this segment */
assert termNotInReader(reader, t) : "no termstate found but term exists in reader";
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -281,7 +281,7 @@ public Scorer scorer(LeafReaderContext context) throws IOException {
List<ImpactsEnum> impacts = new ArrayList<>();
List<Float> termBoosts = new ArrayList<>();
for (int i = 0; i < terms.length; i++) {
TermState state = termStates[i].get(context);
TermState state = termStates[i].get(context).get();
if (state != null) {
TermsEnum termsEnum = context.reader().terms(field).iterator();
termsEnum.seekExact(terms[i].term, state);
Expand Down
37 changes: 30 additions & 7 deletions lucene/core/src/java/org/apache/lucene/search/TermQuery.java
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,9 @@
package org.apache.lucene.search;

import java.io.IOException;
import java.io.UncheckedIOException;
import java.util.Objects;
import java.util.function.Supplier;
import org.apache.lucene.index.IndexReaderContext;
import org.apache.lucene.index.LeafReader;
import org.apache.lucene.index.LeafReaderContext;
Expand Down Expand Up @@ -119,18 +121,34 @@ public ScorerSupplier scorerSupplier(LeafReaderContext context) throws IOExcepti
: "The top-reader used to create Weight is not the same as the current reader's top-reader ("
+ ReaderUtil.getTopLevelContext(context);

final TermsEnum termsEnum = getTermsEnum(context);
if (termsEnum == null) {
return null;
}
final int docFreq = termsEnum.docFreq();
final Supplier<TermState> stateSupplier = termStates.get(context);

return new ScorerSupplier() {

private TermsEnum termsEnum;
private boolean topLevelScoringClause = false;

private TermsEnum getTermsEnum() throws IOException {
if (termsEnum == null) {
TermState state = stateSupplier.get();
if (state == null) {
return null;
}
termsEnum = context.reader().terms(term.field()).iterator();
termsEnum.seekExact(term.bytes(), state);
}
return termsEnum;
}

@Override
public Scorer get(long leadCost) throws IOException {
TermsEnum termsEnum = getTermsEnum();
if (termsEnum == null) {
// nocommit: should we start allowing ScorerSupplier#get to return null?
return new ConstantScoreScorer(
TermWeight.this, 0f, scoreMode, DocIdSetIterator.empty());
}

LeafSimScorer scorer =
new LeafSimScorer(simScorer, context.reader(), term.field(), scoreMode.needsScores());
if (scoreMode == ScoreMode.TOP_SCORES) {
Expand All @@ -150,7 +168,12 @@ public Scorer get(long leadCost) throws IOException {

@Override
public long cost() {
return docFreq;
try {
TermsEnum te = getTermsEnum();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this getter got more costly. It's too bad TermState is so opaque -- under the hood it (BlockTermState) is already storing docFreq.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we were already getting a TermsEnum to be able to get the cost before, it just happened before creating the ScorerSupplier. So the additional cost here is the if (termsEnum != null) check under getTermsEnum().

Agreed that it's a pity to pull a terms enum only to get a cost, which is already encapsulated in the term state. Though I don't expect it to be a major issue in practice.

return te == null ? 0 : te.docFreq();
} catch (IOException e) {
throw new UncheckedIOException(e);
}
}

@Override
Expand Down Expand Up @@ -183,7 +206,7 @@ private TermsEnum getTermsEnum(LeafReaderContext context) throws IOException {
assert termStates.wasBuiltFor(ReaderUtil.getTopLevelContext(context))
: "The top-reader used to create Weight is not the same as the current reader's top-reader ("
+ ReaderUtil.getTopLevelContext(context);
final TermState state = termStates.get(context);
final TermState state = termStates.get(context).get();
if (state == null) { // term is not present in that reader
assert termNotInReader(context.reader(), term)
: "no termstate found but term exists in reader term=" + term;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -135,7 +135,7 @@ public Spans getSpans(final LeafReaderContext context, Postings requiredPostings
: "The top-reader used to create Weight is not the same as the current reader's top-reader ("
+ ReaderUtil.getTopLevelContext(context);

final TermState state = termStates.get(context);
final TermState state = termStates.get(context).get();
if (state == null) { // term is not present in that reader
assert context.reader().docFreq(term) == 0
: "no termstate found but term exists in reader term=" + term;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -404,7 +404,7 @@ public Scorer scorer(LeafReaderContext context) throws IOException {
List<PostingsEnum> iterators = new ArrayList<>();
List<FieldAndWeight> fields = new ArrayList<>();
for (int i = 0; i < fieldTerms.length; i++) {
TermState state = termStates[i].get(context);
TermState state = termStates[i].get(context).get();
if (state != null) {
TermsEnum termsEnum = context.reader().terms(fieldTerms[i].field()).iterator();
termsEnum.seekExact(fieldTerms[i].bytes(), state);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -387,7 +387,7 @@ protected int collectSingleTermData(
Terms terms = leafReaderContext.reader().terms(term.field());
if (terms != null) {
checkTermsHavePositions(terms);
TermState termState = termStates.get(leafReaderContext);
TermState termState = termStates.get(leafReaderContext).get();
if (termState != null) {
termMatchesInSegment = true;
numMatches++;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -415,7 +415,7 @@ public Scorer scorer(LeafReaderContext context) throws IOException {
: "The top-reader used to create Weight is not the same as the current reader's top-reader ("
+ ReaderUtil.getTopLevelContext(context);
BytesRef term = idToTerm.get(ent.getKey());
TermState state = termStates.get(context);
TermState state = termStates.get(context).get();
if (state != null) {
TermsEnum termsEnum = context.reader().terms(field).iterator();
termsEnum.seekExact(term, state);
Expand Down
Loading