Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-12697 Add pure DocValues support to FieldValueFeature #123

Merged
merged 30 commits into from
May 28, 2021
Merged
Show file tree
Hide file tree
Changes from 29 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
2ee8779
[SOLR-12697] add DocValuesFieldValueFeatureScorer to read docValues f…
May 12, 2021
bdce029
[SOLR-12697] formatting changes
May 12, 2021
e6601ee
[SOLR-12697] only apply new scorer to fields that are not stored
May 12, 2021
d6e1477
[SOLR-12697] remove BINARY case because it is not supported
May 12, 2021
5bc995c
[SOLR-12697] only pass fieldType to constructor; determine numberType…
May 13, 2021
4559415
[SOLR-12697] remove - from fieldnames; randomize indexing order for d…
May 13, 2021
ec4cbfb
[SOLR-12697] determine docValuesType before creating DocValuesFieldVa…
May 13, 2021
e6f20f1
split dual-purpose DocValuesFieldValueFeatureScorer into two
cpoerschke May 14, 2021
f16ce3d
add TestFieldValueFeature test coverage (with caveat)
cpoerschke May 14, 2021
e5954eb
[SOLR-12697] remove method to read sorted values from Scorer for nume…
May 16, 2021
da6a635
[SOLR-12697] add fallback feature scorer that always returns the defa…
May 17, 2021
b105627
[SOLR-12697] test that exception is thrown for unsupported dv type, t…
May 19, 2021
e07c432
[SOLR-12697] add tests for parsing different sortedDocValues, add ent…
May 19, 2021
443a396
solr/CHANGES.txt edit
cpoerschke May 20, 2021
2dbd94e
in TestFieldValueFeature reduce potential test interaction
cpoerschke May 20, 2021
9b77154
in FieldValueFeature clarify 'searcher instanceof SolrIndexSearcher' use
cpoerschke May 20, 2021
c1f3a8e
TestFieldValueFeature: replace dvBoolPopularity with dvIsTrendy (form…
cpoerschke May 20, 2021
3c38e91
out-scope TestLTRReRankingPipeline changes
cpoerschke May 20, 2021
53cd2fb
FieldValueFeature: mention stored=true or docValues=true in javadocs
cpoerschke May 20, 2021
e854f50
FieldValueFeature polishes:
cpoerschke May 20, 2021
b9d3cd0
[SOLR-12697] add javadoc to explain which type of FieldValueFeatureSc…
May 20, 2021
da57e9c
Merge remote-tracking branch 'github_tomglk/jira/SOLR-12697' into jir…
cpoerschke May 20, 2021
abb3632
Revert "out-scope TestLTRReRankingPipeline changes"
cpoerschke May 20, 2021
a789b12
fix for SOLR-11134
cpoerschke May 20, 2021
c42be54
[SOLR-12697] out-scope TestLTRReRankingPipeline
May 21, 2021
83bc1ee
apologies, multiple TestFieldValueFeature polishes in one commit, app…
cpoerschke May 24, 2021
385d8b2
add TestFieldValueFeature.testThatDateValuesAreCorrectlyParsed()
cpoerschke May 25, 2021
4348d04
Merge remote-tracking branch 'origin/main' into jira/SOLR-12697
cpoerschke May 25, 2021
ad489d0
small TestLTROnSolrCloud polish:
cpoerschke May 26, 2021
2c3a368
Merge branch 'main' into jira/SOLR-12697
cpoerschke May 28, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions solr/CHANGES.txt
Original file line number Diff line number Diff line change
Expand Up @@ -331,6 +331,9 @@ New Features

* SOLR-15397: Expose zookeeper status in the Prometheus exporter (janhoy)

* SOLR-12697: In contrib/ltr FieldValueFeature support "stored=false docValues=true" a.k.a. pure DocValues fields.
(Stanislav Livotov, Erick Erickson, Tobias Kässmann, Tom Gilke, Christine Poerschke)

Improvements
---------------------
* SOLR-15081: Metrics for a core: add SolrCloud "isLeader" and "replicaState". (David Smiley)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,16 +23,26 @@
import java.util.Set;

import org.apache.lucene.document.Document;
import org.apache.lucene.index.DocValues;
import org.apache.lucene.index.DocValuesType;
import org.apache.lucene.index.FieldInfo;
import org.apache.lucene.index.IndexableField;
import org.apache.lucene.index.LeafReaderContext;
import org.apache.lucene.index.NumericDocValues;
import org.apache.lucene.index.SortedDocValues;
import org.apache.lucene.search.DocIdSetIterator;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.util.BytesRef;
import org.apache.solr.request.SolrQueryRequest;
import org.apache.solr.schema.BoolField;
import org.apache.solr.schema.NumberType;
import org.apache.solr.schema.SchemaField;
import org.apache.solr.search.SolrIndexSearcher;

/**
* This feature returns the value of a field in the current document
* This feature returns the value of a field in the current document.
* The field must have stored="true" or docValues="true" properties.
* Example configuration:
* <pre>{
"name": "rawHits",
Expand All @@ -41,6 +51,17 @@
"field": "hits"
}
}</pre>
*
* <p>There are 4 different types of FeatureScorers that a FieldValueFeatureWeight may use.
* The chosen scorer depends on the field attributes.</p>
*
* <p>FieldValueFeatureScorer (FVFS): used for stored=true, no matter if docValues=true or docValues=false</p>
*
* <p>NumericDocValuesFVFS: used for stored=false and docValues=true, if docValueType == NUMERIC</p>
* <p>SortedDocValuesFVFS: used for stored=false and docValues=true, if docValueType == SORTED
*
* <p>DefaultValueFVFS: used for stored=false and docValues=true, a fallback scorer that is used on segments
* where no document has a value set in the field of this feature</p>
*/
public class FieldValueFeature extends Feature {

Expand Down Expand Up @@ -83,18 +104,52 @@ public FeatureWeight createWeight(IndexSearcher searcher, boolean needsScores,
}

public class FieldValueFeatureWeight extends FeatureWeight {
private final SchemaField schemaField;

public FieldValueFeatureWeight(IndexSearcher searcher,
SolrQueryRequest request, Query originalQuery, Map<String,String[]> efi) {
super(FieldValueFeature.this, searcher, request, originalQuery, efi);
if (searcher instanceof SolrIndexSearcher) {
schemaField = ((SolrIndexSearcher) searcher).getSchema().getFieldOrNull(field);
} else { // some tests pass a null or a non-SolrIndexSearcher searcher
schemaField = null;
}
}

/**
* Return a FeatureScorer that uses docValues or storedFields if no docValues are present
*
* @param context the segment this FeatureScorer is working with
* @return FeatureScorer for the current segment and field
* @throws IOException as defined by abstract class Feature
*/
@Override
public FeatureScorer scorer(LeafReaderContext context) throws IOException {
if (schemaField != null && !schemaField.stored() && schemaField.hasDocValues()) {

final FieldInfo fieldInfo = context.reader().getFieldInfos().fieldInfo(field);
final DocValuesType docValuesType = fieldInfo != null ? fieldInfo.getDocValuesType() : DocValuesType.NONE;

if (DocValuesType.NUMERIC.equals(docValuesType)) {
return new NumericDocValuesFieldValueFeatureScorer(this, context,
DocIdSetIterator.all(DocIdSetIterator.NO_MORE_DOCS), schemaField.getType().getNumberType());
} else if (DocValuesType.SORTED.equals(docValuesType)) {
return new SortedDocValuesFieldValueFeatureScorer(this, context,
DocIdSetIterator.all(DocIdSetIterator.NO_MORE_DOCS));
} else if (DocValuesType.NONE.equals(docValuesType)) {
// Using a fallback feature scorer because this segment has no documents with a doc value for the current field
return new DefaultValueFieldValueFeatureScorer(this, DocIdSetIterator.all(DocIdSetIterator.NO_MORE_DOCS));
}
Comment on lines +133 to +142
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2/2 ... combined with here diverting any unexpected getNumberType() enums onto the "is not supported" code path? Although having said that, I don't know what the existing FieldValueFeatureScorer would return for a stored DATE field i.e. perhaps back compat considerations apply, hmm.

throw new IllegalArgumentException("Doc values type " + docValuesType.name() + " of field " + field
+ " is not supported");
}
return new FieldValueFeatureScorer(this, context,
DocIdSetIterator.all(DocIdSetIterator.NO_MORE_DOCS));
}

/**
* A FeatureScorer that reads the stored value for a field
*/
public class FieldValueFeatureScorer extends FeatureScorer {

LeafReaderContext context = null;
Expand Down Expand Up @@ -146,5 +201,137 @@ public float getMaxScore(int upTo) throws IOException {
return Float.POSITIVE_INFINITY;
}
}

/**
* A FeatureScorer that reads the numeric docValues for a field
*/
public final class NumericDocValuesFieldValueFeatureScorer extends FeatureScorer {
private final NumericDocValues docValues;
private final NumberType numberType;

public NumericDocValuesFieldValueFeatureScorer(final FeatureWeight weight, final LeafReaderContext context,
final DocIdSetIterator itr, final NumberType numberType) {
super(weight, itr);
this.numberType = numberType;

NumericDocValues docValues;
try {
docValues = DocValues.getNumeric(context.reader(), field);
} catch (IOException e) {
throw new IllegalArgumentException("Could not read numeric docValues for field " + field);
}
this.docValues = docValues;
}

@Override
public float score() throws IOException {
if (docValues.advanceExact(itr.docID())) {
return readNumericDocValues();
}
return FieldValueFeature.this.getDefaultValue();
}

/**
* Read the numeric value for a field and convert the different number types to float.
*
* @return The numeric value that the docValues contain for the current document
* @throws IOException if docValues cannot be read
*/
private float readNumericDocValues() throws IOException {
if (NumberType.FLOAT.equals(numberType)) {
// convert float value that was stored as long back to float
return Float.intBitsToFloat((int) docValues.longValue());
} else if (NumberType.DOUBLE.equals(numberType)) {
// handle double value conversion
return (float) Double.longBitsToDouble(docValues.longValue());
}
// just take the long value
return docValues.longValue();
}
Comment on lines +240 to +250
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1/2 So I was trying to learn and understand better w.r.t. why the intBitsToFloat and longBitsToDouble are the conversions to use here and that led to https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.8.2/solr/core/src/java/org/apache/solr/search/SolrDocumentFetcher.java#L603-L644 and the realisation that DATE is a numberType possibility!

Would you have any thoughts on using a switch statement here with FLOAT/DOUBLE/INT/LONG all valid and to throw an exception for anything else ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat strangely so perhaps the use of DATE fields "just works" for both the existing stored and the new docValues implementations. I've added a TestFieldValueFeature.testThatDateValuesAreCorrectlyParsed test to demonstrate that and so then no need to change the implementation here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... the date being a numeric type really is quite unintuitive.
I am not really fond of copying the switch-logic that you referenced, but sadly the method is private. :/

I will go and make a tea and then look into that.
Thank you for adding the test for the date field. The new clarity on what value is returned for the date type is very good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh... Thanks github for not refreshing the page... 🙄
After looking into that I saw the same behavior as you did and was just about to comment that. :D

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... I am not really fond of copying the switch-logic that you referenced, but sadly the method is private. ...

Likewise, I would have preferred for there to be some existing reusable method and was surprised that there isn't one (or not an obviously findable one). The SolrDocumentFetcher one has the sortableNumeric flag (which we don't need here) and some special logic for LatLonPointSpatialField and AbstractEnumField scenarios (also not applicable here) ... otherwise it would have been a clear "okay, let's factor out a method here and put it somewhere that both SolrDocumentFeature and FieldValueFeature can use it" kind of scenario. Oh well.


@Override
public float getMaxScore(int upTo) throws IOException {
return Float.POSITIVE_INFINITY;
}
}

/**
* A FeatureScorer that reads the sorted docValues for a field
*/
public final class SortedDocValuesFieldValueFeatureScorer extends FeatureScorer {
private final SortedDocValues docValues;

public SortedDocValuesFieldValueFeatureScorer(final FeatureWeight weight, final LeafReaderContext context,
final DocIdSetIterator itr) {
super(weight, itr);

SortedDocValues docValues;
try {
docValues = DocValues.getSorted(context.reader(), field);
} catch (IOException e) {
throw new IllegalArgumentException("Could not read sorted docValues for field " + field);
}
this.docValues = docValues;
}

@Override
public float score() throws IOException {
if (docValues.advanceExact(itr.docID())) {
int ord = docValues.ordValue();
return readSortedDocValues(docValues.lookupOrd(ord));
}
return FieldValueFeature.this.getDefaultValue();
}

/**
* Interprets the bytesRef either as true / false token or tries to read it as number string
*
* @param bytesRef the value of the field that should be used as score
* @return the input converted to a number
*/
private float readSortedDocValues(BytesRef bytesRef) {
String string = bytesRef.utf8ToString();
if (string.length() == 1) {
// boolean values in the index are encoded with the
// a single char contained in TRUE_TOKEN or FALSE_TOKEN
// (see BoolField)
if (string.charAt(0) == BoolField.TRUE_TOKEN[0]) {
return 1;
}
if (string.charAt(0) == BoolField.FALSE_TOKEN[0]) {
return 0;
}
}
return FieldValueFeature.this.getDefaultValue();
}

@Override
public float getMaxScore(int upTo) throws IOException {
return Float.POSITIVE_INFINITY;
}
}

/**
* A FeatureScorer that always returns the default value.
*
* It is used as a fallback for cases when a segment does not have any documents that contain doc values for a field.
* By doing so, we prevent a fallback to the FieldValueFeatureScorer, which would also return the default value but
* in a less performant way because it would first try to read the stored fields for the doc (which aren't present).
*/
public final class DefaultValueFieldValueFeatureScorer extends FeatureScorer {
public DefaultValueFieldValueFeatureScorer(final FeatureWeight weight, final DocIdSetIterator itr) {
super(weight, itr);
}

@Override
public float score() throws IOException {
return FieldValueFeature.this.getDefaultValue();
}

@Override
public float getMaxScore(int upTo) throws IOException {
return Float.POSITIVE_INFINITY;
}
}
}
}
27 changes: 27 additions & 0 deletions solr/contrib/ltr/src/test-files/solr/collection1/conf/schema.xml
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,28 @@
<field name="description" type="text_general" indexed="true" stored="true"/>
<field name="keywords" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="popularity" type="int" indexed="true" stored="true" />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

observation: via the docValues="${solr.tests.numeric.dv}" in the <fieldType name="int" ... and popularity being used in the TestFieldValueFeature we have some existing test coverage for doc values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a commit with additional TestFieldValueFeature coverage, for the numeric fields only so far i.e. dvStrNumPopularity and dvStrBoolPopularity not yet.

<field name="dvIntPopularity" type="int" indexed="false" docValues="true" stored="false" multiValued="false"/>
<field name="dvLongPopularity" type="long" indexed="false" docValues="true" stored="false" multiValued="false"/>
<field name="dvFloatPopularity" type="float" indexed="false" docValues="true" stored="false" multiValued="false"/>
<field name="dvDoublePopularity" type="double" indexed="false" docValues="true" stored="false" multiValued="false"/>
<field name="dvStringPopularity" type="string" indexed="false" docValues="true" stored="false" multiValued="false"/>
<field name="dvStringPopularities" type="string" indexed="false" docValues="true" stored="false" multiValued="true"/>
<field name="normHits" type="float" indexed="true" stored="true" />

<field name="isTrendy" type="boolean" indexed="true" stored="true" />
<field name="dvIsTrendy" type="boolean" indexed="true" stored="false" docValues="true"/>

<field name="dvIntField" type="int" indexed="false" docValues="true" stored="false" default="-1" multiValued="false"/>
<field name="dvLongField" type="long" indexed="false" docValues="true" stored="false" default="-2" multiValued="false"/>
<field name="dvFloatField" type="float" indexed="false" docValues="true" stored="false" default="-3" multiValued="false"/>
<field name="dvDoubleField" type="double" indexed="false" docValues="true" stored="false" multiValued="false"/>
<field name="dvStrNumField" type="string" indexed="false" docValues="true" stored="false" multiValued="false"/>
<field name="dvStrBoolField" type="boolean" indexed="false" docValues="true" stored="false" multiValued="false"/>
<field name="dvDateField" type="date" indexed="false" docValues="true" stored="false" multiValued="false"/>

<field name="noDvFloatField" type="float" indexed="false" docValues="false" stored="true" multiValued="false"/>
<field name="noDvStrNumField" type="string" indexed="false" docValues="false" stored="true" multiValued="false"/>
<field name="noDvDateField" type="date" indexed="false" docValues="false" stored="true" multiValued="false"/>

<field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>
<field name="_version_" type="long" indexed="true" stored="true"/>
Expand All @@ -41,6 +61,13 @@
<copyField source="title" dest="text"/>
<copyField source="description" dest="text"/>

<copyField source="popularity" dest="dvIntPopularity"/>
<copyField source="popularity" dest="dvLongPopularity"/>
<copyField source="popularity" dest="dvFloatPopularity"/>
<copyField source="popularity" dest="dvDoublePopularity"/>
Comment on lines +64 to +67
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

observation: copyField here works for the dv*Popularity fields but the dv(Int|Long|Float|Double)Field fields cannot use it since the defaulting logic and the popularity/10 logic is happening there.


<copyField source="isTrendy" dest="dvIsTrendy"/>

<types>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" />
<fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/>
Expand Down
Loading