Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-12697 Add pure DocValues support to FieldValueFeature #123

Merged
merged 30 commits into from
May 28, 2021
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
2ee8779
[SOLR-12697] add DocValuesFieldValueFeatureScorer to read docValues f…
May 12, 2021
bdce029
[SOLR-12697] formatting changes
May 12, 2021
e6601ee
[SOLR-12697] only apply new scorer to fields that are not stored
May 12, 2021
d6e1477
[SOLR-12697] remove BINARY case because it is not supported
May 12, 2021
5bc995c
[SOLR-12697] only pass fieldType to constructor; determine numberType…
May 13, 2021
4559415
[SOLR-12697] remove - from fieldnames; randomize indexing order for d…
May 13, 2021
ec4cbfb
[SOLR-12697] determine docValuesType before creating DocValuesFieldVa…
May 13, 2021
e6f20f1
split dual-purpose DocValuesFieldValueFeatureScorer into two
cpoerschke May 14, 2021
f16ce3d
add TestFieldValueFeature test coverage (with caveat)
cpoerschke May 14, 2021
e5954eb
[SOLR-12697] remove method to read sorted values from Scorer for nume…
May 16, 2021
da6a635
[SOLR-12697] add fallback feature scorer that always returns the defa…
May 17, 2021
b105627
[SOLR-12697] test that exception is thrown for unsupported dv type, t…
May 19, 2021
e07c432
[SOLR-12697] add tests for parsing different sortedDocValues, add ent…
May 19, 2021
443a396
solr/CHANGES.txt edit
cpoerschke May 20, 2021
2dbd94e
in TestFieldValueFeature reduce potential test interaction
cpoerschke May 20, 2021
9b77154
in FieldValueFeature clarify 'searcher instanceof SolrIndexSearcher' use
cpoerschke May 20, 2021
c1f3a8e
TestFieldValueFeature: replace dvBoolPopularity with dvIsTrendy (form…
cpoerschke May 20, 2021
3c38e91
out-scope TestLTRReRankingPipeline changes
cpoerschke May 20, 2021
53cd2fb
FieldValueFeature: mention stored=true or docValues=true in javadocs
cpoerschke May 20, 2021
e854f50
FieldValueFeature polishes:
cpoerschke May 20, 2021
b9d3cd0
[SOLR-12697] add javadoc to explain which type of FieldValueFeatureSc…
May 20, 2021
da57e9c
Merge remote-tracking branch 'github_tomglk/jira/SOLR-12697' into jir…
cpoerschke May 20, 2021
abb3632
Revert "out-scope TestLTRReRankingPipeline changes"
cpoerschke May 20, 2021
a789b12
fix for SOLR-11134
cpoerschke May 20, 2021
c42be54
[SOLR-12697] out-scope TestLTRReRankingPipeline
May 21, 2021
83bc1ee
apologies, multiple TestFieldValueFeature polishes in one commit, app…
cpoerschke May 24, 2021
385d8b2
add TestFieldValueFeature.testThatDateValuesAreCorrectlyParsed()
cpoerschke May 25, 2021
4348d04
Merge remote-tracking branch 'origin/main' into jira/SOLR-12697
cpoerschke May 25, 2021
ad489d0
small TestLTROnSolrCloud polish:
cpoerschke May 26, 2021
2c3a368
Merge branch 'main' into jira/SOLR-12697
cpoerschke May 28, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -23,24 +23,34 @@
import java.util.Set;

import org.apache.lucene.document.Document;
import org.apache.lucene.index.DocValues;
import org.apache.lucene.index.DocValuesType;
import org.apache.lucene.index.FieldInfo;
import org.apache.lucene.index.IndexableField;
import org.apache.lucene.index.LeafReaderContext;
import org.apache.lucene.index.NumericDocValues;
import org.apache.lucene.index.SortedDocValues;
import org.apache.lucene.search.DocIdSetIterator;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.util.BytesRef;
import org.apache.solr.request.SolrQueryRequest;
import org.apache.solr.schema.BoolField;
import org.apache.solr.schema.NumberType;
import org.apache.solr.schema.SchemaField;
import org.apache.solr.search.SolrIndexSearcher;

/**
* This feature returns the value of a field in the current document
* Example configuration:
* <pre>{
"name": "rawHits",
"class": "org.apache.solr.ltr.feature.FieldValueFeature",
"params": {
"field": "hits"
}
}</pre>
* "name": "rawHits",
* "class": "org.apache.solr.ltr.feature.FieldValueFeature",
* "params": {
* "field": "hits",
* "defaultValue": -1
* }
* }</pre>
*/
tomglk marked this conversation as resolved.
Show resolved Hide resolved
public class FieldValueFeature extends Feature {

Expand All @@ -57,50 +67,79 @@ public void setField(String field) {
}

@Override
public LinkedHashMap<String,Object> paramsToMap() {
final LinkedHashMap<String,Object> params = defaultParamsToMap();
public LinkedHashMap<String, Object> paramsToMap() {
final LinkedHashMap<String, Object> params = defaultParamsToMap();
params.put("field", field);
return params;
}

@Override
protected void validate() throws FeatureException {
if (field == null || field.isEmpty()) {
throw new FeatureException(getClass().getSimpleName()+
": field must be provided");
throw new FeatureException(getClass().getSimpleName() + ": field must be provided");
}
}

public FieldValueFeature(String name, Map<String,Object> params) {
public FieldValueFeature(String name, Map<String, Object> params) {
super(name, params);
}

@Override
public FeatureWeight createWeight(IndexSearcher searcher, boolean needsScores,
SolrQueryRequest request, Query originalQuery, Map<String,String[]> efi)
throws IOException {
public FeatureWeight createWeight(IndexSearcher searcher, boolean needsScores, SolrQueryRequest request,
Query originalQuery, Map<String, String[]> efi) throws IOException {
return new FieldValueFeatureWeight(searcher, request, originalQuery, efi);
}

public class FieldValueFeatureWeight extends FeatureWeight {
private final SchemaField schemaField;

public FieldValueFeatureWeight(IndexSearcher searcher,
SolrQueryRequest request, Query originalQuery, Map<String,String[]> efi) {
SolrQueryRequest request, Query originalQuery, Map<String, String[]> efi) {
super(FieldValueFeature.this, searcher, request, originalQuery, efi);
if (searcher instanceof SolrIndexSearcher) {
schemaField = ((SolrIndexSearcher) searcher).getSchema().getFieldOrNull(field);
} else {
schemaField = null;
}
}

/**
* Return a FeatureScorer that uses docValues or storedFields if no docValues are present
*
* @param context the segment this FeatureScorer is working with
* @return FeatureScorer for the current segment and field
* @throws IOException as defined by abstract class Feature
*/
@Override
public FeatureScorer scorer(LeafReaderContext context) throws IOException {
if (schemaField != null && !schemaField.stored() && schemaField.hasDocValues()) {

FieldInfo fieldInfo = context.reader().getFieldInfos().fieldInfo(field);
DocValuesType docValuesType = fieldInfo != null ? fieldInfo.getDocValuesType() : DocValuesType.NONE;

if (DocValuesType.NUMERIC.equals(docValuesType)) {
return new NumericDocValuesFieldValueFeatureScorer(this, context,
DocIdSetIterator.all(DocIdSetIterator.NO_MORE_DOCS), schemaField.getType().getNumberType());
} else if (DocValuesType.SORTED.equals(docValuesType)) {
return new SortedDocValuesFieldValueFeatureScorer(this, context,
DocIdSetIterator.all(DocIdSetIterator.NO_MORE_DOCS));
// If type is NONE, this segment has no docs with this field. That's not a problem, because we won't call score() anyway
tomglk marked this conversation as resolved.
Show resolved Hide resolved
tomglk marked this conversation as resolved.
Show resolved Hide resolved
} else if (!DocValuesType.NONE.equals(docValuesType)) {
throw new IllegalArgumentException("Doc values type " + docValuesType.name() + " of field " + field
+ " is not supported!");
}
}
return new FieldValueFeatureScorer(this, context,
DocIdSetIterator.all(DocIdSetIterator.NO_MORE_DOCS));
DocIdSetIterator.all(DocIdSetIterator.NO_MORE_DOCS));
}

/**
* A FeatureScorer that reads the stored value for a field
*/
public class FieldValueFeatureScorer extends FeatureScorer {
LeafReaderContext context;

LeafReaderContext context = null;

public FieldValueFeatureScorer(FeatureWeight weight,
LeafReaderContext context, DocIdSetIterator itr) {
public FieldValueFeatureScorer(FeatureWeight weight, LeafReaderContext context, DocIdSetIterator itr) {
super(weight, itr);
this.context = context;
}
Expand All @@ -109,8 +148,7 @@ public FieldValueFeatureScorer(FeatureWeight weight,
public float score() throws IOException {

try {
final Document document = context.reader().document(itr.docID(),
fieldAsSet);
final Document document = context.reader().document(itr.docID(), fieldAsSet);
final IndexableField indexableField = document.getField(field);
if (indexableField == null) {
return getDefaultValue();
Expand All @@ -121,22 +159,18 @@ public float score() throws IOException {
} else {
final String string = indexableField.stringValue();
if (string.length() == 1) {
// boolean values in the index are encoded with the
// a single char contained in TRUE_TOKEN or FALSE_TOKEN
// boolean values in the index are encoded with a single char contained in TRUE_TOKEN or FALSE_TOKEN
// (see BoolField)
if (string.charAt(0) == BoolField.TRUE_TOKEN[0]) {
return 1;
return 1f;
}
if (string.charAt(0) == BoolField.FALSE_TOKEN[0]) {
return 0;
return 0f;
}
}
}
} catch (final IOException e) {
throw new FeatureException(
e.toString() + ": " +
"Unable to extract feature for "
+ name, e);
throw new FeatureException(e.toString() + ": " + "Unable to extract feature for " + name, e);
}
return getDefaultValue();
}
Expand All @@ -146,5 +180,141 @@ public float getMaxScore(int upTo) throws IOException {
return Float.POSITIVE_INFINITY;
}
}

/**
* A FeatureScorer that reads the numeric docValues for a field
*/
public class NumericDocValuesFieldValueFeatureScorer extends FeatureWeight.FeatureScorer {
NumericDocValues docValues;
NumberType numberType;

public NumericDocValuesFieldValueFeatureScorer(final FeatureWeight weight, final LeafReaderContext context,
final DocIdSetIterator itr, final NumberType numberType) {
super(weight, itr);
this.numberType = numberType;

try {
docValues = DocValues.getNumeric(context.reader(), field);
} catch (IOException e) {
throw new IllegalArgumentException("Could not read numeric docValues for field " + field);
}
}

@Override
public float score() throws IOException {
if (docValues.advanceExact(itr.docID())) {
return readNumericDocValues();
}
return FieldValueFeature.this.getDefaultValue();
}

/**
* Read the numeric value for a field and convert the different number types to float.
*
* @return The numeric value that the docValues contain for the current document
* @throws IOException if docValues cannot be read
*/
private float readNumericDocValues() throws IOException {
if (NumberType.FLOAT.equals(numberType)) {
// convert float value that was stored as long back to float
return Float.intBitsToFloat((int) docValues.longValue());
} else if (NumberType.DOUBLE.equals(numberType)) {
// handle double value conversion
return (float) Double.longBitsToDouble(docValues.longValue());
}
// just take the long value
return docValues.longValue();
}
Comment on lines +240 to +250
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1/2 So I was trying to learn and understand better w.r.t. why the intBitsToFloat and longBitsToDouble are the conversions to use here and that led to https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.8.2/solr/core/src/java/org/apache/solr/search/SolrDocumentFetcher.java#L603-L644 and the realisation that DATE is a numberType possibility!

Would you have any thoughts on using a switch statement here with FLOAT/DOUBLE/INT/LONG all valid and to throw an exception for anything else ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat strangely so perhaps the use of DATE fields "just works" for both the existing stored and the new docValues implementations. I've added a TestFieldValueFeature.testThatDateValuesAreCorrectlyParsed test to demonstrate that and so then no need to change the implementation here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... the date being a numeric type really is quite unintuitive.
I am not really fond of copying the switch-logic that you referenced, but sadly the method is private. :/

I will go and make a tea and then look into that.
Thank you for adding the test for the date field. The new clarity on what value is returned for the date type is very good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh... Thanks github for not refreshing the page... 🙄
After looking into that I saw the same behavior as you did and was just about to comment that. :D

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... I am not really fond of copying the switch-logic that you referenced, but sadly the method is private. ...

Likewise, I would have preferred for there to be some existing reusable method and was surprised that there isn't one (or not an obviously findable one). The SolrDocumentFetcher one has the sortableNumeric flag (which we don't need here) and some special logic for LatLonPointSpatialField and AbstractEnumField scenarios (also not applicable here) ... otherwise it would have been a clear "okay, let's factor out a method here and put it somewhere that both SolrDocumentFeature and FieldValueFeature can use it" kind of scenario. Oh well.


/**
* Interprets the bytesRef either as true / false token or tries to read it as number string
*
* @param bytesRef the value of the field that should be used as score
* @return the input converted to a number
*/
private float readSortedDocValues(BytesRef bytesRef) {
String string = bytesRef.utf8ToString();
if (string.length() == 1
&& (string.charAt(0) == BoolField.TRUE_TOKEN[0] || string.charAt(0) == BoolField.FALSE_TOKEN[0])) {
// boolean values in the index are encoded with a single char contained in TRUE_TOKEN or FALSE_TOKEN
// (see BoolField)
if (string.charAt(0) == BoolField.TRUE_TOKEN[0]) {
return 1f;
} else {
return 0f;
}
} else {
try {
return Float.parseFloat(string);
} catch (NumberFormatException ex) {
throw new FeatureException("Cannot parse value " + string + " of field " + schemaField.getName() + " to float.");
}
}
}

@Override
public float getMaxScore(int upTo) throws IOException {
return Float.POSITIVE_INFINITY;
}
}
/**
* A FeatureScorer that reads the sorted docValues for a field
*/
public class SortedDocValuesFieldValueFeatureScorer extends FeatureWeight.FeatureScorer {
SortedDocValues docValues;
NumberType fieldNumberType;

public SortedDocValuesFieldValueFeatureScorer(final FeatureWeight weight, final LeafReaderContext context,
final DocIdSetIterator itr) {
super(weight, itr);

try {
docValues = DocValues.getSorted(context.reader(), field);
} catch (IOException e) {
throw new IllegalArgumentException("Could not read sorted docValues for field " + field);
}
}

@Override
public float score() throws IOException {
if (docValues.advanceExact(itr.docID())) {
int ord = docValues.ordValue();
return readSortedDocValues(docValues.lookupOrd(ord));
}
return FieldValueFeature.this.getDefaultValue();
}

/**
* Interprets the bytesRef either as true / false token or tries to read it as number string
*
* @param bytesRef the value of the field that should be used as score
* @return the input converted to a number
*/
private float readSortedDocValues(BytesRef bytesRef) {
String string = bytesRef.utf8ToString();
if (string.length() == 1
&& (string.charAt(0) == BoolField.TRUE_TOKEN[0] || string.charAt(0) == BoolField.FALSE_TOKEN[0])) {
// boolean values in the index are encoded with a single char contained in TRUE_TOKEN or FALSE_TOKEN
// (see BoolField)
if (string.charAt(0) == BoolField.TRUE_TOKEN[0]) {
return 1f;
} else {
return 0f;
}
} else {
try {
return Float.parseFloat(string);
} catch (NumberFormatException ex) {
throw new FeatureException("Cannot parse value " + string + " of field " + schemaField.getName() + " to float.");
}
}
}

@Override
public float getMaxScore(int upTo) throws IOException {
return Float.POSITIVE_INFINITY;
}
}

}
}
20 changes: 19 additions & 1 deletion solr/contrib/ltr/src/test-files/solr/collection1/conf/schema.xml
Original file line number Diff line number Diff line change
Expand Up @@ -18,13 +18,26 @@

<schema name="example" version="1.5">
<fields>
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false"/>
<field name="finalScore" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="finalScoreFloat" type="float" indexed="true" stored="true" multiValued="false"/>
<field name="field" type="text_general" indexed="true" stored="false" multiValued="false"/>
<field name="title" type="text_general" indexed="true" stored="true"/>
<field name="description" type="text_general" indexed="true" stored="true"/>
<field name="keywords" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="popularity" type="int" indexed="true" stored="true" />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

observation: via the docValues="${solr.tests.numeric.dv}" in the <fieldType name="int" ... and popularity being used in the TestFieldValueFeature we have some existing test coverage for doc values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a commit with additional TestFieldValueFeature coverage, for the numeric fields only so far i.e. dvStrNumPopularity and dvStrBoolPopularity not yet.

<field name="dvIntPopularity" type="int" indexed="false" docValues="true" stored="false" multiValued="false"/>
<field name="dvLongPopularity" type="long" indexed="false" docValues="true" stored="false" multiValued="false"/>
<field name="dvFloatPopularity" type="float" indexed="false" docValues="true" stored="false" multiValued="false"/>
<field name="dvDoublePopularity" type="double" indexed="false" docValues="true" stored="false" multiValued="false"/>
<field name="normHits" type="float" indexed="true" stored="true" />
<field name="isTrendy" type="boolean" indexed="true" stored="true" />
<field name="dvIntField" type="int" indexed="false" docValues="true" stored="false" default="-1" multiValued="false"/>
<field name="dvLongField" type="long" indexed="false" docValues="true" stored="false" default="-2" multiValued="false"/>
<field name="dvFloatField" type="float" indexed="false" docValues="true" stored="false" default="-3" multiValued="false"/>
<field name="dvDoubleField" type="double" indexed="false" docValues="true" stored="false" multiValued="false"/>
<field name="dvStrNumField" type="string" indexed="false" docValues="true" stored="false" multiValued="false"/>
<field name="dvStrBoolField" type="boolean" indexed="false" docValues="true" stored="false" multiValued="false"/>

<field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>
<field name="_version_" type="long" indexed="true" stored="true"/>
Expand All @@ -38,6 +51,11 @@
<copyField source="title" dest="text"/>
<copyField source="description" dest="text"/>

<copyField source="popularity" dest="dvIntPopularity"/>
<copyField source="popularity" dest="dvLongPopularity"/>
<copyField source="popularity" dest="dvFloatPopularity"/>
<copyField source="popularity" dest="dvDoublePopularity"/>
Comment on lines +64 to +67
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

observation: copyField here works for the dv*Popularity fields but the dv(Int|Long|Float|Double)Field fields cannot use it since the defaulting logic and the popularity/10 logic is happening there.


<types>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" />
<fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/>
Expand Down
Loading