-
Notifications
You must be signed in to change notification settings - Fork 673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SOLR-12697 Add pure DocValues support to FieldValueFeature #123
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tomglk for opening this pull request today!
I've added some comments from an initial review and hope to look more fully maybe later this week.
@alessandrobenedetti might also be interested in this.
solr/contrib/ltr/src/java/org/apache/solr/ltr/feature/FieldValueFeature.java
Outdated
Show resolved
Hide resolved
solr/contrib/ltr/src/java/org/apache/solr/ltr/feature/FieldValueFeature.java
Outdated
Show resolved
Hide resolved
solr/contrib/ltr/src/java/org/apache/solr/ltr/feature/FieldValueFeature.java
Outdated
Show resolved
Hide resolved
solr/contrib/ltr/src/java/org/apache/solr/ltr/feature/FieldValueFeature.java
Outdated
Show resolved
Hide resolved
solr/contrib/ltr/src/test-files/solr/collection1/conf/schema.xml
Outdated
Show resolved
Hide resolved
solr/contrib/ltr/src/test/org/apache/solr/ltr/TestLTROnSolrCloud.java
Outdated
Show resolved
Hide resolved
solr/contrib/ltr/src/test/org/apache/solr/ltr/TestLTROnSolrCloud.java
Outdated
Show resolved
Hide resolved
solr/contrib/ltr/src/test/org/apache/solr/ltr/TestLTROnSolrCloud.java
Outdated
Show resolved
Hide resolved
solr/contrib/ltr/src/java/org/apache/solr/ltr/feature/FieldValueFeature.java
Outdated
Show resolved
Hide resolved
Thank you very much for the quick and detailed feedback, @cpoerschke ! |
… in constructor instead of once per doc
…ocuments and commit during indexing; improve formatting of test
…lueFeatureScorer so only the supported types have to be handled later; extract number-conversion to separate method
* NumericDocValuesFieldValueFeatureScorer * SortedDocValuesFieldValueFeatureScorer
caveat: TestFieldValueFeature.testIfADocumentDoesntHaveAFieldDefaultValueIsReturned fails
solr/contrib/ltr/src/java/org/apache/solr/ltr/feature/FieldValueFeature.java
Outdated
Show resolved
Hide resolved
final int collectionSize = 8; | ||
for (int docId = 1; docId <= collectionSize; docId++) { | ||
// put documents in random order to check that advanceExact is working correctly | ||
List<Integer> docIds = IntStream.rangeClosed(1, collectionSize).boxed().collect(toList()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I haven't across this IntStream.rangeClosed
thing before!
if (popularity != 1) { | ||
// check that empty values will be read as default | ||
doc.setField("dvIntField", popularity); | ||
doc.setField("dvLongField", popularity); | ||
doc.setField("dvFloatField", ((float) popularity) / 10); | ||
doc.setField("dvDoubleField", ((double) popularity) / 10); | ||
doc.setField("dvStrNumField", popularity); | ||
doc.setField("dvStrBoolField", popularity % 2 == 0 ? "T" : "F"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at TestLTROnSolrCloud
changes today, this approach here is very elegant, combined with the resultN_features
and schema.xml
change.
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false"/> | ||
<field name="finalScore" type="string" indexed="true" stored="true" multiValued="false"/> | ||
<field name="finalScoreFloat" type="float" indexed="true" stored="true" multiValued="false"/> | ||
<field name="field" type="text_general" indexed="true" stored="false" multiValued="false"/> | ||
<field name="title" type="text_general" indexed="true" stored="true"/> | ||
<field name="description" type="text_general" indexed="true" stored="true"/> | ||
<field name="keywords" type="text_general" indexed="true" stored="true" multiValued="true"/> | ||
<field name="popularity" type="int" indexed="true" stored="true" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
observation: via the docValues="${solr.tests.numeric.dv}"
in the <fieldType name="int" ...
and popularity
being used in the TestFieldValueFeature
we have some existing test coverage for doc values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a commit with additional TestFieldValueFeature
coverage, for the numeric fields only so far i.e. dvStrNumPopularity
and dvStrBoolPopularity
not yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @tomglk. Thanks for working on this pull request! I've added two commits to the branch, let me know what you think? Yet to look at the TestLTRReRankingPipeline
test (next week).
solr/contrib/ltr/src/java/org/apache/solr/ltr/feature/FieldValueFeature.java
Outdated
Show resolved
Hide resolved
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false"/> | ||
<field name="finalScore" type="string" indexed="true" stored="true" multiValued="false"/> | ||
<field name="finalScoreFloat" type="float" indexed="true" stored="true" multiValued="false"/> | ||
<field name="field" type="text_general" indexed="true" stored="false" multiValued="false"/> | ||
<field name="title" type="text_general" indexed="true" stored="true"/> | ||
<field name="description" type="text_general" indexed="true" stored="true"/> | ||
<field name="keywords" type="text_general" indexed="true" stored="true" multiValued="true"/> | ||
<field name="popularity" type="int" indexed="true" stored="true" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a commit with additional TestFieldValueFeature
coverage, for the numeric fields only so far i.e. dvStrNumPopularity
and dvStrBoolPopularity
not yet.
<copyField source="popularity" dest="dvIntPopularity"/> | ||
<copyField source="popularity" dest="dvLongPopularity"/> | ||
<copyField source="popularity" dest="dvFloatPopularity"/> | ||
<copyField source="popularity" dest="dvDoublePopularity"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
observation: copyField
here works for the dv*Popularity
fields but the dv(Int|Long|Float|Double)Field
fields cannot use it since the defaulting logic and the popularity/10 logic is happening there.
assertJQ("/query" + query.toQueryString(), | ||
"/response/docs/[0]/=={'[fv]':'"+FeatureLoggerTestUtils.toFeatureVector(field,Float.toString(FIELD_VALUE_FEATURE_DEFAULT_VAL))+"'}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test fails here on the second and subsequent run of the loop. I don't know yet if that is a test artifact or an actual code issue. The testIfADocumentDoesntHaveAFieldASetDefaultValueIsReturned
test does not have the same behaviour since each loop turn uses a different f(feature )store
(and its own model) whereas this test uses the test-wide models and its default feature store.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problem is final private boolean extractAllFeatures;
in LTRScoringQuery
. It is set to true and therefore all previous features are added to the response instead of just the model features.
This is the logic (line 210 in LTRScoringQuery):
if (this.extractAllFeatures) {
features = allFeatures;
} else{
features = modelFeatures;
}
I didn't find a way to set it to false with a query param.
So we would have two options:
- use this constructor
public LTRScoringQuery(LTRScoringModel ltrScoringModel, boolean extractAllFeatures)
- adjust the xpath to something like
[ends-with(., '<fieldname>=0.0')]
I am trying to adjust the xpath, because I would prefer that solution, but I was not able to find the right pattern so far.
Edit:
Ok, I took a break and realized my mistake (the break seemed to be necessary smh).
I took another look at the jsonpath-adjustment. A pattern like this "['response']['docs'][0][?(@['[fv]']=~ /.* <fieldName> =0.0.*/i )]"
should work but it would require an additional dependency for jayway.jsonpath
and we would have to manually call JsonPath.read(...)
and assert that the result is not null.
I am not sure, whether that's nice, but maybe you want to build up upon that / can generate an idea from here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1/2 Thanks for investigating in detail here! I agree the explanation is with the extractAllFeatures
boolean i.e. the response contains the values for all the features in the feature store. The puzzling thing then though is why the first run of the loop succeeds and only subsequent runs fail. Since all loop runs use the same feature store and surely each loop run should return the same feature values? ...
solr/contrib/ltr/src/java/org/apache/solr/ltr/feature/FieldValueFeature.java
Outdated
Show resolved
Hide resolved
assertJQ("/query" + query.toQueryString(), | ||
"/response/docs/[0]/=={'[fv]':'"+FeatureLoggerTestUtils.toFeatureVector(field,Float.toString(FIELD_VALUE_FEATURE_DEFAULT_VAL))+"'}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1/2 Thanks for investigating in detail here! I agree the explanation is with the extractAllFeatures
boolean i.e. the response contains the values for all the features in the feature store. The puzzling thing then though is why the first run of the loop succeeds and only subsequent runs fail. Since all loop runs use the same feature store and surely each loop run should return the same feature values? ...
for (String field : FIELD_NAMES) { | ||
loadFeature(field, FieldValueFeature.class.getName(), | ||
"{\"field\":\""+field+"\"}"); | ||
|
||
loadModel("popularity-model", LinearModel.class.getName(), | ||
new String[] {"popularity"}, "{\"weights\":{\"popularity\":1.0}}"); | ||
loadModel(field + "-model", LinearModel.class.getName(), | ||
new String[] {field}, "{\"weights\":{\""+field+"\":1.0}}"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2/2 ... I think the explanation for the mystery is here i.e. the LinearModel
constructor signature and the interleaving of feature and model loading results in the models having different definitions of what allFeatures
means. One solution then would be for the test expectations to account for the multiple meanings of allFeatures
or (my preference probably) for there to be only one allFeatures
meaning e.g. via the separation of feature and model loading. Or perhaps you can think of a third solution even?
for (String field : FIELD_NAMES) {
loadFeature(field, ...);
}
for (String field : FIELD_NAMES) {
loadModel(field + "-model", ...);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for continuing the investigation!
I think the example you posted here should work, although I am still a bit irritated regarding the extractAllFeatures = true
.
As user I would expect it to always return all the features that were created, no matter if that happened before or after the model.
But the code makes it explicit, that these information should not be modified (allFeatures
is an unmodifiableList).
I think that a nicely placed docString could avoid some confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's a somewhat quirky behaviour. Operationally it could also cause mysteries e.g. when new features are added to an existing store and the (reasonable) expectation is that they will start producing feature values right away when actually that only happens when the existing model(s) are next re-instantiated. +1 to try and reduce such confusion via documentation somehow.
…ult value; only use one model in tests because of extractAllFeatures==true
…est that right scorer classes are used, add more fields to test
…ry to CHANGES.txt
I have added more tests and an entry to the changes.txt. Also: From my side this looks quite finished now. Do you miss anything? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @tomglk!
... From my side this looks quite finished now. Do you miss anything?
I agree, we're close to wrapping this up (and am also mindful of the 8.9.0 release coming up soon and us wishing to include this in it). There's a couple of "polishing" bits I'd like to do and me adding small commits directly to the branch (like the three commits just added) might be the most efficient way if that works for you?
Really looking forward to this being in the next release (and also have some ideas on how afterwards we potentially might (under separate JIRA and pull request) (safely) break backwards compatibility in Solr 9.0.0 to allow stored=true docValues=true
fields to also benefit from the docValues implementation whilst a deprecated (say) LegacyFieldValueFeature
until (say) Solr 10.0.0 retains the existing implementation for users who need that ...
edit: https://issues.apache.org/jira/browse/SOLR-15440 "contrib/ltr FieldValueFeature: DocValues use for stored fields" created for this
/** | ||
* This class is used to track which specific FieldValueFeature is used so that we can test, whether the | ||
* fallback mechanism works correctly. | ||
*/ | ||
public static class ObservingFieldValueFeature extends FieldValueFeature { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The selection of test coverage using this new ObservingFieldValueFeature
and its observing the scorer type is very comprehensive and elegant, thanks for adding it!
solr/contrib/ltr/src/java/org/apache/solr/ltr/feature/FieldValueFeature.java
Outdated
Show resolved
Hide resolved
import org.apache.solr.core.SolrResourceLoader; | ||
import org.apache.solr.ltr.feature.Feature; | ||
import org.apache.solr.ltr.feature.FieldValueFeature; | ||
import org.apache.solr.ltr.model.LTRScoringModel; | ||
import org.apache.solr.ltr.model.TestLinearModel; | ||
import org.apache.solr.ltr.norm.IdentityNormalizer; | ||
import org.apache.solr.ltr.norm.Normalizer; | ||
import org.apache.solr.request.LocalSolrQueryRequest; | ||
import org.apache.solr.request.SolrQueryRequest; | ||
import org.junit.BeforeClass; | ||
import org.junit.Test; | ||
import org.slf4j.Logger; | ||
import org.slf4j.LoggerFactory; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After multiple attempts I'm still struggling to understand and appreciate how the TestLTRReRankingPipeline
changes relate to and fit in with the other changes. @tomglk would you have an extra perspective on this perhaps?
I'm speculating that the changes were required when the SolrDocumentFetcher
implementation was used but now with the TestFieldValueFeature
changes the picture has changed.
If the TestLTRReRankingPipeline
changes are no longer required and relatively unrelated I would favour git checkout origin/main -- TestLTRReRankingPipeline.java
removing them from the pull request scope. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes to this test are not necessary for it to work and I agree that they are only loosely related to the new docValues support.
However, I think that the test is way more comprehensive and better to read now.
I would argue that the test is connected to the changes because of the use of the FieldValueFeature (and not e.g. an OriginalScoreFeature) and future developers can benefit from the increased readability.
So I would like to keep it in this PR to make it more accessible and easier to built up upon in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! Yes, once the "where is the pure docValues connection?" question is gone and the mental focus is on "code comprehension and readability" it is clear that this is a refactoring change that is indeed very beneficial.
And when considering if the refactoring changes inadvertently changes any existing behaviour the @AwaitsFix(bugUrl = "https://issues.apache.org/jira/browse/SOLR-11134")
caught my eye and I think I managed to diagnose this long-standing issue:
- the problem was that
new FloatDocValuesField("final-score", ...)
resulted in astored=false
field (which was not very apparent) and itsschema.xml
replacement namedfinalScore
isstored=true
i.e. immediate readability and comprehension benefit there. - then next the
testDifferentTopN
turned out to have an incorrect expectation w.r.t. what the score should be (addition of an alternativeTestLinearModel.makeFeatureWeights
variant tries to make that clearer). - and lastly once the
testDifferentTopN
test passed it interacted with thetestRescorer
test i.e. if the former ran first then the latter would see not only its own 2 documents but also the documents added by the former test, this is easily fixed by everyone deleting all documents before adding their documents.
... So I would like to keep it in this PR to make it more accessible and easier to built up upon in the future.
I agree we should keep these very beneficial TestLTRReRankingPipeline
changes.
If (and that is an if) the SOLR-11134 fix (or something along those lines) makes sense, then how would you feel about us still removing the TestLTRReRankingPipeline
change from this PR and instead to have it in a separate PR and to "book it" only under SOLR-11134 or jointly under SOLR-11134 and SOLR-12697 tickets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for taking the time to investigate this!
I just had a look at your changes and they totally make sense to me.
Because of the existing jira ticket for 11134 it makes sense to me to remove the file from this PR and to book it under a new PR like "SOLR-11134 fix and restructure TestLTRReRankingPipeline.testDifferentTopN test".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking at and reviewing the changes! I've just opened #145 PR for the "SOLR-11134 fix and restructure TestLTRReRankingPipeline.testDifferentTopN test" changes then.
Hi @cpoerschke , Your idea to make a new jira ticket and PR to break backwards compatibility sounds good to me. It would be useful to profit from the better performance for more fields. I am on board to continue working on this topic with you. I'm also exited to see this in solr 8.9.0! |
…er is more numeric, latter is more boolean and copyField from isTrend simplies the document( add)s
* undo distracting reformatting (hopefully one-off and next time 'spotless' gradle plugin will be available for contrib/ltr) * use private and final where possible * make new scorers final (but not existing scorer for back compat reasons) since no obvious need to extend
…orer is used for different types of fields
…a/SOLR-12697 Resolved Conflicts: solr/contrib/ltr/src/java/org/apache/solr/ltr/feature/FieldValueFeature.java
This reverts commit 3c38e91.
I am currently debugging some ltr stuff and stumbled upon this: final String string = indexableField.stringValue();
if (string.length() == 1) {
// boolean values in the index are encoded with a single char contained in TRUE_TOKEN or FALSE_TOKEN
// (see BoolField)
if (string.charAt(0) == BoolField.TRUE_TOKEN[0]) {
return 1f;
}
if (string.charAt(0) == BoolField.FALSE_TOKEN[0]) {
return 0f;
}
} This code is in the |
…roximately: * minor style tweaks e.g. new SolrQuery(id:21) instead of new SolrQuery(); setQuery(id:21) * in testIfADocumentDoesntHaveAFieldTheDefaultValueFromSchemaIsReturned for clarity remove assumption w.r.t. field default values being -1/-2/-3 sequential * in testThatDefaultFieldValueScorerIsUsedAndDefaultIsReturned replace generic dvTestField name with dvDoubleField and also dvStrBoolField to also cover a non-numeric field * in testThatExceptionIsThrownForUnsupportedType replace generic dvUnsupportedField with dvStringPopularities for the popularities (plural) naming to help signal unsupportedness * in testThatCorrectFieldValueFeatureIsUsedForDocValueTypes replace noDocValuesField with noDvFloatField and noDvStrNumField to cover both numeric and non-numeric * in testThatStringValuesAreCorrectlyParsed also cover non-docValues field _and_ ensure behavioural consistency between dv and non-dv fields [*** this required implementation adjustment for the dv field ***]
Hi @tomglk!
Ah, we seem to have both stumbled upon the same question though perhaps via different routes?! I was trying to extend And advance apologies, the commit has multiple things in one commit though I've tried to at least describe in the commit message details what is in it. Looking forward to your thoughts on it. |
private float readNumericDocValues() throws IOException { | ||
if (NumberType.FLOAT.equals(numberType)) { | ||
// convert float value that was stored as long back to float | ||
return Float.intBitsToFloat((int) docValues.longValue()); | ||
} else if (NumberType.DOUBLE.equals(numberType)) { | ||
// handle double value conversion | ||
return (float) Double.longBitsToDouble(docValues.longValue()); | ||
} | ||
// just take the long value | ||
return docValues.longValue(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1/2 So I was trying to learn and understand better w.r.t. why the intBitsToFloat
and longBitsToDouble
are the conversions to use here and that led to https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.8.2/solr/core/src/java/org/apache/solr/search/SolrDocumentFetcher.java#L603-L644 and the realisation that DATE
is a numberType
possibility!
Would you have any thoughts on using a switch
statement here with FLOAT/DOUBLE/INT/LONG all valid and to throw an exception for anything else ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhat strangely so perhaps the use of DATE
fields "just works" for both the existing stored and the new docValues implementations. I've added a TestFieldValueFeature.testThatDateValuesAreCorrectlyParsed
test to demonstrate that and so then no need to change the implementation here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm... the date being a numeric type really is quite unintuitive.
I am not really fond of copying the switch-logic that you referenced, but sadly the method is private. :/
I will go and make a tea and then look into that.
Thank you for adding the test for the date field. The new clarity on what value is returned for the date type is very good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh... Thanks github for not refreshing the page... 🙄
After looking into that I saw the same behavior as you did and was just about to comment that. :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... I am not really fond of copying the switch-logic that you referenced, but sadly the method is private. ...
Likewise, I would have preferred for there to be some existing reusable method and was surprised that there isn't one (or not an obviously findable one). The SolrDocumentFetcher
one has the sortableNumeric
flag (which we don't need here) and some special logic for LatLonPointSpatialField
and AbstractEnumField
scenarios (also not applicable here) ... otherwise it would have been a clear "okay, let's factor out a method here and put it somewhere that both SolrDocumentFeature and FieldValueFeature can use it" kind of scenario. Oh well.
if (DocValuesType.NUMERIC.equals(docValuesType)) { | ||
return new NumericDocValuesFieldValueFeatureScorer(this, context, | ||
DocIdSetIterator.all(DocIdSetIterator.NO_MORE_DOCS), schemaField.getType().getNumberType()); | ||
} else if (DocValuesType.SORTED.equals(docValuesType)) { | ||
return new SortedDocValuesFieldValueFeatureScorer(this, context, | ||
DocIdSetIterator.all(DocIdSetIterator.NO_MORE_DOCS)); | ||
} else if (DocValuesType.NONE.equals(docValuesType)) { | ||
// Using a fallback feature scorer because this segment has no documents with a doc value for the current field | ||
return new DefaultValueFieldValueFeatureScorer(this, DocIdSetIterator.all(DocIdSetIterator.NO_MORE_DOCS)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2/2 ... combined with here diverting any unexpected getNumberType()
enums onto the "is not supported" code path? Although having said that, I don't know what the existing FieldValueFeatureScorer
would return for a stored DATE
field i.e. perhaps back compat considerations apply, hmm.
Hi @cpoerschke! :)
I just had the time to look through your changes.
No worries, I read you notes in the commit while looking through the code and had no problem with understanding your thoughts behind the changes. |
* undo distracting reformatting (hopefully one-off and next time 'spotless' gradle plugin will be available for contrib/ltr)
Thanks @tomglk for taking a look! I've added one more small polish commit to workaround us not yet having 'spotless' or other automatic code formatting for the Otherwise I think we're all done here, what do you think? |
Hi @cpoerschke ,thank you for the polishing and please excuse me for adding some spaces in these places. I tried to set up my IDE to mimic the correct formatting, but did not fully succeed.
I think so too. :) |
Thank you for the great collaboration on this PR, @cpoerschke . I really did learn a lot thanks to your constructive remarks and had a lot of fun working on this. :) |
…cValues=true" a.k.a. pure DocValues fields. (apache#123) (Stanislav Livotov, Erick Erickson, Tobias Kässmann, Tom Gilke, Christine Poerschke)
…cValues=true" a.k.a. pure DocValues fields. (apache#123) (Stanislav Livotov, Erick Erickson, Tobias Kässmann, Tom Gilke, Christine Poerschke)
Description
This PR refers to the issue SOLR-12697.
The problem is, that the current FieldValueFeature only works for stored fields.
This is not optimal, because using DocValues is faster for this use case. Also it increases the index size if you have to store fields only to use them for ltr.
Solution
Note: This PR is based on the work of Stanislav Livotov and Christine Poerschke that can be seen in the jira ticket.
It uses the latest patch (17th May 2019) from the jira ticket as base. I combined that with suggestions from Mrs. Poerschke and my own approach to the problem.
This PR adds the DocValuesFieldValueFeatureScorer as a new Scorer used by the FieldValueFeatureWeight.
The new scorer is used whenever a field has docValues and is not stored. Therefore it does not affect the current functionality but only is applied for fields that could not be used before.
The new scorer checks the type of docValues a field has and handles NUMERIC and SORTED types. For NUMERIC fields, it simply uses the value, the SORTED type gets parsed as number or boolean-flag.
Tests
New fields that have docValues=true were added to the schema.xml in order to test in TestLTROnSolrCloud that the feature-requests also return values for these fields.
The TestLTRReRankingPipeline was changed from a SolrTestCase to a SolrTestCaseJ4 in order to improve readability.
I ran all tests in the package
org.apache.solr.ltr
.Please note
I am aware that the structure of the FieldValueFeature is now quite hard to read and the new Scorer is a bit hidden. I decided to add another nested class to the FieldValueFeatureWeight to avoid having to duplicate a lot of code just to change the inner functionality.
Unit tests for the handleBytesRef are still missing. I plan to add them, but wanted to create the PR already so that the general approach can be reviewed and discussed.Edit: Tests for the parsing of different Types of sorted docValues (T and F flag, numbers, no number) were added in e07c432.
Checklist
Please review the following and check all that apply:
main
branch../gradlew check
../gradlew check -x test
.