SOLR-12697 Add pure DocValues support to FieldValueFeature #123

tomglk · 2021-05-12T15:59:07Z

Description

This PR refers to the issue SOLR-12697.
The problem is, that the current FieldValueFeature only works for stored fields.
This is not optimal, because using DocValues is faster for this use case. Also it increases the index size if you have to store fields only to use them for ltr.

Solution

Note: This PR is based on the work of Stanislav Livotov and Christine Poerschke that can be seen in the jira ticket.

It uses the latest patch (17th May 2019) from the jira ticket as base. I combined that with suggestions from Mrs. Poerschke and my own approach to the problem.

This PR adds the DocValuesFieldValueFeatureScorer as a new Scorer used by the FieldValueFeatureWeight.
The new scorer is used whenever a field has docValues and is not stored. Therefore it does not affect the current functionality but only is applied for fields that could not be used before.
The new scorer checks the type of docValues a field has and handles NUMERIC and SORTED types. For NUMERIC fields, it simply uses the value, the SORTED type gets parsed as number or boolean-flag.

Tests

New fields that have docValues=true were added to the schema.xml in order to test in TestLTROnSolrCloud that the feature-requests also return values for these fields.
The TestLTRReRankingPipeline was changed from a SolrTestCase to a SolrTestCaseJ4 in order to improve readability.

I ran all tests in the package org.apache.solr.ltr.

Please note

I am aware that the structure of the FieldValueFeature is now quite hard to read and the new Scorer is a bit hidden. I decided to add another nested class to the FieldValueFeatureWeight to avoid having to duplicate a lot of code just to change the inner functionality.

~~Unit tests for the handleBytesRef are still missing. I plan to add them, but wanted to create the PR already so that the general approach can be reviewed and discussed.~~
Edit: Tests for the parsing of different Types of sorted docValues (T and F flag, numbers, no number) were added in e07c432.

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title. (Issue was already present)
I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
I have developed this patch against the main branch.
I have run ./gradlew check.
I have run ./gradlew check -x test.
I have added tests for my changes.
I have added documentation for the Reference Guide

…or ltr feature calculation

cpoerschke

Thanks @tomglk for opening this pull request today!

I've added some comments from an initial review and hope to look more fully maybe later this week.

@alessandrobenedetti might also be interested in this.

solr/contrib/ltr/src/java/org/apache/solr/ltr/feature/FieldValueFeature.java

solr/contrib/ltr/src/test-files/solr/collection1/conf/schema.xml

solr/contrib/ltr/src/test/org/apache/solr/ltr/TestLTROnSolrCloud.java

solr/contrib/ltr/src/java/org/apache/solr/ltr/feature/FieldValueFeature.java

tomglk · 2021-05-12T20:22:39Z

Thank you very much for the quick and detailed feedback, @cpoerschke !
I already have some ideas in mind and will start to improve the code next thing tomorrow morning.

… in constructor instead of once per doc

…ocuments and commit during indexing; improve formatting of test

…lueFeatureScorer so only the supported types have to be handled later; extract number-conversion to separate method

* NumericDocValuesFieldValueFeatureScorer * SortedDocValuesFieldValueFeatureScorer

caveat: TestFieldValueFeature.testIfADocumentDoesntHaveAFieldDefaultValueIsReturned fails

solr/contrib/ltr/src/java/org/apache/solr/ltr/feature/FieldValueFeature.java

cpoerschke · 2021-05-14T19:47:09Z

solr/contrib/ltr/src/test/org/apache/solr/ltr/TestLTROnSolrCloud.java

    final int collectionSize = 8;
-    for (int docId = 1; docId <= collectionSize;  docId++) {
+    // put documents in random order to check that advanceExact is working correctly
+    List<Integer> docIds = IntStream.rangeClosed(1, collectionSize).boxed().collect(toList());


Nice, I haven't across this IntStream.rangeClosed thing before!

cpoerschke · 2021-05-14T19:50:00Z

solr/contrib/ltr/src/test/org/apache/solr/ltr/TestLTROnSolrCloud.java

+    if (popularity != 1) {
+      // check that empty values will be read as default
+      doc.setField("dvIntField", popularity);
+      doc.setField("dvLongField", popularity);
+      doc.setField("dvFloatField", ((float) popularity) / 10);
+      doc.setField("dvDoubleField", ((double) popularity) / 10);
+      doc.setField("dvStrNumField", popularity);
+      doc.setField("dvStrBoolField", popularity % 2 == 0 ? "T" : "F");
+    }


I looked at TestLTROnSolrCloud changes today, this approach here is very elegant, combined with the resultN_features and schema.xml change.

cpoerschke · 2021-05-14T19:53:34Z

solr/contrib/ltr/src/test-files/solr/collection1/conf/schema.xml

+    <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false"/>
+    <field name="finalScore" type="string" indexed="true" stored="true" multiValued="false"/>
+    <field name="finalScoreFloat" type="float" indexed="true" stored="true" multiValued="false"/>
+    <field name="field" type="text_general" indexed="true" stored="false" multiValued="false"/>
    <field name="title" type="text_general" indexed="true" stored="true"/>
    <field name="description" type="text_general" indexed="true" stored="true"/>
    <field name="keywords" type="text_general" indexed="true" stored="true" multiValued="true"/>
    <field name="popularity" type="int" indexed="true" stored="true" />


observation: via the docValues="${solr.tests.numeric.dv}" in the <fieldType name="int" ... and popularity being used in the TestFieldValueFeature we have some existing test coverage for doc values.

Added a commit with additional TestFieldValueFeature coverage, for the numeric fields only so far i.e. dvStrNumPopularity and dvStrBoolPopularity not yet.

cpoerschke

Hi @tomglk. Thanks for working on this pull request! I've added two commits to the branch, let me know what you think? Yet to look at the TestLTRReRankingPipeline test (next week).

solr/contrib/ltr/src/java/org/apache/solr/ltr/feature/FieldValueFeature.java

cpoerschke · 2021-05-14T20:10:14Z

solr/contrib/ltr/src/test-files/solr/collection1/conf/schema.xml

+    <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false"/>
+    <field name="finalScore" type="string" indexed="true" stored="true" multiValued="false"/>
+    <field name="finalScoreFloat" type="float" indexed="true" stored="true" multiValued="false"/>
+    <field name="field" type="text_general" indexed="true" stored="false" multiValued="false"/>
    <field name="title" type="text_general" indexed="true" stored="true"/>
    <field name="description" type="text_general" indexed="true" stored="true"/>
    <field name="keywords" type="text_general" indexed="true" stored="true" multiValued="true"/>
    <field name="popularity" type="int" indexed="true" stored="true" />


Added a commit with additional TestFieldValueFeature coverage, for the numeric fields only so far i.e. dvStrNumPopularity and dvStrBoolPopularity not yet.

cpoerschke · 2021-05-14T20:12:21Z

solr/contrib/ltr/src/test-files/solr/collection1/conf/schema.xml

+  <copyField source="popularity" dest="dvIntPopularity"/>
+  <copyField source="popularity" dest="dvLongPopularity"/>
+  <copyField source="popularity" dest="dvFloatPopularity"/>
+  <copyField source="popularity" dest="dvDoublePopularity"/>


observation: copyField here works for the dv*Popularity fields but the dv(Int|Long|Float|Double)Field fields cannot use it since the defaulting logic and the popularity/10 logic is happening there.

cpoerschke · 2021-05-14T20:18:09Z

solr/contrib/ltr/src/test/org/apache/solr/ltr/feature/TestFieldValueFeature.java

+      assertJQ("/query" + query.toQueryString(),
+              "/response/docs/[0]/=={'[fv]':'"+FeatureLoggerTestUtils.toFeatureVector(field,Float.toString(FIELD_VALUE_FEATURE_DEFAULT_VAL))+"'}");


This test fails here on the second and subsequent run of the loop. I don't know yet if that is a test artifact or an actual code issue. The testIfADocumentDoesntHaveAFieldASetDefaultValueIsReturned test does not have the same behaviour since each loop turn uses a different f(feature )store (and its own model) whereas this test uses the test-wide models and its default feature store.

I think the problem is final private boolean extractAllFeatures; in LTRScoringQuery. It is set to true and therefore all previous features are added to the response instead of just the model features.

This is the logic (line 210 in LTRScoringQuery):

if (this.extractAllFeatures) { features = allFeatures; } else{ features = modelFeatures; }

I didn't find a way to set it to false with a query param.
So we would have two options:

use this constructor public LTRScoringQuery(LTRScoringModel ltrScoringModel, boolean extractAllFeatures)

adjust the xpath to something like [ends-with(., '<fieldname>=0.0')]

I am trying to adjust the xpath, because I would prefer that solution, but I was not able to find the right pattern so far.

Edit:
Ok, I took a break and realized my mistake (the break seemed to be necessary smh).
I took another look at the jsonpath-adjustment. A pattern like this "['response']['docs'][0][?(@['[fv]']=~ /.* <fieldName> =0.0.*/i )]" should work but it would require an additional dependency for jayway.jsonpath and we would have to manually call JsonPath.read(...) and assert that the result is not null.

I am not sure, whether that's nice, but maybe you want to build up upon that / can generate an idea from here?

1/2 Thanks for investigating in detail here! I agree the explanation is with the extractAllFeatures boolean i.e. the response contains the values for all the features in the feature store. The puzzling thing then though is why the first run of the loop succeeds and only subsequent runs fail. Since all loop runs use the same feature store and surely each loop run should return the same feature values? ...

solr/contrib/ltr/src/java/org/apache/solr/ltr/feature/FieldValueFeature.java

…ric docValues

cpoerschke · 2021-05-16T20:19:36Z

solr/contrib/ltr/src/test/org/apache/solr/ltr/feature/TestFieldValueFeature.java

+      assertJQ("/query" + query.toQueryString(),
+              "/response/docs/[0]/=={'[fv]':'"+FeatureLoggerTestUtils.toFeatureVector(field,Float.toString(FIELD_VALUE_FEATURE_DEFAULT_VAL))+"'}");


1/2 Thanks for investigating in detail here! I agree the explanation is with the extractAllFeatures boolean i.e. the response contains the values for all the features in the feature store. The puzzling thing then though is why the first run of the loop succeeds and only subsequent runs fail. Since all loop runs use the same feature store and surely each loop run should return the same feature values? ...

cpoerschke · 2021-05-16T20:26:45Z

solr/contrib/ltr/src/test/org/apache/solr/ltr/feature/TestFieldValueFeature.java

+    for (String field : FIELD_NAMES) {
+      loadFeature(field, FieldValueFeature.class.getName(),
+              "{\"field\":\""+field+"\"}");

-    loadModel("popularity-model", LinearModel.class.getName(),
-            new String[] {"popularity"}, "{\"weights\":{\"popularity\":1.0}}");
+      loadModel(field + "-model", LinearModel.class.getName(),
+              new String[] {field}, "{\"weights\":{\""+field+"\":1.0}}");
+    }


2/2 ... I think the explanation for the mystery is here i.e. the LinearModel constructor signature and the interleaving of feature and model loading results in the models having different definitions of what allFeatures means. One solution then would be for the test expectations to account for the multiple meanings of allFeatures or (my preference probably) for there to be only one allFeatures meaning e.g. via the separation of feature and model loading. Or perhaps you can think of a third solution even?

for (String field : FIELD_NAMES) { loadFeature(field, ...); } for (String field : FIELD_NAMES) { loadModel(field + "-model", ...); }

Thanks for continuing the investigation!
I think the example you posted here should work, although I am still a bit irritated regarding the extractAllFeatures = true.

As user I would expect it to always return all the features that were created, no matter if that happened before or after the model.
But the code makes it explicit, that these information should not be modified (allFeatures is an unmodifiableList).
I think that a nicely placed docString could avoid some confusion.

Yes, it's a somewhat quirky behaviour. Operationally it could also cause mysteries e.g. when new features are added to an existing store and the (reasonable) expectation is that they will start producing feature values right away when actually that only happens when the existing model(s) are next re-instantiated. +1 to try and reduce such confusion via documentation somehow.

…ult value; only use one model in tests because of extractAllFeatures==true

…est that right scorer classes are used, add more fields to test

…ry to CHANGES.txt

tomglk · 2021-05-19T19:06:11Z

I have added more tests and an entry to the changes.txt.
For now, I just added our names, @cpoerschke , because I wasn't sure about the people in the Jira ticket. I think it would be nice to include them because this PR was built upon that and the patches, but I have no overview who participated how much and the last patch was created ~ 2 years ago.
What do you think?

Also: From my side this looks quite finished now. Do you miss anything?

cpoerschke

Hi @tomglk!

... From my side this looks quite finished now. Do you miss anything?

I agree, we're close to wrapping this up (and am also mindful of the 8.9.0 release coming up soon and us wishing to include this in it). There's a couple of "polishing" bits I'd like to do and me adding small commits directly to the branch (like the three commits just added) might be the most efficient way if that works for you?

Really looking forward to this being in the next release (and also have some ideas on how afterwards we potentially might (under separate JIRA and pull request) (safely) break backwards compatibility in Solr 9.0.0 to allow stored=true docValues=true fields to also benefit from the docValues implementation whilst a deprecated (say) LegacyFieldValueFeature until (say) Solr 10.0.0 retains the existing implementation for users who need that ...

edit: https://issues.apache.org/jira/browse/SOLR-15440 "contrib/ltr FieldValueFeature: DocValues use for stored fields" created for this

cpoerschke · 2021-05-20T10:33:11Z

solr/contrib/ltr/src/test/org/apache/solr/ltr/feature/TestFieldValueFeature.java

+  /**
+   * This class is used to track which specific FieldValueFeature is used so that we can test, whether the
+   * fallback mechanism works correctly.
+   */
+  public static class ObservingFieldValueFeature extends FieldValueFeature {


The selection of test coverage using this new ObservingFieldValueFeature and its observing the scorer type is very comprehensive and elegant, thanks for adding it!

solr/contrib/ltr/src/java/org/apache/solr/ltr/feature/FieldValueFeature.java

cpoerschke · 2021-05-20T10:35:11Z

solr/contrib/ltr/src/test/org/apache/solr/ltr/TestLTRReRankingPipeline.java

 import org.apache.solr.core.SolrResourceLoader;
 import org.apache.solr.ltr.feature.Feature;
 import org.apache.solr.ltr.feature.FieldValueFeature;
 import org.apache.solr.ltr.model.LTRScoringModel;
 import org.apache.solr.ltr.model.TestLinearModel;
 import org.apache.solr.ltr.norm.IdentityNormalizer;
 import org.apache.solr.ltr.norm.Normalizer;
+import org.apache.solr.request.LocalSolrQueryRequest;
+import org.apache.solr.request.SolrQueryRequest;
+import org.junit.BeforeClass;
 import org.junit.Test;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;



After multiple attempts I'm still struggling to understand and appreciate how the TestLTRReRankingPipeline changes relate to and fit in with the other changes. @tomglk would you have an extra perspective on this perhaps?

I'm speculating that the changes were required when the SolrDocumentFetcher implementation was used but now with the TestFieldValueFeature changes the picture has changed.

If the TestLTRReRankingPipeline changes are no longer required and relatively unrelated I would favour git checkout origin/main -- TestLTRReRankingPipeline.java removing them from the pull request scope. What do you think?

The changes to this test are not necessary for it to work and I agree that they are only loosely related to the new docValues support.
However, I think that the test is way more comprehensive and better to read now.

I would argue that the test is connected to the changes because of the use of the FieldValueFeature (and not e.g. an OriginalScoreFeature) and future developers can benefit from the increased readability.
So I would like to keep it in this PR to make it more accessible and easier to built up upon in the future.

Thanks for this! Yes, once the "where is the pure docValues connection?" question is gone and the mental focus is on "code comprehension and readability" it is clear that this is a refactoring change that is indeed very beneficial.

And when considering if the refactoring changes inadvertently changes any existing behaviour the @AwaitsFix(bugUrl = "https://issues.apache.org/jira/browse/SOLR-11134") caught my eye and I think I managed to diagnose this long-standing issue:

the problem was that new FloatDocValuesField("final-score", ...) resulted in a stored=false field (which was not very apparent) and its schema.xml replacement named finalScore is stored=true i.e. immediate readability and comprehension benefit there.

then next the testDifferentTopN turned out to have an incorrect expectation w.r.t. what the score should be (addition of an alternative TestLinearModel.makeFeatureWeights variant tries to make that clearer).

and lastly once the testDifferentTopN test passed it interacted with the testRescorer test i.e. if the former ran first then the latter would see not only its own 2 documents but also the documents added by the former test, this is easily fixed by everyone deleting all documents before adding their documents.

... So I would like to keep it in this PR to make it more accessible and easier to built up upon in the future.

I agree we should keep these very beneficial TestLTRReRankingPipeline changes.

If (and that is an if) the SOLR-11134 fix (or something along those lines) makes sense, then how would you feel about us still removing the TestLTRReRankingPipeline change from this PR and instead to have it in a separate PR and to "book it" only under SOLR-11134 or jointly under SOLR-11134 and SOLR-12697 tickets?

Thank you for taking the time to investigate this!
I just had a look at your changes and they totally make sense to me.

Because of the existing jira ticket for 11134 it makes sense to me to remove the file from this PR and to book it under a new PR like "SOLR-11134 fix and restructure TestLTRReRankingPipeline.testDifferentTopN test".

Thanks for looking at and reviewing the changes! I've just opened #145 PR for the "SOLR-11134 fix and restructure TestLTRReRankingPipeline.testDifferentTopN test" changes then.

tomglk · 2021-05-20T11:21:06Z

Hi @cpoerschke ,
feel free to directly commit to the branch. :)

Your idea to make a new jira ticket and PR to break backwards compatibility sounds good to me. It would be useful to profit from the better performance for more fields. I am on board to continue working on this topic with you.

I'm also exited to see this in solr 8.9.0!

…er is more numeric, latter is more boolean and copyField from isTrend simplies the document( add)s

* undo distracting reformatting (hopefully one-off and next time 'spotless' gradle plugin will be available for contrib/ltr) * use private and final where possible * make new scorers final (but not existing scorer for back compat reasons) since no obvious need to extend

…orer is used for different types of fields

…a/SOLR-12697 Resolved Conflicts: solr/contrib/ltr/src/java/org/apache/solr/ltr/feature/FieldValueFeature.java

This reverts commit 3c38e91.

tomglk · 2021-05-21T07:16:46Z

I am currently debugging some ltr stuff and stumbled upon this:

final String string = indexableField.stringValue();
if (string.length() == 1) {
    // boolean values in the index are encoded with a single char contained in TRUE_TOKEN or FALSE_TOKEN
    // (see BoolField)
    if (string.charAt(0) == BoolField.TRUE_TOKEN[0]) {
      return 1f;
    }
    if (string.charAt(0) == BoolField.FALSE_TOKEN[0]) {
      return 0f;
    }
}

This code is in the FieldValueFeatureScorer. Why don't we try to parse the string value as number?
We do that for docValues, but not for stored fields.

…roximately: * minor style tweaks e.g. new SolrQuery(id:21) instead of new SolrQuery(); setQuery(id:21) * in testIfADocumentDoesntHaveAFieldTheDefaultValueFromSchemaIsReturned for clarity remove assumption w.r.t. field default values being -1/-2/-3 sequential * in testThatDefaultFieldValueScorerIsUsedAndDefaultIsReturned replace generic dvTestField name with dvDoubleField and also dvStrBoolField to also cover a non-numeric field * in testThatExceptionIsThrownForUnsupportedType replace generic dvUnsupportedField with dvStringPopularities for the popularities (plural) naming to help signal unsupportedness * in testThatCorrectFieldValueFeatureIsUsedForDocValueTypes replace noDocValuesField with noDvFloatField and noDvStrNumField to cover both numeric and non-numeric * in testThatStringValuesAreCorrectlyParsed also cover non-docValues field _and_ ensure behavioural consistency between dv and non-dv fields [*** this required implementation adjustment for the dv field ***]

cpoerschke · 2021-05-24T12:00:35Z

Hi @tomglk!

I am currently debugging some ltr stuff and stumbled upon this: ... Why don't we try to parse the string value as number? We do that for docValues, but not for stored fields.

Ah, we seem to have both stumbled upon the same question though perhaps via different routes?!

I was trying to extend TestFieldValueFeature.testThatStringValuesAreCorrectlyParsed to also cover non-docValues field and to ensure behavioural consistency between dv and non-dv fields i.e. docValues and stored fields. My conclusion was that for back compat reasons the docValues would need to match stored fields behaviour -- I will add a commit with proposed changes shortly -- and that what looks like a "string" value in the implementation is for "boolean" values only really i.e. F/0 and T/1 and when number values are required a number field type would/should be used instead.

And advance apologies, the commit has multiple things in one commit though I've tried to at least describe in the commit message details what is in it. Looking forward to your thoughts on it.

cpoerschke · 2021-05-24T12:13:40Z

solr/contrib/ltr/src/java/org/apache/solr/ltr/feature/FieldValueFeature.java

+      private float readNumericDocValues() throws IOException {
+        if (NumberType.FLOAT.equals(numberType)) {
+          // convert float value that was stored as long back to float
+          return Float.intBitsToFloat((int) docValues.longValue());
+        } else if (NumberType.DOUBLE.equals(numberType)) {
+          // handle double value conversion
+          return (float) Double.longBitsToDouble(docValues.longValue());
+        }
+        // just take the long value
+        return docValues.longValue();
+      }


1/2 So I was trying to learn and understand better w.r.t. why the intBitsToFloat and longBitsToDouble are the conversions to use here and that led to https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.8.2/solr/core/src/java/org/apache/solr/search/SolrDocumentFetcher.java#L603-L644 and the realisation that DATE is a numberType possibility!

Would you have any thoughts on using a switch statement here with FLOAT/DOUBLE/INT/LONG all valid and to throw an exception for anything else ...

Somewhat strangely so perhaps the use of DATE fields "just works" for both the existing stored and the new docValues implementations. I've added a TestFieldValueFeature.testThatDateValuesAreCorrectlyParsed test to demonstrate that and so then no need to change the implementation here.

Hm... the date being a numeric type really is quite unintuitive.
I am not really fond of copying the switch-logic that you referenced, but sadly the method is private. :/

I will go and make a tea and then look into that.
Thank you for adding the test for the date field. The new clarity on what value is returned for the date type is very good.

Oh... Thanks github for not refreshing the page... 🙄
After looking into that I saw the same behavior as you did and was just about to comment that. :D

... I am not really fond of copying the switch-logic that you referenced, but sadly the method is private. ...

Likewise, I would have preferred for there to be some existing reusable method and was surprised that there isn't one (or not an obviously findable one). The SolrDocumentFetcher one has the sortableNumeric flag (which we don't need here) and some special logic for LatLonPointSpatialField and AbstractEnumField scenarios (also not applicable here) ... otherwise it would have been a clear "okay, let's factor out a method here and put it somewhere that both SolrDocumentFeature and FieldValueFeature can use it" kind of scenario. Oh well.

cpoerschke · 2021-05-24T12:18:22Z

solr/contrib/ltr/src/java/org/apache/solr/ltr/feature/FieldValueFeature.java

+        if (DocValuesType.NUMERIC.equals(docValuesType)) {
+          return new NumericDocValuesFieldValueFeatureScorer(this, context,
+                  DocIdSetIterator.all(DocIdSetIterator.NO_MORE_DOCS), schemaField.getType().getNumberType());
+        } else if (DocValuesType.SORTED.equals(docValuesType)) {
+          return new SortedDocValuesFieldValueFeatureScorer(this, context,
+                  DocIdSetIterator.all(DocIdSetIterator.NO_MORE_DOCS));
+        } else if (DocValuesType.NONE.equals(docValuesType)) {
+          // Using a fallback feature scorer because this segment has no documents with a doc value for the current field
+          return new DefaultValueFieldValueFeatureScorer(this, DocIdSetIterator.all(DocIdSetIterator.NO_MORE_DOCS));
+        }


2/2 ... combined with here diverting any unexpected getNumberType() enums onto the "is not supported" code path? Although having said that, I don't know what the existing FieldValueFeatureScorer would return for a stored DATE field i.e. perhaps back compat considerations apply, hmm.

tomglk · 2021-05-25T17:30:40Z

Hi @cpoerschke! :)

My conclusion was that for back compat reasons the docValues would need to match stored fields behaviour -- I will add a commit with proposed changes shortly -- and that what looks like a "string" value in the implementation is for "boolean" values only really i.e. F/0 and T/1 and when number values are required a number field type would/should be used instead.

I just had the time to look through your changes.
Although I liked the possibility to also handle numbers that are present in a string field, I agree that one should use a numeric field for that. One shouldn't expect the code to handle such cases.
I think your changes made the functionalities that are supported much cleaner. And the consistent behavior between dv and non-dv fields also is a very important point that I admittedly paid not enough attention to. Thanks for keeping an eye on such things!

And advance apologies, the commit has multiple things in one commit though I've tried to at least describe in the commit message details what is in it.

No worries, I read you notes in the commit while looking through the code and had no problem with understanding your thoughts behind the changes.

* undo distracting reformatting (hopefully one-off and next time 'spotless' gradle plugin will be available for contrib/ltr)

cpoerschke · 2021-05-26T17:04:56Z

... I just had the time to look through your changes. ...

Thanks @tomglk for taking a look!

I've added one more small polish commit to workaround us not yet having 'spotless' or other automatic code formatting for the contrib/ltr though #126 hopes to be a step in that direction.

Otherwise I think we're all done here, what do you think?

tomglk · 2021-05-26T17:26:22Z

Hi @cpoerschke ,thank you for the polishing and please excuse me for adding some spaces in these places. I tried to set up my IDE to mimic the correct formatting, but did not fully succeed.
#126 sounds like a good approach to get to a point where we don't have many formatting changes that blow up the change list of a PR. That would make everything easier.

Otherwise I think we're all done here, what do you think?

I think so too. :)

tomglk · 2021-05-28T18:34:06Z

Thank you for the great collaboration on this PR, @cpoerschke . I really did learn a lot thanks to your constructive remarks and had a lot of fun working on this. :)

…cValues=true" a.k.a. pure DocValues fields. (apache#123) (Stanislav Livotov, Erick Erickson, Tobias Kässmann, Tom Gilke, Christine Poerschke)

tomglk added 4 commits May 12, 2021 16:43

[SOLR-12697] add DocValuesFieldValueFeatureScorer to read docValues f…

2ee8779

…or ltr feature calculation

[SOLR-12697] formatting changes

bdce029

[SOLR-12697] only apply new scorer to fields that are not stored

e6601ee

[SOLR-12697] remove BINARY case because it is not supported

d6e1477

cpoerschke reviewed May 12, 2021

View reviewed changes

tomglk and others added 5 commits May 13, 2021 09:46

[SOLR-12697] only pass fieldType to constructor; determine numberType…

5bc995c

… in constructor instead of once per doc

[SOLR-12697] remove - from fieldnames; randomize indexing order for d…

4559415

…ocuments and commit during indexing; improve formatting of test

[SOLR-12697] determine docValuesType before creating DocValuesFieldVa…

ec4cbfb

…lueFeatureScorer so only the supported types have to be handled later; extract number-conversion to separate method

split dual-purpose DocValuesFieldValueFeatureScorer into two

e6f20f1

* NumericDocValuesFieldValueFeatureScorer * SortedDocValuesFieldValueFeatureScorer

add TestFieldValueFeature test coverage (with caveat)

f16ce3d

caveat: TestFieldValueFeature.testIfADocumentDoesntHaveAFieldDefaultValueIsReturned fails

cpoerschke reviewed May 14, 2021

View reviewed changes

cpoerschke reviewed May 15, 2021

View reviewed changes

solr/contrib/ltr/src/java/org/apache/solr/ltr/feature/FieldValueFeature.java Outdated Show resolved Hide resolved

[SOLR-12697] remove method to read sorted values from Scorer for nume…

e5954eb

…ric docValues

cpoerschke reviewed May 16, 2021

View reviewed changes

tomglk added 3 commits May 17, 2021 22:36

[SOLR-12697] add fallback feature scorer that always returns the defa…

da6a635

…ult value; only use one model in tests because of extractAllFeatures==true

[SOLR-12697] test that exception is thrown for unsupported dv type, t…

b105627

…est that right scorer classes are used, add more fields to test

[SOLR-12697] add tests for parsing different sortedDocValues, add ent…

e07c432

…ry to CHANGES.txt

cpoerschke added 3 commits May 20, 2021 09:45

solr/CHANGES.txt edit

443a396

in TestFieldValueFeature reduce potential test interaction

2dbd94e

in FieldValueFeature clarify 'searcher instanceof SolrIndexSearcher' use

9b77154

cpoerschke reviewed May 20, 2021

View reviewed changes

tomglk changed the title ~~[SOLR-12697] Add pure DocValues support to FieldValueFeature~~ SOLR-12697 Add pure DocValues support to FieldValueFeature May 20, 2021

cpoerschke added 4 commits May 20, 2021 18:14

TestFieldValueFeature: replace dvBoolPopularity with dvIsTrendy (form…

c1f3a8e

…er is more numeric, latter is more boolean and copyField from isTrend simplies the document( add)s

out-scope TestLTRReRankingPipeline changes

3c38e91

FieldValueFeature: mention stored=true or docValues=true in javadocs

53cd2fb

tomglk and others added 4 commits May 20, 2021 20:01

[SOLR-12697] add javadoc to explain which type of FieldValueFeatureSc…

b9d3cd0

…orer is used for different types of fields

Merge remote-tracking branch 'github_tomglk/jira/SOLR-12697' into jir…

da57e9c

…a/SOLR-12697 Resolved Conflicts: solr/contrib/ltr/src/java/org/apache/solr/ltr/feature/FieldValueFeature.java

Revert "out-scope TestLTRReRankingPipeline changes"

abb3632

This reverts commit 3c38e91.

fix for SOLR-11134

a789b12

tomglk and others added 2 commits May 21, 2021 18:22

[SOLR-12697] out-scope TestLTRReRankingPipeline

c42be54

cpoerschke reviewed May 24, 2021

View reviewed changes

cpoerschke mentioned this pull request May 24, 2021

SOLR-11134: restructure TestLTRReRankingPipeline and fix testDifferentTopN test #145

Merged

cpoerschke added 2 commits May 25, 2021 13:21

add TestFieldValueFeature.testThatDateValuesAreCorrectlyParsed()

385d8b2

Merge remote-tracking branch 'origin/main' into jira/SOLR-12697

4348d04

small TestLTROnSolrCloud polish:

ad489d0

* undo distracting reformatting (hopefully one-off and next time 'spotless' gradle plugin will be available for contrib/ltr)

cpoerschke approved these changes May 26, 2021

View reviewed changes

Merge branch 'main' into jira/SOLR-12697

2c3a368

cpoerschke merged commit 3db4cdd into apache:main May 28, 2021

janhoy mentioned this pull request Apr 8, 2022

Update github PR link to use the new solr repo #797

Merged

This was referenced Sep 1, 2023

NO JIRA: add TestFieldValueFeature.testBooleanValue_docValues() case #1881

Merged

SOLR-15440: FieldValueFeature now uses DocValues when docValues=true and stored=true are combined. #1883

Merged

		assertJQ("/query" + query.toQueryString(),
		"/response/docs/[0]/=={'[fv]':'"+FeatureLoggerTestUtils.toFeatureVector(field,Float.toString(FIELD_VALUE_FEATURE_DEFAULT_VAL))+"'}");

SOLR-12697 Add pure DocValues support to FieldValueFeature #123

SOLR-12697 Add pure DocValues support to FieldValueFeature #123

Conversation

tomglk commented May 12, 2021 • edited Loading

Description

Solution

Tests

Please note

Checklist

cpoerschke left a comment

Choose a reason for hiding this comment

tomglk commented May 12, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpoerschke left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomglk May 16, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomglk commented May 19, 2021

cpoerschke left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomglk commented May 20, 2021

tomglk commented May 21, 2021

cpoerschke commented May 24, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomglk commented May 25, 2021

cpoerschke commented May 26, 2021

tomglk commented May 26, 2021 • edited Loading

tomglk commented May 28, 2021

tomglk commented May 12, 2021 •

edited

Loading

tomglk May 16, 2021 •

edited

Loading

cpoerschke left a comment •

edited

Loading

tomglk commented May 26, 2021 •

edited

Loading