Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-12697 Add pure DocValues support to FieldValueFeature #123

Merged
merged 30 commits into from
May 28, 2021

Conversation

tomglk
Copy link
Contributor

@tomglk tomglk commented May 12, 2021

Description

This PR refers to the issue SOLR-12697.
The problem is, that the current FieldValueFeature only works for stored fields.
This is not optimal, because using DocValues is faster for this use case. Also it increases the index size if you have to store fields only to use them for ltr.

Solution

Note: This PR is based on the work of Stanislav Livotov and Christine Poerschke that can be seen in the jira ticket.

It uses the latest patch (17th May 2019) from the jira ticket as base. I combined that with suggestions from Mrs. Poerschke and my own approach to the problem.

This PR adds the DocValuesFieldValueFeatureScorer as a new Scorer used by the FieldValueFeatureWeight.
The new scorer is used whenever a field has docValues and is not stored. Therefore it does not affect the current functionality but only is applied for fields that could not be used before.
The new scorer checks the type of docValues a field has and handles NUMERIC and SORTED types. For NUMERIC fields, it simply uses the value, the SORTED type gets parsed as number or boolean-flag.

Tests

New fields that have docValues=true were added to the schema.xml in order to test in TestLTROnSolrCloud that the feature-requests also return values for these fields.
The TestLTRReRankingPipeline was changed from a SolrTestCase to a SolrTestCaseJ4 in order to improve readability.

I ran all tests in the package org.apache.solr.ltr.

Please note

I am aware that the structure of the FieldValueFeature is now quite hard to read and the new Scorer is a bit hidden. I decided to add another nested class to the FieldValueFeatureWeight to avoid having to duplicate a lot of code just to change the inner functionality.

Unit tests for the handleBytesRef are still missing. I plan to add them, but wanted to create the PR already so that the general approach can be reviewed and discussed.
Edit: Tests for the parsing of different Types of sorted docValues (T and F flag, numbers, no number) were added in e07c432.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title. (Issue was already present)
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have run ./gradlew check -x test.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

Copy link
Contributor

@cpoerschke cpoerschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tomglk for opening this pull request today!

I've added some comments from an initial review and hope to look more fully maybe later this week.

@alessandrobenedetti might also be interested in this.

@tomglk
Copy link
Contributor Author

tomglk commented May 12, 2021

Thank you very much for the quick and detailed feedback, @cpoerschke !
I already have some ideas in mind and will start to improve the code next thing tomorrow morning.

tomglk and others added 5 commits May 13, 2021 09:46
…ocuments and commit during indexing; improve formatting of test
…lueFeatureScorer so only the supported types have to be handled later; extract number-conversion to separate method
* NumericDocValuesFieldValueFeatureScorer
* SortedDocValuesFieldValueFeatureScorer
caveat: TestFieldValueFeature.testIfADocumentDoesntHaveAFieldDefaultValueIsReturned fails
final int collectionSize = 8;
for (int docId = 1; docId <= collectionSize; docId++) {
// put documents in random order to check that advanceExact is working correctly
List<Integer> docIds = IntStream.rangeClosed(1, collectionSize).boxed().collect(toList());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I haven't across this IntStream.rangeClosed thing before!

Comment on lines 259 to 267
if (popularity != 1) {
// check that empty values will be read as default
doc.setField("dvIntField", popularity);
doc.setField("dvLongField", popularity);
doc.setField("dvFloatField", ((float) popularity) / 10);
doc.setField("dvDoubleField", ((double) popularity) / 10);
doc.setField("dvStrNumField", popularity);
doc.setField("dvStrBoolField", popularity % 2 == 0 ? "T" : "F");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at TestLTROnSolrCloud changes today, this approach here is very elegant, combined with the resultN_features and schema.xml change.

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false"/>
<field name="finalScore" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="finalScoreFloat" type="float" indexed="true" stored="true" multiValued="false"/>
<field name="field" type="text_general" indexed="true" stored="false" multiValued="false"/>
<field name="title" type="text_general" indexed="true" stored="true"/>
<field name="description" type="text_general" indexed="true" stored="true"/>
<field name="keywords" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="popularity" type="int" indexed="true" stored="true" />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

observation: via the docValues="${solr.tests.numeric.dv}" in the <fieldType name="int" ... and popularity being used in the TestFieldValueFeature we have some existing test coverage for doc values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a commit with additional TestFieldValueFeature coverage, for the numeric fields only so far i.e. dvStrNumPopularity and dvStrBoolPopularity not yet.

Copy link
Contributor

@cpoerschke cpoerschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tomglk. Thanks for working on this pull request! I've added two commits to the branch, let me know what you think? Yet to look at the TestLTRReRankingPipeline test (next week).

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false"/>
<field name="finalScore" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="finalScoreFloat" type="float" indexed="true" stored="true" multiValued="false"/>
<field name="field" type="text_general" indexed="true" stored="false" multiValued="false"/>
<field name="title" type="text_general" indexed="true" stored="true"/>
<field name="description" type="text_general" indexed="true" stored="true"/>
<field name="keywords" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="popularity" type="int" indexed="true" stored="true" />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a commit with additional TestFieldValueFeature coverage, for the numeric fields only so far i.e. dvStrNumPopularity and dvStrBoolPopularity not yet.

Comment on lines +54 to +57
<copyField source="popularity" dest="dvIntPopularity"/>
<copyField source="popularity" dest="dvLongPopularity"/>
<copyField source="popularity" dest="dvFloatPopularity"/>
<copyField source="popularity" dest="dvDoublePopularity"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

observation: copyField here works for the dv*Popularity fields but the dv(Int|Long|Float|Double)Field fields cannot use it since the defaulting logic and the popularity/10 logic is happening there.

Comment on lines 132 to 133
assertJQ("/query" + query.toQueryString(),
"/response/docs/[0]/=={'[fv]':'"+FeatureLoggerTestUtils.toFeatureVector(field,Float.toString(FIELD_VALUE_FEATURE_DEFAULT_VAL))+"'}");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test fails here on the second and subsequent run of the loop. I don't know yet if that is a test artifact or an actual code issue. The testIfADocumentDoesntHaveAFieldASetDefaultValueIsReturned test does not have the same behaviour since each loop turn uses a different f(feature )store (and its own model) whereas this test uses the test-wide models and its default feature store.

Copy link
Contributor Author

@tomglk tomglk May 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem is final private boolean extractAllFeatures; in LTRScoringQuery. It is set to true and therefore all previous features are added to the response instead of just the model features.

This is the logic (line 210 in LTRScoringQuery):

if (this.extractAllFeatures) {
    features = allFeatures;
} else{
    features =  modelFeatures;
}

I didn't find a way to set it to false with a query param.
So we would have two options:

  • use this constructor public LTRScoringQuery(LTRScoringModel ltrScoringModel, boolean extractAllFeatures)
  • adjust the xpath to something like [ends-with(., '<fieldname>=0.0')]

I am trying to adjust the xpath, because I would prefer that solution, but I was not able to find the right pattern so far.

Edit:
Ok, I took a break and realized my mistake (the break seemed to be necessary smh).
I took another look at the jsonpath-adjustment. A pattern like this "['response']['docs'][0][?(@['[fv]']=~ /.* <fieldName> =0.0.*/i )]" should work but it would require an additional dependency for jayway.jsonpath and we would have to manually call JsonPath.read(...) and assert that the result is not null.

I am not sure, whether that's nice, but maybe you want to build up upon that / can generate an idea from here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1/2 Thanks for investigating in detail here! I agree the explanation is with the extractAllFeatures boolean i.e. the response contains the values for all the features in the feature store. The puzzling thing then though is why the first run of the loop succeeds and only subsequent runs fail. Since all loop runs use the same feature store and surely each loop run should return the same feature values? ...

Comment on lines 132 to 133
assertJQ("/query" + query.toQueryString(),
"/response/docs/[0]/=={'[fv]':'"+FeatureLoggerTestUtils.toFeatureVector(field,Float.toString(FIELD_VALUE_FEATURE_DEFAULT_VAL))+"'}");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1/2 Thanks for investigating in detail here! I agree the explanation is with the extractAllFeatures boolean i.e. the response contains the values for all the features in the feature store. The puzzling thing then though is why the first run of the loop succeeds and only subsequent runs fail. Since all loop runs use the same feature store and surely each loop run should return the same feature values? ...

Comment on lines 65 to 71
for (String field : FIELD_NAMES) {
loadFeature(field, FieldValueFeature.class.getName(),
"{\"field\":\""+field+"\"}");

loadModel("popularity-model", LinearModel.class.getName(),
new String[] {"popularity"}, "{\"weights\":{\"popularity\":1.0}}");
loadModel(field + "-model", LinearModel.class.getName(),
new String[] {field}, "{\"weights\":{\""+field+"\":1.0}}");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2/2 ... I think the explanation for the mystery is here i.e. the LinearModel constructor signature and the interleaving of feature and model loading results in the models having different definitions of what allFeatures means. One solution then would be for the test expectations to account for the multiple meanings of allFeatures or (my preference probably) for there to be only one allFeatures meaning e.g. via the separation of feature and model loading. Or perhaps you can think of a third solution even?

for (String field : FIELD_NAMES) {
  loadFeature(field, ...);
}
for (String field : FIELD_NAMES) {
  loadModel(field + "-model", ...);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for continuing the investigation!
I think the example you posted here should work, although I am still a bit irritated regarding the extractAllFeatures = true.

As user I would expect it to always return all the features that were created, no matter if that happened before or after the model.
But the code makes it explicit, that these information should not be modified (allFeatures is an unmodifiableList).
I think that a nicely placed docString could avoid some confusion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's a somewhat quirky behaviour. Operationally it could also cause mysteries e.g. when new features are added to an existing store and the (reasonable) expectation is that they will start producing feature values right away when actually that only happens when the existing model(s) are next re-instantiated. +1 to try and reduce such confusion via documentation somehow.

tomglk added 3 commits May 17, 2021 22:36
…ult value; only use one model in tests because of extractAllFeatures==true
…est that right scorer classes are used, add more fields to test
@tomglk
Copy link
Contributor Author

tomglk commented May 19, 2021

I have added more tests and an entry to the changes.txt.
For now, I just added our names, @cpoerschke , because I wasn't sure about the people in the Jira ticket. I think it would be nice to include them because this PR was built upon that and the patches, but I have no overview who participated how much and the last patch was created ~ 2 years ago.
What do you think?

Also: From my side this looks quite finished now. Do you miss anything?

Copy link
Contributor

@cpoerschke cpoerschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tomglk!

... From my side this looks quite finished now. Do you miss anything?

I agree, we're close to wrapping this up (and am also mindful of the 8.9.0 release coming up soon and us wishing to include this in it). There's a couple of "polishing" bits I'd like to do and me adding small commits directly to the branch (like the three commits just added) might be the most efficient way if that works for you?

Really looking forward to this being in the next release (and also have some ideas on how afterwards we potentially might (under separate JIRA and pull request) (safely) break backwards compatibility in Solr 9.0.0 to allow stored=true docValues=true fields to also benefit from the docValues implementation whilst a deprecated (say) LegacyFieldValueFeature until (say) Solr 10.0.0 retains the existing implementation for users who need that ...

edit: https://issues.apache.org/jira/browse/SOLR-15440 "contrib/ltr FieldValueFeature: DocValues use for stored fields" created for this

Comment on lines 386 to 390
/**
* This class is used to track which specific FieldValueFeature is used so that we can test, whether the
* fallback mechanism works correctly.
*/
public static class ObservingFieldValueFeature extends FieldValueFeature {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The selection of test coverage using this new ObservingFieldValueFeature and its observing the scorer type is very comprehensive and elegant, thanks for adding it!

import org.apache.solr.core.SolrResourceLoader;
import org.apache.solr.ltr.feature.Feature;
import org.apache.solr.ltr.feature.FieldValueFeature;
import org.apache.solr.ltr.model.LTRScoringModel;
import org.apache.solr.ltr.model.TestLinearModel;
import org.apache.solr.ltr.norm.IdentityNormalizer;
import org.apache.solr.ltr.norm.Normalizer;
import org.apache.solr.request.LocalSolrQueryRequest;
import org.apache.solr.request.SolrQueryRequest;
import org.junit.BeforeClass;
import org.junit.Test;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After multiple attempts I'm still struggling to understand and appreciate how the TestLTRReRankingPipeline changes relate to and fit in with the other changes. @tomglk would you have an extra perspective on this perhaps?

I'm speculating that the changes were required when the SolrDocumentFetcher implementation was used but now with the TestFieldValueFeature changes the picture has changed.

If the TestLTRReRankingPipeline changes are no longer required and relatively unrelated I would favour git checkout origin/main -- TestLTRReRankingPipeline.java removing them from the pull request scope. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes to this test are not necessary for it to work and I agree that they are only loosely related to the new docValues support.
However, I think that the test is way more comprehensive and better to read now.

I would argue that the test is connected to the changes because of the use of the FieldValueFeature (and not e.g. an OriginalScoreFeature) and future developers can benefit from the increased readability.
So I would like to keep it in this PR to make it more accessible and easier to built up upon in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! Yes, once the "where is the pure docValues connection?" question is gone and the mental focus is on "code comprehension and readability" it is clear that this is a refactoring change that is indeed very beneficial.

And when considering if the refactoring changes inadvertently changes any existing behaviour the @AwaitsFix(bugUrl = "https://issues.apache.org/jira/browse/SOLR-11134") caught my eye and I think I managed to diagnose this long-standing issue:

  • the problem was that new FloatDocValuesField("final-score", ...) resulted in a stored=false field (which was not very apparent) and its schema.xml replacement named finalScore is stored=true i.e. immediate readability and comprehension benefit there.
  • then next the testDifferentTopN turned out to have an incorrect expectation w.r.t. what the score should be (addition of an alternative TestLinearModel.makeFeatureWeights variant tries to make that clearer).
  • and lastly once the testDifferentTopN test passed it interacted with the testRescorer test i.e. if the former ran first then the latter would see not only its own 2 documents but also the documents added by the former test, this is easily fixed by everyone deleting all documents before adding their documents.

... So I would like to keep it in this PR to make it more accessible and easier to built up upon in the future.

I agree we should keep these very beneficial TestLTRReRankingPipeline changes.

If (and that is an if) the SOLR-11134 fix (or something along those lines) makes sense, then how would you feel about us still removing the TestLTRReRankingPipeline change from this PR and instead to have it in a separate PR and to "book it" only under SOLR-11134 or jointly under SOLR-11134 and SOLR-12697 tickets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking the time to investigate this!
I just had a look at your changes and they totally make sense to me.

Because of the existing jira ticket for 11134 it makes sense to me to remove the file from this PR and to book it under a new PR like "SOLR-11134 fix and restructure TestLTRReRankingPipeline.testDifferentTopN test".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking at and reviewing the changes! I've just opened #145 PR for the "SOLR-11134 fix and restructure TestLTRReRankingPipeline.testDifferentTopN test" changes then.

@tomglk
Copy link
Contributor Author

tomglk commented May 20, 2021

Hi @cpoerschke ,
feel free to directly commit to the branch. :)

Your idea to make a new jira ticket and PR to break backwards compatibility sounds good to me. It would be useful to profit from the better performance for more fields. I am on board to continue working on this topic with you.

I'm also exited to see this in solr 8.9.0!

@tomglk tomglk changed the title [SOLR-12697] Add pure DocValues support to FieldValueFeature SOLR-12697 Add pure DocValues support to FieldValueFeature May 20, 2021
…er is more numeric, latter is more boolean and copyField from isTrend simplies the document( add)s
* undo distracting reformatting (hopefully one-off and next time 'spotless' gradle plugin will be available for contrib/ltr)
* use private and final where possible
* make new scorers final (but not existing scorer for back compat reasons) since no obvious need to extend
tomglk and others added 4 commits May 20, 2021 20:01
…orer is used for different types of fields
…a/SOLR-12697

Resolved Conflicts:
	solr/contrib/ltr/src/java/org/apache/solr/ltr/feature/FieldValueFeature.java
@tomglk
Copy link
Contributor Author

tomglk commented May 21, 2021

I am currently debugging some ltr stuff and stumbled upon this:

final String string = indexableField.stringValue();
if (string.length() == 1) {
    // boolean values in the index are encoded with a single char contained in TRUE_TOKEN or FALSE_TOKEN
    // (see BoolField)
    if (string.charAt(0) == BoolField.TRUE_TOKEN[0]) {
      return 1f;
    }
    if (string.charAt(0) == BoolField.FALSE_TOKEN[0]) {
      return 0f;
    }
}

This code is in the FieldValueFeatureScorer. Why don't we try to parse the string value as number?
We do that for docValues, but not for stored fields.

tomglk and others added 2 commits May 21, 2021 18:22
…roximately:

* minor style tweaks e.g. new SolrQuery(id:21) instead of new SolrQuery(); setQuery(id:21)

* in testIfADocumentDoesntHaveAFieldTheDefaultValueFromSchemaIsReturned for clarity remove assumption w.r.t. field default values being -1/-2/-3 sequential

* in testThatDefaultFieldValueScorerIsUsedAndDefaultIsReturned replace generic dvTestField name with dvDoubleField and also dvStrBoolField to also cover a non-numeric field

* in testThatExceptionIsThrownForUnsupportedType replace generic dvUnsupportedField with dvStringPopularities for the popularities (plural) naming to help signal unsupportedness

* in testThatCorrectFieldValueFeatureIsUsedForDocValueTypes replace noDocValuesField with noDvFloatField and noDvStrNumField to cover both numeric and non-numeric

* in testThatStringValuesAreCorrectlyParsed also cover non-docValues field _and_ ensure behavioural consistency between dv and non-dv fields [*** this required implementation adjustment for the dv field ***]
@cpoerschke
Copy link
Contributor

Hi @tomglk!

I am currently debugging some ltr stuff and stumbled upon this: ... Why don't we try to parse the string value as number? We do that for docValues, but not for stored fields.

Ah, we seem to have both stumbled upon the same question though perhaps via different routes?!

I was trying to extend TestFieldValueFeature.testThatStringValuesAreCorrectlyParsed to also cover non-docValues field and to ensure behavioural consistency between dv and non-dv fields i.e. docValues and stored fields. My conclusion was that for back compat reasons the docValues would need to match stored fields behaviour -- I will add a commit with proposed changes shortly -- and that what looks like a "string" value in the implementation is for "boolean" values only really i.e. F/0 and T/1 and when number values are required a number field type would/should be used instead.

And advance apologies, the commit has multiple things in one commit though I've tried to at least describe in the commit message details what is in it. Looking forward to your thoughts on it.

Comment on lines +240 to +250
private float readNumericDocValues() throws IOException {
if (NumberType.FLOAT.equals(numberType)) {
// convert float value that was stored as long back to float
return Float.intBitsToFloat((int) docValues.longValue());
} else if (NumberType.DOUBLE.equals(numberType)) {
// handle double value conversion
return (float) Double.longBitsToDouble(docValues.longValue());
}
// just take the long value
return docValues.longValue();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1/2 So I was trying to learn and understand better w.r.t. why the intBitsToFloat and longBitsToDouble are the conversions to use here and that led to https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.8.2/solr/core/src/java/org/apache/solr/search/SolrDocumentFetcher.java#L603-L644 and the realisation that DATE is a numberType possibility!

Would you have any thoughts on using a switch statement here with FLOAT/DOUBLE/INT/LONG all valid and to throw an exception for anything else ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat strangely so perhaps the use of DATE fields "just works" for both the existing stored and the new docValues implementations. I've added a TestFieldValueFeature.testThatDateValuesAreCorrectlyParsed test to demonstrate that and so then no need to change the implementation here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... the date being a numeric type really is quite unintuitive.
I am not really fond of copying the switch-logic that you referenced, but sadly the method is private. :/

I will go and make a tea and then look into that.
Thank you for adding the test for the date field. The new clarity on what value is returned for the date type is very good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh... Thanks github for not refreshing the page... 🙄
After looking into that I saw the same behavior as you did and was just about to comment that. :D

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... I am not really fond of copying the switch-logic that you referenced, but sadly the method is private. ...

Likewise, I would have preferred for there to be some existing reusable method and was surprised that there isn't one (or not an obviously findable one). The SolrDocumentFetcher one has the sortableNumeric flag (which we don't need here) and some special logic for LatLonPointSpatialField and AbstractEnumField scenarios (also not applicable here) ... otherwise it would have been a clear "okay, let's factor out a method here and put it somewhere that both SolrDocumentFeature and FieldValueFeature can use it" kind of scenario. Oh well.

Comment on lines +133 to +142
if (DocValuesType.NUMERIC.equals(docValuesType)) {
return new NumericDocValuesFieldValueFeatureScorer(this, context,
DocIdSetIterator.all(DocIdSetIterator.NO_MORE_DOCS), schemaField.getType().getNumberType());
} else if (DocValuesType.SORTED.equals(docValuesType)) {
return new SortedDocValuesFieldValueFeatureScorer(this, context,
DocIdSetIterator.all(DocIdSetIterator.NO_MORE_DOCS));
} else if (DocValuesType.NONE.equals(docValuesType)) {
// Using a fallback feature scorer because this segment has no documents with a doc value for the current field
return new DefaultValueFieldValueFeatureScorer(this, DocIdSetIterator.all(DocIdSetIterator.NO_MORE_DOCS));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2/2 ... combined with here diverting any unexpected getNumberType() enums onto the "is not supported" code path? Although having said that, I don't know what the existing FieldValueFeatureScorer would return for a stored DATE field i.e. perhaps back compat considerations apply, hmm.

@tomglk
Copy link
Contributor Author

tomglk commented May 25, 2021

Hi @cpoerschke! :)

My conclusion was that for back compat reasons the docValues would need to match stored fields behaviour -- I will add a commit with proposed changes shortly -- and that what looks like a "string" value in the implementation is for "boolean" values only really i.e. F/0 and T/1 and when number values are required a number field type would/should be used instead.

I just had the time to look through your changes.
Although I liked the possibility to also handle numbers that are present in a string field, I agree that one should use a numeric field for that. One shouldn't expect the code to handle such cases.
I think your changes made the functionalities that are supported much cleaner. And the consistent behavior between dv and non-dv fields also is a very important point that I admittedly paid not enough attention to. Thanks for keeping an eye on such things!

And advance apologies, the commit has multiple things in one commit though I've tried to at least describe in the commit message details what is in it.

No worries, I read you notes in the commit while looking through the code and had no problem with understanding your thoughts behind the changes.

* undo distracting reformatting (hopefully one-off and next time 'spotless' gradle plugin will be available for contrib/ltr)
@cpoerschke
Copy link
Contributor

... I just had the time to look through your changes. ...

Thanks @tomglk for taking a look!

I've added one more small polish commit to workaround us not yet having 'spotless' or other automatic code formatting for the contrib/ltr though #126 hopes to be a step in that direction.

Otherwise I think we're all done here, what do you think?

@tomglk
Copy link
Contributor Author

tomglk commented May 26, 2021

Hi @cpoerschke ,thank you for the polishing and please excuse me for adding some spaces in these places. I tried to set up my IDE to mimic the correct formatting, but did not fully succeed.
#126 sounds like a good approach to get to a point where we don't have many formatting changes that blow up the change list of a PR. That would make everything easier.

Otherwise I think we're all done here, what do you think?

I think so too. :)

@cpoerschke cpoerschke merged commit 3db4cdd into apache:main May 28, 2021
@tomglk
Copy link
Contributor Author

tomglk commented May 28, 2021

Thank you for the great collaboration on this PR, @cpoerschke . I really did learn a lot thanks to your constructive remarks and had a lot of fun working on this. :)

bszabo97 pushed a commit to bszabo97/solr that referenced this pull request Jun 14, 2021
…cValues=true" a.k.a. pure DocValues fields. (apache#123)

(Stanislav Livotov, Erick Erickson, Tobias Kässmann, Tom Gilke, Christine Poerschke)
epugh pushed a commit to epugh/solr that referenced this pull request Oct 22, 2021
…cValues=true" a.k.a. pure DocValues fields. (apache#123)

(Stanislav Livotov, Erick Erickson, Tobias Kässmann, Tom Gilke, Christine Poerschke)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants