Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-10255 Add support for docValues to solr.BinaryField #2536

Merged
merged 5 commits into from
Jul 17, 2024

Conversation

serba
Copy link
Contributor

@serba serba commented Jun 26, 2024

Description

Add support for docValues to solr.BinaryField

Solution

Lucene has support for BinaryDocValuesField since forever. This is pretty straightforward PR with exposing that Lucene field in Solr.

Tests

Added additional asserts to test solr.BinaryField with docValues="true" in existing TestBinaryField test.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

return new org.apache.lucene.document.StoredField(field.getName(), getBytesRef(val));
}

private static BytesRef getBytesRef(Object val) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the code for converting Object val to BytesRef for code reuse (and readability)

}

@Override
public List<IndexableField> createFields(SchemaField field, Object val) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty much borrowed from StrField type that supports creating stored and/or docValued Lucene fields for a single Solr field.

@@ -609,7 +609,7 @@ private Object decodeDVField(
case BINARY:
BinaryDocValues bdv = e.getBinaryDocValues(localId, leafReader, readerOrd);
if (bdv != null) {
return BytesRef.deepCopyOf(bdv.binaryValue());
return BytesRef.deepCopyOf(bdv.binaryValue()).bytes;
Copy link
Contributor Author

@serba serba Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously this code was returning BytesRef objects that were serialized up the stack with a simple .toString into BytesRef@<hash-code> string. You can revert this change and see it yourself in a test failure.

Now we return a byte[] which is correctly supported by Solr/SolrJ client and serialized as base64.

Note that there's no any additional performance overhead as we were already doing deepCopy into a new byte array and .bytes simply return a reference to a byte array.

for (int i = 0; i < data.length; i++) {
byte b = data[i];
assertEquals((byte) i, b);
for (String field : new String[] {"data", "data_dv"}) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review these code changes with "Hide Whitespaces" toggle on.

Screenshot 2024-06-25 at 11 11 10 PM

Alexey Serba added 2 commits June 26, 2024 09:49
We can add it later if it is needed. I don't see a big value in storing
multiple binary payloads where you simply can store a single binary payload
with multiple values.
IndexableField fval = createField(field, val);

if (field.hasDocValues() && !field.multiValued()) {
IndexableField docval = new BinaryDocValuesField(field.getName(), getBytesRef(val));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the crux of the changes - we want to use Lucene BinaryDocValuesField if user specified docValues="true" in the schema.


fval = docval;
}
return Collections.singletonList(fval);
Copy link
Member

@mkhludnev mkhludnev Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May it be simplified as

Stream.of(createField(field, val), 
(field.hasDocValues() && !field.multiValued()) ? new BinaryDocValuesField(field.getName(), getBytesRef(val)):null)
                 .filter(obj -> obj != null)
                 .collect(Collectors.toList());

?

UPD: However, Streams may put too much footprint.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, my understanding is that this code is getting executed every time the field of this type is being indexed.

I borrowed the code from StrField type that supports creating stored and/or docValued Lucene fields for a single Solr field and there were such optimizations to create an ArrayList only in case of when both stored and docValues options are enabled.

I guess it makes sense to keep this code consistent across different field types until there's some appetite in rewriting this in streams or List.of style across the board.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why to bother with random field. Shouldn't we just remove this file and test, since now we have binary DV?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test is not testing binary doc values per say, but exception handling for any field type that does not support docValues. I was going through different field types to find the best candidate now and figured RandomSortField is the best one as it will never get DocValues support.

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 nice straight-forward change!

@serba
Copy link
Contributor Author

serba commented Jul 17, 2024

@dsmiley @mkhludnev Could you please guys merge this MR if you think it is ready? I don't have write access. Thank you!

@dsmiley dsmiley merged commit 6ab6c4a into apache:main Jul 17, 2024
3 checks passed
dsmiley pushed a commit that referenced this pull request Jul 17, 2024
Co-authored-by: Alexey Serba <[email protected]>
(cherry picked from commit 6ab6c4a)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants