Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-10255 Add support for docValues to solr.BinaryField #2536

Merged
merged 5 commits into from
Jul 17, 2024
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 38 additions & 1 deletion solr/core/src/java/org/apache/solr/schema/BinaryField.java
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,11 @@
import java.lang.invoke.MethodHandles;
import java.nio.ByteBuffer;
import java.nio.charset.StandardCharsets;
import java.util.ArrayList;
import java.util.Base64;
import java.util.Collections;
import java.util.List;
import org.apache.lucene.document.BinaryDocValuesField;
import org.apache.lucene.index.IndexableField;
import org.apache.lucene.search.SortField;
import org.apache.lucene.util.BytesRef;
Expand All @@ -42,6 +46,11 @@ public void checkSchemaField(SchemaField field) {
SolrException.ErrorCode.SERVER_ERROR,
"Field type " + this + " is 'large'; not supported (yet)");
}
if (field.hasDocValues() && field.multiValued()) {
throw new SolrException(
SolrException.ErrorCode.SERVER_ERROR,
"Field type " + this + " does not support multiple doc values");
}
}

private String toBase64String(ByteBuffer buf) {
Expand Down Expand Up @@ -94,6 +103,10 @@ public IndexableField createField(SchemaField field, Object val) {
log.trace("Ignoring unstored binary field: {}", field);
return null;
}
return new org.apache.lucene.document.StoredField(field.getName(), getBytesRef(val));
}

private static BytesRef getBytesRef(Object val) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the code for converting Object val to BytesRef for code reuse (and readability)

byte[] buf = null;
int offset = 0, len = 0;
if (val instanceof byte[]) {
Expand All @@ -112,7 +125,31 @@ public IndexableField createField(SchemaField field, Object val) {
len = buf.length;
}

return new org.apache.lucene.document.StoredField(field.getName(), buf, offset, len);
return new BytesRef(buf, offset, len);
}

@Override
public List<IndexableField> createFields(SchemaField field, Object val) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty much borrowed from StrField type that supports creating stored and/or docValued Lucene fields for a single Solr field.

IndexableField fval = createField(field, val);

if (field.hasDocValues() && !field.multiValued()) {
IndexableField docval = new BinaryDocValuesField(field.getName(), getBytesRef(val));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the crux of the changes - we want to use Lucene BinaryDocValuesField if user specified docValues="true" in the schema.


// Only create list if we have 2 values...
if (fval != null) {
List<IndexableField> fields = new ArrayList<>(2);
fields.add(fval);
fields.add(docval);
return fields;
}

fval = docval;
}
return Collections.singletonList(fval);
Copy link
Member

@mkhludnev mkhludnev Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May it be simplified as

Stream.of(createField(field, val), 
(field.hasDocValues() && !field.multiValued()) ? new BinaryDocValuesField(field.getName(), getBytesRef(val)):null)
                 .filter(obj -> obj != null)
                 .collect(Collectors.toList());

?

UPD: However, Streams may put too much footprint.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, my understanding is that this code is getting executed every time the field of this type is being indexed.

I borrowed the code from StrField type that supports creating stored and/or docValued Lucene fields for a single Solr field and there were such optimizations to create an ArrayList only in case of when both stored and docValues options are enabled.

I guess it makes sense to keep this code consistent across different field types until there's some appetite in rewriting this in streams or List.of style across the board.

}

@Override
protected void checkSupportsDocValues() { // we support DocValues
}

@Override
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -639,7 +639,7 @@ private Object decodeDVField(
case BINARY:
BinaryDocValues bdv = e.getBinaryDocValues(localId, leafReader, readerOrd);
if (bdv != null) {
return BytesRef.deepCopyOf(bdv.binaryValue());
return BytesRef.deepCopyOf(bdv.binaryValue()).bytes;
Copy link
Contributor Author

@serba serba Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously this code was returning BytesRef objects that were serialized up the stack with a simple .toString into BytesRef@<hash-code> string. You can revert this change and see it yourself in a test failure.

Now we return a byte[] which is correctly supported by Solr/SolrJ client and serialized as base64.

Note that there's no any additional performance overhead as we were already doing deepCopy into a new byte array and .bytes simply return a reference to a byte array.

}
return null;
case SORTED:
Expand Down
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why to bother with random field. Shouldn't we just remove this file and test, since now we have binary DV?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test is not testing binary doc values per say, but exception handling for any field type that does not support docValues. I was going through different field types to find the best candidate now and figured RandomSortField is the best one as it will never get DocValues support.

Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,10 @@
-->

<schema name="bad-schema-docValues-unsupported" version="1.6">
<fieldType name="binary" class="solr.BinaryField"/>
<fieldType name="random" class="solr.RandomSortField" />


<!-- change the type if BinaryField gets doc values -->
<field name="id" type="binary" docValues="true"/>
<!-- change the type if RandomSortField gets doc values -->
<field name="id" type="random" docValues="true"/>

</schema>
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@

<field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true"/>
<field name="data" type="binary" stored="true"/>
<field name="data_dv" type="binary" stored="false" docValues="true" />


<uniqueKey>id</uniqueKey>
Expand Down
90 changes: 48 additions & 42 deletions solr/core/src/test/org/apache/solr/schema/TestBinaryField.java
Original file line number Diff line number Diff line change
Expand Up @@ -82,73 +82,78 @@ public void testSimple() throws Exception {
doc = new SolrInputDocument();
doc.addField("id", 1);
doc.addField("data", ByteBuffer.wrap(buf, 2, 5));
doc.addField("data_dv", ByteBuffer.wrap(buf, 2, 5));
client.add(doc);

doc = new SolrInputDocument();
doc.addField("id", 2);
doc.addField("data", ByteBuffer.wrap(buf, 4, 3));
doc.addField("data_dv", ByteBuffer.wrap(buf, 4, 3));
client.add(doc);

doc = new SolrInputDocument();
doc.addField("id", 3);
doc.addField("data", buf);
doc.addField("data_dv", buf);
client.add(doc);

client.commit();

QueryResponse resp = client.query(new SolrQuery("*:*"));
QueryResponse resp = client.query(new SolrQuery("*:*").setFields("id", "data", "data_dv"));
SolrDocumentList res = resp.getResults();
List<Bean> beans = resp.getBeans(Bean.class);
assertEquals(3, res.size());
assertEquals(3, beans.size());
for (SolrDocument d : res) {

Integer id = Integer.parseInt(d.getFieldValue("id").toString());
byte[] data = (byte[]) d.getFieldValue("data");
if (id == 1) {
assertEquals(5, data.length);
for (int i = 0; i < data.length; i++) {
byte b = data[i];
assertEquals((byte) (i + 2), b);
}

} else if (id == 2) {
assertEquals(3, data.length);
for (int i = 0; i < data.length; i++) {
byte b = data[i];
assertEquals((byte) (i + 4), b);
}

} else if (id == 3) {
assertEquals(10, data.length);
for (int i = 0; i < data.length; i++) {
byte b = data[i];
assertEquals((byte) i, b);
for (String field : new String[] {"data", "data_dv"}) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review these code changes with "Hide Whitespaces" toggle on.

Screenshot 2024-06-25 at 11 11 10 PM

byte[] data = (byte[]) d.getFieldValue(field);
if (id == 1) {
assertEquals(5, data.length);
for (int i = 0; i < data.length; i++) {
byte b = data[i];
assertEquals((byte) (i + 2), b);
}

} else if (id == 2) {
assertEquals(3, data.length);
for (int i = 0; i < data.length; i++) {
byte b = data[i];
assertEquals((byte) (i + 4), b);
}

} else if (id == 3) {
assertEquals(10, data.length);
for (int i = 0; i < data.length; i++) {
byte b = data[i];
assertEquals((byte) i, b);
}
}
}
}
for (Bean d : beans) {
Integer id = Integer.parseInt(d.id);
byte[] data = d.data;
if (id == 1) {
assertEquals(5, data.length);
for (int i = 0; i < data.length; i++) {
byte b = data[i];
assertEquals((byte) (i + 2), b);
}

} else if (id == 2) {
assertEquals(3, data.length);
for (int i = 0; i < data.length; i++) {
byte b = data[i];
assertEquals((byte) (i + 4), b);
}

} else if (id == 3) {
assertEquals(10, data.length);
for (int i = 0; i < data.length; i++) {
byte b = data[i];
assertEquals((byte) i, b);
for (byte[] data : new byte[][] {d.data, d.data_dv}) {
if (id == 1) {
assertEquals(5, data.length);
for (int i = 0; i < data.length; i++) {
byte b = data[i];
assertEquals((byte) (i + 2), b);
}

} else if (id == 2) {
assertEquals(3, data.length);
for (int i = 0; i < data.length; i++) {
byte b = data[i];
assertEquals((byte) (i + 4), b);
}

} else if (id == 3) {
assertEquals(10, data.length);
for (int i = 0; i < data.length; i++) {
byte b = data[i];
assertEquals((byte) i, b);
}
}
}
}
Expand All @@ -158,5 +163,6 @@ public void testSimple() throws Exception {
public static class Bean {
@Field String id;
@Field byte[] data;
@Field byte[] data_dv;
}
}