Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query range fields by doc values when they are expected to be more efficient than points #24823

Merged
merged 1 commit into from
Jul 12, 2017

Conversation

martijnvg
Copy link
Member

  • Enable doc values for range fields by default.
  • Store ranges in a binary format that support multi field fields.
  • Added BinaryDocValuesRangeQuery that can query ranges that have been encoded into a binary doc values field.
  • Wrap range queries on a range field in IndexOrDocValuesQuery query.

Closes #24314

@martijnvg martijnvg added :Search/Search Search-related issues that do not fall into other categories >enhancement review v6.0.0 labels May 22, 2017
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good overall. My main concern is that given how the encoding works, we have to deserialize the ranges at query time back to two Comparable instances. I think we should design it in such a way that queries can work directly on the encoded representation?

@Override
public boolean matches() throws IOException {
BytesRef encodedRanges = values.binaryValue();
NormalizedRange[] ranges = rangeType.decodeRanges(encodedRanges);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we work directly on the encoded bytes instead of decoding?


@Override
public int hashCode() {
return classHash() + Objects.hash(fieldName, queryType, expectedRange);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: I'd prefer return Objects.hash(getClass(), fieldName, queryType, expectedRange); which is supposed to be less subject to collisions.

public BytesRef encodeRanges(Set<Range> ranges) throws IOException {
final byte[] encoded = new byte[ByteUtils.MAX_BYTES_VLONG + (16 * 2) * ranges.size()];
ByteArrayDataOutput out = new ByteArrayDataOutput(encoded);
out.writeVInt(ranges.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we use a fixed length, maybe we do not need to encode the number of values?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I see it is to be consistent with other types which use vlong

@martijnvg
Copy link
Member Author

I think we should design it in such a way that queries can work directly on the encoded representation?

The idea here is that we can save some space by encoding the ranges as two vlongs. Downside is that we can't work on the encoded representation. Do you think that the vlong encoding isn't worth the effort here? Personally I'm ok with changing the encoding here, so that we don't to encode at query time, but just wondering if we should make a tradeoff between space or query speed.

@jpountz
Copy link
Contributor

jpountz commented May 23, 2017

You convinced me that we need to perform some sort of compression for numeric values. However I think we should use an encoding that doesn't require the query to know whether the field stores longs or ip addresses?

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments.

}
if (includeTo) {
to = rangeType.nextUp(to);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a bit backward to me, we would usually rather do

if (includeFrom == false) {
    from = rangeType.nextUp(from);
}

Do I miss something?

ByteArrayDataInput in = new ByteArrayDataInput(encodedRanges.bytes, encodedRanges.offset, encodedRanges.length);
int numRanges = in.readVInt();
for (int i = 0; i < numRanges; i++) {
int length = in.readVInt();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the first bytes are not a vint when we encode eg. longs? So I don't think it works?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In RangeFieldMapper.RangeType#encodeLongRanges(...) each encoded value is preceded by a vint.

length = in.readVInt();
bytes = new byte[length];
in.readBytes(bytes, 0, length);
BytesRef otherTo = new BytesRef(bytes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should try to avoid ByteArrayDataInput and BytesRef allocations since this method may be called in a tight loop.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove the ByteRef instance creation and use just plain byte[].
The reason I used ByteArrayDataInput is because of its readVInt() and readBytes() methods. Are there static alternatives that I can use? Otherwise maybe it is sufficient enough if I keep a spare instance around that I keep reusing?

// does not disjoint AND not within:
result = (from.compareTo(otherTo) > 0 || to.compareTo(otherFrom) < 0) == false &&
(from.compareTo(otherFrom) <= 0 && to.compareTo(otherTo) >= 0) == false;
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we put a method on the queryType enum instead of this switch?

}
return encoded;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering that putting these helpers into a dedicated class would make things easier to read?

if (sign == 1) {
encoded[encoded.length - 1 - b] = (byte) (l >>> (8 * b));
} else if (sign == 0) {
encoded[encoded.length - 1 - b] = (byte) (0xFF - (l >>> (8 * b)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we miss a mask here: encoded[encoded.length - 1 - b] = (byte) (0xFF - ((l >>> (8 * b)) & 0xFF));

}
long max = Long.MAX_VALUE / 2;
return (max + max) * random().nextLong() - max;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had similar random number generation in eg. TestIntRangeFieldQueries but this proved inefficient at finding bugs (eg. it makes it very unlikely to get twice the same value in the same test) so we changed it in https://issues.apache.org/jira/browse/LUCENE-7847. Maybe we should do something similar here (as well as for doubles, ip addresses, etc.)

private final QueryType queryType;
private final NormalizedRange expectedRange;

public BinaryDocValuesRangeQuery(String fieldName, RangeFieldMapper.RangeType rangeType, QueryType queryType, Object from, Object to,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid passing the range type? I think it would be more robust if it does not need to know about the type of values that are stored and works directly on the encoded data?

@martijnvg
Copy link
Member Author

@jpountz Thanks for reviewing. I've updated the PR.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good overall, I left some comments. I'm wondering whether we should sort the ranges before encoding them as it might allow for optimizations in the multi-valued case eg. if the max value of the query is less than the min value of the first encoded range.

It would be nice to have some benchmarks as well in order to make sure that this actually performs better than point-based ranges when the range does not drive iteration (it might make more sense to do it after LUCENE-7828 is merged). I can help with that.

int numRanges = in.readVInt();
for (int i = 0; i < numRanges; i++) {
otherFrom.length = in.readVInt();
in.readBytes(otherFrom.bytes, 0, otherFrom.length);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could avoid the copy by doing this instead:

otherFrom.bytes = encodedRanges.bytes;
otherFrom.offset = in.position();
in.skip(otherFrom.length);

and similarly for the to value?

return new BytesRef(encoded, 0, out.getPosition());
}

static byte[] encode(long l) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add javadocs saying that this is variable-length encoding that preserves ordering?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think we need some unit tests for that method in order to ensure it actually preserves ordering?

l = Double.doubleToRawLongBits(d);
sign = 1; // means positive
}
return encode(l, sign);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doubles should probably use a different encoding, but we can work on it as a follow-up

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think we should specialize floats too

@martijnvg
Copy link
Member Author

@jpountz I've updated the pr.

I'm wondering whether we should sort the ranges before encoding them as it might allow for optimizations in the multi-valued case eg. if the max value of the query is less than the min value of the first encoded range.

I've added the sorting, but I think it will only make sense when there multiple adjacent ranges and when ranges are overlapping it wouldn't help that much?

It would be nice to have some benchmarks as well in order to make sure that this actually performs better than point-based ranges when the range does not drive iteration (it might make more sense to do it after LUCENE-7828 is merged). I can help with that.

Thanks! I guess we will need a dataset that when benchmarked with a specific query forces the range query to be executed in random access mode.

@jpountz
Copy link
Contributor

jpountz commented Jun 7, 2017

I've added the sorting, but I think it will only make sense when there multiple adjacent ranges and when ranges are overlapping it wouldn't help that much?

Right I don't think this is a game changer for queries in general. However it might be the case for aggregations? For instance, say you want to run an histogram aggregation on a range field, it would be easier with sorted ranges, since you can easily track which buckets have already been incremented and which buckets haven't?

@jpountz
Copy link
Contributor

jpountz commented Jun 8, 2017

I merged LUCENE-7828 yesterday so that we can better benchmark this change, but upgrading to a new Lucene 7 snapshot is currently blocked on #25028 since the postings highlighter has been removed from Lucene master. It should hopefully be merged in the next few days.

@martijnvg
Copy link
Member Author

martijnvg commented Jul 5, 2017

(Updated the benchmark to index the docs into arbitrary order and captured the 50th precentile instead of the 99th because it is less noisy)

This are the 50th percentile latency results between master (baseline) and this pr (contender) of the new benchmark added to rally:

Metric Operation Baseline Contender Diff Unit
50th percentile latency range_field_big_range 43.802 43.7695 -0.03246 ms
50th percentile latency range_field_small_range 20.4401 14.685 -5.75507 ms
50th percentile latency range_field_conjunction_big_range_small_term_query 53.0726 19.2134 -33.8592 ms
50th percentile latency range_field_conjunction_small_range_small_term_query 26.9517 18.5648 -8.38684 ms
50th percentile latency range_field_conjunction_small_range_big_term_query 75.8826 67.3323 -8.55028 ms
50th percentile latency range_field_conjunction_big_range_big_term_query 20651 17353 -3298.01 ms
50th percentile latency range_field_disjunction_small_range_small_term_query 49.0655 39.0006 -10.0649 ms
50th percentile latency range_field_disjunction_big_range_small_term_query 7185.2 86.6476 -7098.55 ms

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to do 2 things before merging:

  • back compat in the mapper so that the pre-6.0 mappers know they do not have doc values
  • sort ranges by from then to


@Override
public String toString(String field) {
return "BinaryDocValuesRangeQuery(fieldName=" + field + ")";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's put from and to in the toSTring() ?

long r2From = ((Number) r2.from).longValue();
long r2To = ((Number) r2.from).longValue();
long middle2 = r2From + ((r2To - r2From) / 2);
return Long.compare(middle1, middle2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather sort by from then to, feels like something that is easier to rely on.

@@ -194,7 +201,7 @@ public RangeFieldType(RangeType type) {
super();
this.rangeType = Objects.requireNonNull(type);
setTokenized(false);
setHasDocValues(false);
setHasDocValues(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to handle backcompat in the parser to make sure we use false as a default value if the index was created before 6.0?

@martijnvg
Copy link
Member Author

@jpountz I've updated the pr.

…ficient than points.

* Enable doc values for range fields by default.
* Store ranges in a binary format that support multi field fields.
* Added BinaryDocValuesRangeQuery that can query ranges that have been encoded into a binary doc values field.
* Wrap range queries on a range field in IndexOrDocValuesQuery query.

Closes elastic#24314
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories v6.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants