-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
push value range and set index get operations into BitmapIndex #12315
push value range and set index get operations into BitmapIndex #12315
Conversation
marked WIP because some tests I added for |
This pull request introduces 1 alert when merging 21a3507 into 28f8bcc - view on LGTM.com new alerts:
|
I actually think the "like" filter here could be done a bit differently and could eliminate the need for |
turns out there were still a few users of |
// null should match empty rows in multi-value columns | ||
return nullRow && value == null; | ||
// null should match empty rows in multi-value columns if predicate matches null | ||
return nullRow && value == null && predicate.apply(null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cache predicate.apply(null)
?
(Cool bug, btw.)
@@ -97,7 +97,7 @@ public void inspectRuntimeShape(RuntimeShapeInspector inspector) | |||
@Override | |||
public ValueMatcher makeValueMatcher(final Predicate<String> matcherPredicate) | |||
{ | |||
final boolean matchNull = predicate.apply(null); | |||
final boolean matchNull = predicate.apply(null) && matcherPredicate.apply(null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also a cool bug.
} | ||
|
||
@Override | ||
public Iterable<ImmutableBitmap> getBitmapsInRange( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code block is for a bitmap selector on an all-null column, right? It looks overly complex for that. Shouldn't it check if matcher
matches null
and then either return a single getBitmap(0)
, or no bitmaps, appropriately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops yes, I meant to revisit this one. I started with copying the existing path it was taking but forgot about it; it should be better now
|
||
endIndex = startIndex > endIndex ? startIndex : endIndex; | ||
final IntIterator rangeIterator = IntListUtils.fromTo(startIndex, endIndex).iterator(); | ||
return () -> new Iterator<ImmutableBitmap>() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't really a proper Iterable. It will only be iterable once; a proper Iterable should be re-iterable. If it doesn't need to be re-iterable, it may be simpler to return an Iterator instead of an Iterable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch. I'm not really sure they need to be Iterable
, I was just matching the pattern being used externally on top of BitmapIndex
, so I just fixed it instead of changing them to Iterator
for now, but will revisit this later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you change this to only return a single Bitmap and union them here, then you don't need to worry about it ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the interface change should be done to return just an ImmutableBitmap instead of the Iterator of them. That said, this could be done in a different PR as well given that these interfaces are technically not part of the public API.
/** | ||
* Get an {@link Iterable} of {@link ImmutableBitmap} corresponding to the values supplied in the specified range | ||
*/ | ||
default Iterable<ImmutableBitmap> getBitmapsInRange( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that these are always Union'd together and I'm not really seeing a great reason for why this shouldn't just return a single ImmutableBitmap
that has already been unioned instead of an Iterable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered this, but it was a bigger pattern change than I was hoping to deal with right now, so I'd like to save this for a follow-up.
The other thing that I'm unsure of is that currently those bitmap union operations aren't really subject to query timeout or any other sort of limiting, so it might be worth thinking on how we might add some limits to prevent a very large number of bitmap operations from running until completion. The answer may still be to push that enforcement into BitmapIndex
, just need to think a bit more about it first.
@Nullable String startValue, | ||
boolean startStrict, | ||
@Nullable String endValue, | ||
boolean endStrict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be nice to package these things up into a holder Object of some sort so that it's a bit easier to adjust parameters if/when we need to. That could probably be done in a subsequent change though.
* Get an {@link Iterable} of {@link ImmutableBitmap} corresponding to the values supplied in the specified range | ||
* whose dictionary ids also match some predicate | ||
*/ | ||
Iterable<ImmutableBitmap> getBitmapsInRange( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here on Iterable
* Get an {@link Iterable} of {@link ImmutableBitmap} corresponding to the specified set of values (if they are | ||
* contained in the underlying column) | ||
*/ | ||
Iterable<ImmutableBitmap> getBitmapsForValues(Set<String> values); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here on Iterable too
* Returns the index of "value" in some object whose values are accessible by index some {@link IndexedGetter}, or | ||
* (-(insertion point) - 1) if the value is not present, in the manner of Arrays.binarySearch. | ||
* Returns the index of "value" in some object in the {@link Indexed}, or (-(insertion point) - 1) if the value is | ||
* not present, in the manner of Arrays.binarySearch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd personally prefer that we remove this bit of documentation. It's an implementation detail that I don't think we should try to enshrine in the docs anymore (especially after this PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i just killed this method and moved this logic into GenericIndexed
which was the only caller again after the changes in this PR. I think it is ok for GenericIndexed.indexOf
to mention that it has this contract for indexOf
, since it is an implementation detail that the BitmapIndex
implementation of StringBitmapIndexColumnPartSupplier
cares about.
@@ -65,7 +65,7 @@ | |||
* | |||
* @return index of value, or negative number equal to (-(insertion point) - 1). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's here too
|
||
endIndex = startIndex > endIndex ? startIndex : endIndex; | ||
final IntIterator rangeIterator = IntListUtils.fromTo(startIndex, endIndex).iterator(); | ||
return () -> new Iterator<ImmutableBitmap>() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you change this to only return a single Bitmap and union them here, then you don't need to worry about it ;)
} | ||
|
||
@Override | ||
public int getIndex(@Nullable String value) | ||
public Iterable<ImmutableBitmap> getBitmapsInRange( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the "include" case, the implementations of these methods seem like they should be able to be a lot simpler. The normal use case provides a much smaller list of things to show, it seems like we could quite rapidly filter through that list and then delegate to something that uses that list.
The deny-list case requires different logic. But, then I wonder how important this actually is, or is the difference between the implementation that you have and this more direct implementation just a few milliseconds such that it doesn't actually matter...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the "include" case, the implementations of these methods seem like they should be able to be a lot simpler. The normal use case provides a much smaller list of things to show, it seems like we could quite rapidly filter through that list and then delegate to something that uses that list.
I think this is already doing this, the range iterated over here is the cardinality of the idLookup, not the dictionary of the delegate column, so it only considers ranges of the filtered values. There might be room for improvement, especially in the deny case, though I think it probably would need bigger changes to ListFilteredVirtualColumn
and maybe best to do as a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving a green checkmark instead of the gray one.
😱 |
…e#12315) * push value range and set index get operations into BitmapIndex * fix bug * oops, fix better * better like, fix test, javadocs * fix checkstyle * simplify and fixes * cache * fix tests * move indexOf into GenericIndexed * oops * fix tests
Following up on #12315, which pushed most of the logic of building ImmutableBitmap into BitmapIndex in order to hide the details of how column indexes are implemented from the Filter implementations, this PR totally refashions how Filter consume indexes. The end result, while a rather dramatic reshuffling of the existing code, should be extraordinarily flexible, eventually allowing us to model any type of index we can imagine, and providing the machinery to build the filters that use them, while also allowing for other column implementations to implement the built-in index types to provide adapters to make use indexing in the current set filters that Druid provides.
Directly related: 1) InDimFilter: Store stored Strings (in ValuesSet) plus sorted UTF-8 ByteBuffers (in valuesUtf8). Use valuesUtf8 whenever possible. If necessary, the input set is copied into a ValuesSet. Much logic is simplified, because we always know what type the values set will be. I think that there won't even be an efficiency loss in most cases. InDimFilter is most frequently created by deserialization, and this patch updates the JsonCreator constructor to deserialize directly into a ValuesSet. 2) Add Utf8ValueSetIndex, which InDimFilter uses to avoid UTF-8 decodes during index lookups. 3) Add unsigned comparator to ByteBufferUtils and use it in GenericIndexed.BYTE_BUFFER_STRATEGY. This is important because UTF-8 bytes can be compared as bytes if, and only if, the comparison is unsigned. 4) Add specialization to GenericIndexed.singleThreaded().indexOf that avoids needless ByteBuffer allocations. 5) Clarify that objects returned by ColumnIndexSupplier.as are not thread-safe. DictionaryEncodedStringIndexSupplier now calls singleThreaded() on all relevant GenericIndexed objects, saving a ByteBuffer allocation per access. Also: 1) Fix performance regression in LikeFilter: since apache#12315, it applied the suffix matcher to all values in range even for type MATCH_ALL. 2) Add ObjectStrategy.canCompare() method. This fixes LikeFilterBenchmark, which was broken due to calls to strategy.compare in GenericIndexed.fromIterable.
apache#12315)" This reverts commit 9cfb239.
* Direct UTF-8 access for "in" filters. Directly related: 1) InDimFilter: Store stored Strings (in ValuesSet) plus sorted UTF-8 ByteBuffers (in valuesUtf8). Use valuesUtf8 whenever possible. If necessary, the input set is copied into a ValuesSet. Much logic is simplified, because we always know what type the values set will be. I think that there won't even be an efficiency loss in most cases. InDimFilter is most frequently created by deserialization, and this patch updates the JsonCreator constructor to deserialize directly into a ValuesSet. 2) Add Utf8ValueSetIndex, which InDimFilter uses to avoid UTF-8 decodes during index lookups. 3) Add unsigned comparator to ByteBufferUtils and use it in GenericIndexed.BYTE_BUFFER_STRATEGY. This is important because UTF-8 bytes can be compared as bytes if, and only if, the comparison is unsigned. 4) Add specialization to GenericIndexed.singleThreaded().indexOf that avoids needless ByteBuffer allocations. 5) Clarify that objects returned by ColumnIndexSupplier.as are not thread-safe. DictionaryEncodedStringIndexSupplier now calls singleThreaded() on all relevant GenericIndexed objects, saving a ByteBuffer allocation per access. Also: 1) Fix performance regression in LikeFilter: since #12315, it applied the suffix matcher to all values in range even for type MATCH_ALL. 2) Add ObjectStrategy.canCompare() method. This fixes LikeFilterBenchmark, which was broken due to calls to strategy.compare in GenericIndexed.fromIterable. * Add like-filter implementation tests. * Add in-filter implementation tests. * Add tests, fix issues. * Fix style. * Adjustments from review.
It was using virtualColumns.getColumnCapabilities, which only returns capabilities for virtual columns, not regular columns. The effect of this is that expression filters (and in some cases, arrayContainsElement filters) would build value matchers rather than use indexes. I think this has been like this since apache#12315, which added the getColumnCapabilities method to BitmapIndexSelector, and included the same implementation as exists in the code today. This error is easy to make due to the design of virtualColumns.getColumnCapabilities, so to help avoid it in the future, this patch renames the method to getColumnCapabilitiesWithoutFallback to emphasize that it does not return capabilities for regular columns.
* Fix ColumnSelectorColumnIndexSelector#getColumnCapabilities. It was using virtualColumns.getColumnCapabilities, which only returns capabilities for virtual columns, not regular columns. The effect of this is that expression filters (and in some cases, arrayContainsElement filters) would build value matchers rather than use indexes. I think this has been like this since #12315, which added the getColumnCapabilities method to BitmapIndexSelector, and included the same implementation as exists in the code today. This error is easy to make due to the design of virtualColumns.getColumnCapabilities, so to help avoid it in the future, this patch renames the method to getColumnCapabilitiesWithoutFallback to emphasize that it does not return capabilities for regular columns. * Make getColumnCapabilitiesWithoutFallback package-private. * Fix expression filter bitmap usage.
Description
This PR updates the
BitmapIndex
interface to add several "value" based bitmap retrieval methods, to push the logic of how these are collected into theBitmapIndex
implementation, instead of leaking details about the underlying dictionaries (such as GenericIndexed.indexOf returning position or -insertion point).The major changes of adding these methods is that now "in", "bound", and "like" filters push this logic into
BitmapIndex
, so they don't have to know things about the underlying selectors indexes, such as whether or not they are sorted, etc, and now simply call these methods to get a set ofImmutableBitmap
to do whatever with.I consider this the first step of some work I'd like to do on allowing indexes on other types of values, such as numbers and complex types.
This PR also fixes an unrelated bug with
PredicateFilteredDimensionSelector
, where it would return a "match" for null rows, even if the predicate didn't match null, meaning sometimes rows which were meant to be filtered out would be considered a match inappropriately.Key changed/added classes in this PR
BitmapIndex
BitmapIndexSelector
ColumnSelectorBitmapIndexSelector
StringBitmapIndexColumnPartSupplier
InDimFilter
BoundFilter
LikeFilter
This PR has: