-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Direct UTF-8 access for "in" filters. #12517
Conversation
Directly related: 1) InDimFilter: Store stored Strings (in ValuesSet) plus sorted UTF-8 ByteBuffers (in valuesUtf8). Use valuesUtf8 whenever possible. If necessary, the input set is copied into a ValuesSet. Much logic is simplified, because we always know what type the values set will be. I think that there won't even be an efficiency loss in most cases. InDimFilter is most frequently created by deserialization, and this patch updates the JsonCreator constructor to deserialize directly into a ValuesSet. 2) Add Utf8ValueSetIndex, which InDimFilter uses to avoid UTF-8 decodes during index lookups. 3) Add unsigned comparator to ByteBufferUtils and use it in GenericIndexed.BYTE_BUFFER_STRATEGY. This is important because UTF-8 bytes can be compared as bytes if, and only if, the comparison is unsigned. 4) Add specialization to GenericIndexed.singleThreaded().indexOf that avoids needless ByteBuffer allocations. 5) Clarify that objects returned by ColumnIndexSupplier.as are not thread-safe. DictionaryEncodedStringIndexSupplier now calls singleThreaded() on all relevant GenericIndexed objects, saving a ByteBuffer allocation per access. Also: 1) Fix performance regression in LikeFilter: since apache#12315, it applied the suffix matcher to all values in range even for type MATCH_ALL. 2) Add ObjectStrategy.canCompare() method. This fixes LikeFilterBenchmark, which was broken due to calls to strategy.compare in GenericIndexed.fromIterable.
This pull request introduces 2 alerts when merging 7f2af63 into 9177515 - view on LGTM.com new alerts:
|
This pull request introduces 2 alerts when merging 301057b into 7ab2170 - view on LGTM.com new alerts:
|
Added tests and made some updates to satisfy the coverage and style checkers. |
This pull request introduces 2 alerts when merging c1a2d94 into 7ab2170 - view on LGTM.com new alerts:
|
This pull request introduces 2 alerts when merging f7ea7e5 into 7ab2170 - view on LGTM.com new alerts:
|
This pull request introduces 2 alerts when merging ca68d16 into 90531fd - view on LGTM.com new alerts:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall lgtm 🤘
*/ | ||
BitmapColumnIndex forRange( | ||
@Nullable String startValue, | ||
boolean startStrict, | ||
@Nullable String endValue, | ||
boolean endStrict, | ||
Predicate<String> matcher | ||
@Nullable Predicate<String> matcher |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about instead of making this @Nullable
to remove the default
implementation of the version of forRange
which does not take a predicate matcher? I think that is actually the code path your benchmarks were measuring due to how you changed LikeFilter
, since DictionaryEncodedStringIndexSupplier
actually does implement that other version.
The reason it had a default implementation is a left-over artifact of #12315 where all of the methods had to be implemented on BitmapIndex
, meaning a lot of silly implementations were required for things that didn't support those kinds of indexes, so the default
implementation was a laziness on my part.
I think the implementation of the version that doesn't have the matcher should be slightly better than the @Nullable
matcher since it can avoid the checks. Though its probably small fries when computing indexes, everything counts.
I think for ListFilteredVirtualColumn
it would be ok to just use the old 'default' implementation for its version, since its the only other thing that implements this index interface.
Alternatively, we could remove the version that doesn't take a matcher argument and make this version with @Nullable
the only method, though I think that's slightly less cool because having it split (and not having a default) sort of pushes implementors towards providing the more optimal predicate-less implementation. If we do go this route, we should change DictionaryEncodedStringIndexSupplier
to make the version without ranges private and just use that when the matcher is null.
This confusion was my fault I think, this PR made it more obvious that the current state isn't the most intuitive on what to call or implement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about instead of making this @nullable to remove the default implementation of the version of forRange which does not take a predicate matcher?
☝️ I did this one
if (clazz.equals(StringValueSetIndex.class)) { | ||
return (T) new GenericIndexedDictionaryEncodedStringValueSetIndex(bitmapFactory, dictionary, bitmaps); | ||
} else if (clazz.equals(DruidPredicateIndex.class)) { | ||
if (bitmaps != null && clazz.equals(StringValueSetIndex.class)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch 👍
GenericIndexed<ImmutableBitmap> bitmaps | ||
) | ||
{ | ||
this.bitmapFactory = bitmapFactory; | ||
this.dictionary = dictionary; | ||
this.bitmaps = bitmaps; | ||
this.dictionary = dictionary.singleThreaded(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is a shame that this method loses the more specific type, but is still probably better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤘
This pull request introduces 2 alerts when merging d61bb23 into 90531fd - view on LGTM.com new alerts:
|
Directly related:
InDimFilter: Store stored Strings (in ValuesSet) plus sorted UTF-8
ByteBuffers (in valuesUtf8). Use valuesUtf8 whenever possible. If
necessary, the input set is copied into a ValuesSet. Much logic is
simplified, because we always know what type the values set will be.
I think that there won't even be an efficiency loss in most cases.
InDimFilter is most frequently created by deserialization, and this
patch updates the JsonCreator constructor to deserialize
directly into a ValuesSet.
Add Utf8ValueSetIndex, which InDimFilter uses to avoid UTF-8 decodes
during index lookups.
Add unsigned comparator to ByteBufferUtils and use it in
GenericIndexed.BYTE_BUFFER_STRATEGY. This is important because UTF-8
bytes can be compared as bytes if, and only if, the comparison
is unsigned.
Add specialization to GenericIndexed.singleThreaded().indexOf that
avoids needless ByteBuffer allocations.
Clarify that objects returned by ColumnIndexSupplier.as are not
thread-safe. DictionaryEncodedStringIndexSupplier now calls
singleThreaded() on all relevant GenericIndexed objects, saving
a ByteBuffer allocation per access.
Also:
Fix performance regression in LikeFilter: since push value range and set index get operations into BitmapIndex #12315, it applied
the suffix matcher to all values in range even for type MATCH_ALL.
Add ObjectStrategy.canCompare() method. This fixes LikeFilterBenchmark,
which was broken due to calls to strategy.compare in
GenericIndexed.fromIterable.
Benchmarks:
Improvements to "in" filters (due to using UTF-8 comparisons) and "like" filters (due to fixing a perf regression).
patch
9177515 (master)
deb69d1 (before #12388)