Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Direct UTF-8 access for "in" filters. #12517

Merged
merged 7 commits into from
May 20, 2022
Merged

Conversation

gianm
Copy link
Contributor

@gianm gianm commented May 13, 2022

Directly related:

  1. InDimFilter: Store stored Strings (in ValuesSet) plus sorted UTF-8
    ByteBuffers (in valuesUtf8). Use valuesUtf8 whenever possible. If
    necessary, the input set is copied into a ValuesSet. Much logic is
    simplified, because we always know what type the values set will be.
    I think that there won't even be an efficiency loss in most cases.
    InDimFilter is most frequently created by deserialization, and this
    patch updates the JsonCreator constructor to deserialize
    directly into a ValuesSet.

  2. Add Utf8ValueSetIndex, which InDimFilter uses to avoid UTF-8 decodes
    during index lookups.

  3. Add unsigned comparator to ByteBufferUtils and use it in
    GenericIndexed.BYTE_BUFFER_STRATEGY. This is important because UTF-8
    bytes can be compared as bytes if, and only if, the comparison
    is unsigned.

  4. Add specialization to GenericIndexed.singleThreaded().indexOf that
    avoids needless ByteBuffer allocations.

  5. Clarify that objects returned by ColumnIndexSupplier.as are not
    thread-safe. DictionaryEncodedStringIndexSupplier now calls
    singleThreaded() on all relevant GenericIndexed objects, saving
    a ByteBuffer allocation per access.

Also:

  1. Fix performance regression in LikeFilter: since push value range and set index get operations into BitmapIndex #12315, it applied
    the suffix matcher to all values in range even for type MATCH_ALL.

  2. Add ObjectStrategy.canCompare() method. This fixes LikeFilterBenchmark,
    which was broken due to calls to strategy.compare in
    GenericIndexed.fromIterable.

Benchmarks:

Improvements to "in" filters (due to using UTF-8 comparisons) and "like" filters (due to fixing a perf regression).

patch

Benchmark                         (dictionarySize)  (filterSize)  Mode  Cnt     Score    Error  Units
InFilterBenchmark.doFilter                 1000000         10000  avgt   10  3522.824 ± 35.642  us/op

Benchmark                            (cardinality)                Mode  Cnt     Score    Error  Units
LikeFilterBenchmark.matchLikePrefix        1000000                avgt   10   660.164 ± 19.946  us/op

9177515 (master)

Benchmark                         (dictionarySize)  (filterSize)  Mode  Cnt     Score    Error  Units
InFilterBenchmark.doFilter                 1000000         10000  avgt   10  6723.385 ± 59.267  us/op

Benchmark                            (cardinality)                Mode  Cnt     Score    Error  Units
LikeFilterBenchmark.matchLikePrefix        1000000                avgt   10  1348.478 ± 12.959  us/op

deb69d1 (before #12388)

Benchmark                         (dictionarySize)  (filterSize)  Mode  Cnt     Score    Error  Units
InFilterBenchmark.doFilter                 1000000         10000  avgt   10  6803.080 ± 87.903  us/op

Benchmark                            (cardinality)                Mode  Cnt     Score   Error  Units
LikeFilterBenchmark.matchLikePrefix        1000000                avgt   10   717.091 ± 3.451  us/op

Directly related:

1) InDimFilter: Store stored Strings (in ValuesSet) plus sorted UTF-8
   ByteBuffers (in valuesUtf8). Use valuesUtf8 whenever possible. If
   necessary, the input set is copied into a ValuesSet. Much logic is
   simplified, because we always know what type the values set will be.
   I think that there won't even be an efficiency loss in most cases.
   InDimFilter is most frequently created by deserialization, and this
   patch updates the JsonCreator constructor to deserialize
   directly into a ValuesSet.

2) Add Utf8ValueSetIndex, which InDimFilter uses to avoid UTF-8 decodes
   during index lookups.

3) Add unsigned comparator to ByteBufferUtils and use it in
   GenericIndexed.BYTE_BUFFER_STRATEGY. This is important because UTF-8
   bytes can be compared as bytes if, and only if, the comparison
   is unsigned.

4) Add specialization to GenericIndexed.singleThreaded().indexOf that
   avoids needless ByteBuffer allocations.

5) Clarify that objects returned by ColumnIndexSupplier.as are not
   thread-safe. DictionaryEncodedStringIndexSupplier now calls
   singleThreaded() on all relevant GenericIndexed objects, saving
   a ByteBuffer allocation per access.

Also:

1) Fix performance regression in LikeFilter: since apache#12315, it applied
   the suffix matcher to all values in range even for type MATCH_ALL.

2) Add ObjectStrategy.canCompare() method. This fixes LikeFilterBenchmark,
   which was broken due to calls to strategy.compare in
   GenericIndexed.fromIterable.
@lgtm-com
Copy link

lgtm-com bot commented May 13, 2022

This pull request introduces 2 alerts when merging 7f2af63 into 9177515 - view on LGTM.com

new alerts:

  • 2 for Cast from abstract to concrete collection

@lgtm-com
Copy link

lgtm-com bot commented May 13, 2022

This pull request introduces 2 alerts when merging 301057b into 7ab2170 - view on LGTM.com

new alerts:

  • 2 for Cast from abstract to concrete collection

@gianm
Copy link
Contributor Author

gianm commented May 14, 2022

Added tests and made some updates to satisfy the coverage and style checkers.

@lgtm-com
Copy link

lgtm-com bot commented May 14, 2022

This pull request introduces 2 alerts when merging c1a2d94 into 7ab2170 - view on LGTM.com

new alerts:

  • 2 for Cast from abstract to concrete collection

@lgtm-com
Copy link

lgtm-com bot commented May 14, 2022

This pull request introduces 2 alerts when merging f7ea7e5 into 7ab2170 - view on LGTM.com

new alerts:

  • 2 for Cast from abstract to concrete collection

@lgtm-com
Copy link

lgtm-com bot commented May 19, 2022

This pull request introduces 2 alerts when merging ca68d16 into 90531fd - view on LGTM.com

new alerts:

  • 2 for Cast from abstract to concrete collection

Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm 🤘

*/
BitmapColumnIndex forRange(
@Nullable String startValue,
boolean startStrict,
@Nullable String endValue,
boolean endStrict,
Predicate<String> matcher
@Nullable Predicate<String> matcher
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about instead of making this @Nullable to remove the default implementation of the version of forRange which does not take a predicate matcher? I think that is actually the code path your benchmarks were measuring due to how you changed LikeFilter, since DictionaryEncodedStringIndexSupplier actually does implement that other version.

The reason it had a default implementation is a left-over artifact of #12315 where all of the methods had to be implemented on BitmapIndex, meaning a lot of silly implementations were required for things that didn't support those kinds of indexes, so the default implementation was a laziness on my part.

I think the implementation of the version that doesn't have the matcher should be slightly better than the @Nullable matcher since it can avoid the checks. Though its probably small fries when computing indexes, everything counts.

I think for ListFilteredVirtualColumn it would be ok to just use the old 'default' implementation for its version, since its the only other thing that implements this index interface.

Alternatively, we could remove the version that doesn't take a matcher argument and make this version with @Nullable the only method, though I think that's slightly less cool because having it split (and not having a default) sort of pushes implementors towards providing the more optimal predicate-less implementation. If we do go this route, we should change DictionaryEncodedStringIndexSupplier to make the version without ranges private and just use that when the matcher is null.

This confusion was my fault I think, this PR made it more obvious that the current state isn't the most intuitive on what to call or implement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about instead of making this @nullable to remove the default implementation of the version of forRange which does not take a predicate matcher?

☝️ I did this one

if (clazz.equals(StringValueSetIndex.class)) {
return (T) new GenericIndexedDictionaryEncodedStringValueSetIndex(bitmapFactory, dictionary, bitmaps);
} else if (clazz.equals(DruidPredicateIndex.class)) {
if (bitmaps != null && clazz.equals(StringValueSetIndex.class)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch 👍

GenericIndexed<ImmutableBitmap> bitmaps
)
{
this.bitmapFactory = bitmapFactory;
this.dictionary = dictionary;
this.bitmaps = bitmaps;
this.dictionary = dictionary.singleThreaded();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is a shame that this method loses the more specific type, but is still probably better

Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤘

@lgtm-com
Copy link

lgtm-com bot commented May 19, 2022

This pull request introduces 2 alerts when merging d61bb23 into 90531fd - view on LGTM.com

new alerts:

  • 2 for Cast from abstract to concrete collection

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants