Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

push value range and set index get operations into BitmapIndex #12315

Merged
merged 11 commits into from
Mar 9, 2022

Conversation

clintropolis
Copy link
Member

@clintropolis clintropolis commented Mar 7, 2022

Description

This PR updates the BitmapIndex interface to add several "value" based bitmap retrieval methods, to push the logic of how these are collected into the BitmapIndex implementation, instead of leaking details about the underlying dictionaries (such as GenericIndexed.indexOf returning position or -insertion point).

  ImmutableBitmap getBitmapForValue(@Nullable String value);

  default Iterable<ImmutableBitmap> getBitmapsInRange(
      @Nullable String startValue,
      boolean startStrict,
      @Nullable String endValue,
      boolean endStrict
  )
  {
    return getBitmapsInRange(startValue, startStrict, endValue, endStrict, (index) -> true);
  }

  Iterable<ImmutableBitmap> getBitmapsInRange(
      @Nullable String startValue,
      boolean startStrict,
      @Nullable String endValue,
      boolean endStrict,
      IntPredicate matcher
  );

  Iterable<ImmutableBitmap> getBitmapsForValues(Set<String> values);

The major changes of adding these methods is that now "in", "bound", and "like" filters push this logic into BitmapIndex, so they don't have to know things about the underlying selectors indexes, such as whether or not they are sorted, etc, and now simply call these methods to get a set of ImmutableBitmap to do whatever with.

I consider this the first step of some work I'd like to do on allowing indexes on other types of values, such as numbers and complex types.

This PR also fixes an unrelated bug with PredicateFilteredDimensionSelector, where it would return a "match" for null rows, even if the predicate didn't match null, meaning sometimes rows which were meant to be filtered out would be considered a match inappropriately.


Key changed/added classes in this PR
  • BitmapIndex
  • BitmapIndexSelector
  • ColumnSelectorBitmapIndexSelector
  • StringBitmapIndexColumnPartSupplier
  • InDimFilter
  • BoundFilter
  • LikeFilter

This PR has:

  • been self-reviewed.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.

@clintropolis
Copy link
Member Author

marked WIP because some tests I added for ListFilteredVirtualColumn are failing for "deny" lists. However, these tests also fail when not using the bitmap index at all, so I think there is some underlying bug unrelated to the changes in this PR, which I'm still looking into.

@lgtm-com
Copy link

lgtm-com bot commented Mar 7, 2022

This pull request introduces 1 alert when merging 21a3507 into 28f8bcc - view on LGTM.com

new alerts:

  • 1 for Spurious Javadoc @param tags

@clintropolis clintropolis added Bug and removed WIP labels Mar 8, 2022
@clintropolis
Copy link
Member Author

I actually think the "like" filter here could be done a bit differently and could eliminate the need for BitmapIndex.getIndex which is now only used to power BitmapIndexSelector.getDimensionValues, which itself is only used by the LikeFilter to call (index) -> likeMatcher.matchesSuffixOnly(dimValues, index) to filter the values of BitmapIndex.getBitmapsInRange, the version of which is only called by the "like" filter.... I will see if I can further simplify things

@clintropolis
Copy link
Member Author

turns out there were still a few users of BitmapIndexSelector.getDimensionValues, so can't remove yet, however I went ahead and re-arranged the "like" filter as discussed, which I think seems nicer

// null should match empty rows in multi-value columns
return nullRow && value == null;
// null should match empty rows in multi-value columns if predicate matches null
return nullRow && value == null && predicate.apply(null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache predicate.apply(null)?

(Cool bug, btw.)

@@ -97,7 +97,7 @@ public void inspectRuntimeShape(RuntimeShapeInspector inspector)
@Override
public ValueMatcher makeValueMatcher(final Predicate<String> matcherPredicate)
{
final boolean matchNull = predicate.apply(null);
final boolean matchNull = predicate.apply(null) && matcherPredicate.apply(null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also a cool bug.

}

@Override
public Iterable<ImmutableBitmap> getBitmapsInRange(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code block is for a bitmap selector on an all-null column, right? It looks overly complex for that. Shouldn't it check if matcher matches null and then either return a single getBitmap(0), or no bitmaps, appropriately?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops yes, I meant to revisit this one. I started with copying the existing path it was taking but forgot about it; it should be better now


endIndex = startIndex > endIndex ? startIndex : endIndex;
final IntIterator rangeIterator = IntListUtils.fromTo(startIndex, endIndex).iterator();
return () -> new Iterator<ImmutableBitmap>()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't really a proper Iterable. It will only be iterable once; a proper Iterable should be re-iterable. If it doesn't need to be re-iterable, it may be simpler to return an Iterator instead of an Iterable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. I'm not really sure they need to be Iterable, I was just matching the pattern being used externally on top of BitmapIndex, so I just fixed it instead of changing them to Iterator for now, but will revisit this later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you change this to only return a single Bitmap and union them here, then you don't need to worry about it ;)

Copy link
Contributor

@imply-cheddar imply-cheddar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the interface change should be done to return just an ImmutableBitmap instead of the Iterator of them. That said, this could be done in a different PR as well given that these interfaces are technically not part of the public API.

/**
* Get an {@link Iterable} of {@link ImmutableBitmap} corresponding to the values supplied in the specified range
*/
default Iterable<ImmutableBitmap> getBitmapsInRange(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that these are always Union'd together and I'm not really seeing a great reason for why this shouldn't just return a single ImmutableBitmap that has already been unioned instead of an Iterable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered this, but it was a bigger pattern change than I was hoping to deal with right now, so I'd like to save this for a follow-up.

The other thing that I'm unsure of is that currently those bitmap union operations aren't really subject to query timeout or any other sort of limiting, so it might be worth thinking on how we might add some limits to prevent a very large number of bitmap operations from running until completion. The answer may still be to push that enforcement into BitmapIndex, just need to think a bit more about it first.

@Nullable String startValue,
boolean startStrict,
@Nullable String endValue,
boolean endStrict
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to package these things up into a holder Object of some sort so that it's a bit easier to adjust parameters if/when we need to. That could probably be done in a subsequent change though.

* Get an {@link Iterable} of {@link ImmutableBitmap} corresponding to the values supplied in the specified range
* whose dictionary ids also match some predicate
*/
Iterable<ImmutableBitmap> getBitmapsInRange(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here on Iterable

* Get an {@link Iterable} of {@link ImmutableBitmap} corresponding to the specified set of values (if they are
* contained in the underlying column)
*/
Iterable<ImmutableBitmap> getBitmapsForValues(Set<String> values);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here on Iterable too

* Returns the index of "value" in some object whose values are accessible by index some {@link IndexedGetter}, or
* (-(insertion point) - 1) if the value is not present, in the manner of Arrays.binarySearch.
* Returns the index of "value" in some object in the {@link Indexed}, or (-(insertion point) - 1) if the value is
* not present, in the manner of Arrays.binarySearch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd personally prefer that we remove this bit of documentation. It's an implementation detail that I don't think we should try to enshrine in the docs anymore (especially after this PR)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i just killed this method and moved this logic into GenericIndexed which was the only caller again after the changes in this PR. I think it is ok for GenericIndexed.indexOf to mention that it has this contract for indexOf, since it is an implementation detail that the BitmapIndex implementation of StringBitmapIndexColumnPartSupplier cares about.

@@ -65,7 +65,7 @@
*
* @return index of value, or negative number equal to (-(insertion point) - 1).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's here too


endIndex = startIndex > endIndex ? startIndex : endIndex;
final IntIterator rangeIterator = IntListUtils.fromTo(startIndex, endIndex).iterator();
return () -> new Iterator<ImmutableBitmap>()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you change this to only return a single Bitmap and union them here, then you don't need to worry about it ;)

}

@Override
public int getIndex(@Nullable String value)
public Iterable<ImmutableBitmap> getBitmapsInRange(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the "include" case, the implementations of these methods seem like they should be able to be a lot simpler. The normal use case provides a much smaller list of things to show, it seems like we could quite rapidly filter through that list and then delegate to something that uses that list.

The deny-list case requires different logic. But, then I wonder how important this actually is, or is the difference between the implementation that you have and this more direct implementation just a few milliseconds such that it doesn't actually matter...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the "include" case, the implementations of these methods seem like they should be able to be a lot simpler. The normal use case provides a much smaller list of things to show, it seems like we could quite rapidly filter through that list and then delegate to something that uses that list.

I think this is already doing this, the range iterated over here is the cardinality of the idLookup, not the dictionary of the delegate column, so it only considers ranges of the filtered values. There might be room for improvement, especially in the deny case, though I think it probably would need bigger changes to ListFilteredVirtualColumn and maybe best to do as a follow-up.

Copy link
Contributor

@cheddar cheddar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving a green checkmark instead of the gray one.

@gianm
Copy link
Contributor

gianm commented Mar 9, 2022

Leaving a green checkmark instead of the gray one.

😱

@clintropolis clintropolis merged commit 9cfb239 into apache:master Mar 9, 2022
@clintropolis clintropolis deleted the bitmap-index-changes-pt-1 branch March 9, 2022 21:31
TSFenwick pushed a commit to TSFenwick/druid that referenced this pull request Apr 11, 2022
…e#12315)

* push value range and set index get operations into BitmapIndex

* fix bug

* oops, fix better

* better like, fix test, javadocs

* fix checkstyle

* simplify and fixes

* cache

* fix tests

* move indexOf into GenericIndexed

* oops

* fix tests
abhishekagarwal87 pushed a commit that referenced this pull request May 11, 2022
Following up on #12315, which pushed most of the logic of building ImmutableBitmap into BitmapIndex in order to hide the details of how column indexes are implemented from the Filter implementations, this PR totally refashions how Filter consume indexes. The end result, while a rather dramatic reshuffling of the existing code, should be extraordinarily flexible, eventually allowing us to model any type of index we can imagine, and providing the machinery to build the filters that use them, while also allowing for other column implementations to implement the built-in index types to provide adapters to make use indexing in the current set filters that Druid provides.
@abhishekagarwal87 abhishekagarwal87 added this to the 0.23.0 milestone May 11, 2022
gianm added a commit to gianm/druid that referenced this pull request May 13, 2022
Directly related:

1) InDimFilter: Store stored Strings (in ValuesSet) plus sorted UTF-8
   ByteBuffers (in valuesUtf8). Use valuesUtf8 whenever possible. If
   necessary, the input set is copied into a ValuesSet. Much logic is
   simplified, because we always know what type the values set will be.
   I think that there won't even be an efficiency loss in most cases.
   InDimFilter is most frequently created by deserialization, and this
   patch updates the JsonCreator constructor to deserialize
   directly into a ValuesSet.

2) Add Utf8ValueSetIndex, which InDimFilter uses to avoid UTF-8 decodes
   during index lookups.

3) Add unsigned comparator to ByteBufferUtils and use it in
   GenericIndexed.BYTE_BUFFER_STRATEGY. This is important because UTF-8
   bytes can be compared as bytes if, and only if, the comparison
   is unsigned.

4) Add specialization to GenericIndexed.singleThreaded().indexOf that
   avoids needless ByteBuffer allocations.

5) Clarify that objects returned by ColumnIndexSupplier.as are not
   thread-safe. DictionaryEncodedStringIndexSupplier now calls
   singleThreaded() on all relevant GenericIndexed objects, saving
   a ByteBuffer allocation per access.

Also:

1) Fix performance regression in LikeFilter: since apache#12315, it applied
   the suffix matcher to all values in range even for type MATCH_ALL.

2) Add ObjectStrategy.canCompare() method. This fixes LikeFilterBenchmark,
   which was broken due to calls to strategy.compare in
   GenericIndexed.fromIterable.
clintropolis added a commit to clintropolis/druid that referenced this pull request May 18, 2022
@abhishekagarwal87 abhishekagarwal87 removed this from the 0.23.0 milestone May 19, 2022
clintropolis pushed a commit that referenced this pull request May 20, 2022
* Direct UTF-8 access for "in" filters.

Directly related:

1) InDimFilter: Store stored Strings (in ValuesSet) plus sorted UTF-8
   ByteBuffers (in valuesUtf8). Use valuesUtf8 whenever possible. If
   necessary, the input set is copied into a ValuesSet. Much logic is
   simplified, because we always know what type the values set will be.
   I think that there won't even be an efficiency loss in most cases.
   InDimFilter is most frequently created by deserialization, and this
   patch updates the JsonCreator constructor to deserialize
   directly into a ValuesSet.

2) Add Utf8ValueSetIndex, which InDimFilter uses to avoid UTF-8 decodes
   during index lookups.

3) Add unsigned comparator to ByteBufferUtils and use it in
   GenericIndexed.BYTE_BUFFER_STRATEGY. This is important because UTF-8
   bytes can be compared as bytes if, and only if, the comparison
   is unsigned.

4) Add specialization to GenericIndexed.singleThreaded().indexOf that
   avoids needless ByteBuffer allocations.

5) Clarify that objects returned by ColumnIndexSupplier.as are not
   thread-safe. DictionaryEncodedStringIndexSupplier now calls
   singleThreaded() on all relevant GenericIndexed objects, saving
   a ByteBuffer allocation per access.

Also:

1) Fix performance regression in LikeFilter: since #12315, it applied
   the suffix matcher to all values in range even for type MATCH_ALL.

2) Add ObjectStrategy.canCompare() method. This fixes LikeFilterBenchmark,
   which was broken due to calls to strategy.compare in
   GenericIndexed.fromIterable.

* Add like-filter implementation tests.

* Add in-filter implementation tests.

* Add tests, fix issues.

* Fix style.

* Adjustments from review.
gianm added a commit to gianm/druid that referenced this pull request Jan 2, 2024
It was using virtualColumns.getColumnCapabilities, which only returns
capabilities for virtual columns, not regular columns. The effect of this
is that expression filters (and in some cases, arrayContainsElement filters)
would build value matchers rather than use indexes.

I think this has been like this since apache#12315, which added the
getColumnCapabilities method to BitmapIndexSelector, and included the same
implementation as exists in the code today.

This error is easy to make due to the design of virtualColumns.getColumnCapabilities,
so to help avoid it in the future, this patch renames the method to
getColumnCapabilitiesWithoutFallback to emphasize that it does not return
capabilities for regular columns.
clintropolis pushed a commit that referenced this pull request Jan 3, 2024
* Fix ColumnSelectorColumnIndexSelector#getColumnCapabilities.

It was using virtualColumns.getColumnCapabilities, which only returns
capabilities for virtual columns, not regular columns. The effect of this
is that expression filters (and in some cases, arrayContainsElement filters)
would build value matchers rather than use indexes.

I think this has been like this since #12315, which added the
getColumnCapabilities method to BitmapIndexSelector, and included the same
implementation as exists in the code today.

This error is easy to make due to the design of virtualColumns.getColumnCapabilities,
so to help avoid it in the future, this patch renames the method to
getColumnCapabilitiesWithoutFallback to emphasize that it does not return
capabilities for regular columns.

* Make getColumnCapabilitiesWithoutFallback package-private.

* Fix expression filter bitmap usage.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants