remake column indexes and query processing of filters #12388

clintropolis · 2022-04-01T20:39:18Z

Description

Following up on #12315, which pushed most of the logic of building ImmutableBitmap into BitmapIndex in order to hide the details of how column indexes are implemented from the Filter implementations, this PR totally refashions how Filter consume indexes. The end result, while a rather dramatic reshuffling of the existing code, should be extraordinarily flexible, eventually allowing us to model any type of index we can imagine, and providing the machinery to build the filters that use them, while also allowing for other column implementations to implement the built-in index types to provide adapters to make use indexing in the current set filters that Druid provides.

Index changes

BitmapIndex has been split into 4 concrete implementations, DictionaryEncodedStringValueIndex, StringValueSetIndex, LexicographicalRangeIndex, and DruidPredicateIndex.

`DictionaryEncodedStringValueIndex` and `BitmapColumnIndex`

DictionaryEncodedStringValueIndex is identical to BitmapIndex prior to #12315, and is a 'leaky' abstraction that is very much tied to how string dictionary encoded columns store a lexicographical sorted value dictionary, and also a lexicographical sorted index dictionary for the index on those values. Filter implementations no longer use this index directly, since it requires knowing a bunch of implementation details which filtering need not concern themselves with. Instead, the other 3 implementations produce a new thing

public interface BitmapColumnIndex
{
  ColumnIndexCapabilities getIndexCapabilities();

  <T> T computeBitmapResult(BitmapResultFactory<T> bitmapResultFactory);
}

which captures some information about that index, and can be computed into a T with the BitmapResultFactory (e.g. into a ImmutableBitmap representing the rows selected by that filter to use with a BitmapOffset). A promise of a bitmap, and information about it for that filter against those inputs, if you will.

`StringValueSetIndex`

StringValueSetIndex is the index implementation that the majority of our current filters which support indexes use, and provides facilities for gettingBitmapColumnIndex for a single string value to match, a set of string values.

`LexicographicalRangeIndex`

LexicographicalRangeIndex captures the pattern of the optimization which can be done for bound and like filters in certain cases, where taking advantage of the fact that the indexes are stored in lexicographical order for string dimensions, a ImmutableBitmap can be produced by gathering all indexes within the start and end dictionary id of the underlying values and unioning them.

`DruidPredicateIndex`

DruidPredicateIndex is an index which can be constructed using a DruidPredicateFactory, which for dictionary encoded string columns means the string predicate is used and evaluated against all of the values which are present in the value dictionary.

`ColumnIndexSelector` and `ColumnIndexSupplier`

The methods by which Filter obtain these indexes from columns has also been heavily reworked. BitmapIndexSelector is now called ColumnIndexSelector, and the focus here is a new method:

  @Nullable
  ColumnIndexSupplier getIndexSupplier(String column);

which Filter implementations can call into to get an index supplier for the column they are interested in using indexes.

public interface ColumnIndexSupplier
{
  @Nullable
  <T> T as(Class<T> clazz);
}

All columns which "exist" will have a non-null ColumnIndexSupplier, so Filter implementations can use a null response here to check if the filter matches the null value, and produce an all true or all false index instead.

ColumnHolder has also been modified to instead of having a Supplier<BitmapIndex> and Supplier<SpatialIndex> to use ColumnIndexSupplier. VirtualColumn also contains a method to fetch a ColumnIndexSupplier, with additional access to a ColumnSelector so they can potentially read stuff from an underlying segment when constructing it.

This lets the implementation of ColumnIndexSelector, ColumnSelectorColumnIndexSelector (formerly known as ColumnSelectorBitmapIndexSelector), pass through the either the virtual or physical column index supplier, or null if the column is missing entirely.

For dictionary encoded string columns, the DictionaryEncodedStringIndexSupplier.getIndex method can supply implementations of DictionaryEncodedStringValueIndex, StringValueSetIndex, LexicographicalRangeIndex, DruidPredicateIndex and, through the same interface, SpatialIndex, without any specific methods to fetch any of these implementations.

`ColumnIndexCapabilities`

The vast majority of the changes in this PR are simply re-arranging existing pieces. There is one exception to this however, which is the previously mentioned ColumnIndexCapabilities type. I've preemptively added this type in anticipation of adding additional index types, and gone ahead and wired this up into the filter partitioning logic that currently resides in QueryableIndexStorageAdapter.

ColumnIndexCapabilities is sort of an analog to ColumnCapabilities, but for information about a specific conjuring of BitmapColumnIndex. I've added two methods here which no current filters actually take advantage of, but that I know will be of use in the near future as we begin to add additional index implementations: 'is exact' and 'is invertible', the former of which captures the pattern of indexes which can be used to reduce the total number of rows scanned, or which may produce false positives and so should still be 'post' filtered with a value matcher, and the latter for index types which might have false positives and so cannot be used in a 'not' filter.

Filter changes

The changes to indexes greatly simplify all of the Filter implementations. All Filter bitmap index methods have been consolidated into a single method:

  @Nullable
  BitmapColumnIndex getBitmapColumnIndex(ColumnIndexSelector selector);

which simply if not null, means that the filter has a BitmapColumnIndex that can be computed into an offset.

This single method replaces:

ImmutableBitmap getBitmapIndex(BitmapIndexSelector selector)
<T> T getBitmapResult(BitmapIndexSelector selector, BitmapResultFactory<T> bitmapResultFactory);
boolean supportsBitmapIndex(BitmapIndexSelector selector);
boolean shouldUseBitmapIndex(BitmapIndexSelector selector);

Future work

ColumnCapabilities methods hasBitmapIndexes and hasSpatialIndexes no longer accurately describe columns and should probably be adapted into the set of classes which the ColumnIndexSupplier can provide. Likewise, dimension declarations on ingestion specs should eventually also be modified to be more flexible in the types of indexes which can be specified, though I'm not sure yet what this should look like quite yet.

The mechanics of how indexes work is still a bit leaky after this change, primarily in that QueryableIndexStorageAdapter knows too much about partitioning filters into 'pre' and 'post' filters when building the cursor. I would like to push this into Filter itself if possible, potentially getting to where instead of QueryableIndexStorageAdapter examining and finding which filters support indexes and collecting the bitmaps from them to create a bitmap offset and using any remaining filters as a post-filtered offset, that instead perhaps the Filter itself can create the offset that is used when constructing the cursor. Since ColumnIndexSupplier isn't at all opinionated about what types of indexes columns can provide, it opens up the door to creating offsets from information stored in the segment without bitmaps being required.

However, I defer these changes for the future since I've already changed enough here. This does seem like it should be possible though, since the filters already know how to create everything that goes into the current offset construction (BitmapColumnIndex and ValueMatcher/VectorValueMatcher), but would also allow much larger flexibility when constructing the offsets as well as releasing QueryableIndexStorageAdapter from having to know anything at all about filters.

Key changed/added classes in this PR

BitmapIndex -> DictionaryEncodedStringValueIndex, StringValueSetIndex, LexicographicalRangeIndex, DruidPredicateIndex
BitmapColumnIndex
BitmapIndexSelector -> ColumnIndexSelector
ColumnHolder
ColumnIndexSupplier
DictionaryEncodedStringIndexSupplier
QueryableIndexStorageAdapter
ColumnIndexCapabilities
VirtualColumn
Filter and all implementations

This PR has:

been self-reviewed.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
been tested in a test Druid cluster.

… providing the methods directly, to be more consistent with ColumnHolder

jihoonson

LGTM. The new interfaces including ColumnIndexCapabilities and ColumnIndexSupplier make sense to me.

imply-cheddar

I've left some comments on the general interfaces and structuring. There are ambiguities hidden in them that I think make things more complex than they need to be. I haven't gone into the nitty gritty details of all of the implementations, but after looking at about 50% of the files I convinced myself that the suggestions made sense and would simplify things so I wanted to submit this review with the suggestions first.

It's possible I missed things in the implementations that my suggestions don't line up with, if so, please point them out.

imply-cheddar · 2022-04-11T04:40:32Z

processing/src/main/java/org/apache/druid/query/filter/BooleanFilter.java

-        return false;
+      ColumnIndexCapabilities other = f.getIndexCapabilities(selector);
+      if (other == null) {
+        return null;


Is it really okay to short-circuit all of the other filters in this case?

imply-cheddar · 2022-04-11T04:44:49Z

processing/src/main/java/org/apache/druid/query/filter/Filter.java

@@ -150,7 +129,7 @@ default VectorValueMatcher makeVectorMatcher(VectorColumnSelectorFactory factory
   *
   * @return true if this Filter supports selectivity estimation, false otherwise.
   */
-  boolean supportsSelectivityEstimation(ColumnSelector columnSelector, BitmapIndexSelector indexSelector);
+  boolean supportsSelectivityEstimation(ColumnSelector columnSelector, ColumnIndexSelector indexSelector);


Once we move the selectivityEstimation value as suggested elsewhere, this method can be completely eliminated.

imply-cheddar · 2022-04-11T04:50:17Z

processing/src/main/java/org/apache/druid/query/filter/InDimFilter.java

-      final BitmapIndex bitmapIndex = selector.getBitmapIndex(dimension);
-      return bitmapResultFactory.unionDimensionValueBitmaps(getBitmapIterable(values, bitmapIndex));
+      final StringValueSetIndex valueSetIndex = selector.as(dimension, StringValueSetIndex.class);
+      if (valueSetIndex == null) {


It strikes me that this could be null because the column doesn't exist OR it could be null because the column exists, but doesn't support StringValueSetIndex. the .as() method should likely not actually take a String parameter and instead the call should be split into 2 calls, one for the column and then one for the interface. In this way, the semantics of null becomes extremely clear.

imply-cheddar · 2022-04-11T04:53:36Z

processing/src/main/java/org/apache/druid/query/filter/ColumnIndexSelector.java

  @Nullable
-  ImmutableBitmap getBitmapIndex(String dimension, String value);
-  ImmutableRTree getSpatialIndex(String dimension);
+  <T> T as(String column, Class<T> clazz);


The fact that .as() has two arguments makes this call site ambiguous and introduces superfluous methods like getIndexCapabilities(). We should instead just do a .get(String column) that returns ColumnIndexSupplier (this can be used to determine if the column exists or not) and then we can use the ColumnIndexSupplier object to call .getIndexAs() and get the implementation. If we do this, we don't need getIndexCapabilities() at all.

This makes sense to just expose the index supplier here, and eliminates the need to check capabilities as a proxy for if the column exists or not. However, I think we still need index capabilities for other reasons to provide information about characteristics of the indexes themselves. It would need to return a non-null index supplier for all columns that exist, and that index supplier can just return nulls for capabilities and indexes, which is easy enough.

For the capabilities, what I was suggesting is, e.g. where LexicographicRangeIndex has method

default Iterable<ImmutableBitmap> getBitmapsInRange( @Nullable String startValue, boolean startStrict, @Nullable String endValue, boolean endStrict )

It could have signature

default FilterBundle getBitmapsInRange( @Nullable String startValue, boolean startStrict, @Nullable String endValue, boolean endStrict )

Where FilterBundle would be a new class that could have fields for bitmap indexes, value predicates, selectivity estimation, whether it's safe to invert, etc. etc. Which, I believe meets the need that you are highlighting for ColumnIndexCapabilities.

I think FilterBundle is going to be a bit more complex than you're hoping, which is part of why I didn't try to consolidate getIndexCapabilities and getIndex yet. Not all value matchers are predicates, so it can't strictly operate on predicates, and value matchers need the cursor created from the offset of the 'pre' part of the filter bundle to actually be created, so it might be a factory that can make value matchers to use as a post-filter.

imply-cheddar · 2022-04-11T05:04:04Z

processing/src/main/java/org/apache/druid/query/filter/Filter.java

   */
-  double estimateSelectivity(BitmapIndexSelector indexSelector);
+  double estimateSelectivity(ColumnIndexSelector indexSelector);


Similar to some other suggestions made, I think that selectivity estimation is a function of the bitmap that is returned. We should perhaps move it to an object that is returned when the bitmap is gotten instead of having it here. Looking at, e.g. the BoundsFilter implementation, this method effectively does all of the work to build the bitmap only to then figure out if it's okay to actually use the bitmap. If it is okay, it's going to do all of the work to build the bitmap again. If we have to do the work anyway, might as well just do it once and the current API doesn't actually allow the code to do that...

Maybe we don't fix it in this PR and we fix it in a follow-up that additionally adds the various things for IndexCapabilities, but either way we should put a target on this method as it's an abstraction that can be improved.

I didn't explicitly mention it, but I was opting to refactor selectivity estimation as a follow-up change, since this was already pretty big, so I'd prefer a follow-up I think

imply-cheddar · 2022-04-11T08:15:57Z

processing/src/main/java/org/apache/druid/segment/ColumnSelectorColumnIndexSelector.java

+
+  @Nullable
+  @Override
+  public <T> T as(String column, Class<T> clazz)


Looking at the implementation for as() and getColumnIndexCapabilities(), I really think this should just be getColumnIndexSupplier() where null effectively means "not filterable". We then have a choice of whether we return a NullColumnIndexSupplier for cases where the dimension doesn't exist (i.e. at this level, act like it does exist by returning something that knows how to deal with all nulls) or also have null mean that the column didn't exist. Personally, I tend to think that we should return a NullColumnIndexSupplier and let the callers believe the column actually existed, but maybe there's reasons not to do that.

imply-cheddar · 2022-04-12T02:49:28Z

processing/src/main/java/org/apache/druid/segment/column/ColumnIndexCapabilities.java

+ * Sort of like {@link ColumnCapabilities}, except for indexes supplied by {@link ColumnIndexSelector}, provides
+ * information for how query processing may use indexes.
+ */
+public interface ColumnIndexCapabilities


The values that this class represent (isInvertible and isExact) are not functions of a column, but are functions of a specific Bitmap returned by said column. ColumnIndexCapabilities belongs to a column and is acting like it represents all things from that column, it cannot, however do that. To handle this case, we should instead move these things onto the actually ImmutableBitmap object or some other wrapper object that gets return and indicates what all can be done. I would recommend that put them on an object that also contains the ValueMatcher for when a ValueMatcher is required.

As is, this interface and API structure is not actually in alignment with the physical realities of the data and the extensions, so we will, once again, just be fighting with it. So, we should fix this before merging. One way to fix it (and my recommendation) is to just remove the Capabilities altogether here and re-introduce isInvertible and isExact in the PR that also introduces a combination of BitmapIndex and ValueMatchers.

ColumnIndexCapabilities belongs to a column and is acting like it represents all things from that column, it cannot, however do that.

This isn't true, getIndexCapabilities takes both a column, and a specific class implementation of the index, so these are the capabilities of a specific index of the column. I guess I wasn't considering the case where an index might sometimes produce exact matches and sometimes not, since in my mind these wouldn't be mixed within the same index, but I wasn't thinking clever enough about it and these cases probably do exist.

imply-cheddar · 2022-04-12T02:50:56Z

processing/src/main/java/org/apache/druid/segment/FilteredOffset.java

+      final ColumnIndexCapabilities capabilities = postFilter.getIndexCapabilities(bitmapIndexSelector);
+      // we only consider "exact" filters here, because if false, we've already used the bitmap index for the base
+      // offset
+      if (capabilities != null && capabilities.isExact()) {


Instead of checking capabilities, this can very easily be

BitmapIndex index = postFilter.getBitmapIndex(bitmapIndexSelector); if (index != null) {

imply-cheddar · 2022-04-12T02:55:32Z

processing/src/main/java/org/apache/druid/segment/column/ColumnIndexSupplier.java

+   * which can greatly reduce the total number of rows which need to be scanned and processed.
+   */
+  @Nullable
+  <T> T getIndex(Class<T> clazz);


Naming nit: .getIndexAs() to indicate that we aren't trying to access a field called Index, but we are asking the object to "morph" itself into the specified class.

hmm, this isn't an index itself though (or at least wasn't in my mind), but a sort of index supplier, so in my mind were asking if the supplier could give us an index of a specific type 🤷

TBH, the name "Index" was already ruined by the original author of a lot of this code. It's so overloaded in what it can mean, that I find myself wanting us to use a different noun. Upon reflection, my suggestion to add As was actually coming from me not liking the word Index and wanting something with extra words to make it more descriptive.

oops, I meant to rename this back to as after the changes from the first round of reviews, it only called it getIndex because as seemed strange when the interface also had getIndexCapabilities, so I've renamed this method to as.

I actually want to work on reclaiming the word 'index' to primarily be used for filtering/cursors so that the codebase is more 'friendly', but not in this PR.

imply-cheddar · 2022-04-12T02:56:56Z

processing/src/main/java/org/apache/druid/segment/column/ColumnIndexSupplier.java

+   * in unavailable.
+   */
+  @Nullable
+  <T> ColumnIndexCapabilities getIndexCapabilities(Class<T> clazz);


As suggested in other comments, perhaps move this to be fields on the Bitmap object returned by the index object instead of generic representations of the column as a whole.

… indexes

…ges-pt-2

rohangarg

LGTM after CI 👍

imply-cheddar

Still part way through review, need to run to a meeting, leaving comments so taht they don't sit pending foreverz

imply-cheddar · 2022-05-06T01:27:21Z

processing/src/main/java/org/apache/druid/segment/column/ColumnIndexSupplier.java

+   * which can greatly reduce the total number of rows which need to be scanned and processed.
+   */
+  @Nullable
+  <T> T getIndex(Class<T> clazz);


TBH, the name "Index" was already ruined by the original author of a lot of this code. It's so overloaded in what it can mean, that I find myself wanting us to use a different noun. Upon reflection, my suggestion to add As was actually coming from me not liking the word Index and wanting something with extra words to make it more descriptive.

imply-cheddar · 2022-05-06T01:37:03Z

processing/src/main/java/org/apache/druid/query/metadata/SegmentAnalyzer.java

-            size += StringUtils.estimatedBinaryLengthAsUTF8(value) *
-                    ((long) bitmapIndex.getBitmapForValue(value).size());
+      ColumnIndexSupplier indexSupplier = columnHolder.getIndexSupplier();
+      if (indexSupplier != null) {


nit: this is a != on an if with an else, invert them to remove the negation please.

imply-cheddar · 2022-05-06T01:39:53Z

processing/src/main/java/org/apache/druid/query/metadata/SegmentAnalyzer.java

+        final DictionaryEncodedStringValueIndex valueIndex = indexSupplier.getIndex(
+            DictionaryEncodedStringValueIndex.class
+        );
+        Preconditions.checkNotNull(valueIndex, "Column [%s] does not have a value index", columnName);


This should be equivalent to capabilities.hasBitmapIndexes(). How about we remove that check above and replace it with the 2-phased checks for the ColumnIndexSupplier and then that it can return a DictionaryEncodedStringValueIndex?

imply-cheddar · 2022-05-06T01:40:58Z

processing/src/main/java/org/apache/druid/query/metadata/SegmentAnalyzer.java

+        cardinality = valueIndex.getCardinality();
+        if (analyzingSize()) {
+          for (int i = 0; i < cardinality; ++i) {
+            String value = valueIndex.getValue(i);


While we are here, could we like... use the getUTF8Bytes and just take .remaining() instead?

ah, I completely agree this is silly, but I have bigger plans with plumbing through the utf8 bytes into filters, so do you mind if I save this for a follow-up PR?

imply-cheddar · 2022-05-06T01:46:23Z

processing/src/main/java/org/apache/druid/segment/ColumnSelectorColumnIndexSelector.java

+      indexSupplier = virtualColumns.getIndexSupplier(column, columnSelector);
+    } else {
+      final ColumnHolder columnHolder = columnSelector.getColumnHolder(column);
+      if (columnHolder == null || !columnHolder.getCapabilities().isFilterable()) {


Why the isFilterable check? If the ColumnHolder believes that it is not filterable, wouldn't it just return a null IndexSupplier? Or do we want a not filterable column to act like it doesn't even exist?

Or do we want a not filterable column to act like it doesn't even exist?

This is the way it is currently intended to behave, if it is not filterable the column is meant to be treated as only nulls. I lost some comments about what was going on along the way, which i will try to add back.

// for missing columns and columns with types that do not support filtering, // treat the column as if it were a String column full of nulls. // Create a BitmapIndex so that filters applied to null columns can use // bitmap indexes. Filters check for the presence of a bitmap index, this is used to determine // whether the filter is applied in the pre or post filtering stage.

To be fair, i'm not completely certain this is the best way to handle this, since it does create odd situations where unfilterable IS NULL matches every row for a column that is not filterable when amusingly the value is not null, so I will be revisiting this in the future I think. It should probably never match anything ever if the column is not filterable, but that probably needs discussion since would be a behavior change, so i'm only going to make sure the comments documenting existing behavior are there and save logic changes for future work.

Seems like a great thing to resolve with bundles. Everything is technically filterable with a value predicate...

imply-cheddar · 2022-05-06T01:49:58Z

processing/src/main/java/org/apache/druid/segment/QueryableIndexIndexableAdapter.java

@@ -383,6 +391,9 @@ public BitmapValues getBitmapValues(String dimension, int dictId)
    }
  }

+  /**
+   * this method is only for testing, don't use it for stuff


Does it really need to exist? Could it be a helper method in a class in the test package?

added a getQueryableIndex method so that this method could be moved to the only test that uses it

imply-cheddar · 2022-05-06T01:52:34Z

processing/src/main/java/org/apache/druid/segment/QueryableIndexStorageAdapter.java

@@ -146,8 +148,9 @@ public Comparable getMinValue(String dimension)
  {
    ColumnHolder columnHolder = index.getColumnHolder(dimension);
    if (columnHolder != null && columnHolder.getCapabilities().hasBitmapIndexes()) {
-      BitmapIndex bitmap = columnHolder.getBitmapIndex();
-      return bitmap.getCardinality() > 0 ? bitmap.getValue(0) : null;
+      ColumnIndexSupplier indexSupplier = columnHolder.getIndexSupplier();


Total drive-by comment as it's not really related to this PR, but this getMinValue seems like a weird method to have on this API. It looks like it's used for some join logic? I wonder if that logic couldn't just specialize to the concrete class it expects and get the method that way instead...

agree it is strange, and is used only by SegmentAnalyzer so agree it should be reworked though i might save it for a follow-up PR

imply-cheddar · 2022-05-06T01:56:01Z

processing/src/main/java/org/apache/druid/segment/column/AllTrueBitmapColumnIndex.java

+  @Override
+  public double estimateSelectivity(int totalRows)
+  {
+    return 1.;


Personal style thing. I like return 1D; better, mostly mentioning just to make sure that you are aware that 1D; is also an option so that you can evaluate according to your own aesthetics.

ah, i think i was just moving the existing code around, the lineage of this is from

if (bitmapIndex == null || bitmapIndex.getCardinality() == 0) { return doesMatchNull() ? 1. : 0.; }

if i remember, just using 1 and 0 seems probably fine too. I'll change it to something else because i also find this to be among the stranger ways of writing it

imply-cheddar · 2022-05-06T02:00:37Z

processing/src/main/java/org/apache/druid/segment/column/DictionaryEncodedStringValueIndex.java

+  /**
+   * Get the {@link ImmutableBitmap} for dictionary id of the underlying dictionary
+   */
+  ImmutableBitmap getBitmap(int idx);


I wonder if this shouldn't return BitmapColumnIndex instead of ImmutableBitmap?

i don't think it is necessary; BitmapColumnIndex is specifically for filter processing, while DictionaryEncodedStringValueIndex should never be used by filters because it makes assumptions about how the column is represented that filters shouldn't concern themselves with. It is intended to be a 'leaky' abstraction to the way dictionary encoded columns with bitmap value indexes store things since it basically provides raw access. This class is only used by SegmentAnalyzer, ListFilteredVirtualColumn, search query strategies, and segment merging code, all of which are doing lots of invasive things in general to know things about the segments.

Part of me wants to detach this from the ColumnIndexSupplier and make it its own thing so that it can't be for filtering, but on the other hand, it needs all of the same things that ColumnIndexSupplier has, and all existing methods were using the same machinery as filters, so its a bit disruptive. I will maybe consider this as a follow up, change though, so that way we can tie this class more strongly to dictionary encoded string columns, since it isn't really valid for any other type of column.

Nah, makes sense.

jihoonson

I reviewed only the interface design. The latest change LGTM.

imply-cheddar

A few more minor comments, but overall, looks good.

imply-cheddar · 2022-05-09T02:40:27Z

processing/src/main/java/org/apache/druid/segment/ColumnSelectorColumnIndexSelector.java

+      indexSupplier = virtualColumns.getIndexSupplier(column, columnSelector);
+    } else {
+      final ColumnHolder columnHolder = columnSelector.getColumnHolder(column);
+      if (columnHolder == null || !columnHolder.getCapabilities().isFilterable()) {


Seems like a great thing to resolve with bundles. Everything is technically filterable with a value predicate...

imply-cheddar · 2022-05-09T02:44:26Z

processing/src/main/java/org/apache/druid/segment/column/DictionaryEncodedStringValueIndex.java

+  /**
+   * Get the {@link ImmutableBitmap} for dictionary id of the underlying dictionary
+   */
+  ImmutableBitmap getBitmap(int idx);


Nah, makes sense.

imply-cheddar · 2022-05-09T02:47:27Z

processing/src/main/java/org/apache/druid/segment/filter/AndFilter.java

-        // Short-circuit.
-        return bitmapResultFactory.wrapAllFalse(Filters.allFalse(selector));
+      final BitmapColumnIndex index = filter.getBitmapColumnIndex(selector);
+      if (index == null) {


When will index be null? I'm trying to think of what that semantic could be and the dots are not really connecting for me? It seems like this should never be null and the filter should always do something to return a non-null response?

I'm thinking it might be related to the whole "not filterable columns" thingie? But, it seems like the logic that we are keeping is that not-filterable-columns are filterable, just all null. Alternatively, if this is an error case, I find myself wondering why there isn't an exception thrown deeper down that gives a meaningful error message about why the error is occurring...

since this is the method on Filter, index here will be null for any sub-filter which was unable to find an index to use, like how numeric columns currently are since they lack indexes.

At the lower level, if ColumnIndexSupplier, then the column doesn't exist (or is not filterable, which as mentioned in another comment I would sort of like to rework since not filterable doesn't really mean not filterable right now, it means all null and matches like that). All columns which exist have a non-null ColumnIndexSupplier. If the ColumnIndexSupplier is not null, then a null response from its as method means that the column doesn't have the type of index it was asked for.

Back to the filter getBitmapColumnIndex method, this method might return a non-null value even if the underlying column doesn't exist (ColumnIndexSupplier was null so it can match the matcher against null to produce an all true or all false), or the column does exist and supports one of the types of indexes that the filter understands.

cheddar

Approving for green checkmark. Review was done as imply-cheddar

gianm · 2022-05-29T06:38:07Z

The changes to NullColumnPartSerde in this patch introduce a compatibility problem: prior to this patch, bitmapSerdeFactory was a required parameter, and now it's gone. So that means segments with explicit null columns written by servers on this patch cannot be read by a server that has #12279. I noticed this when upgrading a dev cluster.

We haven't done a release with #12279 yet, but it's slated to be in 0.23.0. So we should either change the code in 0.23.0 to not require bitmapSerdeFactory, or we should change master to write it out once again, lest we create an accidental storage format incompatibility.

Any thoughts on the best approach @clintropolis, @jihoonson, @abhishekagarwal87, or @imply-cheddar?

clintropolis · 2022-05-29T06:58:19Z

Ah good catch 👍

Hmm, I'm not really sure how to remove bitmapSerdeFactory without the changes here but will think about it, I guess in the worst case we could just leave it there even though it isn't used.

clintropolis · 2022-05-30T09:58:42Z

hmm, I wonder if we could just allow ColumnHolder to be null from the NullColumnPartSerde instead of making a 'null' column holder implementation, then we wouldn't need to serialize rows or the serde factory, and it can behave just like any other actually null column... going to poke around and do some tests to see if this is feasible or if it doesn't work for some reason I'm not thinking of.

clintropolis · 2022-05-31T08:53:11Z

created #12585 to add back bitmapSerdeFactory since that is the easy solution, if a bit ugly

remake how query processing handles column indexes

f44be4d

clintropolis added Area - Querying WIP Design Review Area - Segment Format and Ser/De labels Apr 1, 2022

clintropolis added 8 commits April 1, 2022 14:30

fix build, more javadocs

b824b50

fixes

6b21412

dump segment bitmap test

da84c14

fixes

6614cad

more javadocs, VirtualColumn now has a ColumnIndexSupplier instead of…

337d4c7

… providing the methods directly, to be more consistent with ColumnHolder

cleanup

7fcba94

missed one

65ac563

cleaner

842e619

clintropolis removed the WIP label Apr 4, 2022

checkstyle

b5e9c55

clintropolis added Release Notes and removed Release Notes labels Apr 5, 2022

jihoonson approved these changes Apr 7, 2022

View reviewed changes

imply-cheddar suggested changes Apr 12, 2022

View reviewed changes

clintropolis added 4 commits April 28, 2022 03:13

add BitmapColumnIndex to consolidate index capabilities and computing…

9e905eb

… indexes

Merge remote-tracking branch 'upstream/master' into bitmap-index-chan…

e96656a

…ges-pt-2

remove unused

1c25629

remove more unused

c910bdd

clintropolis requested a review from jihoonson April 29, 2022 06:22

push down selectivity estimation

0c05141

rohangarg approved these changes May 5, 2022

View reviewed changes

imply-cheddar reviewed May 6, 2022

View reviewed changes

clintropolis added 2 commits May 6, 2022 01:38

review stuffs

331220d

fixes

b61a46d

jihoonson approved these changes May 6, 2022

View reviewed changes

imply-cheddar approved these changes May 9, 2022

View reviewed changes

cheddar approved these changes May 9, 2022

View reviewed changes

abhishekagarwal87 merged commit 9e5a940 into apache:master May 11, 2022

gianm mentioned this pull request May 13, 2022

Direct UTF-8 access for "in" filters. #12517

Merged

clintropolis deleted the bitmap-index-changes-pt-2 branch May 18, 2022 05:51

clintropolis mentioned this pull request May 18, 2022

Revert "push value range and set index get operations into BitmapIndex (#12315)" #12533

Merged

clintropolis mentioned this pull request May 31, 2022

fix backwards compatibility for explicit null columns #12585

Merged

clintropolis mentioned this pull request Jun 10, 2022

split out null value index #12627

Merged

3 tasks

clintropolis mentioned this pull request Jun 23, 2022

Druid nested data columns #12695

Open

clintropolis mentioned this pull request Jul 30, 2022

add DictionaryEncodedStringValueIndex implementation to NestedFieldLiteralColumnIndexSupplier #12837

Merged

4 tasks

clintropolis mentioned this pull request Aug 17, 2022

Why druid-sql not support Filtered DimensionSpecs #12911

Closed

abhishekagarwal87 added this to the 24.0.0 milestone Aug 26, 2022

techdocsmith mentioned this pull request Aug 26, 2022

[Draft] 24.0 Release notes #12825

Closed

abhishekagarwal87 mentioned this pull request Sep 8, 2022

Test issue [Please ignore] #13055

Closed

clintropolis mentioned this pull request Oct 19, 2023

[WIP] Add an option to enable bitmap in IncrementalIndex during real time ingestion #11307

Closed

8 tasks

clintropolis mentioned this pull request Feb 6, 2024

adaptive filter partitioning #15838

Merged

8 tasks

remake column indexes and query processing of filters #12388

remake column indexes and query processing of filters #12388

Conversation

clintropolis commented Apr 1, 2022 • edited Loading

Description

Index changes

DictionaryEncodedStringValueIndex and BitmapColumnIndex

StringValueSetIndex

LexicographicalRangeIndex

DruidPredicateIndex

ColumnIndexSelector and ColumnIndexSupplier

ColumnIndexCapabilities

Filter changes

Future work

Key changed/added classes in this PR

jihoonson left a comment

Choose a reason for hiding this comment

imply-cheddar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rohangarg left a comment

Choose a reason for hiding this comment

imply-cheddar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clintropolis May 6, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jihoonson left a comment

Choose a reason for hiding this comment

imply-cheddar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cheddar left a comment

Choose a reason for hiding this comment

gianm commented May 29, 2022 • edited Loading

clintropolis commented May 29, 2022 • edited Loading

clintropolis commented May 30, 2022

clintropolis commented May 31, 2022

clintropolis commented Apr 1, 2022 •

edited

Loading

`DictionaryEncodedStringValueIndex` and `BitmapColumnIndex`

`StringValueSetIndex`

`LexicographicalRangeIndex`

`DruidPredicateIndex`

`ColumnIndexSelector` and `ColumnIndexSupplier`

`ColumnIndexCapabilities`

clintropolis May 6, 2022 •

edited

Loading

gianm commented May 29, 2022 •

edited

Loading

clintropolis commented May 29, 2022 •

edited

Loading