-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remake column indexes and query processing of filters #12388
remake column indexes and query processing of filters #12388
Conversation
… providing the methods directly, to be more consistent with ColumnHolder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. The new interfaces including ColumnIndexCapabilities
and ColumnIndexSupplier
make sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left some comments on the general interfaces and structuring. There are ambiguities hidden in them that I think make things more complex than they need to be. I haven't gone into the nitty gritty details of all of the implementations, but after looking at about 50% of the files I convinced myself that the suggestions made sense and would simplify things so I wanted to submit this review with the suggestions first.
It's possible I missed things in the implementations that my suggestions don't line up with, if so, please point them out.
return false; | ||
ColumnIndexCapabilities other = f.getIndexCapabilities(selector); | ||
if (other == null) { | ||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really okay to short-circuit all of the other filters in this case?
@@ -150,7 +129,7 @@ default VectorValueMatcher makeVectorMatcher(VectorColumnSelectorFactory factory | |||
* | |||
* @return true if this Filter supports selectivity estimation, false otherwise. | |||
*/ | |||
boolean supportsSelectivityEstimation(ColumnSelector columnSelector, BitmapIndexSelector indexSelector); | |||
boolean supportsSelectivityEstimation(ColumnSelector columnSelector, ColumnIndexSelector indexSelector); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we move the selectivityEstimation value as suggested elsewhere, this method can be completely eliminated.
final BitmapIndex bitmapIndex = selector.getBitmapIndex(dimension); | ||
return bitmapResultFactory.unionDimensionValueBitmaps(getBitmapIterable(values, bitmapIndex)); | ||
final StringValueSetIndex valueSetIndex = selector.as(dimension, StringValueSetIndex.class); | ||
if (valueSetIndex == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It strikes me that this could be null
because the column doesn't exist OR it could be null
because the column exists, but doesn't support StringValueSetIndex
. the .as()
method should likely not actually take a String parameter and instead the call should be split into 2 calls, one for the column and then one for the interface. In this way, the semantics of null
becomes extremely clear.
@Nullable | ||
ImmutableBitmap getBitmapIndex(String dimension, String value); | ||
ImmutableRTree getSpatialIndex(String dimension); | ||
<T> T as(String column, Class<T> clazz); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that .as()
has two arguments makes this call site ambiguous and introduces superfluous methods like getIndexCapabilities()
. We should instead just do a .get(String column)
that returns ColumnIndexSupplier
(this can be used to determine if the column exists or not) and then we can use the ColumnIndexSupplier
object to call .getIndexAs()
and get the implementation. If we do this, we don't need getIndexCapabilities()
at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to just expose the index supplier here, and eliminates the need to check capabilities as a proxy for if the column exists or not. However, I think we still need index capabilities for other reasons to provide information about characteristics of the indexes themselves. It would need to return a non-null index supplier for all columns that exist, and that index supplier can just return nulls for capabilities and indexes, which is easy enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the capabilities, what I was suggesting is, e.g. where LexicographicRangeIndex has method
default Iterable<ImmutableBitmap> getBitmapsInRange(
@Nullable String startValue,
boolean startStrict,
@Nullable String endValue,
boolean endStrict
)
It could have signature
default FilterBundle getBitmapsInRange(
@Nullable String startValue,
boolean startStrict,
@Nullable String endValue,
boolean endStrict
)
Where FilterBundle
would be a new class that could have fields for bitmap indexes, value predicates, selectivity estimation, whether it's safe to invert, etc. etc. Which, I believe meets the need that you are highlighting for ColumnIndexCapabilities
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think FilterBundle
is going to be a bit more complex than you're hoping, which is part of why I didn't try to consolidate getIndexCapabilities
and getIndex
yet. Not all value matchers are predicates, so it can't strictly operate on predicates, and value matchers need the cursor created from the offset of the 'pre' part of the filter bundle to actually be created, so it might be a factory that can make value matchers to use as a post-filter.
*/ | ||
double estimateSelectivity(BitmapIndexSelector indexSelector); | ||
double estimateSelectivity(ColumnIndexSelector indexSelector); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to some other suggestions made, I think that selectivity estimation is a function of the bitmap that is returned. We should perhaps move it to an object that is returned when the bitmap is gotten instead of having it here. Looking at, e.g. the BoundsFilter implementation, this method effectively does all of the work to build the bitmap only to then figure out if it's okay to actually use the bitmap. If it is okay, it's going to do all of the work to build the bitmap again. If we have to do the work anyway, might as well just do it once and the current API doesn't actually allow the code to do that...
Maybe we don't fix it in this PR and we fix it in a follow-up that additionally adds the various things for IndexCapabilities, but either way we should put a target on this method as it's an abstraction that can be improved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't explicitly mention it, but I was opting to refactor selectivity estimation as a follow-up change, since this was already pretty big, so I'd prefer a follow-up I think
|
||
@Nullable | ||
@Override | ||
public <T> T as(String column, Class<T> clazz) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the implementation for as()
and getColumnIndexCapabilities()
, I really think this should just be getColumnIndexSupplier()
where null
effectively means "not filterable". We then have a choice of whether we return a NullColumnIndexSupplier
for cases where the dimension doesn't exist (i.e. at this level, act like it does exist by returning something that knows how to deal with all nulls) or also have null
mean that the column didn't exist. Personally, I tend to think that we should return a NullColumnIndexSupplier
and let the callers believe the column actually existed, but maybe there's reasons not to do that.
* Sort of like {@link ColumnCapabilities}, except for indexes supplied by {@link ColumnIndexSelector}, provides | ||
* information for how query processing may use indexes. | ||
*/ | ||
public interface ColumnIndexCapabilities |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The values that this class represent (isInvertible
and isExact
) are not functions of a column, but are functions of a specific Bitmap returned by said column. ColumnIndexCapabilities
belongs to a column and is acting like it represents all things from that column, it cannot, however do that. To handle this case, we should instead move these things onto the actually ImmutableBitmap
object or some other wrapper object that gets return and indicates what all can be done. I would recommend that put them on an object that also contains the ValueMatcher
for when a ValueMatcher is required.
As is, this interface and API structure is not actually in alignment with the physical realities of the data and the extensions, so we will, once again, just be fighting with it. So, we should fix this before merging. One way to fix it (and my recommendation) is to just remove the Capabilities altogether here and re-introduce isInvertible
and isExact
in the PR that also introduces a combination of BitmapIndex
and ValueMatchers
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ColumnIndexCapabilities belongs to a column and is acting like it represents all things from that column, it cannot, however do that.
This isn't true, getIndexCapabilities
takes both a column, and a specific class implementation of the index, so these are the capabilities of a specific index of the column. I guess I wasn't considering the case where an index might sometimes produce exact matches and sometimes not, since in my mind these wouldn't be mixed within the same index, but I wasn't thinking clever enough about it and these cases probably do exist.
final ColumnIndexCapabilities capabilities = postFilter.getIndexCapabilities(bitmapIndexSelector); | ||
// we only consider "exact" filters here, because if false, we've already used the bitmap index for the base | ||
// offset | ||
if (capabilities != null && capabilities.isExact()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of checking capabilities, this can very easily be
BitmapIndex index = postFilter.getBitmapIndex(bitmapIndexSelector);
if (index != null) {
* which can greatly reduce the total number of rows which need to be scanned and processed. | ||
*/ | ||
@Nullable | ||
<T> T getIndex(Class<T> clazz); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming nit: .getIndexAs()
to indicate that we aren't trying to access a field called Index, but we are asking the object to "morph" itself into the specified class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, this isn't an index itself though (or at least wasn't in my mind), but a sort of index supplier, so in my mind were asking if the supplier could give us an index of a specific type 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, the name "Index" was already ruined by the original author of a lot of this code. It's so overloaded in what it can mean, that I find myself wanting us to use a different noun. Upon reflection, my suggestion to add As
was actually coming from me not liking the word Index
and wanting something with extra words to make it more descriptive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, I meant to rename this back to as
after the changes from the first round of reviews, it only called it getIndex
because as
seemed strange when the interface also had getIndexCapabilities
, so I've renamed this method to as
.
I actually want to work on reclaiming the word 'index' to primarily be used for filtering/cursors so that the codebase is more 'friendly', but not in this PR.
* in unavailable. | ||
*/ | ||
@Nullable | ||
<T> ColumnIndexCapabilities getIndexCapabilities(Class<T> clazz); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As suggested in other comments, perhaps move this to be fields on the Bitmap object returned by the index object instead of generic representations of the column as a whole.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM after CI 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still part way through review, need to run to a meeting, leaving comments so taht they don't sit pending foreverz
* which can greatly reduce the total number of rows which need to be scanned and processed. | ||
*/ | ||
@Nullable | ||
<T> T getIndex(Class<T> clazz); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, the name "Index" was already ruined by the original author of a lot of this code. It's so overloaded in what it can mean, that I find myself wanting us to use a different noun. Upon reflection, my suggestion to add As
was actually coming from me not liking the word Index
and wanting something with extra words to make it more descriptive.
size += StringUtils.estimatedBinaryLengthAsUTF8(value) * | ||
((long) bitmapIndex.getBitmapForValue(value).size()); | ||
ColumnIndexSupplier indexSupplier = columnHolder.getIndexSupplier(); | ||
if (indexSupplier != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this is a !=
on an if with an else, invert them to remove the negation please.
final DictionaryEncodedStringValueIndex valueIndex = indexSupplier.getIndex( | ||
DictionaryEncodedStringValueIndex.class | ||
); | ||
Preconditions.checkNotNull(valueIndex, "Column [%s] does not have a value index", columnName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be equivalent to capabilities.hasBitmapIndexes()
. How about we remove that check above and replace it with the 2-phased checks for the ColumnIndexSupplier
and then that it can return a DictionaryEncodedStringValueIndex
?
cardinality = valueIndex.getCardinality(); | ||
if (analyzingSize()) { | ||
for (int i = 0; i < cardinality; ++i) { | ||
String value = valueIndex.getValue(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we are here, could we like... use the getUTF8Bytes
and just take .remaining()
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I completely agree this is silly, but I have bigger plans with plumbing through the utf8 bytes into filters, so do you mind if I save this for a follow-up PR?
indexSupplier = virtualColumns.getIndexSupplier(column, columnSelector); | ||
} else { | ||
final ColumnHolder columnHolder = columnSelector.getColumnHolder(column); | ||
if (columnHolder == null || !columnHolder.getCapabilities().isFilterable()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the isFilterable
check? If the ColumnHolder believes that it is not filterable, wouldn't it just return a null
IndexSupplier? Or do we want a not filterable column to act like it doesn't even exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or do we want a not filterable column to act like it doesn't even exist?
This is the way it is currently intended to behave, if it is not filterable the column is meant to be treated as only nulls. I lost some comments about what was going on along the way, which i will try to add back.
// for missing columns and columns with types that do not support filtering,
// treat the column as if it were a String column full of nulls.
// Create a BitmapIndex so that filters applied to null columns can use
// bitmap indexes. Filters check for the presence of a bitmap index, this is used to determine
// whether the filter is applied in the pre or post filtering stage.
To be fair, i'm not completely certain this is the best way to handle this, since it does create odd situations where unfilterable IS NULL
matches every row for a column that is not filterable when amusingly the value is not null, so I will be revisiting this in the future I think. It should probably never match anything ever if the column is not filterable, but that probably needs discussion since would be a behavior change, so i'm only going to make sure the comments documenting existing behavior are there and save logic changes for future work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a great thing to resolve with bundles. Everything is technically filterable with a value predicate...
@@ -383,6 +391,9 @@ public BitmapValues getBitmapValues(String dimension, int dictId) | |||
} | |||
} | |||
|
|||
/** | |||
* this method is only for testing, don't use it for stuff |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it really need to exist? Could it be a helper method in a class in the test package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a getQueryableIndex
method so that this method could be moved to the only test that uses it
@@ -146,8 +148,9 @@ public Comparable getMinValue(String dimension) | |||
{ | |||
ColumnHolder columnHolder = index.getColumnHolder(dimension); | |||
if (columnHolder != null && columnHolder.getCapabilities().hasBitmapIndexes()) { | |||
BitmapIndex bitmap = columnHolder.getBitmapIndex(); | |||
return bitmap.getCardinality() > 0 ? bitmap.getValue(0) : null; | |||
ColumnIndexSupplier indexSupplier = columnHolder.getIndexSupplier(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Total drive-by comment as it's not really related to this PR, but this getMinValue
seems like a weird method to have on this API. It looks like it's used for some join logic? I wonder if that logic couldn't just specialize to the concrete class it expects and get the method that way instead...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree it is strange, and is used only by SegmentAnalyzer
so agree it should be reworked though i might save it for a follow-up PR
@Override | ||
public double estimateSelectivity(int totalRows) | ||
{ | ||
return 1.; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personal style thing. I like return 1D;
better, mostly mentioning just to make sure that you are aware that 1D;
is also an option so that you can evaluate according to your own aesthetics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, i think i was just moving the existing code around, the lineage of this is from
if (bitmapIndex == null || bitmapIndex.getCardinality() == 0) {
return doesMatchNull() ? 1. : 0.;
}
if i remember, just using 1
and 0
seems probably fine too. I'll change it to something else because i also find this to be among the stranger ways of writing it
/** | ||
* Get the {@link ImmutableBitmap} for dictionary id of the underlying dictionary | ||
*/ | ||
ImmutableBitmap getBitmap(int idx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this shouldn't return BitmapColumnIndex
instead of ImmutableBitmap
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think it is necessary; BitmapColumnIndex
is specifically for filter processing, while DictionaryEncodedStringValueIndex
should never be used by filters because it makes assumptions about how the column is represented that filters shouldn't concern themselves with. It is intended to be a 'leaky' abstraction to the way dictionary encoded columns with bitmap value indexes store things since it basically provides raw access. This class is only used by SegmentAnalyzer
, ListFilteredVirtualColumn
, search query strategies, and segment merging code, all of which are doing lots of invasive things in general to know things about the segments.
Part of me wants to detach this from the ColumnIndexSupplier
and make it its own thing so that it can't be for filtering, but on the other hand, it needs all of the same things that ColumnIndexSupplier
has, and all existing methods were using the same machinery as filters, so its a bit disruptive. I will maybe consider this as a follow up, change though, so that way we can tie this class more strongly to dictionary encoded string columns, since it isn't really valid for any other type of column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah, makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed only the interface design. The latest change LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more minor comments, but overall, looks good.
indexSupplier = virtualColumns.getIndexSupplier(column, columnSelector); | ||
} else { | ||
final ColumnHolder columnHolder = columnSelector.getColumnHolder(column); | ||
if (columnHolder == null || !columnHolder.getCapabilities().isFilterable()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a great thing to resolve with bundles. Everything is technically filterable with a value predicate...
/** | ||
* Get the {@link ImmutableBitmap} for dictionary id of the underlying dictionary | ||
*/ | ||
ImmutableBitmap getBitmap(int idx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah, makes sense.
// Short-circuit. | ||
return bitmapResultFactory.wrapAllFalse(Filters.allFalse(selector)); | ||
final BitmapColumnIndex index = filter.getBitmapColumnIndex(selector); | ||
if (index == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When will index be null
? I'm trying to think of what that semantic could be and the dots are not really connecting for me? It seems like this should never be null and the filter should always do something to return a non-null response?
I'm thinking it might be related to the whole "not filterable columns" thingie? But, it seems like the logic that we are keeping is that not-filterable-columns are filterable, just all null. Alternatively, if this is an error case, I find myself wondering why there isn't an exception thrown deeper down that gives a meaningful error message about why the error is occurring...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this is the method on Filter
, index here will be null for any sub-filter which was unable to find an index to use, like how numeric columns currently are since they lack indexes.
At the lower level, if ColumnIndexSupplier
, then the column doesn't exist (or is not filterable, which as mentioned in another comment I would sort of like to rework since not filterable doesn't really mean not filterable right now, it means all null and matches like that). All columns which exist have a non-null ColumnIndexSupplier
. If the ColumnIndexSupplier
is not null, then a null response from its as
method means that the column doesn't have the type of index it was asked for.
Back to the filter getBitmapColumnIndex
method, this method might return a non-null value even if the underlying column doesn't exist (ColumnIndexSupplier
was null so it can match the matcher against null to produce an all true or all false), or the column does exist and supports one of the types of indexes that the filter understands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving for green checkmark. Review was done as imply-cheddar
The changes to NullColumnPartSerde in this patch introduce a compatibility problem: prior to this patch, We haven't done a release with #12279 yet, but it's slated to be in 0.23.0. So we should either change the code in 0.23.0 to not require Any thoughts on the best approach @clintropolis, @jihoonson, @abhishekagarwal87, or @imply-cheddar? |
Ah good catch 👍 Hmm, I'm not really sure how to remove |
hmm, I wonder if we could just allow |
created #12585 to add back |
Description
Following up on #12315, which pushed most of the logic of building
ImmutableBitmap
intoBitmapIndex
in order to hide the details of how column indexes are implemented from theFilter
implementations, this PR totally refashions howFilter
consume indexes. The end result, while a rather dramatic reshuffling of the existing code, should be extraordinarily flexible, eventually allowing us to model any type of index we can imagine, and providing the machinery to build the filters that use them, while also allowing for other column implementations to implement the built-in index types to provide adapters to make use indexing in the current set filters that Druid provides.Index changes
BitmapIndex
has been split into 4 concrete implementations,DictionaryEncodedStringValueIndex
,StringValueSetIndex
,LexicographicalRangeIndex
, andDruidPredicateIndex
.DictionaryEncodedStringValueIndex
andBitmapColumnIndex
DictionaryEncodedStringValueIndex
is identical toBitmapIndex
prior to #12315, and is a 'leaky' abstraction that is very much tied to how string dictionary encoded columns store a lexicographical sorted value dictionary, and also a lexicographical sorted index dictionary for the index on those values.Filter
implementations no longer use this index directly, since it requires knowing a bunch of implementation details which filtering need not concern themselves with. Instead, the other 3 implementations produce a new thingwhich captures some information about that index, and can be computed into a
T
with theBitmapResultFactory
(e.g. into aImmutableBitmap
representing the rows selected by that filter to use with aBitmapOffset
). A promise of a bitmap, and information about it for that filter against those inputs, if you will.StringValueSetIndex
StringValueSetIndex
is the index implementation that the majority of our current filters which support indexes use, and provides facilities for gettingBitmapColumnIndex
for a single string value to match, a set of string values.LexicographicalRangeIndex
LexicographicalRangeIndex
captures the pattern of the optimization which can be done for bound and like filters in certain cases, where taking advantage of the fact that the indexes are stored in lexicographical order for string dimensions, aImmutableBitmap
can be produced by gathering all indexes within the start and end dictionary id of the underlying values and unioning them.DruidPredicateIndex
DruidPredicateIndex
is an index which can be constructed using aDruidPredicateFactory
, which for dictionary encoded string columns means the string predicate is used and evaluated against all of the values which are present in the value dictionary.ColumnIndexSelector
andColumnIndexSupplier
The methods by which
Filter
obtain these indexes from columns has also been heavily reworked.BitmapIndexSelector
is now calledColumnIndexSelector
, and the focus here is a new method:which
Filter
implementations can call into to get an index supplier for the column they are interested in using indexes.All columns which "exist" will have a non-null
ColumnIndexSupplier
, soFilter
implementations can use a null response here to check if the filter matches the null value, and produce an all true or all false index instead.ColumnHolder
has also been modified to instead of having aSupplier<BitmapIndex>
andSupplier<SpatialIndex>
to useColumnIndexSupplier
.VirtualColumn
also contains a method to fetch aColumnIndexSupplier
, with additional access to aColumnSelector
so they can potentially read stuff from an underlying segment when constructing it.This lets the implementation of
ColumnIndexSelector
,ColumnSelectorColumnIndexSelector
(formerly known asColumnSelectorBitmapIndexSelector
), pass through the either the virtual or physical column index supplier, or null if the column is missing entirely.For dictionary encoded string columns, the
DictionaryEncodedStringIndexSupplier.getIndex
method can supply implementations ofDictionaryEncodedStringValueIndex
,StringValueSetIndex
,LexicographicalRangeIndex
,DruidPredicateIndex
and, through the same interface,SpatialIndex
, without any specific methods to fetch any of these implementations.ColumnIndexCapabilities
The vast majority of the changes in this PR are simply re-arranging existing pieces. There is one exception to this however, which is the previously mentioned
ColumnIndexCapabilities
type. I've preemptively added this type in anticipation of adding additional index types, and gone ahead and wired this up into the filter partitioning logic that currently resides inQueryableIndexStorageAdapter
.ColumnIndexCapabilities
is sort of an analog to ColumnCapabilities, but for information about a specific conjuring ofBitmapColumnIndex
. I've added two methods here which no current filters actually take advantage of, but that I know will be of use in the near future as we begin to add additional index implementations: 'is exact' and 'is invertible', the former of which captures the pattern of indexes which can be used to reduce the total number of rows scanned, or which may produce false positives and so should still be 'post' filtered with a value matcher, and the latter for index types which might have false positives and so cannot be used in a 'not' filter.Filter changes
The changes to indexes greatly simplify all of the
Filter
implementations. AllFilter
bitmap index methods have been consolidated into a single method:which simply if not null, means that the filter has a
BitmapColumnIndex
that can be computed into an offset.This single method replaces:
ImmutableBitmap getBitmapIndex(BitmapIndexSelector selector)
<T> T getBitmapResult(BitmapIndexSelector selector, BitmapResultFactory<T> bitmapResultFactory);
boolean supportsBitmapIndex(BitmapIndexSelector selector);
boolean shouldUseBitmapIndex(BitmapIndexSelector selector);
Future work
ColumnCapabilities
methodshasBitmapIndexes
andhasSpatialIndexes
no longer accurately describe columns and should probably be adapted into the set of classes which theColumnIndexSupplier
can provide. Likewise, dimension declarations on ingestion specs should eventually also be modified to be more flexible in the types of indexes which can be specified, though I'm not sure yet what this should look like quite yet.The mechanics of how indexes work is still a bit leaky after this change, primarily in that
QueryableIndexStorageAdapter
knows too much about partitioning filters into 'pre' and 'post' filters when building the cursor. I would like to push this intoFilter
itself if possible, potentially getting to where instead ofQueryableIndexStorageAdapter
examining and finding which filters support indexes and collecting the bitmaps from them to create a bitmap offset and using any remaining filters as a post-filtered offset, that instead perhaps theFilter
itself can create the offset that is used when constructing the cursor. SinceColumnIndexSupplier
isn't at all opinionated about what types of indexes columns can provide, it opens up the door to creating offsets from information stored in the segment without bitmaps being required.However, I defer these changes for the future since I've already changed enough here. This does seem like it should be possible though, since the filters already know how to create everything that goes into the current offset construction (
BitmapColumnIndex
andValueMatcher
/VectorValueMatcher
), but would also allow much larger flexibility when constructing the offsets as well as releasingQueryableIndexStorageAdapter
from having to know anything at all about filters.Key changed/added classes in this PR
BitmapIndex
->DictionaryEncodedStringValueIndex
,StringValueSetIndex
,LexicographicalRangeIndex
,DruidPredicateIndex
BitmapColumnIndex
BitmapIndexSelector
->ColumnIndexSelector
ColumnHolder
ColumnIndexSupplier
DictionaryEncodedStringIndexSupplier
QueryableIndexStorageAdapter
ColumnIndexCapabilities
VirtualColumn
Filter
and all implementationsThis PR has: