Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remake column indexes and query processing of filters #12388

Merged

Conversation

clintropolis
Copy link
Member

@clintropolis clintropolis commented Apr 1, 2022

Description

Following up on #12315, which pushed most of the logic of building ImmutableBitmap into BitmapIndex in order to hide the details of how column indexes are implemented from the Filter implementations, this PR totally refashions how Filter consume indexes. The end result, while a rather dramatic reshuffling of the existing code, should be extraordinarily flexible, eventually allowing us to model any type of index we can imagine, and providing the machinery to build the filters that use them, while also allowing for other column implementations to implement the built-in index types to provide adapters to make use indexing in the current set filters that Druid provides.

Index changes

BitmapIndex has been split into 4 concrete implementations, DictionaryEncodedStringValueIndex, StringValueSetIndex, LexicographicalRangeIndex, and DruidPredicateIndex.

DictionaryEncodedStringValueIndex and BitmapColumnIndex

DictionaryEncodedStringValueIndex is identical to BitmapIndex prior to #12315, and is a 'leaky' abstraction that is very much tied to how string dictionary encoded columns store a lexicographical sorted value dictionary, and also a lexicographical sorted index dictionary for the index on those values. Filter implementations no longer use this index directly, since it requires knowing a bunch of implementation details which filtering need not concern themselves with. Instead, the other 3 implementations produce a new thing

public interface BitmapColumnIndex
{
  ColumnIndexCapabilities getIndexCapabilities();

  <T> T computeBitmapResult(BitmapResultFactory<T> bitmapResultFactory);
}

which captures some information about that index, and can be computed into a T with the BitmapResultFactory (e.g. into a ImmutableBitmap representing the rows selected by that filter to use with a BitmapOffset). A promise of a bitmap, and information about it for that filter against those inputs, if you will.

StringValueSetIndex

StringValueSetIndex is the index implementation that the majority of our current filters which support indexes use, and provides facilities for gettingBitmapColumnIndex for a single string value to match, a set of string values.

LexicographicalRangeIndex

LexicographicalRangeIndex captures the pattern of the optimization which can be done for bound and like filters in certain cases, where taking advantage of the fact that the indexes are stored in lexicographical order for string dimensions, a ImmutableBitmap can be produced by gathering all indexes within the start and end dictionary id of the underlying values and unioning them.

DruidPredicateIndex

DruidPredicateIndex is an index which can be constructed using a DruidPredicateFactory, which for dictionary encoded string columns means the string predicate is used and evaluated against all of the values which are present in the value dictionary.

ColumnIndexSelector and ColumnIndexSupplier

The methods by which Filter obtain these indexes from columns has also been heavily reworked. BitmapIndexSelector is now called ColumnIndexSelector, and the focus here is a new method:

  @Nullable
  ColumnIndexSupplier getIndexSupplier(String column);

which Filter implementations can call into to get an index supplier for the column they are interested in using indexes.

public interface ColumnIndexSupplier
{
  @Nullable
  <T> T as(Class<T> clazz);
}

All columns which "exist" will have a non-null ColumnIndexSupplier, so Filter implementations can use a null response here to check if the filter matches the null value, and produce an all true or all false index instead.

ColumnHolder has also been modified to instead of having a Supplier<BitmapIndex> and Supplier<SpatialIndex> to use ColumnIndexSupplier. VirtualColumn also contains a method to fetch a ColumnIndexSupplier, with additional access to a ColumnSelector so they can potentially read stuff from an underlying segment when constructing it.

This lets the implementation of ColumnIndexSelector, ColumnSelectorColumnIndexSelector (formerly known as ColumnSelectorBitmapIndexSelector), pass through the either the virtual or physical column index supplier, or null if the column is missing entirely.

For dictionary encoded string columns, the DictionaryEncodedStringIndexSupplier.getIndex method can supply implementations of DictionaryEncodedStringValueIndex, StringValueSetIndex, LexicographicalRangeIndex, DruidPredicateIndex and, through the same interface, SpatialIndex, without any specific methods to fetch any of these implementations.

ColumnIndexCapabilities

The vast majority of the changes in this PR are simply re-arranging existing pieces. There is one exception to this however, which is the previously mentioned ColumnIndexCapabilities type. I've preemptively added this type in anticipation of adding additional index types, and gone ahead and wired this up into the filter partitioning logic that currently resides in QueryableIndexStorageAdapter.

ColumnIndexCapabilities is sort of an analog to ColumnCapabilities, but for information about a specific conjuring of BitmapColumnIndex. I've added two methods here which no current filters actually take advantage of, but that I know will be of use in the near future as we begin to add additional index implementations: 'is exact' and 'is invertible', the former of which captures the pattern of indexes which can be used to reduce the total number of rows scanned, or which may produce false positives and so should still be 'post' filtered with a value matcher, and the latter for index types which might have false positives and so cannot be used in a 'not' filter.

Filter changes

The changes to indexes greatly simplify all of the Filter implementations. All Filter bitmap index methods have been consolidated into a single method:

  @Nullable
  BitmapColumnIndex getBitmapColumnIndex(ColumnIndexSelector selector);

which simply if not null, means that the filter has a BitmapColumnIndex that can be computed into an offset.

This single method replaces:

  • ImmutableBitmap getBitmapIndex(BitmapIndexSelector selector)
  • <T> T getBitmapResult(BitmapIndexSelector selector, BitmapResultFactory<T> bitmapResultFactory);
  • boolean supportsBitmapIndex(BitmapIndexSelector selector);
  • boolean shouldUseBitmapIndex(BitmapIndexSelector selector);

Future work

ColumnCapabilities methods hasBitmapIndexes and hasSpatialIndexes no longer accurately describe columns and should probably be adapted into the set of classes which the ColumnIndexSupplier can provide. Likewise, dimension declarations on ingestion specs should eventually also be modified to be more flexible in the types of indexes which can be specified, though I'm not sure yet what this should look like quite yet.

The mechanics of how indexes work is still a bit leaky after this change, primarily in that QueryableIndexStorageAdapter knows too much about partitioning filters into 'pre' and 'post' filters when building the cursor. I would like to push this into Filter itself if possible, potentially getting to where instead of QueryableIndexStorageAdapter examining and finding which filters support indexes and collecting the bitmaps from them to create a bitmap offset and using any remaining filters as a post-filtered offset, that instead perhaps the Filter itself can create the offset that is used when constructing the cursor. Since ColumnIndexSupplier isn't at all opinionated about what types of indexes columns can provide, it opens up the door to creating offsets from information stored in the segment without bitmaps being required.

However, I defer these changes for the future since I've already changed enough here. This does seem like it should be possible though, since the filters already know how to create everything that goes into the current offset construction (BitmapColumnIndex and ValueMatcher/VectorValueMatcher), but would also allow much larger flexibility when constructing the offsets as well as releasing QueryableIndexStorageAdapter from having to know anything at all about filters.


Key changed/added classes in this PR
  • BitmapIndex -> DictionaryEncodedStringValueIndex, StringValueSetIndex, LexicographicalRangeIndex, DruidPredicateIndex
  • BitmapColumnIndex
  • BitmapIndexSelector -> ColumnIndexSelector
  • ColumnHolder
  • ColumnIndexSupplier
  • DictionaryEncodedStringIndexSupplier
  • QueryableIndexStorageAdapter
  • ColumnIndexCapabilities
  • VirtualColumn
  • Filter and all implementations

This PR has:

  • been self-reviewed.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

@clintropolis clintropolis removed the WIP label Apr 4, 2022
Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The new interfaces including ColumnIndexCapabilities and ColumnIndexSupplier make sense to me.

Copy link
Contributor

@imply-cheddar imply-cheddar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left some comments on the general interfaces and structuring. There are ambiguities hidden in them that I think make things more complex than they need to be. I haven't gone into the nitty gritty details of all of the implementations, but after looking at about 50% of the files I convinced myself that the suggestions made sense and would simplify things so I wanted to submit this review with the suggestions first.

It's possible I missed things in the implementations that my suggestions don't line up with, if so, please point them out.

return false;
ColumnIndexCapabilities other = f.getIndexCapabilities(selector);
if (other == null) {
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really okay to short-circuit all of the other filters in this case?

@@ -150,7 +129,7 @@ default VectorValueMatcher makeVectorMatcher(VectorColumnSelectorFactory factory
*
* @return true if this Filter supports selectivity estimation, false otherwise.
*/
boolean supportsSelectivityEstimation(ColumnSelector columnSelector, BitmapIndexSelector indexSelector);
boolean supportsSelectivityEstimation(ColumnSelector columnSelector, ColumnIndexSelector indexSelector);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we move the selectivityEstimation value as suggested elsewhere, this method can be completely eliminated.

final BitmapIndex bitmapIndex = selector.getBitmapIndex(dimension);
return bitmapResultFactory.unionDimensionValueBitmaps(getBitmapIterable(values, bitmapIndex));
final StringValueSetIndex valueSetIndex = selector.as(dimension, StringValueSetIndex.class);
if (valueSetIndex == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It strikes me that this could be null because the column doesn't exist OR it could be null because the column exists, but doesn't support StringValueSetIndex. the .as() method should likely not actually take a String parameter and instead the call should be split into 2 calls, one for the column and then one for the interface. In this way, the semantics of null becomes extremely clear.

@Nullable
ImmutableBitmap getBitmapIndex(String dimension, String value);
ImmutableRTree getSpatialIndex(String dimension);
<T> T as(String column, Class<T> clazz);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that .as() has two arguments makes this call site ambiguous and introduces superfluous methods like getIndexCapabilities(). We should instead just do a .get(String column) that returns ColumnIndexSupplier (this can be used to determine if the column exists or not) and then we can use the ColumnIndexSupplier object to call .getIndexAs() and get the implementation. If we do this, we don't need getIndexCapabilities() at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to just expose the index supplier here, and eliminates the need to check capabilities as a proxy for if the column exists or not. However, I think we still need index capabilities for other reasons to provide information about characteristics of the indexes themselves. It would need to return a non-null index supplier for all columns that exist, and that index supplier can just return nulls for capabilities and indexes, which is easy enough.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the capabilities, what I was suggesting is, e.g. where LexicographicRangeIndex has method

default Iterable<ImmutableBitmap> getBitmapsInRange(
      @Nullable String startValue,
      boolean startStrict,
      @Nullable String endValue,
      boolean endStrict
  )

It could have signature

default FilterBundle getBitmapsInRange(
      @Nullable String startValue,
      boolean startStrict,
      @Nullable String endValue,
      boolean endStrict
  )

Where FilterBundle would be a new class that could have fields for bitmap indexes, value predicates, selectivity estimation, whether it's safe to invert, etc. etc. Which, I believe meets the need that you are highlighting for ColumnIndexCapabilities.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think FilterBundle is going to be a bit more complex than you're hoping, which is part of why I didn't try to consolidate getIndexCapabilities and getIndex yet. Not all value matchers are predicates, so it can't strictly operate on predicates, and value matchers need the cursor created from the offset of the 'pre' part of the filter bundle to actually be created, so it might be a factory that can make value matchers to use as a post-filter.

*/
double estimateSelectivity(BitmapIndexSelector indexSelector);
double estimateSelectivity(ColumnIndexSelector indexSelector);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to some other suggestions made, I think that selectivity estimation is a function of the bitmap that is returned. We should perhaps move it to an object that is returned when the bitmap is gotten instead of having it here. Looking at, e.g. the BoundsFilter implementation, this method effectively does all of the work to build the bitmap only to then figure out if it's okay to actually use the bitmap. If it is okay, it's going to do all of the work to build the bitmap again. If we have to do the work anyway, might as well just do it once and the current API doesn't actually allow the code to do that...

Maybe we don't fix it in this PR and we fix it in a follow-up that additionally adds the various things for IndexCapabilities, but either way we should put a target on this method as it's an abstraction that can be improved.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't explicitly mention it, but I was opting to refactor selectivity estimation as a follow-up change, since this was already pretty big, so I'd prefer a follow-up I think


@Nullable
@Override
public <T> T as(String column, Class<T> clazz)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the implementation for as() and getColumnIndexCapabilities(), I really think this should just be getColumnIndexSupplier() where null effectively means "not filterable". We then have a choice of whether we return a NullColumnIndexSupplier for cases where the dimension doesn't exist (i.e. at this level, act like it does exist by returning something that knows how to deal with all nulls) or also have null mean that the column didn't exist. Personally, I tend to think that we should return a NullColumnIndexSupplier and let the callers believe the column actually existed, but maybe there's reasons not to do that.

* Sort of like {@link ColumnCapabilities}, except for indexes supplied by {@link ColumnIndexSelector}, provides
* information for how query processing may use indexes.
*/
public interface ColumnIndexCapabilities
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The values that this class represent (isInvertible and isExact) are not functions of a column, but are functions of a specific Bitmap returned by said column. ColumnIndexCapabilities belongs to a column and is acting like it represents all things from that column, it cannot, however do that. To handle this case, we should instead move these things onto the actually ImmutableBitmap object or some other wrapper object that gets return and indicates what all can be done. I would recommend that put them on an object that also contains the ValueMatcher for when a ValueMatcher is required.

As is, this interface and API structure is not actually in alignment with the physical realities of the data and the extensions, so we will, once again, just be fighting with it. So, we should fix this before merging. One way to fix it (and my recommendation) is to just remove the Capabilities altogether here and re-introduce isInvertible and isExact in the PR that also introduces a combination of BitmapIndex and ValueMatchers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ColumnIndexCapabilities belongs to a column and is acting like it represents all things from that column, it cannot, however do that.

This isn't true, getIndexCapabilities takes both a column, and a specific class implementation of the index, so these are the capabilities of a specific index of the column. I guess I wasn't considering the case where an index might sometimes produce exact matches and sometimes not, since in my mind these wouldn't be mixed within the same index, but I wasn't thinking clever enough about it and these cases probably do exist.

final ColumnIndexCapabilities capabilities = postFilter.getIndexCapabilities(bitmapIndexSelector);
// we only consider "exact" filters here, because if false, we've already used the bitmap index for the base
// offset
if (capabilities != null && capabilities.isExact()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of checking capabilities, this can very easily be

BitmapIndex index = postFilter.getBitmapIndex(bitmapIndexSelector);
if (index != null) {

* which can greatly reduce the total number of rows which need to be scanned and processed.
*/
@Nullable
<T> T getIndex(Class<T> clazz);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming nit: .getIndexAs() to indicate that we aren't trying to access a field called Index, but we are asking the object to "morph" itself into the specified class.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, this isn't an index itself though (or at least wasn't in my mind), but a sort of index supplier, so in my mind were asking if the supplier could give us an index of a specific type 🤷

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, the name "Index" was already ruined by the original author of a lot of this code. It's so overloaded in what it can mean, that I find myself wanting us to use a different noun. Upon reflection, my suggestion to add As was actually coming from me not liking the word Index and wanting something with extra words to make it more descriptive.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, I meant to rename this back to as after the changes from the first round of reviews, it only called it getIndex because as seemed strange when the interface also had getIndexCapabilities, so I've renamed this method to as.

I actually want to work on reclaiming the word 'index' to primarily be used for filtering/cursors so that the codebase is more 'friendly', but not in this PR.

* in unavailable.
*/
@Nullable
<T> ColumnIndexCapabilities getIndexCapabilities(Class<T> clazz);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As suggested in other comments, perhaps move this to be fields on the Bitmap object returned by the index object instead of generic representations of the column as a whole.

Copy link
Member

@rohangarg rohangarg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after CI 👍

Copy link
Contributor

@imply-cheddar imply-cheddar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still part way through review, need to run to a meeting, leaving comments so taht they don't sit pending foreverz

* which can greatly reduce the total number of rows which need to be scanned and processed.
*/
@Nullable
<T> T getIndex(Class<T> clazz);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, the name "Index" was already ruined by the original author of a lot of this code. It's so overloaded in what it can mean, that I find myself wanting us to use a different noun. Upon reflection, my suggestion to add As was actually coming from me not liking the word Index and wanting something with extra words to make it more descriptive.

size += StringUtils.estimatedBinaryLengthAsUTF8(value) *
((long) bitmapIndex.getBitmapForValue(value).size());
ColumnIndexSupplier indexSupplier = columnHolder.getIndexSupplier();
if (indexSupplier != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is a != on an if with an else, invert them to remove the negation please.

final DictionaryEncodedStringValueIndex valueIndex = indexSupplier.getIndex(
DictionaryEncodedStringValueIndex.class
);
Preconditions.checkNotNull(valueIndex, "Column [%s] does not have a value index", columnName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be equivalent to capabilities.hasBitmapIndexes(). How about we remove that check above and replace it with the 2-phased checks for the ColumnIndexSupplier and then that it can return a DictionaryEncodedStringValueIndex?

cardinality = valueIndex.getCardinality();
if (analyzingSize()) {
for (int i = 0; i < cardinality; ++i) {
String value = valueIndex.getValue(i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we are here, could we like... use the getUTF8Bytes and just take .remaining() instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I completely agree this is silly, but I have bigger plans with plumbing through the utf8 bytes into filters, so do you mind if I save this for a follow-up PR?

indexSupplier = virtualColumns.getIndexSupplier(column, columnSelector);
} else {
final ColumnHolder columnHolder = columnSelector.getColumnHolder(column);
if (columnHolder == null || !columnHolder.getCapabilities().isFilterable()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the isFilterable check? If the ColumnHolder believes that it is not filterable, wouldn't it just return a null IndexSupplier? Or do we want a not filterable column to act like it doesn't even exist?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or do we want a not filterable column to act like it doesn't even exist?

This is the way it is currently intended to behave, if it is not filterable the column is meant to be treated as only nulls. I lost some comments about what was going on along the way, which i will try to add back.

      // for missing columns and columns with types that do not support filtering,
      // treat the column as if it were a String column full of nulls.
      // Create a BitmapIndex so that filters applied to null columns can use
      // bitmap indexes. Filters check for the presence of a bitmap index, this is used to determine
      // whether the filter is applied in the pre or post filtering stage.

To be fair, i'm not completely certain this is the best way to handle this, since it does create odd situations where unfilterable IS NULL matches every row for a column that is not filterable when amusingly the value is not null, so I will be revisiting this in the future I think. It should probably never match anything ever if the column is not filterable, but that probably needs discussion since would be a behavior change, so i'm only going to make sure the comments documenting existing behavior are there and save logic changes for future work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a great thing to resolve with bundles. Everything is technically filterable with a value predicate...

@@ -383,6 +391,9 @@ public BitmapValues getBitmapValues(String dimension, int dictId)
}
}

/**
* this method is only for testing, don't use it for stuff
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it really need to exist? Could it be a helper method in a class in the test package?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a getQueryableIndex method so that this method could be moved to the only test that uses it

@@ -146,8 +148,9 @@ public Comparable getMinValue(String dimension)
{
ColumnHolder columnHolder = index.getColumnHolder(dimension);
if (columnHolder != null && columnHolder.getCapabilities().hasBitmapIndexes()) {
BitmapIndex bitmap = columnHolder.getBitmapIndex();
return bitmap.getCardinality() > 0 ? bitmap.getValue(0) : null;
ColumnIndexSupplier indexSupplier = columnHolder.getIndexSupplier();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Total drive-by comment as it's not really related to this PR, but this getMinValue seems like a weird method to have on this API. It looks like it's used for some join logic? I wonder if that logic couldn't just specialize to the concrete class it expects and get the method that way instead...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree it is strange, and is used only by SegmentAnalyzer so agree it should be reworked though i might save it for a follow-up PR

@Override
public double estimateSelectivity(int totalRows)
{
return 1.;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personal style thing. I like return 1D; better, mostly mentioning just to make sure that you are aware that 1D; is also an option so that you can evaluate according to your own aesthetics.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, i think i was just moving the existing code around, the lineage of this is from

      if (bitmapIndex == null || bitmapIndex.getCardinality() == 0) {
        return doesMatchNull() ? 1. : 0.;
      }

if i remember, just using 1 and 0 seems probably fine too. I'll change it to something else because i also find this to be among the stranger ways of writing it

/**
* Get the {@link ImmutableBitmap} for dictionary id of the underlying dictionary
*/
ImmutableBitmap getBitmap(int idx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this shouldn't return BitmapColumnIndex instead of ImmutableBitmap?

Copy link
Member Author

@clintropolis clintropolis May 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think it is necessary; BitmapColumnIndex is specifically for filter processing, while DictionaryEncodedStringValueIndex should never be used by filters because it makes assumptions about how the column is represented that filters shouldn't concern themselves with. It is intended to be a 'leaky' abstraction to the way dictionary encoded columns with bitmap value indexes store things since it basically provides raw access. This class is only used by SegmentAnalyzer, ListFilteredVirtualColumn, search query strategies, and segment merging code, all of which are doing lots of invasive things in general to know things about the segments.

Part of me wants to detach this from the ColumnIndexSupplier and make it its own thing so that it can't be for filtering, but on the other hand, it needs all of the same things that ColumnIndexSupplier has, and all existing methods were using the same machinery as filters, so its a bit disruptive. I will maybe consider this as a follow up, change though, so that way we can tie this class more strongly to dictionary encoded string columns, since it isn't really valid for any other type of column.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, makes sense.

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed only the interface design. The latest change LGTM.

Copy link
Contributor

@imply-cheddar imply-cheddar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more minor comments, but overall, looks good.

indexSupplier = virtualColumns.getIndexSupplier(column, columnSelector);
} else {
final ColumnHolder columnHolder = columnSelector.getColumnHolder(column);
if (columnHolder == null || !columnHolder.getCapabilities().isFilterable()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a great thing to resolve with bundles. Everything is technically filterable with a value predicate...

/**
* Get the {@link ImmutableBitmap} for dictionary id of the underlying dictionary
*/
ImmutableBitmap getBitmap(int idx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, makes sense.

// Short-circuit.
return bitmapResultFactory.wrapAllFalse(Filters.allFalse(selector));
final BitmapColumnIndex index = filter.getBitmapColumnIndex(selector);
if (index == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When will index be null? I'm trying to think of what that semantic could be and the dots are not really connecting for me? It seems like this should never be null and the filter should always do something to return a non-null response?

I'm thinking it might be related to the whole "not filterable columns" thingie? But, it seems like the logic that we are keeping is that not-filterable-columns are filterable, just all null. Alternatively, if this is an error case, I find myself wondering why there isn't an exception thrown deeper down that gives a meaningful error message about why the error is occurring...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is the method on Filter, index here will be null for any sub-filter which was unable to find an index to use, like how numeric columns currently are since they lack indexes.

At the lower level, if ColumnIndexSupplier, then the column doesn't exist (or is not filterable, which as mentioned in another comment I would sort of like to rework since not filterable doesn't really mean not filterable right now, it means all null and matches like that). All columns which exist have a non-null ColumnIndexSupplier. If the ColumnIndexSupplier is not null, then a null response from its as method means that the column doesn't have the type of index it was asked for.

Back to the filter getBitmapColumnIndex method, this method might return a non-null value even if the underlying column doesn't exist (ColumnIndexSupplier was null so it can match the matcher against null to produce an all true or all false), or the column does exist and supports one of the types of indexes that the filter understands.

Copy link
Contributor

@cheddar cheddar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving for green checkmark. Review was done as imply-cheddar

@gianm
Copy link
Contributor

gianm commented May 29, 2022

The changes to NullColumnPartSerde in this patch introduce a compatibility problem: prior to this patch, bitmapSerdeFactory was a required parameter, and now it's gone. So that means segments with explicit null columns written by servers on this patch cannot be read by a server that has #12279. I noticed this when upgrading a dev cluster.

We haven't done a release with #12279 yet, but it's slated to be in 0.23.0. So we should either change the code in 0.23.0 to not require bitmapSerdeFactory, or we should change master to write it out once again, lest we create an accidental storage format incompatibility.

Any thoughts on the best approach @clintropolis, @jihoonson, @abhishekagarwal87, or @imply-cheddar?

@clintropolis
Copy link
Member Author

clintropolis commented May 29, 2022

Ah good catch 👍

Hmm, I'm not really sure how to remove bitmapSerdeFactory without the changes here but will think about it, I guess in the worst case we could just leave it there even though it isn't used.

@clintropolis
Copy link
Member Author

hmm, I wonder if we could just allow ColumnHolder to be null from the NullColumnPartSerde instead of making a 'null' column holder implementation, then we wouldn't need to serialize rows or the serde factory, and it can behave just like any other actually null column... going to poke around and do some tests to see if this is feasible or if it doesn't work for some reason I'm not thinking of.

@clintropolis
Copy link
Member Author

created #12585 to add back bitmapSerdeFactory since that is the easy solution, if a bit ugly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants