fix nested column range index range computation #13297

clintropolis · 2022-11-02T09:25:35Z

Description

Fixes a bug with nested column range index value range computation that can lead to incorrect values matching filters when the range is not present in the nested fields local dictionary.

This PR fixes the issue by checking that the computed local start and end index values global id mapping does not violate the computed original global id range whenever the actual global ids are not present in the local dictionary. That's a lot of words thats probably hard to follow, but its basically an off by 1 error when mapping the global range to the local range in certain cases.

This wrong behavior was accidentally encoded in a test I wrote which would have caught the issue if the test had been expecting the correct answer, so I went through and manually verified that the tests are actually expecting the correct thing this time, as well as tried to get 100% coverage on everything which matters in practice, which is now pretty high

The remaining uncovered lines are largely things like safety checks that are not hit in regular operation such as calling next on an iterator before calling hasNext, etc, which I cannot hit during normal usage of the index supplier.

Release note

Fixes a bug with nested columns when using filters that use range indexes such as greater/less than or like filters.

This PR has:

been self-reviewed.
a release note entry in the PR description.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
been tested in a test Druid cluster.

kfaraz · 2022-11-02T14:59:25Z

...ing/src/main/java/org/apache/druid/segment/nested/NestedFieldLiteralColumnIndexSupplier.java

+      localStartIndex = -(localFound + 1);
+      // if the computed local start index violates the global range, shift up by 1
+      int actualGlobalStartIndex = localDictionary.get(localStartIndex);
+      if (actualGlobalStartIndex < globalStartIndex) {


Should this be done in a loop or can we be sure that the global index at the next localStartIndex would be valid?

ok so i looked a lot closer in debugger and working things out by hand, and this code on the start index is pointless and only can be hit in cases where the range will effectively be empty anyway so can be dropped. it also uncovered a missing bounds check on FixedIndexed get method, which I have fixed and added tests for 😅

I've re-arranged some things so the logic is simplified, it was really only the end index that needed adjustment when it was missing from the local dictionary, and that was because I was doing the shifting to account for the end index being exclusive in a funny manner. I think its clearer now. Thanks for asking this question!

Thanks! Simpler to follow now.

I also wonder if the last value returned from FixedIndex.indexOf() should always be -(minIndex + 1).
My concern is that if we haven't found the value until now, then the insertion point could either be before or after the currIndex that we are evaluating. I think we should incorporate that fact into the returned value.

Say at the last step, we found that our value is bigger than the currValue, so minIndex becomes currIndex + 1, then we break out of the loop and then we return -(minIndex + 1) = -(currIndex + 2).

Doesn't this mean that we have actually skipped the insertion point? Or maybe I am missing something.
(Although, I guess this should be okay because the Indexed interface doesn't make any promises about the returned negative value.)

FixedIndexed.indexOf is the same-ish implementation as GenericIndexed.indexOf, both of which are basically the same as Arrays.binarySearch, so I think it is ok. Btw, range finder method is used for both FixedIndexed and GenericIndexed global dictionaries depending on if it is making LexicographicalRangeIndex for nested strings or NumericRangeIndex for nested numbers, though the local dictionary is always a FixedIndexed.

Or did you mean something about how the range calculation is using indexOf to compute the ranges?

btw, I did recently strengthen the contract of indexOf to be (-(insertion point) - 1) for missing values if isSorted() returns true, since these index things basically require the (-(insertion point) - 1) behavior to work correctly.

Makes sense, thanks for the clarification!

kfaraz

Minor query, otherwise looks good.

* fix nested column range index range computation * simplify, add missing bounds check for FixedIndexed

fix nested column range index range computation

cf32e19

clintropolis added Bug Area - Querying labels Nov 2, 2022

kfaraz reviewed Nov 2, 2022

View reviewed changes

kfaraz approved these changes Nov 2, 2022

View reviewed changes

simplify, add missing bounds check for FixedIndexed

43ea322

clintropolis merged commit 018f984 into apache:master Nov 3, 2022

clintropolis deleted the fix-nested-column-range-filter branch November 3, 2022 04:37

clintropolis added a commit to clintropolis/druid that referenced this pull request Nov 3, 2022

fix nested column range index range computation (apache#13297)

dd965e5

* fix nested column range index range computation * simplify, add missing bounds check for FixedIndexed

clintropolis added a commit to clintropolis/druid that referenced this pull request Nov 3, 2022

fix nested column range index range computation (apache#13297)

0b9e106

* fix nested column range index range computation * simplify, add missing bounds check for FixedIndexed

kfaraz added this to the 24.0.1 milestone Nov 3, 2022

kfaraz pushed a commit that referenced this pull request Nov 3, 2022

fix nested column range index range computation (#13297) (#13300)

7b9dd1c

* fix nested column range index range computation * simplify, add missing bounds check for FixedIndexed

kfaraz mentioned this pull request Nov 7, 2022

[Draft] 24.0.1 Release Notes #13320

Closed

clintropolis mentioned this pull request Nov 22, 2022

fix off by one error in nested column range index #13405

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix nested column range index range computation #13297

fix nested column range index range computation #13297

clintropolis commented Nov 2, 2022 •

edited

Loading

kfaraz Nov 2, 2022

clintropolis Nov 3, 2022

kfaraz Nov 3, 2022

clintropolis Nov 3, 2022

kfaraz Nov 3, 2022

kfaraz left a comment

fix nested column range index range computation #13297

fix nested column range index range computation #13297

Conversation

clintropolis commented Nov 2, 2022 • edited Loading

Description

Release note

kfaraz Nov 2, 2022

Choose a reason for hiding this comment

clintropolis Nov 3, 2022

Choose a reason for hiding this comment

kfaraz Nov 3, 2022

Choose a reason for hiding this comment

clintropolis Nov 3, 2022

Choose a reason for hiding this comment

kfaraz Nov 3, 2022

Choose a reason for hiding this comment

kfaraz left a comment

Choose a reason for hiding this comment

clintropolis commented Nov 2, 2022 •

edited

Loading