-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix nested column range index range computation #13297
fix nested column range index range computation #13297
Conversation
localStartIndex = -(localFound + 1); | ||
// if the computed local start index violates the global range, shift up by 1 | ||
int actualGlobalStartIndex = localDictionary.get(localStartIndex); | ||
if (actualGlobalStartIndex < globalStartIndex) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be done in a loop or can we be sure that the global index at the next localStartIndex
would be valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok so i looked a lot closer in debugger and working things out by hand, and this code on the start index is pointless and only can be hit in cases where the range will effectively be empty anyway so can be dropped. it also uncovered a missing bounds check on FixedIndexed
get method, which I have fixed and added tests for 😅
I've re-arranged some things so the logic is simplified, it was really only the end index that needed adjustment when it was missing from the local dictionary, and that was because I was doing the shifting to account for the end index being exclusive in a funny manner. I think its clearer now. Thanks for asking this question!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Simpler to follow now.
I also wonder if the last value returned from FixedIndex.indexOf()
should always be -(minIndex + 1)
.
My concern is that if we haven't found the value until now, then the insertion point could either be before or after the currIndex
that we are evaluating. I think we should incorporate that fact into the returned value.
Say at the last step, we found that our value
is bigger than the currValue
, so minIndex
becomes currIndex + 1
, then we break out of the loop and then we return -(minIndex + 1) = -(currIndex + 2)
.
Doesn't this mean that we have actually skipped the insertion point? Or maybe I am missing something.
(Although, I guess this should be okay because the Indexed
interface doesn't make any promises about the returned negative value.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FixedIndexed.indexOf
is the same-ish implementation as GenericIndexed.indexOf
, both of which are basically the same as Arrays.binarySearch
, so I think it is ok. Btw, range finder method is used for both FixedIndexed
and GenericIndexed
global dictionaries depending on if it is making LexicographicalRangeIndex
for nested strings or NumericRangeIndex
for nested numbers, though the local dictionary is always a FixedIndexed
.
Or did you mean something about how the range calculation is using indexOf
to compute the ranges?
btw, I did recently strengthen the contract of indexOf
to be (-(insertion point) - 1)
for missing values if isSorted()
returns true, since these index things basically require the (-(insertion point) - 1)
behavior to work correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks for the clarification!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor query, otherwise looks good.
* fix nested column range index range computation * simplify, add missing bounds check for FixedIndexed
* fix nested column range index range computation * simplify, add missing bounds check for FixedIndexed
Description
Fixes a bug with nested column range index value range computation that can lead to incorrect values matching filters when the range is not present in the nested fields local dictionary.
This PR fixes the issue by checking that the computed local start and end index values global id mapping does not violate the computed original global id range whenever the actual global ids are not present in the local dictionary. That's a lot of words thats probably hard to follow, but its basically an off by 1 error when mapping the global range to the local range in certain cases.
This wrong behavior was accidentally encoded in a test I wrote which would have caught the issue if the test had been expecting the correct answer, so I went through and manually verified that the tests are actually expecting the correct thing this time, as well as tried to get 100% coverage on everything which matters in practice, which is now pretty high
The remaining uncovered lines are largely things like safety checks that are not hit in regular operation such as calling next on an iterator before calling hasNext, etc, which I cannot hit during normal usage of the index supplier.
Release note
Fixes a bug with nested columns when using filters that use range indexes such as greater/less than or like filters.
This PR has: