-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor GroupBy and TopN code to relax the constraint of dimensions being comparable #15559
Refactor GroupBy and TopN code to relax the constraint of dimensions being comparable #15559
Conversation
processing/src/test/java/org/apache/druid/segment/DimensionHandlerUtilsTest.java
Fixed
Show fixed
Hide fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice thanks for taking this on, have wanted this to happen for a while 🤘
public class ComparisonUtils | ||
{ | ||
|
||
public static Comparator<Object> getComparatorForType(TypeSignature<ValueType> type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is necessary, can just use type.getNullableStrategy()
implements GroupByColumnSelectorStrategy | ||
{ | ||
protected static final int GROUP_BY_MISSING_VALUE = -1; | ||
|
||
// TODO(laksh): Keep the dictionary types as List<T> instead of Object[] to allow for equality comparisons |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alternatively, i think you could also use a collection that takes a comparator and use the comparator of the array type? I think that would let this work with any type of array (including nested arrays, if we also switch to sourcing the comparator from nullable typestrategy) to make this the 'default' dictionary building array strategy (leaving the string thing as an optimization)
processing/src/main/java/org/apache/druid/segment/DimensionHandlerUtils.java
Outdated
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/segment/DimensionHandlerUtils.java
Outdated
Show resolved
Hide resolved
return addToIndexedDictionary(Arrays.asList((Double[]) object)); | ||
return addToIndexedDictionary(Arrays.stream((Double[]) object).toArray()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wonder if we need to handle this case anymore (we used to not homogenize arrays to Object[]
prior to #12914). Same comment for other types
That said, it probably doesn't hurt much to be here... just might also be able to get by with removing it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there are ad-hoc checks elsewhere, I kept it for now just in case we do encounter a rogue selector (possibly in some custom extension). If we do shift to ArrayColumnValueSelectors, we can get rid of it safely. I have added a comment about the redundancy of the branch though.
@@ -160,32 +146,33 @@ public Grouper.BufferComparator bufferComparator(int keyBufferPosition, @Nullabl | |||
{ | |||
StringComparator comparator = stringComparator == null ? StringComparators.NUMERIC : stringComparator; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, why are we using the string comparator here? it seems like maybe we should just switch to using the native array type comparator? I guess what I mean is that I feel like we should drop support for using any string comparator options unless we are grouping on actual strings. Everywhere else it adds a bunch of complexity that shouldn't exist imo. I know this isn't new here, just using it as an opportunity to start a discussion and maybe simplify some things.
@@ -437,19 +437,22 @@ private Ordering<ResultRow> dimensionOrdering( | |||
{ | |||
Comparator arrayComparator = null; | |||
if (columnType.isArray()) { | |||
final ValueType elementType = columnType.getElementType().getType(); | |||
final TypeSignature<ValueType> elementType = columnType.getElementType(); | |||
if (columnType.getElementType().isNumeric()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, it feels like we should ignore the funny string comparator options for numeric arrays at least, but preferably all arrays. I wonder if we should just remove the option to use these comparators completely... SQL doesn't use them, but i suppose they are ok to use with actual string columns, but any other type feels like it should just be an error. I don't see what the use of sorting numbers or arrays lexicographically is...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding to this and the previous comment, it is also ambiguous to the person writing the query, that the custom StringComparators coerce the entire array before comparing, or if it compares the elements piecewise, with the custom comparator (the latter happens).
I guess it should be fine to drop the support for "unnatural" comparators for the non-string arrays entirely since that's the most obscure use case. I wonder if we should discuss removing the custom comparators for the string types, and the numeric types in the mailing list before committing to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, i will try to start a dev list thread soon since i think this is super strange and not very useful
// TODO(laksh): Get this change vetted | ||
return Arrays.toString((Object[]) valObj); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when does this happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It happened in one of the funny cases that you alluded to before, where we try to compare arrays as strings. Since I wasn't able to find the test case directly, I have pushed a change without this method to see where it fails. Before the patch, when we used ComparableStringArrays
and ComparableList,
the toString
was implicitly returning the required representation, however with the change, it started returning Object@...
, which doesn't sit well with the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests like the following require this change:
ClientQuerySegmentWalkerTest#testGroupByOnArraysLongsAsString
ClientQuerySegmentWalkerTest#testGroupByOnArraysStrings
ClientQuerySegmentWalkerTest#testGroupByOnArraysStringsasString
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i feel like we could probably consider just make this an error case too, maybe also worth a dev list thread along with dropping support for funny string comparators on non-string types. i'll try to start a dev list thread soon
@@ -53,6 +63,39 @@ public static <T> Object2IntMap<T> createReverseDictionary() | |||
return m; | |||
} | |||
|
|||
private static <T> Object2IntMap<T> createReverseDictionary(final Hash.Strategy<T> hashStrategy) | |||
{ | |||
final Object2IntOpenCustomHashMap<T> m = new Object2IntOpenCustomHashMap<>(hashStrategy); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this is fine for now, but i wonder if it is worth exploring/measuring using a sorted map instead like avl or rb tree map. if the cost difference isn't too dramatic we could instead just get by using the comparator instead of needing equals and hashcode to be implemented, which seems like it would simplify some things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this PR is open for sometime, lets do this work as part of a follow up PR since we would require a some benchmark code as well.
Description
The code in the groupBy engine and the topN engine assume that the dimensions are comparable and can call
dimA.compareTo(dimB)
to sort the dimensions and group them together.This works well for the primitive dimensions, because they are Comparable, however falls apart when the dimensions can be arrays (or in future scenarios complex columns). In cases when the dimensions are not comparable, Druid resorts to having a wrapper type
ComparableStringArray
andComparableList
, which is a Comparable, based on the list comparator.A cleaner approach would be to assume that each type is associated with a comparator - for primitives, it would be the natural comparator, and for arrays it would be the list comparator, and use the comparator to compare the dimensions for sorting. This code achieves the same, each dimension is comparable if:
Since this is a refactor, there's no user-facing impact of the same.
Key changed/added classes in this PR
This PR has: