Refactor GroupBy and TopN code to relax the constraint of dimensions being comparable #15559

LakshSingla · 2023-12-14T08:24:52Z

Description

The code in the groupBy engine and the topN engine assume that the dimensions are comparable and can call dimA.compareTo(dimB) to sort the dimensions and group them together.
This works well for the primitive dimensions, because they are Comparable, however falls apart when the dimensions can be arrays (or in future scenarios complex columns). In cases when the dimensions are not comparable, Druid resorts to having a wrapper type ComparableStringArray and ComparableList, which is a Comparable, based on the list comparator.

A cleaner approach would be to assume that each type is associated with a comparator - for primitives, it would be the natural comparator, and for arrays it would be the list comparator, and use the comparator to compare the dimensions for sorting. This code achieves the same, each dimension is comparable if:

We know its type -> We can use the ComparisonUtil to fetch the comparator associated with the type
We know the dimension is an Object in the desired type -> We resort to coercion using the DimensionHandlerUtils to coerce the dimension (from the column value selectors) into the necessary type (interchange numerics to their desired type, convert List to Object[] for arrays, parse strings if permitted etc). This coercion is necessary to allow rogue selectors (perhaps from extensions or expressions) to work well with the engine. Druid has been lenient in what selector's #getObject() can return, because it casts such cases in multiple places. Till we have strictly typed selectors, this coercion will be needed. The good thing about refactoring is that it is separate from 1, so once we feel coercion is not required, we can get rid of it easily.

Since this is a refactor, there's no user-facing impact of the same.

Key changed/added classes in this PR

This PR has:

processing/src/test/java/org/apache/druid/segment/DimensionHandlerUtilsTest.java

processing/src/main/java/org/apache/druid/query/ComparisonUtils.java

clintropolis

nice thanks for taking this on, have wanted this to happen for a while 🤘

clintropolis · 2024-01-11T23:46:34Z

processing/src/main/java/org/apache/druid/query/ComparisonUtils.java

+public class ComparisonUtils
+{
+
+  public static Comparator<Object> getComparatorForType(TypeSignature<ValueType> type)


I don't think this is necessary, can just use type.getNullableStrategy()

clintropolis · 2024-01-11T23:53:13Z

...ache/druid/query/groupby/epinephelinae/column/ArrayNumericGroupByColumnSelectorStrategy.java

    implements GroupByColumnSelectorStrategy
 {
  protected static final int GROUP_BY_MISSING_VALUE = -1;

+  // TODO(laksh): Keep the dictionary types as List<T> instead of Object[] to allow for equality comparisons


alternatively, i think you could also use a collection that takes a comparator and use the comparator of the array type? I think that would let this work with any type of array (including nested arrays, if we also switch to sourcing the comparator from nullable typestrategy) to make this the 'default' dictionary building array strategy (leaving the string thing as an optimization)

processing/src/main/java/org/apache/druid/segment/DimensionHandlerUtils.java

clintropolis · 2024-02-07T07:10:01Z

...pache/druid/query/groupby/epinephelinae/column/ArrayDoubleGroupByColumnSelectorStrategy.java

-      return addToIndexedDictionary(Arrays.asList((Double[]) object));
+      return addToIndexedDictionary(Arrays.stream((Double[]) object).toArray());


i wonder if we need to handle this case anymore (we used to not homogenize arrays to Object[] prior to #12914). Same comment for other types

That said, it probably doesn't hurt much to be here... just might also be able to get by with removing it

Since there are ad-hoc checks elsewhere, I kept it for now just in case we do encounter a rogue selector (possibly in some custom extension). If we do shift to ArrayColumnValueSelectors, we can get rid of it safely. I have added a comment about the redundancy of the branch though.

clintropolis · 2024-02-07T07:16:00Z

...ache/druid/query/groupby/epinephelinae/column/ArrayNumericGroupByColumnSelectorStrategy.java

@@ -160,32 +146,33 @@ public Grouper.BufferComparator bufferComparator(int keyBufferPosition, @Nullabl
  {
    StringComparator comparator = stringComparator == null ? StringComparators.NUMERIC : stringComparator;


hmm, why are we using the string comparator here? it seems like maybe we should just switch to using the native array type comparator? I guess what I mean is that I feel like we should drop support for using any string comparator options unless we are grouping on actual strings. Everywhere else it adds a bunch of complexity that shouldn't exist imo. I know this isn't new here, just using it as an opportunity to start a discussion and maybe simplify some things.

clintropolis · 2024-02-07T07:19:24Z

processing/src/main/java/org/apache/druid/query/groupby/orderby/DefaultLimitSpec.java

@@ -437,19 +437,22 @@ private Ordering<ResultRow> dimensionOrdering(
  {
    Comparator arrayComparator = null;
    if (columnType.isArray()) {
-      final ValueType elementType = columnType.getElementType().getType();
+      final TypeSignature<ValueType> elementType = columnType.getElementType();
      if (columnType.getElementType().isNumeric()) {


hmm, it feels like we should ignore the funny string comparator options for numeric arrays at least, but preferably all arrays. I wonder if we should just remove the option to use these comparators completely... SQL doesn't use them, but i suppose they are ok to use with actual string columns, but any other type feels like it should just be an error. I don't see what the use of sorting numbers or arrays lexicographically is...

Adding to this and the previous comment, it is also ambiguous to the person writing the query, that the custom StringComparators coerce the entire array before comparing, or if it compares the elements piecewise, with the custom comparator (the latter happens).

I guess it should be fine to drop the support for "unnatural" comparators for the non-string arrays entirely since that's the most obscure use case. I wonder if we should discuss removing the custom comparators for the string types, and the numeric types in the mailing list before committing to it.

yea, i will try to start a dev list thread soon since i think this is super strange and not very useful

clintropolis · 2024-02-07T07:20:20Z

processing/src/main/java/org/apache/druid/segment/DimensionHandlerUtils.java

+      // TODO(laksh): Get this change vetted
+      return Arrays.toString((Object[]) valObj);


when does this happen?

It happened in one of the funny cases that you alluded to before, where we try to compare arrays as strings. Since I wasn't able to find the test case directly, I have pushed a change without this method to see where it fails. Before the patch, when we used ComparableStringArrays and ComparableList, the toString was implicitly returning the required representation, however with the change, it started returning Object@..., which doesn't sit well with the test.

Tests like the following require this change:
ClientQuerySegmentWalkerTest#testGroupByOnArraysLongsAsString
ClientQuerySegmentWalkerTest#testGroupByOnArraysStrings
ClientQuerySegmentWalkerTest#testGroupByOnArraysStringsasString

i feel like we could probably consider just make this an error case too, maybe also worth a dev list thread along with dropping support for funny string comparators on non-string types. i'll try to start a dev list thread soon

clintropolis · 2024-02-23T11:26:53Z

processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/DictionaryBuilding.java

@@ -53,6 +63,39 @@ public static <T> Object2IntMap<T> createReverseDictionary()
    return m;
  }

+  private static <T> Object2IntMap<T> createReverseDictionary(final Hash.Strategy<T> hashStrategy)
+  {
+    final Object2IntOpenCustomHashMap<T> m = new Object2IntOpenCustomHashMap<>(hashStrategy);


i think this is fine for now, but i wonder if it is worth exploring/measuring using a sorted map instead like avl or rb tree map. if the cost difference isn't too dramatic we could instead just get by using the comparator instead of needing equals and hashcode to be implemented, which seems like it would simplify some things

Since this PR is open for sometime, lets do this work as part of a follow up PR since we would require a some benchmark code as well.

LakshSingla added 2 commits December 7, 2023 10:59

init

d254742

more refactor, calciteqtest working

0c5b719

github-actions bot added the Area - Segment Format and Ser/De label Dec 14, 2023

github-advanced-security bot found potential problems Dec 14, 2023

View reviewed changes

processing/src/test/java/org/apache/druid/segment/DimensionHandlerUtilsTest.java Fixed Show fixed Hide fixed

processing/src/main/java/org/apache/druid/query/ComparisonUtils.java Fixed Show fixed Hide fixed

remove wrappers

c13551e

github-actions bot added the Area - Querying label Dec 14, 2023

LakshSingla added 3 commits December 14, 2023 22:55

checkstyle, passing tests

6e5cfae

Merge branch 'master' into unified-group-by-strategies

b686706

compilation

c384eab

clintropolis reviewed Jan 12, 2024

View reviewed changes

clintropolis added the Refactoring label Jan 12, 2024

LakshSingla mentioned this pull request Jan 18, 2024

Sql Single Value Aggregator for scalar queries #15700

Merged

10 tasks

LakshSingla added 3 commits February 1, 2024 02:09

Merge branch 'master' into unified-group-by-strategies

24a252c

more changes

fa814da

fix tests

7c3d67a

clintropolis reviewed Feb 7, 2024

View reviewed changes

LakshSingla added 3 commits February 19, 2024 11:02

Merge branch 'master' into unified-group-by-strategies

11fa768

address comments

f8f4084

add tests, comments

e4adb0e

github-actions bot added Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Feb 19, 2024

clintropolis approved these changes Feb 23, 2024

View reviewed changes

cryptoe merged commit 17e4f3a into apache:master Feb 27, 2024
82 of 83 checks passed

adarshsanjeev added this to the 30.0.0 milestone May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor GroupBy and TopN code to relax the constraint of dimensions being comparable #15559

Refactor GroupBy and TopN code to relax the constraint of dimensions being comparable #15559

LakshSingla commented Dec 14, 2023 •

edited

Loading

clintropolis left a comment

clintropolis Jan 11, 2024

clintropolis Jan 11, 2024

clintropolis Feb 7, 2024

LakshSingla Feb 19, 2024

clintropolis Feb 7, 2024

clintropolis Feb 7, 2024

LakshSingla Feb 19, 2024

clintropolis Feb 23, 2024

clintropolis Feb 7, 2024

LakshSingla Feb 19, 2024

LakshSingla Feb 19, 2024

clintropolis Feb 23, 2024

clintropolis Feb 23, 2024 •

edited

Loading

cryptoe Feb 27, 2024

		return addToIndexedDictionary(Arrays.asList((Double[]) object));
		return addToIndexedDictionary(Arrays.stream((Double[]) object).toArray());

		@@ -160,32 +146,33 @@ public Grouper.BufferComparator bufferComparator(int keyBufferPosition, @Nullabl
		{
		StringComparator comparator = stringComparator == null ? StringComparators.NUMERIC : stringComparator;

		// TODO(laksh): Get this change vetted
		return Arrays.toString((Object[]) valObj);

Refactor GroupBy and TopN code to relax the constraint of dimensions being comparable #15559

Refactor GroupBy and TopN code to relax the constraint of dimensions being comparable #15559

Conversation

LakshSingla commented Dec 14, 2023 • edited Loading

Description

Key changed/added classes in this PR

clintropolis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clintropolis Feb 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LakshSingla commented Dec 14, 2023 •

edited

Loading

clintropolis Feb 23, 2024 •

edited

Loading