Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor GroupBy and TopN code to relax the constraint of dimensions being comparable #15559

Merged
merged 12 commits into from
Feb 27, 2024

Conversation

LakshSingla
Copy link
Contributor

@LakshSingla LakshSingla commented Dec 14, 2023

Description

The code in the groupBy engine and the topN engine assume that the dimensions are comparable and can call dimA.compareTo(dimB) to sort the dimensions and group them together.
This works well for the primitive dimensions, because they are Comparable, however falls apart when the dimensions can be arrays (or in future scenarios complex columns). In cases when the dimensions are not comparable, Druid resorts to having a wrapper type ComparableStringArray and ComparableList, which is a Comparable, based on the list comparator.

A cleaner approach would be to assume that each type is associated with a comparator - for primitives, it would be the natural comparator, and for arrays it would be the list comparator, and use the comparator to compare the dimensions for sorting. This code achieves the same, each dimension is comparable if:

  1. We know its type -> We can use the ComparisonUtil to fetch the comparator associated with the type
  2. We know the dimension is an Object in the desired type -> We resort to coercion using the DimensionHandlerUtils to coerce the dimension (from the column value selectors) into the necessary type (interchange numerics to their desired type, convert List to Object[] for arrays, parse strings if permitted etc). This coercion is necessary to allow rogue selectors (perhaps from extensions or expressions) to work well with the engine. Druid has been lenient in what selector's #getObject() can return, because it casts such cases in multiple places. Till we have strictly typed selectors, this coercion will be needed. The good thing about refactoring is that it is separate from 1, so once we feel coercion is not required, we can get rid of it easily.

Since this is a refactor, there's no user-facing impact of the same.


Key changed/added classes in this PR

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice thanks for taking this on, have wanted this to happen for a while 🤘

public class ComparisonUtils
{

public static Comparator<Object> getComparatorForType(TypeSignature<ValueType> type)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is necessary, can just use type.getNullableStrategy()

implements GroupByColumnSelectorStrategy
{
protected static final int GROUP_BY_MISSING_VALUE = -1;

// TODO(laksh): Keep the dictionary types as List<T> instead of Object[] to allow for equality comparisons
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively, i think you could also use a collection that takes a comparator and use the comparator of the array type? I think that would let this work with any type of array (including nested arrays, if we also switch to sourcing the comparator from nullable typestrategy) to make this the 'default' dictionary building array strategy (leaving the string thing as an optimization)

Comment on lines 60 to 48
return addToIndexedDictionary(Arrays.asList((Double[]) object));
return addToIndexedDictionary(Arrays.stream((Double[]) object).toArray());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if we need to handle this case anymore (we used to not homogenize arrays to Object[] prior to #12914). Same comment for other types

That said, it probably doesn't hurt much to be here... just might also be able to get by with removing it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there are ad-hoc checks elsewhere, I kept it for now just in case we do encounter a rogue selector (possibly in some custom extension). If we do shift to ArrayColumnValueSelectors, we can get rid of it safely. I have added a comment about the redundancy of the branch though.

@@ -160,32 +146,33 @@ public Grouper.BufferComparator bufferComparator(int keyBufferPosition, @Nullabl
{
StringComparator comparator = stringComparator == null ? StringComparators.NUMERIC : stringComparator;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, why are we using the string comparator here? it seems like maybe we should just switch to using the native array type comparator? I guess what I mean is that I feel like we should drop support for using any string comparator options unless we are grouping on actual strings. Everywhere else it adds a bunch of complexity that shouldn't exist imo. I know this isn't new here, just using it as an opportunity to start a discussion and maybe simplify some things.

@@ -437,19 +437,22 @@ private Ordering<ResultRow> dimensionOrdering(
{
Comparator arrayComparator = null;
if (columnType.isArray()) {
final ValueType elementType = columnType.getElementType().getType();
final TypeSignature<ValueType> elementType = columnType.getElementType();
if (columnType.getElementType().isNumeric()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, it feels like we should ignore the funny string comparator options for numeric arrays at least, but preferably all arrays. I wonder if we should just remove the option to use these comparators completely... SQL doesn't use them, but i suppose they are ok to use with actual string columns, but any other type feels like it should just be an error. I don't see what the use of sorting numbers or arrays lexicographically is...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding to this and the previous comment, it is also ambiguous to the person writing the query, that the custom StringComparators coerce the entire array before comparing, or if it compares the elements piecewise, with the custom comparator (the latter happens).

I guess it should be fine to drop the support for "unnatural" comparators for the non-string arrays entirely since that's the most obscure use case. I wonder if we should discuss removing the custom comparators for the string types, and the numeric types in the mailing list before committing to it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, i will try to start a dev list thread soon since i think this is super strange and not very useful

Comment on lines 302 to 303
// TODO(laksh): Get this change vetted
return Arrays.toString((Object[]) valObj);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when does this happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It happened in one of the funny cases that you alluded to before, where we try to compare arrays as strings. Since I wasn't able to find the test case directly, I have pushed a change without this method to see where it fails. Before the patch, when we used ComparableStringArrays and ComparableList, the toString was implicitly returning the required representation, however with the change, it started returning Object@..., which doesn't sit well with the test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests like the following require this change:
ClientQuerySegmentWalkerTest#testGroupByOnArraysLongsAsString
ClientQuerySegmentWalkerTest#testGroupByOnArraysStrings
ClientQuerySegmentWalkerTest#testGroupByOnArraysStringsasString

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i feel like we could probably consider just make this an error case too, maybe also worth a dev list thread along with dropping support for funny string comparators on non-string types. i'll try to start a dev list thread soon

@github-actions github-actions bot added Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Feb 19, 2024
@@ -53,6 +63,39 @@ public static <T> Object2IntMap<T> createReverseDictionary()
return m;
}

private static <T> Object2IntMap<T> createReverseDictionary(final Hash.Strategy<T> hashStrategy)
{
final Object2IntOpenCustomHashMap<T> m = new Object2IntOpenCustomHashMap<>(hashStrategy);
Copy link
Member

@clintropolis clintropolis Feb 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is fine for now, but i wonder if it is worth exploring/measuring using a sorted map instead like avl or rb tree map. if the cost difference isn't too dramatic we could instead just get by using the comparator instead of needing equals and hashcode to be implemented, which seems like it would simplify some things

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this PR is open for sometime, lets do this work as part of a follow up PR since we would require a some benchmark code as well.

@cryptoe cryptoe merged commit 17e4f3a into apache:master Feb 27, 2024
82 of 83 checks passed
@adarshsanjeev adarshsanjeev added this to the 30.0.0 milestone May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 Area - Querying Area - Segment Format and Ser/De Refactoring
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants