Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduce type from the aggregators when materializing subquery results #16703

Merged
merged 3 commits into from
Jul 23, 2024

Conversation

LakshSingla
Copy link
Contributor

Description

For aggregators like StringFirst/Last, whose intermediate type isn't the same as the final type, using them in GroupBy, TopN or Timeseries subqueries causes a fallback when maxSubqueryBytes is set. This is because we assume that the finalization is not known, due to which the row signature cannot determine whether to use the intermediate or the final type, and it puts it as null. This PR figures out the finalization from the query context and uses the intermediate or the final type appropriately.

Release note


Key changed/added classes in this PR
  • MyFoo
  • OurBar
  • TheirBaz

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Comment on lines +814 to +818
RowSignature rowSignature = query.getResultRowSignature(
query.context().isFinalize(true)
? RowSignature.Finalization.YES
: RowSignature.Finalization.NO
);
Copy link
Member

@kgyrtkirk kgyrtkirk Jul 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this logic supposed to be inside query.getResultRowSignature; query already knows context() ; why should we tell it from the outside the value of Finalization?

doesn't that work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should work, but I am scared to make that change given that it will affect everything from native and SQL queries. Lemme try making the change and see if there are any failing tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realised that it shouldn't work. For example - look at GroupByPreShuffleFrameProcessor and GroupByPostShuffleFrameProcessor. The same query requires different finalization modes, since one partially aggregates and we need to intermediate type while the other completely aggregates and finalizes. This information isn't fully captured by the query and needs someone from the outside to tell which finalization mode to use. Therefore we can't trustily determine based on the query context.

@@ -535,4 +532,16 @@ private Function<Result<TimeseriesResultValue>, Result<TimeseriesResultValue>> m
);
};
}

private RowSignature resultSignature(final TimeseriesQuery query, final RowSignature.Finalization finalization)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this method be moved to be: TimeseriesQuery#getResultSignature (like for GroupByQuery )
or TimeseriesQuery#getRowSignature (like for ScanQuery ) ?

@@ -558,7 +558,16 @@ public Optional<Sequence<FrameSignaturePair>> resultsAsFrames(
boolean useNestedForUnknownTypes
)
{
final RowSignature rowSignature = resultArraySignature(query);
final RowSignature rowSignature =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarily to TS: - shouldn't this be TopNQuery#getResultRowSignature ?
please also update resultArraySignature to use that method so we are not duplicating logic

@LakshSingla
Copy link
Contributor Author

Thanks for the review @kgyrtkirk.

@LakshSingla LakshSingla merged commit 11bb409 into apache:master Jul 23, 2024
83 of 88 checks passed
@LakshSingla LakshSingla deleted the groupby-subquery branch July 23, 2024 06:22
sreemanamala pushed a commit to sreemanamala/druid that referenced this pull request Aug 6, 2024
…pache#16703)

For aggregators like StringFirst/Last, whose intermediate type isn't the same as the final type, using them in GroupBy, TopN or Timeseries subqueries causes a fallback when maxSubqueryBytes is set. This is because we assume that the finalization is not known, due to which the row signature cannot determine whether to use the intermediate or the final type, and it puts it as null. This PR figures out the finalization from the query context and uses the intermediate or the final type appropriately.
@kfaraz kfaraz added this to the 31.0.0 milestone Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants