-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding new config for disabling group by on multiValue column #12253
Adding new config for disabling group by on multiValue column #12253
Conversation
processing/src/main/java/org/apache/druid/query/groupby/GroupByQueryConfig.java
Outdated
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/query/groupby/GroupByQueryEngine.java
Outdated
Show resolved
Hide resolved
query.getDimensions()); | ||
if (!(query.getContextValue(GroupByQueryConfig.CTX_KEY_ENABLE_MULTI_VALUE_UNNESTING, true)) | ||
&& !allSingleValueDims) { | ||
throw new ISE( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should name the dimension so this error message is more actionable. People are going to need to know which dimension is causing the problem, so they can either remove it, set it to array mode, or process it in some other way.
How about:
- Change hasNoExplodingDimensions to findExplodingDimensions (i.e., return the names of the exploding dimensions).
- Change the error message to:
Encountered multi-value dimensions [%s] that cannot be processed with %s set to false. Consider changing these dimensions to arrays or setting %s to true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Acked. Agreed such messaging would be better for the users
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review @gianm . Please have a look at findAllProbableExplodingDimensions
query.getDimensions()); | ||
if (!(query.getContextValue(GroupByQueryConfig.CTX_KEY_ENABLE_MULTI_VALUE_UNNESTING, true)) | ||
&& !allSingleValueDims) { | ||
throw new ISE( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Acked. Agreed such messaging would be better for the users
processing/src/main/java/org/apache/druid/query/groupby/GroupByQueryEngine.java
Outdated
Show resolved
Hide resolved
## Disable GroupBy on multivalue columns | ||
|
||
As grouping on multivalue columns causes implicit unnest, users can avoid this behaviour by setting | ||
`groupByEnableMultiValueUnnesting` in the query context to `false`. This will result the query to error out. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should leave this out of the docs until we the array-based story is complete. Then we can document this, array types, array functions, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively we can keep this documented but change the error message to not mention array-based dimensions. In that case, we can change the error message to this:
Encountered multi-value dimension [%s] that cannot be processed with %s set to false. Consider setting %s to true for unnesting behavior, or using an expression to create a scalar from the multi-value dimension.
For the docs, a couple style points:
- The rest of this page uses second person ("you can…") rather than third ("users can…") so we should stick to that.
- We usually use US spelling in documentation (e.g. behavior instead of behaviour).
So I'd go with:
You can disable the implicit unnesting behavior for groupBy by setting
groupByEnableMultiValueUnnesting: false
in your query context. In this mode, the groupBy engine will return an error instead of completing the query. This is a safety feature for situations where you believe that all dimensions are singly-valued and want the engine to reject any multi-valued dimensions that were inadvertently included.
Also, all documented groupBy parameters should be included in the groupbyquery.md document as well, under "GroupBy v2 configurations". So if you mention this here it should be mentioned in the main doc too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the latest draft I have kept the documentation and changed the error messaging a bit.
Update the "GroupBy v2 configurations" as well.
*/ | ||
public static boolean hasNoExplodingDimensions( | ||
public static List<String> findAllProbableExplodingDimensions( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, looking through this I just realized that this function can't tell if dimensions are definitely multi-valued or not. It can only tell if they might be multi-valued. (But they also might not be!)
So I think we should move the check; instead of checking here, we should check row by row as we actually run the query. If a multi-value row is ever encountered, then at that point we should throw the error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catch. Updated the check while doing row by row iteration.
@@ -764,6 +783,9 @@ protected void aggregateMultiValueDims(Grouper<ByteBuffer> grouper) | |||
); | |||
|
|||
if (doAggregate) { | |||
// this check is done during the row aggregation as a dimension can become multi-value col if | |||
// {@link org.apache.druid.segment.column.ColumnCapabilities} is unkown. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: javadoc links don't work here, maybe just say "column capabilities", also "unkown" -> "unknown"
if (multiValuesSize == 0) { | ||
if (!grouper.aggregate(GroupByColumnSelectorStrategy.GROUP_BY_MISSING_VALUE).isOk()) { | ||
return; | ||
} | ||
} else { | ||
if (multiValuesSize > 1) { | ||
// this check is done during the row aggregation as a dimension can become multi-value col if | ||
// {@link org.apache.druid.segment.column.ColumnCapabilities} is unkown. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: same nit and typo
Looks good after the changes. Thanks! |
Description
As part of #12078 one of the followup's was to have a specific config which does not allow accidental unnesting of multi value columns if such columns become part of the grouping key.
Added a config
groupByEnableMultiValueUnnesting
which can be set in the query context.The default value of
groupByEnableMultiValueUnnesting
istrue
, therefore it does not change the current engine behavior.If
groupByEnableMultiValueUnnesting
is set tofalse
, the query will fail if it encounters a multi-value column in the grouping key.Key changed/added classes in this PR
GroupByQueryConfig
GroupByQueryEngineV2
GroupByQueryEngine
This PR has: