Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add grouping_id function #10518

Merged
merged 8 commits into from
Dec 7, 2020
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions docs/querying/aggregations.md
Original file line number Diff line number Diff line change
Expand Up @@ -426,3 +426,26 @@ This makes it possible to compute the results of a filtered and an unfiltered ag
"aggregator" : <aggregation>
}
```

### Grouping Aggregator

A grouping aggregator can only be used as part of GroupBy queries which have a subtotal spec. It returns a number for
each output row that lets you infer whether a particular dimension is included in the sub-grouping used for that row. You can pass
a *non-empty* list of dimensions to this aggregator which *must* be a subset of dimensions that you are grouping on.
E.g if the aggregator has `["dim1", "dim2"]` as input dimensions and `[["dim1", "dim2"], ["dim1"], ["dim2"], []]` as subtotals,
following can be the possible output of the aggregator

| subtotal used in query | Output | (bits representation) |
|------------------------|--------|-----------------------|
| `["dim1", "dim2"]` | 0 | (00) |
| `["dim1"]` | 1 | (01) |
| `["dim2"]` | 2 | (10) |
| `[]` | 3 | (11) |

As illustrated in above example, output number can be thought of as an unsigned n bit number where n is the number of dimensions passed to the aggregator.
The bit at position X is set in this number to 0 if a dimension at position X in input to aggregators is included in the sub-grouping. Otherwise, this bit
is set to 1.

```json
{ "type" : "grouping", "name" : <output_name>, "groupings" : [<dimension>] }
```
4 changes: 3 additions & 1 deletion docs/querying/groupbyquery.md
Original file line number Diff line number Diff line change
Expand Up @@ -226,7 +226,9 @@ The response for the query above would look something like:
]
```

> Notice that dimensions that are not included in an individual subtotalsSpec grouping are returned with a `null` value. This response format represents a behavior change as of Apache Druid 0.18.0. In release 0.17.0 and earlier, such dimensions were entirely excluded from the result.
> Notice that dimensions that are not included in an individual subtotalsSpec grouping are returned with a `null` value. This response format represents a behavior change as of Apache Druid 0.18.0.
> In release 0.17.0 and earlier, such dimensions were entirely excluded from the result. If you were relying on this old behavior to determine whether a particular dimension was not part of
> a subtotal grouping, you can now use [Grouping aggregator](aggregations.md#Grouping Aggregator) instead.


## Implementation details
Expand Down
4 changes: 3 additions & 1 deletion docs/querying/sql.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,8 @@ total. Finally, GROUP BY CUBE computes a grouping set for each combination of gr
`GROUP BY CUBE (country, city)` is equivalent to `GROUP BY GROUPING SETS ( (country, city), (country), (city), () )`.
Grouping columns that do not apply to a particular row will contain `NULL`. For example, when computing
`GROUP BY GROUPING SETS ( (country, city), () )`, the grand total row corresponding to `()` will have `NULL` for the
"country" and "city" columns.
"country" and "city" columns. Column may also be `NULL` if it was `NULL` in the data itself. To differentiate such rows
, you can use `GROUPING` aggregation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abhishekagarwal87 Extraneous whitespace before the , might interfere with rendering. (I'm not sure, but it looks suspicious.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-worked this a bit in #10654


When using GROUP BY GROUPING SETS, GROUP BY ROLLUP, or GROUP BY CUBE, be aware that results may not be generated in the
order that you specify your grouping sets in the query. If you need results to be generated in a particular order, use
Expand Down Expand Up @@ -337,6 +338,7 @@ Only the COUNT aggregation can accept DISTINCT.
|`LATEST(expr, maxBytesPerString)`|Like `LATEST(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|
|`ANY_VALUE(expr)`|Returns any value of `expr` including null. `expr` must be numeric. This aggregator can simplify and optimize the performance by returning the first encountered value (including null)|
|`ANY_VALUE(expr, maxBytesPerString)`|Like `ANY_VALUE(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|
|`GROUPING(expr, expr...)`|Returns a number to indicate which groupBy dimension is included in a row, when using `GROUPING SETS`. Refer to [additional documentation](aggregations.md#Grouping Aggregator) on how to infer this number.|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abhishekagarwal87 I don't think this anchor will work (our doc generators don't include spaces in anchors). Could you please double-check it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Even github doesn't render it correctly though intellij does it. Fixed in PR
#10654


For advice on choosing approximate aggregation functions, check out our [approximate aggregations documentation](aggregations.html#approx).

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
import org.apache.druid.query.aggregation.FloatMaxAggregatorFactory;
import org.apache.druid.query.aggregation.FloatMinAggregatorFactory;
import org.apache.druid.query.aggregation.FloatSumAggregatorFactory;
import org.apache.druid.query.aggregation.GroupingAggregatorFactory;
import org.apache.druid.query.aggregation.HistogramAggregatorFactory;
import org.apache.druid.query.aggregation.JavaScriptAggregatorFactory;
import org.apache.druid.query.aggregation.LongMaxAggregatorFactory;
Expand Down Expand Up @@ -118,7 +119,8 @@ public AggregatorsModule()
@JsonSubTypes.Type(name = "longAny", value = LongAnyAggregatorFactory.class),
@JsonSubTypes.Type(name = "floatAny", value = FloatAnyAggregatorFactory.class),
@JsonSubTypes.Type(name = "doubleAny", value = DoubleAnyAggregatorFactory.class),
@JsonSubTypes.Type(name = "stringAny", value = StringAnyAggregatorFactory.class)
@JsonSubTypes.Type(name = "stringAny", value = StringAnyAggregatorFactory.class),
@JsonSubTypes.Type(name = "grouping", value = GroupingAggregatorFactory.class)
})
public interface AggregatorFactoryMixin
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,10 @@ public class AggregatorUtil
public static final byte FLOAT_ANY_CACHE_TYPE_ID = 0x44;
public static final byte STRING_ANY_CACHE_TYPE_ID = 0x45;

// GROUPING aggregator
public static final byte GROUPING_CACHE_TYPE_ID = 0x46;


/**
* returns the list of dependent postAggregators that should be calculated in order to calculate given postAgg
*
Expand Down
Loading