Skip to content

Commit

Permalink
Add grouping_id function (#10518)
Browse files Browse the repository at this point in the history
* First draft of grouping_id function

* Add more tests and documentation

* Add calcite tests

* Fix travis failures

* bit of a change

* Add documentation

* Fix typos

* typo fix
  • Loading branch information
abhishekagarwal87 authored Dec 7, 2020
1 parent b681861 commit 26d74b3
Show file tree
Hide file tree
Showing 19 changed files with 1,248 additions and 27 deletions.
23 changes: 23 additions & 0 deletions docs/querying/aggregations.md
Original file line number Diff line number Diff line change
Expand Up @@ -426,3 +426,26 @@ This makes it possible to compute the results of a filtered and an unfiltered ag
"aggregator" : <aggregation>
}
```

### Grouping Aggregator

A grouping aggregator can only be used as part of GroupBy queries which have a subtotal spec. It returns a number for
each output row that lets you infer whether a particular dimension is included in the sub-grouping used for that row. You can pass
a *non-empty* list of dimensions to this aggregator which *must* be a subset of dimensions that you are grouping on.
E.g if the aggregator has `["dim1", "dim2"]` as input dimensions and `[["dim1", "dim2"], ["dim1"], ["dim2"], []]` as subtotals,
following can be the possible output of the aggregator

| subtotal used in query | Output | (bits representation) |
|------------------------|--------|-----------------------|
| `["dim1", "dim2"]` | 0 | (00) |
| `["dim1"]` | 1 | (01) |
| `["dim2"]` | 2 | (10) |
| `[]` | 3 | (11) |

As illustrated in above example, output number can be thought of as an unsigned n bit number where n is the number of dimensions passed to the aggregator.
The bit at position X is set in this number to 0 if a dimension at position X in input to aggregators is included in the sub-grouping. Otherwise, this bit
is set to 1.

```json
{ "type" : "grouping", "name" : <output_name>, "groupings" : [<dimension>] }
```
4 changes: 3 additions & 1 deletion docs/querying/groupbyquery.md
Original file line number Diff line number Diff line change
Expand Up @@ -226,7 +226,9 @@ The response for the query above would look something like:
]
```

> Notice that dimensions that are not included in an individual subtotalsSpec grouping are returned with a `null` value. This response format represents a behavior change as of Apache Druid 0.18.0. In release 0.17.0 and earlier, such dimensions were entirely excluded from the result.
> Notice that dimensions that are not included in an individual subtotalsSpec grouping are returned with a `null` value. This response format represents a behavior change as of Apache Druid 0.18.0.
> In release 0.17.0 and earlier, such dimensions were entirely excluded from the result. If you were relying on this old behavior to determine whether a particular dimension was not part of
> a subtotal grouping, you can now use [Grouping aggregator](aggregations.md#Grouping Aggregator) instead.

## Implementation details
Expand Down
4 changes: 3 additions & 1 deletion docs/querying/sql.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,8 @@ total. Finally, GROUP BY CUBE computes a grouping set for each combination of gr
`GROUP BY CUBE (country, city)` is equivalent to `GROUP BY GROUPING SETS ( (country, city), (country), (city), () )`.
Grouping columns that do not apply to a particular row will contain `NULL`. For example, when computing
`GROUP BY GROUPING SETS ( (country, city), () )`, the grand total row corresponding to `()` will have `NULL` for the
"country" and "city" columns.
"country" and "city" columns. Column may also be `NULL` if it was `NULL` in the data itself. To differentiate such rows
, you can use `GROUPING` aggregation.

When using GROUP BY GROUPING SETS, GROUP BY ROLLUP, or GROUP BY CUBE, be aware that results may not be generated in the
order that you specify your grouping sets in the query. If you need results to be generated in a particular order, use
Expand Down Expand Up @@ -337,6 +338,7 @@ Only the COUNT aggregation can accept DISTINCT.
|`LATEST(expr, maxBytesPerString)`|Like `LATEST(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|
|`ANY_VALUE(expr)`|Returns any value of `expr` including null. `expr` must be numeric. This aggregator can simplify and optimize the performance by returning the first encountered value (including null)|
|`ANY_VALUE(expr, maxBytesPerString)`|Like `ANY_VALUE(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|
|`GROUPING(expr, expr...)`|Returns a number to indicate which groupBy dimension is included in a row, when using `GROUPING SETS`. Refer to [additional documentation](aggregations.md#Grouping Aggregator) on how to infer this number.|

For advice on choosing approximate aggregation functions, check out our [approximate aggregations documentation](aggregations.html#approx).

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
import org.apache.druid.query.aggregation.FloatMaxAggregatorFactory;
import org.apache.druid.query.aggregation.FloatMinAggregatorFactory;
import org.apache.druid.query.aggregation.FloatSumAggregatorFactory;
import org.apache.druid.query.aggregation.GroupingAggregatorFactory;
import org.apache.druid.query.aggregation.HistogramAggregatorFactory;
import org.apache.druid.query.aggregation.JavaScriptAggregatorFactory;
import org.apache.druid.query.aggregation.LongMaxAggregatorFactory;
Expand Down Expand Up @@ -118,7 +119,8 @@ public AggregatorsModule()
@JsonSubTypes.Type(name = "longAny", value = LongAnyAggregatorFactory.class),
@JsonSubTypes.Type(name = "floatAny", value = FloatAnyAggregatorFactory.class),
@JsonSubTypes.Type(name = "doubleAny", value = DoubleAnyAggregatorFactory.class),
@JsonSubTypes.Type(name = "stringAny", value = StringAnyAggregatorFactory.class)
@JsonSubTypes.Type(name = "stringAny", value = StringAnyAggregatorFactory.class),
@JsonSubTypes.Type(name = "grouping", value = GroupingAggregatorFactory.class)
})
public interface AggregatorFactoryMixin
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,10 @@ public class AggregatorUtil
public static final byte FLOAT_ANY_CACHE_TYPE_ID = 0x44;
public static final byte STRING_ANY_CACHE_TYPE_ID = 0x45;

// GROUPING aggregator
public static final byte GROUPING_CACHE_TYPE_ID = 0x46;


/**
* returns the list of dependent postAggregators that should be calculated in order to calculate given postAgg
*
Expand Down
Loading

0 comments on commit 26d74b3

Please sign in to comment.