Add grouping_id function (#10518)

* First draft of grouping_id function * Add more tests and documentation * Add calcite tests * Fix travis failures * bit of a change * Add documentation * Fix typos * typo fix
apache · Dec 7, 2020 · 26d74b3 · 26d74b3
1 parent b681861
commit 26d74b3
Show file tree

Hide file tree

Showing 19 changed files with 1,248 additions and 27 deletions.
diff --git a/docs/querying/aggregations.md b/docs/querying/aggregations.md
@@ -426,3 +426,26 @@ This makes it possible to compute the results of a filtered and an unfiltered ag
   "aggregator" : <aggregation>
 }
 ```
+
+### Grouping Aggregator
+
+A grouping aggregator can only be used as part of GroupBy queries which have a subtotal spec. It returns a number for
+each output row that lets you infer whether a particular dimension is included in the sub-grouping used for that row. You can pass
+a *non-empty* list of dimensions to this aggregator which *must* be a subset of dimensions that you are grouping on. 
+E.g if the aggregator has `["dim1", "dim2"]` as input dimensions and `[["dim1", "dim2"], ["dim1"], ["dim2"], []]` as subtotals, 
+following can be the possible output of the aggregator
+
+| subtotal used in query | Output | (bits representation) |
+|------------------------|--------|-----------------------|
+| `["dim1", "dim2"]`       | 0      | (00)                  |
+| `["dim1"]`               | 1      | (01)                  |
+| `["dim2"]`               | 2      | (10)                  |
+| `[]`                     | 3      | (11)                  |  
+
+As illustrated in above example, output number can be thought of as an unsigned n bit number where n is the number of dimensions passed to the aggregator. 
+The bit at position X is set in this number to 0 if a dimension at position X in input to aggregators is included in the sub-grouping. Otherwise, this bit 
+is set to 1.
+
+```json
+{ "type" : "grouping", "name" : <output_name>, "groupings" : [<dimension>] }
+```
diff --git a/docs/querying/groupbyquery.md b/docs/querying/groupbyquery.md
@@ -226,7 +226,9 @@ The response for the query above would look something like:
 ]
 ```
 
-> Notice that dimensions that are not included in an individual subtotalsSpec grouping are returned with a `null` value. This response format represents a behavior change as of Apache Druid 0.18.0. In release 0.17.0 and earlier, such dimensions were entirely excluded from the result.   
+> Notice that dimensions that are not included in an individual subtotalsSpec grouping are returned with a `null` value. This response format represents a behavior change as of Apache Druid 0.18.0. 
+> In release 0.17.0 and earlier, such dimensions were entirely excluded from the result. If you were relying on this old behavior to determine whether a particular dimension was not part of
+> a subtotal grouping, you can now use [Grouping aggregator](aggregations.md#Grouping Aggregator) instead.  
 
 
 ## Implementation details

diff --git a/docs/querying/sql.md b/docs/querying/sql.md
@@ -99,7 +99,8 @@ total. Finally, GROUP BY CUBE computes a grouping set for each combination of gr
 `GROUP BY CUBE (country, city)` is equivalent to `GROUP BY GROUPING SETS ( (country, city), (country), (city), () )`.
 Grouping columns that do not apply to a particular row will contain `NULL`. For example, when computing
 `GROUP BY GROUPING SETS ( (country, city), () )`, the grand total row corresponding to `()` will have `NULL` for the
-"country" and "city" columns.
+"country" and "city" columns. Column may also be `NULL` if it was `NULL` in the data itself. To differentiate such rows
+, you can use `GROUPING` aggregation. 
 
 When using GROUP BY GROUPING SETS, GROUP BY ROLLUP, or GROUP BY CUBE, be aware that results may not be generated in the
 order that you specify your grouping sets in the query. If you need results to be generated in a particular order, use
@@ -337,6 +338,7 @@ Only the COUNT aggregation can accept DISTINCT.
 |`LATEST(expr, maxBytesPerString)`|Like `LATEST(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|
 |`ANY_VALUE(expr)`|Returns any value of `expr` including null. `expr` must be numeric. This aggregator can simplify and optimize the performance by returning the first encountered value (including null)|
 |`ANY_VALUE(expr, maxBytesPerString)`|Like `ANY_VALUE(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|
+|`GROUPING(expr, expr...)`|Returns a number to indicate which groupBy dimension is included in a row, when using `GROUPING SETS`. Refer to [additional documentation](aggregations.md#Grouping Aggregator) on how to infer this number.|
 
 For advice on choosing approximate aggregation functions, check out our [approximate aggregations documentation](aggregations.html#approx).
 

diff --git a/processing/src/main/java/org/apache/druid/jackson/AggregatorsModule.java b/processing/src/main/java/org/apache/druid/jackson/AggregatorsModule.java
@@ -31,6 +31,7 @@
 import org.apache.druid.query.aggregation.FloatMaxAggregatorFactory;
 import org.apache.druid.query.aggregation.FloatMinAggregatorFactory;
 import org.apache.druid.query.aggregation.FloatSumAggregatorFactory;
+import org.apache.druid.query.aggregation.GroupingAggregatorFactory;
 import org.apache.druid.query.aggregation.HistogramAggregatorFactory;
 import org.apache.druid.query.aggregation.JavaScriptAggregatorFactory;
 import org.apache.druid.query.aggregation.LongMaxAggregatorFactory;
@@ -118,7 +119,8 @@ public AggregatorsModule()
       @JsonSubTypes.Type(name = "longAny", value = LongAnyAggregatorFactory.class),
       @JsonSubTypes.Type(name = "floatAny", value = FloatAnyAggregatorFactory.class),
       @JsonSubTypes.Type(name = "doubleAny", value = DoubleAnyAggregatorFactory.class),
-      @JsonSubTypes.Type(name = "stringAny", value = StringAnyAggregatorFactory.class)
+      @JsonSubTypes.Type(name = "stringAny", value = StringAnyAggregatorFactory.class),
+      @JsonSubTypes.Type(name = "grouping", value = GroupingAggregatorFactory.class)
   })
   public interface AggregatorFactoryMixin
   {

diff --git a/processing/src/main/java/org/apache/druid/query/aggregation/AggregatorUtil.java b/processing/src/main/java/org/apache/druid/query/aggregation/AggregatorUtil.java
@@ -134,6 +134,10 @@ public class AggregatorUtil
   public static final byte FLOAT_ANY_CACHE_TYPE_ID = 0x44;
   public static final byte STRING_ANY_CACHE_TYPE_ID = 0x45;
 
+  // GROUPING aggregator
+  public static final byte GROUPING_CACHE_TYPE_ID = 0x46;
+
+
   /**
    * returns the list of dependent postAggregators that should be calculated in order to calculate given postAgg
    *