Merge pull request #4 from yahoo/group-all-operations

Added support for GROUP all operations MAX, MIN, SUM and AVG
bullet-db · Dec 22, 2016 · e00c68c · e00c68c
2 parents 1a2c6ab + b416b26
commit e00c68c
Show file tree

Hide file tree

Showing 12 changed files with 766 additions and 184 deletions.
diff --git a/README.md b/README.md
@@ -151,14 +151,14 @@ for raw records, use projections to help reduce the load on the system and netwo
 Aggregations allow you to perform some operation on the collected records. They take an optional size to restrict
 the size of the aggregation (this applies for aggregations high cardinality aggregations and raw records).
 
-The current aggregations supported are:
+The current aggregation types that are supported are:
 
 | Aggregation | Meaning |
 | ----------- | ------- |
-| COUNT       | The resulting output would be a single record that contains the number of records seen that matched the filters. |
+| GROUP       | The resulting output would be a record containing the result of an operation for each unique group in the specified fields (only supported with no fields at this time, which groups all records) |
 | LIMIT       | The resulting output would be at most the number specified in size. |
 
-We currently only support COUNT if there is no fields being grouped on. In other words, you get a count of the number of records that matched your filters.
+We currently only support GROUP operations if there is no fields being grouped on. In other words, you get the results of the operation on all records that matched your filters.
 
 The current format for an aggregation is (**note see above for what is supported at the moment**):
 
@@ -183,48 +183,16 @@ The current format for an aggregation is (**note see above for what is supported
 
 You can also use LIMIT as an alias for RAW. DISTINCT is also an alias for GROUP. These exist to make some queries read a bit better.
 
-See the [examples section](#examples) below to see how to perform the aggregations supported at the moment, as well as the attributes for them.
+Currently we support GROUP aggregations on the following operations:
 
-#### Coming Soon
-
-It is often intractable to perform aggregations on an unbounded stream of data and support arbitrary queries. However, it is possible
-if an exact answer is not required as long as the error is quantifiable. There are stochastic algorithms and data structures that let us
-support these aggregations. We will be using [Data Sketches](https://datasketches.github.io/) to solve aggregations such as counting
-uniques, getting distributions, approximating top k etc. Sketches let us be exact in our computation up to configured thresholds
-and approximate after. The error is very controllable and mathematically provable. This lets us address otherwise hard to solve problems in
-sublinear space. We will also use Sketches as a way to control high cardinality grouping (group by a natural key column or related) and rely on
-the Sketching data structure to drop excess groups. It is up to the user launching Bullet to determine to set Sketch sizes large or
-small enough for to satisfy the queries that will be performed on that instance of Bullet.
-
-Using Sketches, we are working on other aggregations including but not limited to:
-
-| Aggregation    | Meaning |
+| Operation      | Meaning |
 | -------------- | ------- |
 | SUM            | Computes the sum of the elements in the group |
 | MIN            | Returns the minimum of the elements in the group |
 | MAX            | Returns the maximum of the elements in the group |
 | AVG            | Computes the average of the elements in the group |
-| COUNT DISTINCT | Computes the number of distinct elements in the column |
-| TOP K          | Returns the top K most freqently appearing values in the column |
-| DISTRIBUTION   | Computes distributions of the elements in the column |
-
-The following attributes are planned to be supported for the different aggregations:
-
-Attributes for TOP K:
-
-```javascript
-    "attributes": {
-        "k": 15,
-    }
-```
 
-Attributes for COUNT DISTINCT:
-
-```javascript
-    "attributes": {
-        "newName": "the name of the resulting count column"
-    }
-```
+The following attributes are supported for GROUP:
 
 Attributes for GROUP:
 ```javascript
@@ -258,6 +226,46 @@ Attributes for GROUP:
     }
 ```
 
+See the [examples section](#examples) for a detailed description of how to perform these aggregations.
+
+#### Coming Soon
+
+It is often intractable to perform aggregations on an unbounded stream of data and support arbitrary queries. However, it is possible
+if an exact answer is not required as long as the error is quantifiable. There are stochastic algorithms and data structures that let us
+support these aggregations. We will be using [Data Sketches](https://datasketches.github.io/) to solve aggregations such as counting
+uniques, getting distributions, approximating top k etc. Sketches let us be exact in our computation up to configured thresholds
+and approximate after. The error is very controllable and mathematically provable. This lets us address otherwise hard to solve problems in
+sublinear space. We will also use Sketches as a way to control high cardinality grouping (group by a natural key column or related) and rely on
+the Sketching data structure to drop excess groups. It is up to the user launching Bullet to determine to set Sketch sizes large or
+small enough for to satisfy the queries that will be performed on that instance of Bullet.
+
+Using Sketches, we are working on other aggregations including but not limited to:
+
+| Aggregation    | Meaning |
+| -------------- | ------- |
+| GROUP          | We currently support GROUP with no fields (group all); grouping on specific fields will be supported soon |
+| COUNT DISTINCT | Computes the number of distinct elements in the column |
+| TOP K          | Returns the top K most freqently appearing values in the column |
+| DISTRIBUTION   | Computes distributions of the elements in the column |
+
+The following attributes are planned to be supported for the different aggregations:
+
+Attributes for TOP K:
+
+```javascript
+    "attributes": {
+        "k": 15,
+    }
+```
+
+Attributes for COUNT DISTINCT:
+
+```javascript
+    "attributes": {
+        "newName": "the name of the resulting count column"
+    }
+```
+
 The attributes for the DISTRIBUTION aggregation haven't been decided yet.
 
 ### Constraints
@@ -584,6 +592,71 @@ A sample result would look like:
 
 This result indicates that 363,201 records were counted with demographics.age > 65 during the 20 seconds the query was running.
 
+COUNT is the only GROUP operation for which you can omit a "field". A more complicated example with multiple group operations would look like:
+
+```javascript
+{
+   "filters":[
+      {
+         "field": "demographics.state",
+         "operation": "==",
+         "values": ["california"]
+      }
+   ],
+   "aggregation":{
+      "type": "GROUP",
+      "attributes": {
+         "operations": [
+            {
+               "type": "COUNT",
+               "newName": "numCalifornians"
+            },
+            {
+               "type": "AVG",
+               "field": "demographics.age",
+               "newName": "avgAge"
+            },
+            {
+               "type": "MIN",
+               "field": "demographics.age",
+               "newName": "minAge"
+            },
+            {
+               "type": "MAX",
+               "field": "demographics.age",
+               "newName": "maxAge"
+            }
+         ]
+      }
+   },
+   "duration": 20000
+}
+```
+
+A sample result would look like:
+
+```javascript
+{
+    "records": [
+        {
+            "maxAge": 94.0,
+            "numCalifornians": 188451,
+            "minAge": 6.0,
+            "avgAge": 33.71828
+        }
+    ],
+    "meta": {
+        "rule_id": 8051040987827161000,
+        "rule_body": "<RULE BODY HERE>}",
+        "rule_finish_time": 1482371927435,
+        "rule_receive_time": 1482371916625
+    }
+}
+```
+
+This result indicates that, among the records observed during the 20 seconds this query ran, there were 188,451 users with demographics.state == "california". Among these users the average age was 33.71828, the max
+age observed was 94, and the minimum age observed was 6.
+
 ## Configuration
 
 Bullet is configured at run-time using settings defined in a file. Settings not overridden will default to the values in [src/main/resources/bullet_defaults.yaml](src/main/resources/bullet_defaults.yaml).

diff --git a/pom.xml b/pom.xml
@@ -38,7 +38,7 @@
         <maven.compiler.target>1.8</maven.compiler.target>
         <storm.version>1.0.2</storm.version>
         <storm.metrics.version>1.0.2</storm.metrics.version>
-        <bullet.record.version>0.0.2</bullet.record.version>
+        <bullet.record.version>0.0.3</bullet.record.version>
     </properties>
 
     <dependencies>
@@ -145,6 +145,7 @@
                 <artifactId>maven-clover2-plugin</artifactId>
                 <version>4.0.5</version>
                 <configuration>
+                    <licenseLocation>${user.home}/clover.license</licenseLocation>
                     <excludes>
                         <exclude>**/Client.java</exclude>
                         <exclude>**/Topology.java</exclude>

diff --git a/src/main/java/com/yahoo/bullet/operations/AggregationOperations.java b/src/main/java/com/yahoo/bullet/operations/AggregationOperations.java
@@ -17,6 +17,7 @@
 import java.util.Map;
 import java.util.Objects;
 import java.util.Set;
+import java.util.function.BiFunction;
 
 public class AggregationOperations {
     public enum AggregationType {
@@ -41,7 +42,9 @@ public enum GroupOperationType {
         SUM("SUM"),
         MIN("MIN"),
         MAX("MAX"),
-        AVG("AVG");
+        AVG("AVG"),
+        // COUNT_FIELD operation is only used internally in conjunction with AVG and won't be returned.
+        COUNT_FIELD("COUNT_FIELD");
 
         private String name;
 
@@ -60,6 +63,15 @@ public boolean isMe(String name) {
         }
     }
 
+    public interface AggregationOperator extends BiFunction<Number, Number, Number> {
+    }
+
+    // If either argument is null, a NullPointerException will be thrown.
+    public static final AggregationOperator MIN = (x, y) -> x.doubleValue() <  y.doubleValue() ? x : y;
+    public static final AggregationOperator MAX = (x, y) -> x.doubleValue() >  y.doubleValue() ? x : y;
+    public static final AggregationOperator SUM = (x, y) -> x.doubleValue() + y.doubleValue();
+    public static final AggregationOperator COUNT = (x, y) -> x.longValue() + y.longValue();
+
     /**
      * Checks to see if a {@link Map} contains items.
      *

diff --git a/src/main/java/com/yahoo/bullet/operations/aggregations/GroupAll.java b/src/main/java/com/yahoo/bullet/operations/aggregations/GroupAll.java
@@ -30,12 +30,12 @@ public GroupAll(Aggregation aggregation) {
 
     @Override
     public void consume(BulletRecord data) {
-        this.data.compute(data);
+        this.data.consume(data);
     }
 
     @Override
     public void combine(byte[] serializedAggregation) {
-        data.merge(serializedAggregation);
+        data.combine(serializedAggregation);
     }
 
     @Override