Skip to content

Commit

Permalink
Merge pull request #4 from yahoo/group-all-operations
Browse files Browse the repository at this point in the history
Added support for GROUP all operations MAX, MIN, SUM and AVG
  • Loading branch information
akshaisarma authored Dec 22, 2016
2 parents 1a2c6ab + b416b26 commit e00c68c
Show file tree
Hide file tree
Showing 12 changed files with 766 additions and 184 deletions.
149 changes: 111 additions & 38 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -151,14 +151,14 @@ for raw records, use projections to help reduce the load on the system and netwo
Aggregations allow you to perform some operation on the collected records. They take an optional size to restrict
the size of the aggregation (this applies for aggregations high cardinality aggregations and raw records).

The current aggregations supported are:
The current aggregation types that are supported are:

| Aggregation | Meaning |
| ----------- | ------- |
| COUNT | The resulting output would be a single record that contains the number of records seen that matched the filters. |
| GROUP | The resulting output would be a record containing the result of an operation for each unique group in the specified fields (only supported with no fields at this time, which groups all records) |
| LIMIT | The resulting output would be at most the number specified in size. |

We currently only support COUNT if there is no fields being grouped on. In other words, you get a count of the number of records that matched your filters.
We currently only support GROUP operations if there is no fields being grouped on. In other words, you get the results of the operation on all records that matched your filters.

The current format for an aggregation is (**note see above for what is supported at the moment**):

Expand All @@ -183,48 +183,16 @@ The current format for an aggregation is (**note see above for what is supported

You can also use LIMIT as an alias for RAW. DISTINCT is also an alias for GROUP. These exist to make some queries read a bit better.

See the [examples section](#examples) below to see how to perform the aggregations supported at the moment, as well as the attributes for them.
Currently we support GROUP aggregations on the following operations:

#### Coming Soon

It is often intractable to perform aggregations on an unbounded stream of data and support arbitrary queries. However, it is possible
if an exact answer is not required as long as the error is quantifiable. There are stochastic algorithms and data structures that let us
support these aggregations. We will be using [Data Sketches](https://datasketches.github.io/) to solve aggregations such as counting
uniques, getting distributions, approximating top k etc. Sketches let us be exact in our computation up to configured thresholds
and approximate after. The error is very controllable and mathematically provable. This lets us address otherwise hard to solve problems in
sublinear space. We will also use Sketches as a way to control high cardinality grouping (group by a natural key column or related) and rely on
the Sketching data structure to drop excess groups. It is up to the user launching Bullet to determine to set Sketch sizes large or
small enough for to satisfy the queries that will be performed on that instance of Bullet.

Using Sketches, we are working on other aggregations including but not limited to:

| Aggregation | Meaning |
| Operation | Meaning |
| -------------- | ------- |
| SUM | Computes the sum of the elements in the group |
| MIN | Returns the minimum of the elements in the group |
| MAX | Returns the maximum of the elements in the group |
| AVG | Computes the average of the elements in the group |
| COUNT DISTINCT | Computes the number of distinct elements in the column |
| TOP K | Returns the top K most freqently appearing values in the column |
| DISTRIBUTION | Computes distributions of the elements in the column |

The following attributes are planned to be supported for the different aggregations:

Attributes for TOP K:

```javascript
"attributes": {
"k": 15,
}
```

Attributes for COUNT DISTINCT:

```javascript
"attributes": {
"newName": "the name of the resulting count column"
}
```
The following attributes are supported for GROUP:

Attributes for GROUP:
```javascript
Expand Down Expand Up @@ -258,6 +226,46 @@ Attributes for GROUP:
}
```

See the [examples section](#examples) for a detailed description of how to perform these aggregations.

#### Coming Soon

It is often intractable to perform aggregations on an unbounded stream of data and support arbitrary queries. However, it is possible
if an exact answer is not required as long as the error is quantifiable. There are stochastic algorithms and data structures that let us
support these aggregations. We will be using [Data Sketches](https://datasketches.github.io/) to solve aggregations such as counting
uniques, getting distributions, approximating top k etc. Sketches let us be exact in our computation up to configured thresholds
and approximate after. The error is very controllable and mathematically provable. This lets us address otherwise hard to solve problems in
sublinear space. We will also use Sketches as a way to control high cardinality grouping (group by a natural key column or related) and rely on
the Sketching data structure to drop excess groups. It is up to the user launching Bullet to determine to set Sketch sizes large or
small enough for to satisfy the queries that will be performed on that instance of Bullet.

Using Sketches, we are working on other aggregations including but not limited to:

| Aggregation | Meaning |
| -------------- | ------- |
| GROUP | We currently support GROUP with no fields (group all); grouping on specific fields will be supported soon |
| COUNT DISTINCT | Computes the number of distinct elements in the column |
| TOP K | Returns the top K most freqently appearing values in the column |
| DISTRIBUTION | Computes distributions of the elements in the column |

The following attributes are planned to be supported for the different aggregations:

Attributes for TOP K:

```javascript
"attributes": {
"k": 15,
}
```

Attributes for COUNT DISTINCT:

```javascript
"attributes": {
"newName": "the name of the resulting count column"
}
```

The attributes for the DISTRIBUTION aggregation haven't been decided yet.

### Constraints
Expand Down Expand Up @@ -584,6 +592,71 @@ A sample result would look like:

This result indicates that 363,201 records were counted with demographics.age > 65 during the 20 seconds the query was running.

COUNT is the only GROUP operation for which you can omit a "field". A more complicated example with multiple group operations would look like:

```javascript
{
"filters":[
{
"field": "demographics.state",
"operation": "==",
"values": ["california"]
}
],
"aggregation":{
"type": "GROUP",
"attributes": {
"operations": [
{
"type": "COUNT",
"newName": "numCalifornians"
},
{
"type": "AVG",
"field": "demographics.age",
"newName": "avgAge"
},
{
"type": "MIN",
"field": "demographics.age",
"newName": "minAge"
},
{
"type": "MAX",
"field": "demographics.age",
"newName": "maxAge"
}
]
}
},
"duration": 20000
}
```

A sample result would look like:

```javascript
{
"records": [
{
"maxAge": 94.0,
"numCalifornians": 188451,
"minAge": 6.0,
"avgAge": 33.71828
}
],
"meta": {
"rule_id": 8051040987827161000,
"rule_body": "<RULE BODY HERE>}",
"rule_finish_time": 1482371927435,
"rule_receive_time": 1482371916625
}
}
```

This result indicates that, among the records observed during the 20 seconds this query ran, there were 188,451 users with demographics.state == "california". Among these users the average age was 33.71828, the max
age observed was 94, and the minimum age observed was 6.

## Configuration

Bullet is configured at run-time using settings defined in a file. Settings not overridden will default to the values in [src/main/resources/bullet_defaults.yaml](src/main/resources/bullet_defaults.yaml).
Expand Down
3 changes: 2 additions & 1 deletion pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@
<maven.compiler.target>1.8</maven.compiler.target>
<storm.version>1.0.2</storm.version>
<storm.metrics.version>1.0.2</storm.metrics.version>
<bullet.record.version>0.0.2</bullet.record.version>
<bullet.record.version>0.0.3</bullet.record.version>
</properties>

<dependencies>
Expand Down Expand Up @@ -145,6 +145,7 @@
<artifactId>maven-clover2-plugin</artifactId>
<version>4.0.5</version>
<configuration>
<licenseLocation>${user.home}/clover.license</licenseLocation>
<excludes>
<exclude>**/Client.java</exclude>
<exclude>**/Topology.java</exclude>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
import java.util.Map;
import java.util.Objects;
import java.util.Set;
import java.util.function.BiFunction;

public class AggregationOperations {
public enum AggregationType {
Expand All @@ -41,7 +42,9 @@ public enum GroupOperationType {
SUM("SUM"),
MIN("MIN"),
MAX("MAX"),
AVG("AVG");
AVG("AVG"),
// COUNT_FIELD operation is only used internally in conjunction with AVG and won't be returned.
COUNT_FIELD("COUNT_FIELD");

private String name;

Expand All @@ -60,6 +63,15 @@ public boolean isMe(String name) {
}
}

public interface AggregationOperator extends BiFunction<Number, Number, Number> {
}

// If either argument is null, a NullPointerException will be thrown.
public static final AggregationOperator MIN = (x, y) -> x.doubleValue() < y.doubleValue() ? x : y;
public static final AggregationOperator MAX = (x, y) -> x.doubleValue() > y.doubleValue() ? x : y;
public static final AggregationOperator SUM = (x, y) -> x.doubleValue() + y.doubleValue();
public static final AggregationOperator COUNT = (x, y) -> x.longValue() + y.longValue();

/**
* Checks to see if a {@link Map} contains items.
*
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -30,12 +30,12 @@ public GroupAll(Aggregation aggregation) {

@Override
public void consume(BulletRecord data) {
this.data.compute(data);
this.data.consume(data);
}

@Override
public void combine(byte[] serializedAggregation) {
data.merge(serializedAggregation);
data.combine(serializedAggregation);
}

@Override
Expand Down
Loading

0 comments on commit e00c68c

Please sign in to comment.