Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add query clauses for grouping and aggregation #441

Closed
jclark opened this issue Feb 28, 2020 · 20 comments
Closed

Add query clauses for grouping and aggregation #441

jclark opened this issue Feb 28, 2020 · 20 comments
Assignees
Labels
Area/Lang Relates to the Ballerina language specification sl-update-priority Priority for Swan Lake Updates Type/Improvement Enhancement to language design
Milestone

Comments

@jclark
Copy link
Collaborator

jclark commented Feb 28, 2020

@jclark jclark added Type/Improvement Enhancement to language design Area/Lang Relates to the Ballerina language specification labels Feb 28, 2020
@jclark jclark added this to the 2020R2 milestone Feb 28, 2020
@jclark jclark self-assigned this Feb 28, 2020
@jclark
Copy link
Collaborator Author

jclark commented Feb 28, 2020

Also need to consider how this will work with event streaming #440.

@lasinicl
Copy link
Contributor

lasinicl commented Jul 20, 2020

@jclark
To implement the group by clause should we refer Query v2 document or are there any other documents or issues that should be referred?

@jclark
Copy link
Collaborator Author

jclark commented Jul 20, 2020

This is the relevant section of the doc you mentioned (we should avoid links to non-public docs here):

group-by-clause := "group" "by" grouping-key ["," grouping-key]*
grouping-key :=
   variable-name
   | inferable-type-descriptor variable-name "=" expression

Groups give rise to frames that have indexed variable bindings and a special binding for the number of incoming frames in the group. An indexed variable binding is a variable that has a separate value for each index in the group. Variables bound by preceding clauses other than those used as grouping keys become indexed variables.

The syntax

   group by var x1 = E1, var x2 = E2

is short for

   let x1 = E1, x2 = E2 group by x1, x2

A group by clause is executed as follows

  • Iterate over each input frame f:
    • partition the input frames into groups, where two frames are in the same group if they have the same (==) value for all grouping-key variables
    • within a group, frames are in incoming order
    • groups are ordered by earliest member in incoming order
  • for each group
    • emit a frame that has
      • each grouping key variable bound to the value common to the group
      • other variables become indexed
aggregate-function-call-expr :=
   "aggregate" aggregate-function-reference "(" [ expression ] ")"
aggregate-function-reference := variable-reference
aggregate-list-constructor-expr := "aggregate" "[" expression "]"

An indexed variable binding can only be referenced in an indexed context. The expression in an aggregate-function-call-expr or aggregate-list-constructor-expr is an indexed context. An aggregate-function-call-expr is evaluated as follows:

  • the expression is evaluated once for each index giving N values, where N is the group size
  • an unqualified aggregate-function-reference is looked up using the basic type of the expression in a similar way to a method call on a non-object
    • the result must be a function that allows any number of arguments >= 1 of that basic type
  • the function is called with all the N expression results as arguments
  • this can make use of existing min, max, sum lang library functions
  • we will need to add average to lang int, decimal, float modules
  • if the expression is omitted, then the aggregate-function-reference must be the identifier count, and the result is the group size (number of frames in the group) (hardcoding like this is a bit ugly, but I haven't been able to think of anything better)
  • aggregate-list-constructor-expr is similar except that it makes a list of the N values
  • these expressions can only be used in a query following a group-by clause (compile time error if used otherwise)

Example

var summary =
   from var { buyer, price } in orders
      group by buyer // price is now an indexed variable
      select { buyer, totalSpend: aggregate sum(price) };

Note that you can use a where clause after group by, which provides functionality similar to an SQL HAVING clause.

var bigSpenders =
   from var { buyer, price } in orders
      group by buyer
      let totalSpend = aggregate sum(price) 
      where totalSpend > 1000
      select { buyer, totalSpend };

@jclark
Copy link
Collaborator Author

jclark commented Jul 20, 2020

This design above isn't final yet, because I haven't yet worked through how it would be extended to the unbounded stream case. Some sort of window concept might be needed, and that might affect the design in the bounded case.

@jclark jclark modified the milestones: Later, Swan Lake + 1 Aug 6, 2020
@jclark
Copy link
Collaborator Author

jclark commented Jan 22, 2021

Google Charts query language has a pivot clause, which looks quite useful. It has similar functionality to pivot tables, which are a key feature in the spreadsheet world. We should consider whether this can fit into our approach to aggregation.

@jclark
Copy link
Collaborator Author

jclark commented Jan 27, 2021

The design in #441 (comment) requires aggregate in front of calls to functions like sum following a group by cause. This is ugly, not SQL-like, and easy to forget. I think we can avoid it.

The aggregate keyword is serving two purposes:

  1. For functions on numbers, it allows different functions to be called depending on whether the number is int, float or decimal.
  2. A single function argument containing an indexed variable gets turned into multiple arguments, one for each index.

For point 1, we are now predeclaring int, float, and decimal prefixes, so I think it is easy enough to say e.g.

let totalSpend = decimal:sum(price)

We can make

let totalSpend = price.sum();

expand into decimal:sum(price) when price is an indexed variable, just as we do usually.

For point 2, we can instead put aggregate in the function definition, more precisely on the relevant parameter, rather than on each call. An expression for a function argument declared as aggregate would count as an indexed context, and work the same as what I previously proposed when aggregate was specified before the function call.

Still to do: find a clean way to do count(). Since group is a keyword, perhaps we could make group.count() work.

@jclark
Copy link
Collaborator Author

jclark commented Nov 22, 2021

See

https://blog.brownplt.org/2021/11/21/b2t2.html

for some interesting design benchmarks.

@jclark
Copy link
Collaborator Author

jclark commented Dec 8, 2021

We need to make sure we can aggregate over the whole table. Instead of ending with select, we want to be able to combine all the values in an easy way. This should do not just things like sum but also things like every/some.

@jclark
Copy link
Collaborator Author

jclark commented Feb 20, 2022

I now think it's better to have an explicit syntax for an aggregated call expression. Then you can aggregate over the whole table just by using an aggregated function call expression in the select e.g.

var total = from var { cost } in orders select sum(cost,...);

Here the ,... is marking sum as being a special kind of function call. The semantics are to evaluate the select expression once, but evaluate the expression preceding ,... once for each binding and then use a langlib function to combine the values. So if cost is bound to 2 and then 3, this will evaluate to m:sum(2, 3) where m is a lang library module determined by the type of cost.

This is designed to be similar to how in SQL you would say SELECT SUM(cost); we cannot make that exact syntax work, because it already has a meaning in Ballerina.

A more complicated example:

string nameList =
  from var { firstName, lastName } in persons
  select ", ".join(firstName + " " + lastName,...);

We could provide extensibility by alllowing a qualified name for the aggregation function. This would refer to a function defined in a module with a stream parameter or a normal varargs parameter.

The syntax has a , before the ... to make things clearer when the argument is a complex expression. For example, in the last case join(firstName + " " + lastName...) would make it seem that just the lastName is being repeated rather than the whole expression.

@jclark
Copy link
Collaborator Author

jclark commented Feb 20, 2022

Notes on specific aggregation functions

  • add avg to lang.int, lang.float, lang.decimal (could use average, but we have min, max and both SQL and XQuery use avg)
  • add count function to lang.value function count(any|error...) returns int (or, since this is not useful generally, we could have a special langlib module used only for aggregations instead of lang.value)
  • need to allow min/max when there are no values: this should return (); user can also do min(0, x, ....) to avoid () return
  • every, some need functions in lang.boolean e.g. function some(boolean...) returns boolean (these would be in addition to functions in lang.array for Add array:some/every langlib function #1005)
  • maybe useful to add first and last functions (which could return ())

@jclark
Copy link
Collaborator Author

jclark commented Feb 20, 2022

We could simplify things to start with by allowing an aggregated call only for the entire select expression, but the semantics make sense without this. For example, if you didn't have an avg function you could do it like

var average = from var { cost } in order select sum(cost,...)/count(cost,...);

and we will need this with group by

@jclark
Copy link
Collaborator Author

jclark commented Feb 20, 2022

The other way to approach this is to use an alternative keyword/operator instead of select, which results in the frame variables being bound to lists e.g.

var total =
  from var { cost } in order
  combine int:sum(...cost);

In this case, int:sum(...cost) is a normal expression evaluated with cost bound to a list of numbers.

This is semantically simpler and cleaner, but the user has to explicitly specify a qualified function name, which would be particularly awkward for count, since lang.value is not automically imported. Plus we need an extra keyword.

It also doesn't handle expressions as nicely:

string nameList =
  from var { firstName, lastName } in persons
  let string tmp = firstName + " " + lastName
  combine ", ".join(...tmp);

So maybe the way to do things is to desugar into something like this.

@jclark
Copy link
Collaborator Author

jclark commented Feb 21, 2022

The two basic approaches are:

  • have a different clause that changes the type that a variable is bound to
  • have a special query-specific kind of variable binding, which can be used in a special query-specific kind of expression

The first is simpler and so preferable, other things being equal. The challenge is to make the ergonomics of the first approach good.

Another idea for the first approach is to leverage streams. The idea is that

  • the clause has the effect of changing the type of each frame variable from T to stream<T,()>.
  • the aggregation functions become functions on lang.stream

So this would look something like:

var total = from var { cost } in orders => cost.sum();

In this example, use of => would cause all the bindings for cost from the LHS or => to be collected into a stream which would be bound to cost on the right hand side of =>.

One challenge is langlib typing isn't strong enough currently to make sum just work for a subtype of int or float or decimal (this is overloading more than parameterized typing).

@jclark
Copy link
Collaborator Author

jclark commented Feb 21, 2022

My current preferred solution is as follows:

  • there is a new clause that is an alternative to select
  • syntax for new clause does not use a new keyword, but instead uses either => or select with a following modifier (e.g. select...)
  • the body of new clause would initially be limited to a call expression (method or function call)
  • variables bound in the query pipeline can occur only in the last argument to the call expression
  • an unqualified function name is mapped onto a langlib function using the argument types
  • a method call or qualified function name is resolved in the usual way; the function must be either varargs or accept a stream argument

So a simple example would look like:

var total = from var { cost } in orders => sum(cost);

More complex example:

string nameList =
  from var { firstName, lastName } in persons
  => ", ".join(firstName + " " + lastName);

From a user perspective, effect is that the last argument is repeated once for each variable binding.

This can be generalized to work with group by and to allow full expressions after =>:

  • the right hand side of => is an expression
  • there is special kind of variable binding (let's call it a sequence-binding), in which a variable is associated with a sequence of one or more values
  • using => causes all variable bindings to have sequence-bindings
  • using group by causes the non-grouped variables to have sequence-bindings
  • a variable with a sequence binding can be used in two places
    • in the last argument of a function call
    • in a list constructor
  • the use of a variable with a sequence binding in a function argument makes a function call special
  • one effect is that the expression for the function argument or list member that contains one or more variables with sequence-bindings is evaluated once for each index in the sequence, and each resulting value is used as a function argument or list member in a similar way to ...x
  • the other effect is that the name of the unqualified function argument is mapped to a langlib function in a similar way to x.foo() when x is not an object

The group by example would look like:

var summary =
   from var { buyer, price } in orders
      group by buyer // price now has a sequence-binding
      select { buyer, totalSpend: sum(price) };

@jclark jclark modified the milestones: Giselle, Swan Lake updates Feb 22, 2022
@jclark
Copy link
Collaborator Author

jclark commented Feb 22, 2022

Note that we could potentially do this with any incompatibilities (and so in Swan Lake Updates) by having some cleverness on how we recognize "group by".

@hasithaa
Copy link
Contributor

I think => func-call() syntax works better, it is syntactic sugar for saying both group and select. Also we can extend this for do... by adding braces. i.e. => { stmt* }

@jclark
Copy link
Collaborator Author

jclark commented Feb 22, 2022

I think => func-call() syntax works better, it is syntactic sugar for saying both group and select. Also we can extend this for do... by adding braces. i.e. => { stmt* }

I agree. I updated #441 (comment) to use =>.

As you observe, with this => is sugar for group by true select (true is just a convenient expression that returns the same value for every frame).

We could allow => stmt-block, but I prefer to keep => for expressions as with expression-bodied functions and anonymous functions. Users can still do group by true do if they really need this (I don't see it being a commonly needed thing).

@jclark
Copy link
Collaborator Author

jclark commented Mar 4, 2022

I think a better way to explain "sequenced-binding" or "indexed-binding" is to say that

  • a variable bound to T becomes bound T[], but
  • a variable that is bound in this way is "implicitly spread", in that when it is referenced, a ... is automatically added before it
  • accordingly such a variable is only allowed where ...x (with x having list type) is allowed

This connects to #52.

@jclark
Copy link
Collaborator Author

jclark commented Mar 4, 2022

We can split this up into a number of separate points:

  • add group by clause; this includes special kind of variable binding
  • add => clause
  • don't treat identifiers that are significant for query expressions as reserved words generally (we don't have a design for this yet)
  • add additional langlib functions for aggregation
    • avg (at least to lang.float and lang.decimal); probably int:avg should return a decimal (as in XQuery)
    • count (probably to lang.value)
    • some, every (to lang.boolean)

@jclark
Copy link
Collaborator Author

jclark commented Jun 15, 2023

All done now.

@jclark jclark closed this as completed Jun 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area/Lang Relates to the Ballerina language specification sl-update-priority Priority for Swan Lake Updates Type/Improvement Enhancement to language design
Projects
None yet
Development

No branches or pull requests

3 participants