Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow query expressions to have group by clause #1134

Closed
jclark opened this issue Jul 5, 2022 · 10 comments
Closed

Allow query expressions to have group by clause #1134

jclark opened this issue Jul 5, 2022 · 10 comments
Labels
Area/Lang Relates to the Ballerina language specification sl-update-priority Priority for Swan Lake Updates
Milestone

Comments

@jclark
Copy link
Collaborator

jclark commented Jul 5, 2022

This is part of #441.

@jclark
Copy link
Collaborator Author

jclark commented Jul 5, 2022

The group by clause requires the concept that frames can have variable bindings that are aggregated #1144.

The syntax of the group by clause is as follows.

group-by-clause := "group" "by" grouping-key ["," grouping-key]*
grouping-key :=
   variable-name
   | inferable-type-descriptor variable-name "=" expression

The syntax

   group by var x1 = E1, var x2 = E2

is short for

   let x1 = E1, x2 = E2 group by x1, x2

A group by clause is executed as follows

  • Iterate over each input frame f:
    • partition the input frames into groups, where two frames are in the same group if they have the same (==) value for all grouping-key variables
    • within a group, frames are in incoming order
    • groups are ordered by earliest member in incoming order
  • for each group
    • emit a frame that has
      • each grouping key variable bound to the value common to the group
      • a variable v of type T bound in a preceding clause that is not used as a grouping key is rebound as a aggregated variable of type T[] with the value being a list of the values of v for for each frame in the group.

@jclark
Copy link
Collaborator Author

jclark commented Jul 6, 2022

This has now been moved into a separate issue #1144.

The semantics of aggregated variables needs some more detailed specification.

When a variable x is aggregated, it is similar to each reference to x being treated as ...x, except that:

  • when the reference occurs in a function call and the function name is unqualified, the function is looked up in langlib using the type of the variable, similarly to a method call
  • when an argument to a function call contains a reference to such a variable, the entire argument expression is repeated.

For example:

decimal wageBill =
  from var { salary, bonus } in persons
  collect sum(salary + bonus);

Here sum will be resolved to decimal:sum.

The expression salary + bonus will be evaluated once for each index (they are guaranteed to have the same length). So it is equivalent to:

decimal wageBill =
  from var { salary, bonus } in persons
  let decimal total = salary + bonus
  collect sum(total);

which is in turn equivalent to

decimal[] totals =
  from var { salary, bonus } in persons
  let decimal total = salary + bonus
  select total;
decimal wageBill = decimal:sum(...totals);

If you want to get at the value of an aggregated variable as a list, you can do [x], which will work the same as [...x] when x is array.

@suleka96
Copy link
Contributor

suleka96 commented Oct 3, 2022

Regarding group by clause, if we consider the below example:

Order[] orders = [{price1: 1, price2: 2, name: "X1"}, {price1: 2, price2: 1, name: "X2"}, {price1: 2, price2: 2, name: "X3"}];

  _ = from var {price1, price2, name} in orders
                group by var price = price1 + price2
                // Output of group by clause :
                // [
                // {price: 3, name: [X1, X2]},
                // {price: 4, name: [X3]}
                // ]
                select name;
                // Output of select clause :
                // [[X1, X2], [X3]]

after price = price1 + price2 should price1 and price2 be accessible? Since fields that are not used as grouping keys are to be represented as sequences, should price1 and price2 comply with that? Or since those two field do contribute in making the grouping key, should we not include them as sequences as shown in the example?

@jclark
Copy link
Collaborator Author

jclark commented Oct 3, 2022

@suleka96 After the group by clause price1 and price2 become aggregated variables just like name. The fact that they were used in the expression to initialize price makes no difference to this.

@suleka96
Copy link
Contributor

suleka96 commented Oct 3, 2022

Ack. Just to confirm, if we have a code sample like the one shown below, the output will be [[1, 2], [2]] right

Order[] orders = [{price1: 1, price2: 2, name: "X1"}, {price1: 2, price2: 1, name: "X2"}, {price1: 2, price2: 2, name: "X3"}];

  _ = from var {price1, price2, name} in orders
                group by var price = price1 + price2
                // Output of group by clause :
                // [
                // {price: 3, price1: [1, 2],  price2:[2, 1], name: [X1, X2]},
                // {price: 4, price1:[2], price2: [2], name: [X3]}
                // ]
                select price1;
                // Output of select clause :
                // [[1, 2], [2]]

@jclark
Copy link
Collaborator Author

jclark commented Oct 3, 2022

No. You can only reference aggregated variables in specific contexts. See #1144. The basic idea is that a reference to an aggregated variable x is treated like ...x .

@suleka96
Copy link
Contributor

As of now ... is modeled differently in different contexts as shown below:

function foo(int... arr) {
    
}

public function main() {
    int[] aa = [1, 2, 3];
    foo(...aa); // BLangRestArgExpr
    var _ = [...aa]; // BLangListConstructorSpreadOpExpr
}

Is it okay to reuse one of these representations for the non grouping keys or do we need to have a new way of representing this in the type checker.

@jclark
Copy link
Collaborator Author

jclark commented Nov 2, 2022

#1137 (comment) raises the issue of whether we can have multiple group by clauses.

One way to handle this would be to say that variables that were aggregated before the group by cannot be referenced within or after the group by. If you want to use them, then you must instead first create a regular variable from the aggregated variable using an appropriate function (or a list constructor).

@jclark jclark added Area/Lang Relates to the Ballerina language specification sl-update-priority Priority for Swan Lake Updates labels Dec 20, 2022
@jclark jclark added this to the 2023R1 milestone Dec 20, 2022
@jclark jclark modified the milestones: 2023R1, 2013R2 Apr 25, 2023
@KavinduZoysa
Copy link
Contributor

@jclark, please consider the following example.

    var input = [{name: "Saman", price1: 11, price2: 12}, 
                    {name: "Saman", price1: 11, price2: 14}, 
                    {name: "Kamal", price1: 12, price2: 12}, 
                    {name: "Kamal", price1: 12, price2: 14}, 
                    {name: "Saman", price1: 19, price2: 20}];

    var y = from var {name, price1, price2} in input
                    group by price1
                    group by var p2 = [price2] 
                    select [name]; // name is aggregated twice.

In this example, name is aggregated twice. Is it correct to use the aggregated variable as shown in the example?

Instead of that, we can do the following thing.

    var input = [{name: "Saman", price1: 11, price2: 12}, 
                    {name: "Saman", price1: 11, price2: 14}, 
                    {name: "Kamal", price1: 12, price2: 12}, 
                    {name: "Kamal", price1: 12, price2: 14}, 
                    {name: "Saman", price1: 19, price2: 20}];

    var y = from var {name, price1, price2} in input
                    group by price1
                    let var n = [name]
                    group by var p2 = [price2] 
                    select [n];

@jclark
Copy link
Collaborator Author

jclark commented May 9, 2023

@KavinduZoysa That's a good question, which the spec needs to answer. I don't think your first example should be allowed. Your second example is fine. For now I think we should say that attempting to reference a doubly-aggregated variable is an error: we keep them in the frame, so that that cannot be shadowed.

In the future, we could consider allowing access to them, so that something like this should work:

var y = from var {name, price1, price2} in input
                    group by price1
                    group by var p2 = [price2] 
                    select [[name]]; 

But that's something that can be left to later.

@jclark jclark closed this as completed in 5e76414 Jun 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area/Lang Relates to the Ballerina language specification sl-update-priority Priority for Swan Lake Updates
Projects
None yet
Development

No branches or pull requests

3 participants