Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GroupBy.aggregate #274

Closed
MarcoGorelli opened this issue Oct 3, 2023 · 2 comments · Fixed by #286
Closed

GroupBy.aggregate #274

MarcoGorelli opened this issue Oct 3, 2023 · 2 comments · Fixed by #286

Comments

@MarcoGorelli
Copy link
Contributor

MarcoGorelli commented Oct 3, 2023

I think the current group_by design is too limited. For example, trying to implement tpc-h queries, how are we supposed to write

SELECT l_returnflag,
       l_linestatus,
       Sum(l_extendedprice * ( 1 - l_discount )) AS sum_disc_price,
       Avg(l_quantity)                           AS avg_qty,
       Count(*)                                  AS count_order
FROM   lineitem
GROUP  BY l_returnflag,
          l_linestatus  

?

There's two problems with the current API:

  1. we can't mix and match sum and avg within the same groupby
  2. we can't transform columns within the group by aggregation. For example, tpc-h would require l_extendedprice*(1-l_discount) to be written inside the group by, rather than outside it

There's a couple of solutions I can think of

1. Expressions

Once again, the expressions API may help us here (#249)

We could introduce GroupBy.aggregate, which you can pass an expression to and which must reduce itself to a 1-row column per group

So, for example, the above could be expressed with:

lineitem: DataFrame
lineitem.group_by('l_return_flag', 'l_linestatus').aggregate(
    (col('l_extendedprice')*(1-col('discount'))).sum().rename('sum_disc_price'),
    col('l_quantity').mean().rename('avg_qty'),
    col('l_return_flag').count().rename('count_order'),
)

In the pandas implementation:

  • col('l_quantity').mean().rename('avg_qty') could get mapped to avg_qty=pd.NamedAgg(column="l_quantity", aggfunc="mean")
  • (col('l_extendedprice')*(1-col('discount'))).sum().rename('sum_disc_price') would probably require a custom lambda function

This would be maximally flexible, but it would mean that people would write code which could be inefficient in the pandas implementation

2. (output_column_name, input_column_name, aggregation_name, **kwargs) tuples

Alternatively, .aggregate could take a tuple of (output_column_name, input_column_name, aggregation_name), and in the case above then (col('l_extendedprice')*(1-col('discount'))) would have had to be computed before entering the group_by. E.g.:

lineitem: DataFrame
lineitem = lineitem.assign(
    (col('l_extendedprice')*(1-col('discount'))).rename('disc_price')
)
lineitem.group_by('l_return_flag', 'l_linestatus').aggregate(
    ('sum_disc_price', 'disc_price', 'sum', {}),
    ('avg_qty', 'l_quantity', 'mean', {}),
    ('count_order', 'l_return_flag', 'count', {}),
)

Maybe we could also define our own dataclass to make the above more readable

Technically this might not be totally in accordance to tpc-h rules on not reordering, but it would avoid requiring lambda functions for the pandas implementation


Which solution is best?

  • if we're going for "how should the API of a new dataframe library look like?", then I prefer option 1
  • if we're going for "a minimal API which all major dataframe libraries today can agree on" then I prefer option 2

All things considered, I might be leaning more towards option 2

@MarcoGorelli
Copy link
Contributor Author

There could even be

3. top-level aggregation functions

Introduce namespace.mean, namespace.count, etc. which only take a single string as input. Then the call above becomes:

lineitem: DataFrame
namespace = lineitem.__dataframe_namespace__()
lineitem = lineitem.assign(
    (col('l_extendedprice')*(1-col('discount'))).rename('disc_price')
)
lineitem.group_by('l_return_flag', 'l_linestatus').aggregate(
    namespace.sum('disc_price').rename('sum_disc_price'),
    namespace.mean('l_quantity').rename('avg_qty'),
    namespace.count('l_return_flag').rename('count_order'),
)

I think I like this one the most - it's as readable as option 1, but with the same restrictions as option 2

@MarcoGorelli
Copy link
Contributor Author

#286 proposes:

    result = (
        lineitem.filter(mask)
        .group_by(["l_returnflag", "l_linestatus"])
        .aggregate(
            namespace.Aggregation.sum("l_quantity").rename("sum_qty"),
            namespace.Aggregation.sum("l_extendedprice").rename("sum_base_price"),
            namespace.Aggregation.sum("l_disc_price").rename("sum_disc_price"),
            namespace.Aggregation.sum("change").rename("sum_charge"),
            namespace.Aggregation.mean("l_quantity").rename("avg_qty"),
            namespace.Aggregation.mean("l_discount").rename("avg_disc"),
            namespace.Aggregation.size().rename("count_order"),
        )
        .sort(["l_returnflag", "l_linestatus"])
    )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant