GroupBy.aggregate #274

MarcoGorelli · 2023-10-03T14:34:27Z

I think the current group_by design is too limited. For example, trying to implement tpc-h queries, how are we supposed to write

SELECT l_returnflag,
       l_linestatus,
       Sum(l_extendedprice * ( 1 - l_discount )) AS sum_disc_price,
       Avg(l_quantity)                           AS avg_qty,
       Count(*)                                  AS count_order
FROM   lineitem
GROUP  BY l_returnflag,
          l_linestatus

?

There's two problems with the current API:

we can't mix and match sum and avg within the same groupby
we can't transform columns within the group by aggregation. For example, tpc-h would require l_extendedprice*(1-l_discount) to be written inside the group by, rather than outside it

There's a couple of solutions I can think of

1. Expressions

Once again, the expressions API may help us here (#249)

We could introduce GroupBy.aggregate, which you can pass an expression to and which must reduce itself to a 1-row column per group

So, for example, the above could be expressed with:

lineitem: DataFrame
lineitem.group_by('l_return_flag', 'l_linestatus').aggregate(
    (col('l_extendedprice')*(1-col('discount'))).sum().rename('sum_disc_price'),
    col('l_quantity').mean().rename('avg_qty'),
    col('l_return_flag').count().rename('count_order'),
)

In the pandas implementation:

col('l_quantity').mean().rename('avg_qty') could get mapped to avg_qty=pd.NamedAgg(column="l_quantity", aggfunc="mean")
(col('l_extendedprice')*(1-col('discount'))).sum().rename('sum_disc_price') would probably require a custom lambda function

This would be maximally flexible, but it would mean that people would write code which could be inefficient in the pandas implementation

2. (output_column_name, input_column_name, aggregation_name, **kwargs) tuples

Alternatively, .aggregate could take a tuple of (output_column_name, input_column_name, aggregation_name), and in the case above then (col('l_extendedprice')*(1-col('discount'))) would have had to be computed before entering the group_by. E.g.:

lineitem: DataFrame
lineitem = lineitem.assign(
    (col('l_extendedprice')*(1-col('discount'))).rename('disc_price')
)
lineitem.group_by('l_return_flag', 'l_linestatus').aggregate(
    ('sum_disc_price', 'disc_price', 'sum', {}),
    ('avg_qty', 'l_quantity', 'mean', {}),
    ('count_order', 'l_return_flag', 'count', {}),
)

Maybe we could also define our own dataclass to make the above more readable

Technically this might not be totally in accordance to tpc-h rules on not reordering, but it would avoid requiring lambda functions for the pandas implementation

Which solution is best?

if we're going for "how should the API of a new dataframe library look like?", then I prefer option 1
if we're going for "a minimal API which all major dataframe libraries today can agree on" then I prefer option 2

All things considered, I might be leaning more towards option 2

The text was updated successfully, but these errors were encountered:

MarcoGorelli · 2023-10-13T17:45:18Z

There could even be

3. top-level aggregation functions

Introduce namespace.mean, namespace.count, etc. which only take a single string as input. Then the call above becomes:

lineitem: DataFrame
namespace = lineitem.__dataframe_namespace__()
lineitem = lineitem.assign(
    (col('l_extendedprice')*(1-col('discount'))).rename('disc_price')
)
lineitem.group_by('l_return_flag', 'l_linestatus').aggregate(
    namespace.sum('disc_price').rename('sum_disc_price'),
    namespace.mean('l_quantity').rename('avg_qty'),
    namespace.count('l_return_flag').rename('count_order'),
)

I think I like this one the most - it's as readable as option 1, but with the same restrictions as option 2

MarcoGorelli · 2023-10-23T16:02:29Z

#286 proposes:

    result = (
        lineitem.filter(mask)
        .group_by(["l_returnflag", "l_linestatus"])
        .aggregate(
            namespace.Aggregation.sum("l_quantity").rename("sum_qty"),
            namespace.Aggregation.sum("l_extendedprice").rename("sum_base_price"),
            namespace.Aggregation.sum("l_disc_price").rename("sum_disc_price"),
            namespace.Aggregation.sum("change").rename("sum_charge"),
            namespace.Aggregation.mean("l_quantity").rename("avg_qty"),
            namespace.Aggregation.mean("l_discount").rename("avg_disc"),
            namespace.Aggregation.size().rename("count_order"),
        )
        .sort(["l_returnflag", "l_linestatus"])
    )

MarcoGorelli added the API design label Oct 3, 2023

MarcoGorelli mentioned this issue Oct 17, 2023

Add GroupBy.aggregate (and tpch-1 query to examples) #286

Merged

MarcoGorelli closed this as completed in #286 Oct 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GroupBy.aggregate #274

GroupBy.aggregate #274

MarcoGorelli commented Oct 3, 2023 •

edited

Loading

MarcoGorelli commented Oct 13, 2023

MarcoGorelli commented Oct 23, 2023

GroupBy.aggregate #274

GroupBy.aggregate #274

Comments

MarcoGorelli commented Oct 3, 2023 • edited Loading

1. Expressions

2. (output_column_name, input_column_name, aggregation_name, **kwargs) tuples

MarcoGorelli commented Oct 13, 2023

3. top-level aggregation functions

MarcoGorelli commented Oct 23, 2023

MarcoGorelli commented Oct 3, 2023 •

edited

Loading