-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GroupBy.aggregate #274
Labels
Comments
There could even be 3. top-level aggregation functionsIntroduce lineitem: DataFrame
namespace = lineitem.__dataframe_namespace__()
lineitem = lineitem.assign(
(col('l_extendedprice')*(1-col('discount'))).rename('disc_price')
)
lineitem.group_by('l_return_flag', 'l_linestatus').aggregate(
namespace.sum('disc_price').rename('sum_disc_price'),
namespace.mean('l_quantity').rename('avg_qty'),
namespace.count('l_return_flag').rename('count_order'),
) I think I like this one the most - it's as readable as option 1, but with the same restrictions as option 2 |
#286 proposes: result = (
lineitem.filter(mask)
.group_by(["l_returnflag", "l_linestatus"])
.aggregate(
namespace.Aggregation.sum("l_quantity").rename("sum_qty"),
namespace.Aggregation.sum("l_extendedprice").rename("sum_base_price"),
namespace.Aggregation.sum("l_disc_price").rename("sum_disc_price"),
namespace.Aggregation.sum("change").rename("sum_charge"),
namespace.Aggregation.mean("l_quantity").rename("avg_qty"),
namespace.Aggregation.mean("l_discount").rename("avg_disc"),
namespace.Aggregation.size().rename("count_order"),
)
.sort(["l_returnflag", "l_linestatus"])
) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I think the current group_by design is too limited. For example, trying to implement tpc-h queries, how are we supposed to write
?
There's two problems with the current API:
sum
andavg
within the same groupbyl_extendedprice*(1-l_discount)
to be written inside the group by, rather than outside itThere's a couple of solutions I can think of
1. Expressions
Once again, the expressions API may help us here (#249)
We could introduce
GroupBy.aggregate
, which you can pass an expression to and which must reduce itself to a 1-row column per groupSo, for example, the above could be expressed with:
In the pandas implementation:
col('l_quantity').mean().rename('avg_qty')
could get mapped toavg_qty=pd.NamedAgg(column="l_quantity", aggfunc="mean")
(col('l_extendedprice')*(1-col('discount'))).sum().rename('sum_disc_price')
would probably require a custom lambda functionThis would be maximally flexible, but it would mean that people would write code which could be inefficient in the pandas implementation
2. (output_column_name, input_column_name, aggregation_name, **kwargs) tuples
Alternatively,
.aggregate
could take a tuple of (output_column_name, input_column_name, aggregation_name), and in the case above then(col('l_extendedprice')*(1-col('discount')))
would have had to be computed before entering thegroup_by
. E.g.:Maybe we could also define our own dataclass to make the above more readable
Technically this might not be totally in accordance to tpc-h rules on not reordering, but it would avoid requiring lambda functions for the pandas implementation
Which solution is best?
All things considered, I might be leaning more towards option 2
The text was updated successfully, but these errors were encountered: