-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] Add support for exogenous variables in utils.aggregate #297
[FEAT] Add support for exogenous variables in utils.aggregate #297
Conversation
…exogenous variables within the hierarchical forecast aggregate function. This supports either single or multiple aggregation functions being applied to the same column. Output column will have the format column_aggfunc. \nExamples:\n aggregate(df, spec, exog_vars = {'avg_price':'mean'}) \n aggregate(df, spec, exog_vars = {'avg_price':['mean','sum']})
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
This is a great and valuable contribution. I'd like to first merge #296 so that we can have all the correct tests running. In the mean time, I'll have a look at the PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - nice and simple!
Could you include a unit test at the bottom of utils.ipynb
that verifies correct behaviour? For example, for a small toy dataset it returns the correct aggregations over a set of inputs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for implementing the recommendations from the old PR.
As @elephaint said, a simple unit test is all it needs.
Going forward, considering the potential scale of hierarchical forecasting problems, we should probably consider swapping out pandas in these sorts of tasks for a library like Polars, for single-node environments, or Ibis, to give us the flexibility to easily switch between something like Polars or DuckDB as execution engines in single-node environments and PySpark in distributed environments.
Thanks @elephaint and @christophertitchen , I'll have some time later this week and wrap this up. On Polars/DuckDB/PySpark, this would have to be a repository level change right? Or would we want to consider modular self-contained conversions for tasks that are likely to require a fair amount of compute e.g. aggregations? |
I'd first like to move towards Polars support on a repo level, that should be relatively doable. Distributed support seems harder because of all the computations involved with Numpy arrays at the moment. |
…ring if you only want to apply one agg_func. Added a unit test to verify aggregate functionality
…ring if you only want to apply one agg_func. Added a unit test to verify aggregate functionality
…ring if you only want to apply one agg_func Added a unit test to verify aggregate functionality
Apologies, the commit somehow ended up being very messy merging & syncing all the changes that had happened since. All I was resolving were:
Let me know if this inadvertently caused other issues |
Thanks, I think the PR is fine but I observed some merge conflict issues in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KuriaMaingi Thank you for your contribution! We appreciate it very much.
This change to the utility function will assist in instances where you need to generate your summation and Y_df but also want to retain any exogenous vars required for your forecast.
You will need to pass in a dictionary containing your exogenous vars and the Pandas agg functions you want applied against them. The latter can either be a single string or a list of strings if you want to apply multiple aggfuncs to the same column.
Output column will have the format column_aggfunc.
Examples:
aggregate(df, spec, exog_vars = {'avg_price':'mean'}) -> Will return a new column called "avg_price_mean"
aggregate(df, spec, exog_vars = {'avg_price':['mean','sum']}) -> Will return 2 new columns called "avg_price_mean" & "avg_price_sum