Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rolling functions, rolling aggregates, sliding window, moving average #2778

Open
16 of 28 tasks
jangorecki opened this issue Apr 21, 2018 · 51 comments
Open
16 of 28 tasks
Assignees
Labels
feature request froll top request One of our most-requested issues

Comments

@jangorecki
Copy link
Member

jangorecki commented Apr 21, 2018

To gather requirements in single place and refresh ~4 years old discussions creating this issue to cover rolling functions feature (also known as rolling aggregates, sliding window or moving average/moving aggregates).

rolling functions

features

@jangorecki

This comment was marked as resolved.

@MichaelChirico

This comment was marked as outdated.

@st-pasha

This comment was marked as outdated.

@MichaelChirico

This comment was marked as outdated.

@jangorecki

This comment was marked as outdated.

@st-pasha

This comment was marked as outdated.

@jangorecki

This comment was marked as outdated.

@jangorecki

This comment was marked as outdated.

@jangorecki
Copy link
Member Author

jangorecki commented Apr 27, 2018

@mattdowle answering questions from PR

Why are we doing this inside data.table? Why are we integrating it instead of contributing to existing packages and using them from data.table?

  1. There were 3 different issues created asking for that functionality in data.table. Also multiple SO questions tagged data.table. Users expects that to be in scope of data.table.
  2. data.table fits perfectly for time-series data and rolling aggregates are pretty useful statistic there.

my guess is it comes down to syntax (features only possible or convenient if built into data.table; e.g. inside [...] and optimized) and building data.table internals into the rolling function at C level; e.g. froll* should be aware and use data.table indices and key. If so, more specifics on that are needed; e.g. a simple short example.

For me personally it is about speed and lack of chain of dependencies, nowadays not easy to achieve.
Key/indices could be useful for frollmin/frollmax, but it is unlikely that user will create index on measure variable. It is unlikely that user will make index on measure variable, also we haven't made this optimization for min/max yet. I don't see much sense for GForce optimization because allocated memory is not released after roll* call but returned as answer (as opposed to non-rolling mean, sum, etc.).

If there is no convincing argument for integrating, then we should contribute to the other packages instead.

I listed some above, if you are not convinced I recommend you to fill a question to data.table users, ask on twitter, etc. to check response. This feature was long time requested and by many users. If response won't convince you then you can close this issue.

jangorecki added a commit that referenced this issue May 19, 2018
jangorecki added a commit that referenced this issue May 29, 2018
@harryprince

This comment was marked as outdated.

@jangorecki

This comment was marked as outdated.

@MichaelChirico

This comment was marked as outdated.

@st-pasha

This comment was marked as outdated.

@jangorecki

This comment was marked as off-topic.

@jangorecki

This comment was marked as resolved.

@eliocamp

This comment was marked as resolved.

@eliocamp

This comment was marked as resolved.

@jangorecki

This comment was marked as resolved.

@eliocamp

This comment was marked as outdated.

@jangorecki

This comment was marked as outdated.

@eliocamp

This comment was marked as outdated.

@ywhcuhk

This comment was marked as outdated.

@jangorecki

This comment was marked as outdated.

@ywhcuhk

This comment was marked as outdated.

@AdrianAntico

This comment was marked as resolved.

@jangorecki

This comment was marked as outdated.

@jangorecki
Copy link
Member Author

jangorecki commented Aug 30, 2023

rollcor
rollcov
rollrank
rollunqn
rolllm

went out of scope as of current moment. All can work using frollapply (not master branch but PRs), just not super fast. We could consider adding them to scope in future. For the current moment the following set of sum mean prod min max sd var median feels fine and complete to me.

@MichaelChirico MichaelChirico added the top request One of our most-requested issues label Apr 14, 2024
@roaldarbol
Copy link

roaldarbol commented Sep 30, 2024

@jangorecki just following up here based on your comment in {roll}. I was happy to see that frollmedian and friends will be available in {data.table}! What is the status on frollmedian - do you have a rough ETA? I can see that the PR has not been worked in since January and currently fails checks.

@jangorecki
Copy link
Member Author

No ETA (it requires multiple other branches to be merged first). I recommend to use rollmedian branch directly. It was made on a very stable point in master (cascading through other rolling related branches). I know it is being used in production.

@roaldarbol
Copy link

Sounds good, I'll try that. Which rolling functions are available on that branch? Just frollmedian or also others? (I'm doing some benchmarking, so just want make sure I get as many of your implementations as possible) 😊

@jangorecki
Copy link
Member Author

jangorecki commented Sep 30, 2024

Others as well, rollmedian is the most recent branch of all rolling branches so includes the rest as well. There is also rewritten frollapply to apply any function, which is multi threaded and memory optimized.

@MichaelChirico
Copy link
Member

@roaldarbol if you're keen, the blocker for merging existing PR is lack of reviewer+author bandwidth. We could go for someone to either:

@roaldarbol
Copy link

Others as well, rollmedian is the most recent branch of all rolling branches so includes the rest as well. There is also rewritten frollapply to apply any function, which is multi threaded and memory optimized.

That's great, I'll give it a spin! @MichaelChirico I unfortunately don't have the time currently, but if the need is still there a few months from now I might have a look. 😊

@roaldarbol
Copy link

I've started benchmarking the various rolling stats across a bunch of packages, and the new data.table implementations are sweeping the floor! Hope we can get those PRs over the line!

PS As someone new here, would such a benchmark be worth adding to the Articles?

@jangorecki
Copy link
Member Author

jangorecki commented Oct 7, 2024

Thanks for benchmarking.

Note that readers will not really know how those functions scales, which reduces utility of the benchmark. It is always good to present multiple input vector sizes and as well multiple window sizes. If there are at least three different sizes then it is possible to conclude if it scales linearly or worse (or better) than linearly.
This scaling effect give a bit more insight than just single fixed input size and window size.
To give practical example, you can have situation where one tool will be fastest on window size 10 but will be slowest on window size 1000.
And here the median result will probably be different if you increase window size. I could add RollingWindow package to this benchmark: https://github.com/jangorecki/rollbench so it will be easily visible.

When benchmarking custom function with rollapply, I would go for some real custom function as there may be optimization detecting "sum" and switching to optimized sum.

Definitely make sense to add to articles, this is what articles are for in data.table wiki page.

@roaldarbol
Copy link

Oh yeah, absolutely, I'm quite aware of the scaling dimension - I started out with those here: jasonjfoster/roll#44. But I also find that a lot of the nuance in the smaller values disappears (e.g. when data.table is 5x faster than another fast function), and I want to show both, so I'm currently experimenting with better ways (or combinations of ways) to visualize benchmarks. Hope that makes sense. :-)

Thanks for the note on custom functions, I'll change that. And then I'll add it to articles once I've found a preferred way of visualising the scaling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request froll top request One of our most-requested issues
Projects
None yet
Development

No branches or pull requests