-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark on bigger data #184
Comments
Hello, thanks for the benchmarks. The package is completely serial at the c/c++ level, no parallelism yet. That explains the benchmarks. |
Thanks for the update. Sounds like you have a lot on your plate. If you're open to suggestions, I made one to the data.table and datatable guys yesterday that you might find interesting. |
Thanks, but I don't think I'll do rolling functions, also because the roll package is an absolute blast (way faster than |
Not sure about your comment. I'm showing the opposite performance-wise. For a single variable / single period column creation, data.table is showing to be 12.7% faster. For multiple columns and multiple periods, I'm getting a compounding effect. In the example below, I'm showing a 22x run time advantage for data.table.
|
You are right, they are actually quite comparable. The |
I'd argue that the use case I listed to the data.table folks is pretty unique. I built a version for myself but it utilizes data.table, not c / c++. While creating rolling stats fast is important for full data sets, the partial data set use case is much more critical from a model scoring perspective when milliseconds matter most. I'm okay if you don't want to pursue this. I figured I'd run it by you just in case. Thanks for your time. |
This is a pretty cool package and I'm starting to test some of the functions available. I already updated a function inside my R package because of it.
I saw you mention that as data grows data.table by itself will possibly become faster. I just reran one of the benchmarks on a bigger data set and pasted the code / results below. Are your c/c++ operations parallelized? If so, any idea why the benchmark results would reverse for bigger data sets?
The text was updated successfully, but these errors were encountered: