-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement COLLECT rolling window aggregation #7189
Conversation
No min-periods check yet.
Also, fixed null/empty list representation.
... at the beginning of the output.
There are now 4 fairly large tests that pertain to
I intended to combine them all under the |
Codecov Report
@@ Coverage Diff @@
## branch-0.18 #7189 +/- ##
===============================================
+ Coverage 82.09% 82.19% +0.09%
===============================================
Files 97 99 +2
Lines 16474 16813 +339
===============================================
+ Hits 13524 13819 +295
- Misses 2950 2994 +44
Continue to review full report at Codecov.
|
Thank you for the reviews, @vuule, @rgsl888prabhu! 👏 |
@gpucibot merge |
I have filed #7258 for optional null-filtering, to support Spark use-cases. |
Fixes #7265. `cudf::detail::get_num_child_rows()` is currently defined in `cudf/lists/detail/utilities.cuh`. The build pipelines for #7189 are fine, but there seem to be build failures in dependent projects such as `spark-rapids`: ``` [2021-01-31T08:12:10.611Z] /.../workspace/spark/cudf18_nightly/cpp/include/cudf/lists/detail/utilities.cuh:31:18: error: 'cudf::size_type cudf::detail::get_num_child_rows(const cudf::column_view&, rmm::cuda_stream_view)' defined but not used [-Werror=unused-function] [2021-01-31T08:12:10.611Z] static cudf::size_type get_num_child_rows(cudf::column_view const& list_offsets, [2021-01-31T08:12:10.611Z] ^~~~~~~~~~~~~~~~~~ [2021-01-31T08:12:11.981Z] cc1plus: all warnings being treated as errors [2021-01-31T08:12:12.238Z] make[2]: *** [CMakeFiles/cudf_hash.dir/build.make:82: CMakeFiles/cudf_hash.dir/src/hash/hashing.cu.o] Error 1 [2021-01-31T08:12:12.238Z] make[1]: *** [CMakeFiles/Makefile2:220: CMakeFiles/cudf_hash.dir/all] Error 2 ``` In any case, it is less than ideal for the function to be completely defined in the header, especially given that the likes of `hashing.cu` are exposed to it (by way of `scatter.cuh`). This commit moves the function definition to a separate translation unit, without changing implementation or interface. Authors: - MithunR (@mythrocks) Approvers: - @nvdbaranec - Mike Wilson (@hyperbolic2346) - David (@davidwendt) URL: #7266
Add unit tests for aggregate 'collect' with windowing. This PR depends on the PR #7189 . Signed-off-by: Liangcai Li <[email protected]> Authors: - Liangcai Li (@firestarman) Approvers: - MithunR (@mythrocks) - Robert (Bobby) Evans (@revans2) URL: #7121
Closes #7258. #7189 implements `COLLECT` aggregations to be done from window functions. The semantics of how null input rows are handled are consistent with CUDF semantics. E.g. ```c++ auto input_col = fixed_width_column_wrapper<int32_t>{70, ∅, 72, 73, 74}; auto output_col = cudf::rolling_window(input_col, 2, 1, 1, collect_aggr); // == [ [70,∅], [70,∅,72], [∅,72,73], [72,73,74], [73,74] ] ``` Note that the null element (`∅`) is replicated in the first 3 rows of the output. SparkSQL (and Hive, and other big data SQL systems) have different semantics, in that all null elements are purged. The output for the same operation should yield the following: ```c++ auto sparkish_output_col = cudf::rolling_window(input_col, 2, 1, 1, collect_aggr); // == [ [70], [70,72], [72,73], [72,73,74], [73,74] ] ``` CUDF should allow the `COLLECT` aggregation to be constructed with an optional `null_policy` argument (with default `INCLUDE`). The `COLLECT` window function should check the policy, and filter out null list-elements _a posteriori_. Authors: - MithunR (@mythrocks) Approvers: - Ram (Ramakrishna Prabhu) (@rgsl888prabhu) - AJ Schmidt (@ajschmidt8) - Vukasin Milovanovic (@vuule) - Jake Hemstad (@jrhemstad) URL: #7264
Closes #7133.
This is an implementation of the
COLLECT
aggregation in the context of rolling window functions. This enables the collection of rows (of typeT
) within specified window boundaries into a list column (containing elements of typeT
). In this context, one list row would be generated per input row. E.g. Consider the following example:Calling
rolling_window()
withpreceding=2
,following=1
,min_periods=1
produces the following:COLLECT
is supported withrolling_window()
,grouped_rolling_window()
, andgrouped_time_range_rolling_window()
, across primitive types and arbitrarily nested lists and structs.min_periods
is also honoured: If the number of observations is fewer than min_periods, the resulting list row is null.