Implement COLLECT rolling window aggregation #7189

mythrocks · 2021-01-22T04:16:57Z

Closes #7133.

This is an implementation of the COLLECT aggregation in the context of rolling window functions. This enables the collection of rows (of type T) within specified window boundaries into a list column (containing elements of type T). In this context, one list row would be generated per input row. E.g. Consider the following example:

auto input_col = fixed_width_column_wrapper<int32_t>{70, 71, 72, 73, 74};

Calling rolling_window() with preceding=2, following=1, min_periods=1 produces the following:

auto output_col = cudf::rolling_window(input_col, 2, 1, 1, collect_aggr);
            // == [ [70,71], [70,71,72], [71,72,73], [72,73,74], [73,74] ]

COLLECT is supported with rolling_window(), grouped_rolling_window(), and grouped_time_range_rolling_window(), across primitive types and arbitrarily nested lists and structs.

min_periods is also honoured: If the number of observations is fewer than min_periods, the resulting list row is null.

No min-periods check yet.

No min_periods handling yet. Minimal test.

Also, fixed null/empty list representation.

... at the beginning of the output.

mythrocks · 2021-01-22T04:21:05Z

Part of the algorithm employed here is a refinement of @harrism's suggestion on #6791 (in a different context). I have tagged @harrism on this PR to confirm that this works as he intended.
(The function in question is get_list_child_to_list_row_mapping(), in rolling_detail.cuh.)

mythrocks · 2021-01-22T04:35:32Z

There are now 4 fairly large tests that pertain to *rolling_window() functions:

ROLLING_TEST
GROUPED_ROLLING_TEST
LEAD_LAG_TEST
COLLECT_LIST_TEST

I intended to combine them all under the tests/rolling/ directory. I will do this in a subsequent PR so as not to detract from this one.

cpp/tests/collect_list/collect_list_test.cu

codecov · 2021-01-22T08:20:09Z

Codecov Report

Merging #7189 (7115460) into branch-0.18 (8860baf) will increase coverage by 0.09%.
The diff coverage is n/a.

@@               Coverage Diff               @@
##           branch-0.18    #7189      +/-   ##
===============================================
+ Coverage        82.09%   82.19%   +0.09%     
===============================================
  Files               97       99       +2     
  Lines            16474    16813     +339     
===============================================
+ Hits             13524    13819     +295     
- Misses            2950     2994      +44

Impacted Files	Coverage Δ
python/cudf/cudf/__init__.py	`100.00% <ø> (ø)`
python/cudf/cudf/_fuzz_testing/parquet.py	`0.00% <ø> (ø)`
python/cudf/cudf/_lib/__init__.py	`100.00% <ø> (ø)`
python/cudf/cudf/_typing.py	`91.66% <ø> (ø)`
python/cudf/cudf/core/__init__.py	`100.00% <ø> (ø)`
python/cudf/cudf/core/abc.py	`91.48% <ø> (+4.25%)`	⬆️
python/cudf/cudf/core/buffer.py	`80.00% <ø> (+0.95%)`	⬆️
python/cudf/cudf/core/column/__init__.py	`100.00% <ø> (ø)`
python/cudf/cudf/core/column/categorical.py	`92.73% <ø> (-0.62%)`	⬇️
python/cudf/cudf/core/column/column.py	`87.75% <ø> (-0.39%)`	⬇️
... and 70 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b608832...7115460. Read the comment docs.

cpp/tests/collect_list/collect_list_test.cpp

mythrocks · 2021-01-29T23:46:54Z

Thank you for the reviews, @vuule, @rgsl888prabhu! 👏

kkraus14 · 2021-01-30T00:21:11Z

@gpucibot merge

mythrocks · 2021-01-30T01:26:08Z

I have filed #7258 for optional null-filtering, to support Spark use-cases.

@mythrocks

Fixes #7265. `cudf::detail::get_num_child_rows()` is currently defined in `cudf/lists/detail/utilities.cuh`. The build pipelines for #7189 are fine, but there seem to be build failures in dependent projects such as `spark-rapids`: ``` [2021-01-31T08:12:10.611Z] /.../workspace/spark/cudf18_nightly/cpp/include/cudf/lists/detail/utilities.cuh:31:18: error: 'cudf::size_type cudf::detail::get_num_child_rows(const cudf::column_view&, rmm::cuda_stream_view)' defined but not used [-Werror=unused-function] [2021-01-31T08:12:10.611Z] static cudf::size_type get_num_child_rows(cudf::column_view const& list_offsets, [2021-01-31T08:12:10.611Z] ^~~~~~~~~~~~~~~~~~ [2021-01-31T08:12:11.981Z] cc1plus: all warnings being treated as errors [2021-01-31T08:12:12.238Z] make[2]: *** [CMakeFiles/cudf_hash.dir/build.make:82: CMakeFiles/cudf_hash.dir/src/hash/hashing.cu.o] Error 1 [2021-01-31T08:12:12.238Z] make[1]: *** [CMakeFiles/Makefile2:220: CMakeFiles/cudf_hash.dir/all] Error 2 ``` In any case, it is less than ideal for the function to be completely defined in the header, especially given that the likes of `hashing.cu` are exposed to it (by way of `scatter.cuh`). This commit moves the function definition to a separate translation unit, without changing implementation or interface. Authors: - MithunR (@mythrocks) Approvers: - @nvdbaranec - Mike Wilson (@hyperbolic2346) - David (@davidwendt) URL: #7266

@firestarman

Add unit tests for aggregate 'collect' with windowing. This PR depends on the PR #7189 . Signed-off-by: Liangcai Li <[email protected]> Authors: - Liangcai Li (@firestarman) Approvers: - MithunR (@mythrocks) - Robert (Bobby) Evans (@revans2) URL: #7121

@mythrocks

Closes #7258. #7189 implements `COLLECT` aggregations to be done from window functions. The semantics of how null input rows are handled are consistent with CUDF semantics. E.g. ```c++ auto input_col = fixed_width_column_wrapper<int32_t>{70, ∅, 72, 73, 74}; auto output_col = cudf::rolling_window(input_col, 2, 1, 1, collect_aggr); // == [ [70,∅], [70,∅,72], [∅,72,73], [72,73,74], [73,74] ] ``` Note that the null element (`∅`) is replicated in the first 3 rows of the output. SparkSQL (and Hive, and other big data SQL systems) have different semantics, in that all null elements are purged. The output for the same operation should yield the following: ```c++ auto sparkish_output_col = cudf::rolling_window(input_col, 2, 1, 1, collect_aggr); // == [ [70], [70,72], [72,73], [72,73,74], [73,74] ] ``` CUDF should allow the `COLLECT` aggregation to be constructed with an optional `null_policy` argument (with default `INCLUDE`). The `COLLECT` window function should check the policy, and filter out null list-elements _a posteriori_. Authors: - MithunR (@mythrocks) Approvers: - Ram (Ramakrishna Prabhu) (@rgsl888prabhu) - AJ Schmidt (@ajschmidt8) - Vukasin Milovanovic (@vuule) - Jake Hemstad (@jrhemstad) URL: #7264

mythrocks added 14 commits January 17, 2021 11:54

JUNCO: Working prototype:

4229fd6

No min-periods check yet.

WIP: Got offsets.

ca0f651

WIP: Got child/input mapping.

5b639c5

WIP: Got gather map working.

3170c18

Working.

f0a208e

No min_periods handling yet. Minimal test.

WIP: Added fixup for rolling_window() iterators.

d8b834b

WIP: Added support for empty lists in result.

2568341

WIP: Switch for_each_n() to transform()

db73208

WIP: Support for min_periods checks.

871a4ae

Also, fixed null/empty list representation.

WIP: Clarify how empty lists are handled...

3fbaf3e

... at the beginning of the output.

WIP: Tests!

219cb5d

WIP: Moved get_num_child_rows() to utilities

507239c

Merge remote-tracking branch 'origin/branch-0.18' into collect_list

095e911

Code formatting.

aec5ae1

mythrocks requested review from a team as code owners January 22, 2021 04:16

mythrocks requested review from vuule, rgsl888prabhu and harrism January 22, 2021 04:17

mythrocks self-assigned this Jan 22, 2021

mythrocks added 3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond feature request New feature or request non-breaking Non-breaking change labels Jan 22, 2021

mythrocks mentioned this pull request Jan 22, 2021

[FEA] COLLECT aggregation for rolling windows #7133

Closed

mythrocks commented Jan 22, 2021

View reviewed changes

cpp/tests/collect_list/collect_list_test.cu Outdated Show resolved Hide resolved

Moved collect_list_test to .cpp

3aa029b

Fixed copyrights. Refactored null mask construction.

76926d4

mythrocks requested a review from rgsl888prabhu January 26, 2021 20:59

rgsl888prabhu reviewed Jan 26, 2021

View reviewed changes

cpp/tests/collect_list/collect_list_test.cpp Show resolved Hide resolved

cpp/tests/collect_list/collect_list_test.cpp Show resolved Hide resolved

firestarman mentioned this pull request Jan 27, 2021

[FEA] Support GPU accelerated UDF alternative for higher order function "aggregate" over window NVIDIA/spark-rapids#1419

Closed

12 tasks

mythrocks added 4 commits January 27, 2021 15:37

Test for Input columns with nulls.

4bcd852

Merge remote-tracking branch 'origin/branch-0.18' into collect_list

6558ac9

More tests for nulled inputs.

95a3d49

Merge remote-tracking branch 'origin/branch-0.18' into collect_list

7115460

vuule approved these changes Jan 29, 2021

View reviewed changes

mythrocks removed the request for review from harrism January 29, 2021 21:37

rgsl888prabhu approved these changes Jan 29, 2021

View reviewed changes

mythrocks added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Jan 29, 2021

kkraus14 removed 3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond labels Jan 30, 2021

rapids-bot bot merged commit 14b0900 into rapidsai:branch-0.18 Jan 30, 2021

mythrocks mentioned this pull request Jan 30, 2021

[FEA] COLLECT window aggregation should support null_policy::EXCLUDE #7258

Closed

sameerz mentioned this pull request Jan 30, 2021

[FEA] Support window operations on Decimal NVIDIA/spark-rapids#1333

Closed

6 tasks

This was referenced Feb 1, 2021

Support null_policy::EXCLUDE for COLLECT rolling aggregation #7264

Merged

Move lists utility function definition out of header #7266

Merged

mythrocks mentioned this pull request Feb 18, 2021

Parallelize child column construction in scatter() for lists columns #6791

Closed

ttnghia mentioned this pull request Feb 22, 2021

Implement groupby collect_set #7420

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement COLLECT rolling window aggregation #7189

Implement COLLECT rolling window aggregation #7189

mythrocks commented Jan 22, 2021

mythrocks commented Jan 22, 2021 •

edited

Loading

mythrocks commented Jan 22, 2021

codecov bot commented Jan 22, 2021 •

edited

Loading

mythrocks commented Jan 29, 2021

kkraus14 commented Jan 30, 2021

mythrocks commented Jan 30, 2021

Implement COLLECT rolling window aggregation #7189

Implement COLLECT rolling window aggregation #7189

Conversation

mythrocks commented Jan 22, 2021

mythrocks commented Jan 22, 2021 • edited Loading

mythrocks commented Jan 22, 2021

codecov bot commented Jan 22, 2021 • edited Loading

Codecov Report

mythrocks commented Jan 29, 2021

kkraus14 commented Jan 30, 2021

mythrocks commented Jan 30, 2021

mythrocks commented Jan 22, 2021 •

edited

Loading

codecov bot commented Jan 22, 2021 •

edited

Loading