Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] grouped rolling produces wrong answer for structs #8887

Closed
revans2 opened this issue Jul 28, 2021 · 4 comments · Fixed by #9228
Closed

[BUG] grouped rolling produces wrong answer for structs #8887

revans2 opened this issue Jul 28, 2021 · 4 comments · Fixed by #9228
Assignees
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Jul 28, 2021

Describe the bug
The grouped rolling API lets structs in as a group by key, but produces incorrect answers.

Steps/Code to reproduce bug

diff --git a/cpp/tests/rolling/grouped_rolling_test.cpp b/cpp/tests/rolling/grouped_rolling_test.cpp
index cb123114fd..d2b75832cd 100644
--- a/cpp/tests/rolling/grouped_rolling_test.cpp
+++ b/cpp/tests/rolling/grouped_rolling_test.cpp
@@ -2439,3 +2439,32 @@ TYPED_TEST(TypedUnboundedWindowTest, UnboundedPrecedingAndFollowingWindowMultiGr
                                  fixed_width_column_wrapper<cudf::size_type>{
                                    {3, 3, 3, 3, 3, 4, 4, 4, 4, 4}, {1, 1, 1, 1, 1, 1, 1, 1, 1, 1}});
 }
+
+TYPED_TEST(TypedUnboundedWindowTest, UnboundedPrecedingAndFollowingWindowStructGroup)
+{
+  using namespace cudf::test;
+  using T = TypeParam;
+
+  auto grp_col_inner = fixed_width_column_wrapper<T>{0, 0, 0, 0, 0, 1, 1, 1, 1, 1};
+  //std::vector<std::unique_ptr<cudf::column>> children{grp_col_inner.release()};
+  auto const grp_col = structs_column_wrapper{{grp_col_inner}};
+  
+  auto const agg_col =
+    fixed_width_column_wrapper<T>{{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}, {0, 1, 1, 0, 1, 0, 1, 1, 1, 1}};
+
+  auto const grouping_keys       = cudf::table_view{std::vector<cudf::column_view>{grp_col}};
+  auto const unbounded_preceding = cudf::window_bounds::unbounded();
+  auto const unbounded_following = cudf::window_bounds::unbounded();
+  auto const min_periods         = 1L;
+  auto const output =
+    cudf::grouped_rolling_window(grouping_keys,
+                                 agg_col,
+                                 unbounded_preceding,
+                                 unbounded_following,
+                                 min_periods,
+                                 *cudf::make_count_aggregation<cudf::rolling_aggregation>());
+
+  CUDF_TEST_EXPECT_COLUMNS_EQUAL(output->view(),
+                                 fixed_width_column_wrapper<cudf::size_type>{
+                                   {3, 3, 3, 3, 3, 4, 4, 4, 4, 4}, {1, 1, 1, 1, 1, 1, 1, 1, 1, 1}});
+}

This is almost exactly the same as the test above it, but grp_col is a struct column wrapping the original grp_col. It now fails with an incorrect result.

Expected behavior
I would love it if it could produce a correct result, but I am willing to put up with it throwing an exception for group by keys it cannot support.

Environment overview (please complete the following information)
I tested this on both 20.08 and 20.10. I think think it is a blocker in any way for us. I can disable it in the Spark plugin.

@revans2 revans2 added bug Something isn't working Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Jul 28, 2021
@mythrocks
Copy link
Contributor

mythrocks commented Jul 28, 2021

I ran into a related failure, while working on groupby on struct columns. Grouping on struct columns is currently busted: it returns the input column, unmodified. This causes each row to be deemed its own group. (!!)

Since the groupby happens upstream from the window function, it is understandable that the window function's output is borked. My optimistic guess is that fixing the groupby should fix this failure as well. I'm working on this right now.

@mythrocks mythrocks self-assigned this Jul 28, 2021
@revans2
Copy link
Contributor Author

revans2 commented Jul 29, 2021

That is great. It would be good to also understand if there is type checking that needs to be added in somewhere? Is this a problem for list columns too?

@mythrocks
Copy link
Contributor

It would be good to also understand if there is type checking that needs to be added in somewhere?

Yes, I should think so. groupby's checks to cause grouping on struct input to fail, rather than pass silently.

Is this a problem for list columns too?

Yes, this appears to be borked on list input as well. :/ I will raise a cudf issue for it, and link this current PR as a dependent.

@harrism harrism added the libcudf Affects libcudf (C++/CUDA) code. label Aug 12, 2021
@beckernick beckernick removed the Needs Triage Need team to review and classify label Aug 23, 2021
@mythrocks
Copy link
Contributor

I have verified that #9194 fixes this.

To be safe, we should probably add a test for grouping on STRUCT columns for window functions.

rapids-bot bot pushed a commit that referenced this issue Sep 15, 2021
Fixes #8887.

#9024 added support for `STRUCT` groupby keys.
This commit adds a test for `grouped_rolling_window()` where the groupby keys are `STRUCT`.

Authors:
  - MithunR (https://github.com/mythrocks)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - David Wendt (https://github.com/davidwendt)

URL: #9228
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants