Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Incorrect result for COUNT_VALID/COUNT_ALL in rolling_window #6343

Closed
mythrocks opened this issue Sep 29, 2020 · 0 comments · Fixed by #6344
Closed

[BUG] Incorrect result for COUNT_VALID/COUNT_ALL in rolling_window #6343

mythrocks opened this issue Sep 29, 2020 · 0 comments · Fixed by #6344
Assignees
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code.

Comments

@mythrocks
Copy link
Contributor

mythrocks commented Sep 29, 2020

In rolling_window() operations, COUNT_VALID/COUNT_ALL should only return null rows if the min_periods requirement is not satisfied. For all other cases, the count produced must be valid, even if the input row is null.

As it currently stands, the COUNT* rolling_window() operation returns null if even one of its input rows is null. That behaviour, while correct for aggregations like SUM, is incorrect for COUNT.

E.g. Consider a vector with all nulls:

[null, null, null, null, null]
  1. COUNT_ALL with window (preceding=2, following=1, min_periods=1) should yield [2, 3, 3, 3, 2], not [null, null, null, null, null].
  2. COUNT_ALL with window (preceding=2, following=1, min_periods=3) should yield [null, 3, 3, 3, null], not [null, null, null, null, null], because min_periods is not met at either end of the vector.
  3. COUNT_VALID with window (preceding=2, following=1, min_periods=1) should yield [0, 0, 0, 0, 0], not [null, null, null, null, null].
  4. COUNT_VALID with window (preceding=2, following=1, min_periods=3) should yield [null, 0, 0, 0, null], not [null, null, null, null, null], because min_periods is not met at either end of the vector.

The following test should illustrate the expected results, and serve to reproduce the erroneous output:

TEST_F(BorkedCountWindowTest, TestCount)
{
  using T = int32_t;
  using namespace cudf::test;

  auto const input_col = fixed_width_column_wrapper<T>{
    {0, 1, 2, 3, 4, 5}, 
    {1, 0, 1, 1, 1, 1}}.release();

  auto const input_size    = input_col->size();

  auto const preceding   = 2;
  auto const following   = 1;
  auto const min_periods = 3;

  expect_columns_equal(
    cudf::rolling_window(input_col->view(),
                         preceding,
                         following,
                         min_periods,
                         cudf::make_count_aggregation(cudf::null_policy::INCLUDE))->view(),
    fixed_width_column_wrapper<T>{
      {-1, 3, 3, 3, 3, -1},
      { 0, 1, 1, 1, 1,  0}
    });

  expect_columns_equal(
    cudf::rolling_window(input_col->view(),
                         preceding,
                         following,
                         min_periods,
                         cudf::make_count_aggregation(cudf::null_policy::EXCLUDE))->view(),
    fixed_width_column_wrapper<T>{
      {-1, 2, 2, 3, 3, -1},
      { 0, 1, 1, 1, 1,  0}
    });
}

It would be good to have a Pandas perspective on this assessment.

@mythrocks mythrocks added bug Something isn't working Needs Triage Need team to review and classify labels Sep 29, 2020
@mythrocks mythrocks self-assigned this Sep 29, 2020
@kkraus14 kkraus14 added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Oct 8, 2020
harrism pushed a commit that referenced this issue Oct 13, 2020
Closes #6343. Fixes COUNT_ALL, COUNT_VALID for window functions. In rolling_window() operations, COUNT_VALID/COUNT_ALL should only return null rows if the min_periods requirement is not satisfied. For all other cases, the count produced must be valid, even if the input row is null.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants