Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] collect_list (rolling_window, groupby) fails for empty input #7611

Closed
ttnghia opened this issue Mar 16, 2021 · 8 comments · Fixed by #8274
Closed

[BUG] collect_list (rolling_window, groupby) fails for empty input #7611

ttnghia opened this issue Mar 16, 2021 · 8 comments · Fixed by #8274
Assignees
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. tests Unit testing for project

Comments

@ttnghia
Copy link
Contributor

ttnghia commented Mar 16, 2021

Currently, collect_list is implemented for rolling_window and groupby. However, those implementations do not have tests for the cases when the input column(s) are empty.

For instance, below are some tests in which exception is thrown:

  • groupby test:
#define COL_K cudf::test::fixed_width_column_wrapper<int32_t, int32_t>
#define COL_V cudf::test::fixed_width_column_wrapper<TypeParam, int32_t>
#define LCL_V cudf::test::lists_column_wrapper<TypeParam, int32_t>

TYPED_TEST(CollectSetTest, EmptyLists)
{
  test_single_agg(COL_K{}, COL_V{}, COL_K{}, COL_V{}, cudf::make_collect_list_aggregation());
}
  • rolling_window test:
TYPED_TEST(TypedCollectListTest, EmptyTests)
{
  using T                 = TypeParam;
  auto const input_column = fixed_width_column_wrapper<T, int32_t>{};
  auto const prev_column  = fixed_width_column_wrapper<size_type>{};
  auto const foll_column  = fixed_width_column_wrapper<size_type>{};

  auto const result_column_based_window =
    rolling_window(input_column, prev_column, foll_column, 1, make_collect_list_aggregation());

  auto const expected_result = lists_column_wrapper<T, int32_t>{}.release();
  CUDF_TEST_EXPECT_COLUMNS_EQUIVALENT(expected_result->view(), result_column_based_window->view());
}
@ttnghia ttnghia added bug Something isn't working Needs Triage Need team to review and classify and removed Needs Triage Need team to review and classify labels Mar 16, 2021
@ttnghia ttnghia added libcudf Affects libcudf (C++/CUDA) code. tests Unit testing for project labels Mar 16, 2021
@mythrocks
Copy link
Contributor

What does the failure manifest as?

@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 17, 2021

For the first example above, here is the error message:

C++ exception with description "cuDF failure at: .../cpp/src/column/column_factories.cpp:67: 
make_empty_column is invalid to call on nested types" thrown in the test body.

@mythrocks
Copy link
Contributor

Hmm. Looks like make_empty_column() might need an update for nested types. I'm not up to date on our approach for persisting child type information all the way down. This might need some discussion on interfaces.

@mythrocks
Copy link
Contributor

Ok, here are my findings regarding rolling_window(). (I can't speak for the groupby cases just yet.)

  1. The suggestion was made that calling grouped_rolling_window() produces a single large list when called with a grouping of completely unique (non-matching) keys. I modified UnboundedPrecedingWindowMultiGroup from grouped_rolling_test to do a collect, and to use unique grouping keys. Here are the printed results:
Input gby: 0,1,2,3,4,5,6,7,8,9
Input agg: NULL,1,2,NULL,4,NULL,6,7,8,9

collect_list(agg, prec=1, foll=1, min_periods=1) yields:
List<int32_t>:
Length : 10
Offsets : 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Children :
1111010110
   NULL, 1, 2, NULL, 4, NULL, 6, 7, 8, 9

The results are as expected. The assertion that this produces the wrong result is patently false.

  1. The suggestion was made that collect aggregations on empty columns in rolling window will crash the program. This isn't the case. An empty column is returned. Also, the rolling_window() code does not use make_empty_column, as far as I can see.
    There is, however, a related issue that the deep type of the empty column isn't set correctly. I've been meaning to fix that for some time. It might be a good time to address that in during 0.19 burndown.

In summary: both assertions regarding window functions are incorrect. I will raise a PR for fixing the result column type, so that this wasn't a complete waste of time.
Alas, I don't have the time to investigate the reports regarding groupby at the moment.

@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 17, 2021

  1. The suggestion was made that calling grouped_rolling_window() produces a single large list when called with a grouping of completely unique (non-matching) keys.

How is this apply to groupby? I observed that groupby also behaves exactly the same: produces a single list for non-repeated keys.

@ttnghia ttnghia changed the title [BUG] collect_list (rolling_window, groupby) fails for empty and trivial cases [BUG] collect_list (rolling_window, groupby) fails for empty input Mar 17, 2021
@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 17, 2021

Updated the issue, only empty input actually causes an issue.

@mythrocks
Copy link
Contributor

The description is still not correct:

For instance, below are some tests in which exception is thrown:

This implies that rolling_window() throws an exception when input is empty. That is patently untrue. The resultant empty column needs its type set based on input/aggregation.

For completeness: The issue regarding non-repeated keys was pilot error. The expected result in your test case was constructed incorrectly. Both groupby and grouped_rollling_window() produce the right results.

I will raise a PR for the rolling-window's result type after higher priority tasks are addressed.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@mythrocks mythrocks self-assigned this Apr 27, 2021
rapids-bot bot pushed a commit that referenced this issue May 21, 2021
Fixes the group-by portion of #7611.

When `COLLECT_LIST()` or `COLLECT_SET()` aggregations are called on a grouped input, if the input column is empty, then one sees the following failure:
```
C++ exception with description "cuDF failure at: .../cpp/src/column/column_factories.cpp:67: 
make_empty_column is invalid to call on nested types" thrown in the test body.
```
The operation should have resulted in an empty `LIST` column. `make_empty_column()` does not support `LIST` types (in part because the `data_type` parameter does not capture the types of the child columns).

This commit fixes this by constructing the output column from the specified `values` input, but only for `COLLECT_LIST()` and `COLLECT_SET()`; other aggregation types are unchanged.

Authors:
  - MithunR (https://github.com/mythrocks)

Approvers:
  - Conor Hoekstra (https://github.com/codereport)
  - Nghia Truong (https://github.com/ttnghia)
  - https://github.com/nvdbaranec

URL: #8279
@mythrocks mythrocks linked a pull request May 26, 2021 that will close this issue
rapids-bot bot pushed a commit that referenced this issue May 28, 2021
Fixes the rolling-window part of #7611.

All the rolling window functions return empty results when the input aggregation column is empty, just as they should. But the column types are incorrectly set to match the input type. While this is alright for `[MIN(), MAX(), LEAD(), LAG()]`, it is incorrect for some aggregations:
Aggregation   |     Input Types      |           Output Type             |
--------------|----------------------|-----------------------------------|
COUNT_VALID   | All types            | INT32                             |
COUNT_ALL     | All types            | INT32                             |
ROW_NUMBER    | All types            | INT32                             |
SUM           | Numerics (e.g. INT8) | 64-bit promoted type (e.g. INT64) |
SUM           | Chrono               | Same as input type                |
SUM           | All else             | Unsupported                       |
MEAN          | Numerics             | FLOAT64                           |
MEAN          | Chrono               | FLOAT64                           |
MEAN          | All else             | Unsupported                       |
COLLECT_LIST  | All types T          | LIST with child of type T         |

This mapping is congruent with `cudf::target_type_t` from `<cudf/detail/aggregation/aggregation.hpp>`.

This commit corrects the type of the output column that results from an empty input. It adds test for all the combinations listed above.

Note: This is dependent on #8158, and should be merged after that is committed.

Authors:
  - MithunR (https://github.com/mythrocks)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - https://github.com/nvdbaranec
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #8274
nvdbaranec added a commit to nvdbaranec/cudf that referenced this issue Aug 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. tests Unit testing for project
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants