Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix null handling for structs min and arg_min in groupby, groupby scan, reduction, and inclusive_scan #9864

Merged

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Dec 8, 2021

When finding min, max, arg_min and arg_max for structs in groupby, groupby scan, reduction and inclusive_scan operations, null struct rows should be excluded from the operation (but the null rows of its children column are not). The current implementation for structs wrongly includes nulls at all levels, producing wrong results for min and arg_min operations.

This PR fixes that. In particular, null rows at the children levels are still being handled by the old way (nulls are smaller than non-null elements), but handling nulls at the top parent column level is modified such that:

  • nulls are considered as larger than all other non-null rows, if finding for min and arg_min, or
  • nulls are considered as smaller than all other non-null rows, if finding for max and arg_max.

@ttnghia ttnghia added bug Something isn't working 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Dec 8, 2021
@ttnghia ttnghia self-assigned this Dec 8, 2021
@ttnghia ttnghia changed the title Fix null order for structs min, max, arg_min and arg_max in groupby, groupby scan, reduction, and inclusive_scan Fix null order for structs min and arg_min in groupby, groupby scan, reduction, and inclusive_scan Dec 8, 2021
@rapidsai rapidsai deleted a comment from codecov bot Dec 8, 2021
@ttnghia ttnghia marked this pull request as ready for review December 8, 2021 03:29
@ttnghia ttnghia requested a review from a team as a code owner December 8, 2021 03:29
@rapidsai rapidsai deleted a comment from codecov bot Dec 8, 2021
@rapidsai rapidsai deleted a comment from codecov bot Dec 8, 2021
@jrhemstad
Copy link
Contributor

However, when finding min struct elements, null are considered last after all non-null elements.

Says who?

@ttnghia
Copy link
Contributor Author

ttnghia commented Dec 8, 2021

However, when finding min struct elements, null are considered last after all non-null elements.

Says who?

Says Spark:

scala> df.show
+-----+
|value|
+-----+
| null|
|    1|
| null|
|    2|
+-----+


scala> df.sort("value").show
+-----+
|value|
+-----+
| null|
| null|
|    1|
|    2|
+-----+


scala> df.selectExpr("min(value)").show
+----------+
|min(value)|
+----------+
|         1|
+----------+

And struct:

scala> df.show
+---+------------+                                                              
|  b|           a|
+---+------------+
|  1|        null|
|  1|{null, null}|
|  1|      {1, 2}|
|  1|   {null, 1}|
|  1|   {1, null}|
|  1|      {1, 1}|
|  2|        null|
+---+------------+

scala> df.selectExpr("min(a)").show()
+------------+
|      min(a)|
+------------+
|{null, null}|
+------------+

@rapidsai rapidsai deleted a comment from codecov bot Dec 8, 2021
@jrhemstad
Copy link
Contributor

Says Spark:

libcudf is not Spark.

Here's what the docs say happen with nulls in reduce and scan:

* The null values are skipped for the operation.

* The null values are skipped for the operation, and if an input element
* at `i` is null, then the output element at `i` will also be null.

@ttnghia
Copy link
Contributor Author

ttnghia commented Dec 8, 2021

Here's what the docs say happen with nulls in reduce and scan:

Got it. I'll update the PR. Thanks.
The same results will be achieved, but we will look at the situation differently.

Update: Done.

@ttnghia ttnghia changed the title Fix null order for structs min and arg_min in groupby, groupby scan, reduction, and inclusive_scan Fix null handling for structs min and arg_min in groupby, groupby scan, reduction, and inclusive_scan Dec 8, 2021
@ttnghia ttnghia changed the title Fix null handling for structs min and arg_min in groupby, groupby scan, reduction, and inclusive_scan Fix null handling for structs min, max, arg_min and arg_max in groupby, groupby scan, reduction, and inclusive_scan Dec 8, 2021
@ttnghia ttnghia changed the title Fix null handling for structs min, max, arg_min and arg_max in groupby, groupby scan, reduction, and inclusive_scan Fix null handling for structs min and arg_min in groupby, groupby scan, reduction, and inclusive_scan Dec 8, 2021
@rapidsai rapidsai deleted a comment from codecov bot Dec 8, 2021
@rapidsai rapidsai deleted a comment from codecov bot Dec 8, 2021
@rapidsai rapidsai deleted a comment from codecov bot Dec 9, 2021
@rapidsai rapidsai deleted a comment from codecov bot Dec 9, 2021
@rapidsai rapidsai deleted a comment from codecov bot Dec 14, 2021
Comment on lines 36 to 37
row_arg_minmax_fn(size_type const num_rows,
table_device_view const& table,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is num_rows here the same as table.num_rows(). table_device_view::num_rows() is callable on the host, so you could drop the parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

cpp/src/groupby/sort/group_scan_util.cuh Outdated Show resolved Hide resolved
Comment on lines 193 to 195
auto constexpr is_min_op = K == aggregation::MIN;
auto const binop_generator =
cudf::reduction::detail::comparison_binop_generator(values, is_min_op, stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be a little cleaner to pass the aggregation to the comparison_binop_generator and have it do the is_min_op logic internally. That way the user doesn't have to type it out every time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@rapidsai rapidsai deleted a comment from codecov bot Dec 15, 2021
@rapidsai rapidsai deleted a comment from codecov bot Dec 15, 2021
@rapidsai rapidsai deleted a comment from codecov bot Dec 15, 2021
"" /*NULL*/,
"" /*NULL*/,
"" /*NULL*/},
null_at(2)};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The null_at(2) does not seem to match the NULL comments above. Perhaps I'm reading this wrong.

Copy link
Contributor Author

@ttnghia ttnghia Dec 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That child column has null at idx 2. The following nulls are at the parent column which are specified below (nulls_at({3, 4, ...})). Sorry for the confusion.
Edit: I have added more comments to clarify it.

@codecov

This comment has been minimized.

@ttnghia
Copy link
Contributor Author

ttnghia commented Dec 16, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit b8f812a into rapidsai:branch-22.02 Dec 16, 2021
@ttnghia ttnghia deleted the fix_null_order_in_structs_min_max branch December 16, 2021 21:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants