[FEA] Groupby MIN/MAX with NaN values does not match what Spark expects #4753

razajafri · 2020-03-31T05:12:45Z

Describe the bug
Running min aggregate on a table returns the NaN value as its long value instead of the literal "nan" as it does for the other aggregates. I haven't gotten around to writing a unit test for this but can do if so required

Steps/Code to reproduce bug
Create the following table

scala> spark.sql(""select * from floatsAndDoubles"").show
+-----+------+
|float|double|
+-----+------+
|  NaN|   NaN|
| 1.02|   NaN|
|  NaN|   4.5|
+-----+------+

running an aggregate(min) op on the double column will result in the following table

+----------+-----------+
| float    |min(double)|
+----------+-----------+
| 1.020000 | 179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000.000000 |
| NaN      | 4.500000  |
+----------+-----------+

Expected behavior
It should output this

scala> spark.sql(""select float, min(double) from floatsAndDoubles group by float"").show
+-----+-----------+
|float|min(double)|
+-----+-----------+
| 1.02|        NaN|
|  NaN|        4.5|
+-----+-----------+

Additional context
For context here is what aggregate(sum) does in cudf

+------+-----+
|float | sum |
+------+-----+
| 1.02 | NaN |
| NaN  | NaN |
+------+-----+

The text was updated successfully, but these errors were encountered:

jrhemstad · 2020-03-31T13:41:28Z

The example was very confusing until I realized you were describing a groupby aggregation.

jrhemstad · 2020-03-31T13:44:06Z

From #4754

scala> spark.sql(""select * from floatsAndDoubles"").show
+-----+------+
|float|double|
+-----+------+
|  NaN|   NaN|
| 1.02|   NaN|
|  NaN|   4.5|
+-----+------+

running an aggregate(max) op on the double-column will result in the following table

+----------+-----------+
| float    |max(double)|
+----------+-----------+
| 1.020000 | -179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000.000000 |
| NaN      | 4.500000  |
+----------+-----------+

Expected behavior
It should output this

"scala> spark.sql(""select float, max(double) from floatsAndDoubles group by float"").show
+-----+-----------+
|float|max(double)|
+-----+-----------+
| 1.02|        NaN|
|  NaN|        NaN|
+-----+-----------+"

Additional context
For context here is what aggregate(sum) does in cudf

+------+-----+
|float | sum |
+------+-----+
| 1.02 | NaN |
| NaN  | NaN |
+------+-----+

jrhemstad · 2020-03-31T14:01:49Z

There's no way to accomplish this behavior in a hash based groupby without significant performance loss. For min/max, we rely on CUDA atomicMin/atomicMax. There's no way to inject custom logic into these comparisons. So we would instead have to specialize floating point min/max operations to use a atomicCAS instead, which is much slower than a native min/max atomic.

Sort-based groupby uses the same aggregate_row functionality for min/max, so it has the same problem.

jrhemstad · 2020-03-31T14:23:20Z

Similar to #4752, I'm marking this as a feature request rather than a bug.

OlivierNV · 2020-03-31T15:04:52Z

There's no way to accomplish this behavior in a hash based groupby without significant performance loss. For min/max, we rely on CUDA atomicMin/atomicMax.

I think the first part of that statement is only a consequence of the second part. A reduction not based on atomics (like in ORC/Parquet stats) should be just as fast if not faster than an atomic-based implementation (it may just require a tiny 2nd low utilization pass to aggregate multiple results)

jrhemstad · 2020-03-31T15:15:22Z

I think the first part of that statement is only a consequence of the second part. A reduction not based on atomics (like in ORC/Parquet stats) should be just as fast if not faster than an atomic-based implementation (it may just require a tiny 2nd low utilization pass to aggregate multiple results)

A groupby reduction (reduce by key) and a column-level reduction are very different things.

The implementation of groupby that uses a hash table requires the use of atomics.

kuhushukla · 2020-06-02T14:42:57Z

To add here and possibly this is already known - if an aggregation does not have a grouping key, operations like max() still give a different result than Spark's implementation.

jrhemstad · 2020-06-02T16:30:26Z

if an aggregation does not have a grouping key

What does that mean? Is that just a column-level reduction?

kuhushukla · 2020-06-02T17:38:08Z

Is that just a column-level reduction?

Yes

jrhemstad · 2020-06-02T18:42:08Z

Is that just a column-level reduction?

Yes

Based on discussion in #4760, I believe that is a situation where Spark will need to do additional pre-processing to satisfy Spark's requirements.

kuhushukla · 2020-06-02T18:51:36Z

Spark will need to do additional pre-processing to satisfy Spark's requirements.

Other than not running on the GPU, I am not sure how aggregates/reductions can be pre-processed to handle NaNs. @revans2. Am I missing something here?

jlowe · 2020-06-02T18:57:19Z

Some simple reductions like min/max/sum/etc. should be straightforward. We would just need to replace NaNs with the value desired if the reduction with NaNs doesn't result in NaN. For example, if we're doing a max reduction then we can check if there's a NaN anywhere and if so just return NaN otherwise do the reduction (via copy_if_else). If we're doing a min then we can just replace NaNs with null or filter them out completely.

Aggregations might be able to be handled similarly, but I suspect some aggregations we won't be able to pre-process properly.

kuhushukla · 2020-06-02T18:59:21Z

Thanks @jlowe , sounds like a plan. I will follow up on the changes we need on our side. Thanks @jrhemstad.

razajafri · 2020-11-23T17:46:13Z

@kuhushukla any update on this?

kuhushukla · 2020-11-23T17:54:30Z

No update at the moment

github-actions · 2021-03-14T19:13:55Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions · 2021-03-14T19:13:56Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

revans2 · 2021-08-11T16:09:44Z

I believe that this is still needed. We already have a NaN equality configs for collect_set and merge_set operations. This feels similar. To be clear we could make the proposal #4753 (comment) work for some cases, but not all, and in the cases we can make it work it feels unreasonable to do so.

Instead of doing a single reduction we would have to

call is_nan
do an any reduction to see if we have to go any further.
If there are any nans do an all reduction.
If they are all nans stop because the result will be NaN
if they are not all nans, then call replace_nans to replace the NaN with a min or max value depending on the operation being done.

This makes the code a little more complicated, but we can isolate it and it is not likely to cause too many issues.

For group by aggregations it would be much more complicated. We could do it but to be able to tell on a per group basis if we need to replace a NaN or not is going to require us to do a group by aggregation to check the same any/all conditions and then find a way to put that back with the original input data. That means either doing a join with the original input data on the grouping keys or sorting the data. Both of those would have very large performance penalties, even in the common case when there are no NaN values.

For windowing, which we now also need to support, there is no way to do this. We could have a single row that contributes to multiple separate aggregations. Some of which may have all of the values be NaN. Some of which may have only a few of the values be NaNs. There is no way for us to fix up the input data to work around this.

GregoryKimball · 2023-02-13T18:45:23Z

Based on some offline discussion, we agreed to keep this issue open. The NaN value issue impacts hash-based groupby aggregations with a float key type. We expect float types to be rarely used as groupby keys, and Spark-RAPIDS has workarounds that are adequate for now.

razajafri added bug Something isn't working Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Mar 31, 2020

jrhemstad changed the title ~~[BUG] Aggregate MIN is mucking up the NaN value on return~~ [BUG] Groupby MIN with NaN values does not match what Spark expects Mar 31, 2020

jrhemstad added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Mar 31, 2020

jrhemstad mentioned this issue Mar 31, 2020

[BUG] Max aggregate returns invalid results for a float32 column with nans #4754

Closed

jrhemstad changed the title ~~[BUG] Groupby MIN with NaN values does not match what Spark expects~~ [BUG] Groupby MIN/MAX with NaN values does not match what Spark expects Mar 31, 2020

jrhemstad added feature request New feature or request and removed bug Something isn't working labels Mar 31, 2020

jrhemstad changed the title ~~[BUG] Groupby MIN/MAX with NaN values does not match what Spark expects~~ [FEA] Groupby MIN/MAX with NaN values does not match what Spark expects Mar 31, 2020

jrhemstad mentioned this issue Mar 31, 2020

[DISCUSSION] Behavior for NaN comparisons in libcudf #4760

Closed

kuhushukla mentioned this issue Jun 1, 2020

[BUG] Max() can return wrong results when column consists of Nans NVIDIA/spark-rapids#87

Closed

github-actions bot added the inactive-90d label Mar 14, 2021

github-actions bot added the inactive-30d label Mar 14, 2021

jrhemstad added 0 - Backlog In queue waiting for assignment and removed inactive-30d labels Nov 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Groupby MIN/MAX with NaN values does not match what Spark expects #4753

[FEA] Groupby MIN/MAX with NaN values does not match what Spark expects #4753

razajafri commented Mar 31, 2020

jrhemstad commented Mar 31, 2020

jrhemstad commented Mar 31, 2020

jrhemstad commented Mar 31, 2020 •

edited

Loading

jrhemstad commented Mar 31, 2020

OlivierNV commented Mar 31, 2020

jrhemstad commented Mar 31, 2020

kuhushukla commented Jun 2, 2020 •

edited

Loading

jrhemstad commented Jun 2, 2020

kuhushukla commented Jun 2, 2020

jrhemstad commented Jun 2, 2020

kuhushukla commented Jun 2, 2020

jlowe commented Jun 2, 2020

kuhushukla commented Jun 2, 2020

razajafri commented Nov 23, 2020

kuhushukla commented Nov 23, 2020

github-actions bot commented Mar 14, 2021

github-actions bot commented Mar 14, 2021

revans2 commented Aug 11, 2021 •

edited

Loading

GregoryKimball commented Feb 13, 2023 •

edited

Loading

[FEA] Groupby MIN/MAX with NaN values does not match what Spark expects #4753

[FEA] Groupby MIN/MAX with NaN values does not match what Spark expects #4753

Comments

razajafri commented Mar 31, 2020

jrhemstad commented Mar 31, 2020

jrhemstad commented Mar 31, 2020

jrhemstad commented Mar 31, 2020 • edited Loading

jrhemstad commented Mar 31, 2020

OlivierNV commented Mar 31, 2020

jrhemstad commented Mar 31, 2020

kuhushukla commented Jun 2, 2020 • edited Loading

jrhemstad commented Jun 2, 2020

kuhushukla commented Jun 2, 2020

jrhemstad commented Jun 2, 2020

kuhushukla commented Jun 2, 2020

jlowe commented Jun 2, 2020

kuhushukla commented Jun 2, 2020

razajafri commented Nov 23, 2020

kuhushukla commented Nov 23, 2020

github-actions bot commented Mar 14, 2021

github-actions bot commented Mar 14, 2021

revans2 commented Aug 11, 2021 • edited Loading

GregoryKimball commented Feb 13, 2023 • edited Loading

jrhemstad commented Mar 31, 2020 •

edited

Loading

kuhushukla commented Jun 2, 2020 •

edited

Loading

revans2 commented Aug 11, 2021 •

edited

Loading

GregoryKimball commented Feb 13, 2023 •

edited

Loading