Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Groupby MIN/MAX with NaN values does not match what Spark expects #4753

Open
razajafri opened this issue Mar 31, 2020 · 19 comments
Open
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@razajafri
Copy link
Contributor

Describe the bug
Running min aggregate on a table returns the NaN value as its long value instead of the literal "nan" as it does for the other aggregates. I haven't gotten around to writing a unit test for this but can do if so required

Steps/Code to reproduce bug
Create the following table

scala> spark.sql(""select * from floatsAndDoubles"").show
+-----+------+
|float|double|
+-----+------+
|  NaN|   NaN|
| 1.02|   NaN|
|  NaN|   4.5|
+-----+------+

running an aggregate(min) op on the double column will result in the following table

+----------+-----------+
| float    |min(double)|
+----------+-----------+
| 1.020000 | 179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000.000000 |
| NaN      | 4.500000  |
+----------+-----------+

Expected behavior
It should output this

scala> spark.sql(""select float, min(double) from floatsAndDoubles group by float"").show
+-----+-----------+
|float|min(double)|
+-----+-----------+
| 1.02|        NaN|
|  NaN|        4.5|
+-----+-----------+

Additional context
For context here is what aggregate(sum) does in cudf

+------+-----+
|float | sum |
+------+-----+
| 1.02 | NaN |
| NaN  | NaN |
+------+-----+
@razajafri razajafri added bug Something isn't working Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Mar 31, 2020
@jrhemstad jrhemstad changed the title [BUG] Aggregate MIN is mucking up the NaN value on return [BUG] Groupby MIN with NaN values does not match what Spark expects Mar 31, 2020
@jrhemstad jrhemstad added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Mar 31, 2020
@jrhemstad
Copy link
Contributor

The example was very confusing until I realized you were describing a groupby aggregation.

@jrhemstad
Copy link
Contributor

From #4754

scala> spark.sql(""select * from floatsAndDoubles"").show
+-----+------+
|float|double|
+-----+------+
|  NaN|   NaN|
| 1.02|   NaN|
|  NaN|   4.5|
+-----+------+

running an aggregate(max) op on the double-column will result in the following table

+----------+-----------+
| float    |max(double)|
+----------+-----------+
| 1.020000 | -179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000.000000 |
| NaN      | 4.500000  |
+----------+-----------+

Expected behavior
It should output this

"scala> spark.sql(""select float, max(double) from floatsAndDoubles group by float"").show
+-----+-----------+
|float|max(double)|
+-----+-----------+
| 1.02|        NaN|
|  NaN|        NaN|
+-----+-----------+"

Additional context
For context here is what aggregate(sum) does in cudf

+------+-----+
|float | sum |
+------+-----+
| 1.02 | NaN |
| NaN  | NaN |
+------+-----+

@jrhemstad jrhemstad changed the title [BUG] Groupby MIN with NaN values does not match what Spark expects [BUG] Groupby MIN/MAX with NaN values does not match what Spark expects Mar 31, 2020
@jrhemstad
Copy link
Contributor

jrhemstad commented Mar 31, 2020

There's no way to accomplish this behavior in a hash based groupby without significant performance loss. For min/max, we rely on CUDA atomicMin/atomicMax. There's no way to inject custom logic into these comparisons. So we would instead have to specialize floating point min/max operations to use a atomicCAS instead, which is much slower than a native min/max atomic.

Sort-based groupby uses the same aggregate_row functionality for min/max, so it has the same problem.

@jrhemstad
Copy link
Contributor

Similar to #4752, I'm marking this as a feature request rather than a bug.

@jrhemstad jrhemstad added feature request New feature or request and removed bug Something isn't working labels Mar 31, 2020
@jrhemstad jrhemstad changed the title [BUG] Groupby MIN/MAX with NaN values does not match what Spark expects [FEA] Groupby MIN/MAX with NaN values does not match what Spark expects Mar 31, 2020
@OlivierNV
Copy link
Contributor

There's no way to accomplish this behavior in a hash based groupby without significant performance loss. For min/max, we rely on CUDA atomicMin/atomicMax.

I think the first part of that statement is only a consequence of the second part. A reduction not based on atomics (like in ORC/Parquet stats) should be just as fast if not faster than an atomic-based implementation (it may just require a tiny 2nd low utilization pass to aggregate multiple results)

@jrhemstad
Copy link
Contributor

I think the first part of that statement is only a consequence of the second part. A reduction not based on atomics (like in ORC/Parquet stats) should be just as fast if not faster than an atomic-based implementation (it may just require a tiny 2nd low utilization pass to aggregate multiple results)

A groupby reduction (reduce by key) and a column-level reduction are very different things.

The implementation of groupby that uses a hash table requires the use of atomics.

@kuhushukla
Copy link
Contributor

kuhushukla commented Jun 2, 2020

To add here and possibly this is already known - if an aggregation does not have a grouping key, operations like max() still give a different result than Spark's implementation.

@jrhemstad
Copy link
Contributor

if an aggregation does not have a grouping key

What does that mean? Is that just a column-level reduction?

@kuhushukla
Copy link
Contributor

Is that just a column-level reduction?

Yes

@jrhemstad
Copy link
Contributor

Is that just a column-level reduction?

Yes

Based on discussion in #4760, I believe that is a situation where Spark will need to do additional pre-processing to satisfy Spark's requirements.

@kuhushukla
Copy link
Contributor

Spark will need to do additional pre-processing to satisfy Spark's requirements.

Other than not running on the GPU, I am not sure how aggregates/reductions can be pre-processed to handle NaNs. @revans2. Am I missing something here?

@jlowe
Copy link
Member

jlowe commented Jun 2, 2020

Some simple reductions like min/max/sum/etc. should be straightforward. We would just need to replace NaNs with the value desired if the reduction with NaNs doesn't result in NaN. For example, if we're doing a max reduction then we can check if there's a NaN anywhere and if so just return NaN otherwise do the reduction (via copy_if_else). If we're doing a min then we can just replace NaNs with null or filter them out completely.

Aggregations might be able to be handled similarly, but I suspect some aggregations we won't be able to pre-process properly.

@kuhushukla
Copy link
Contributor

Thanks @jlowe , sounds like a plan. I will follow up on the changes we need on our side. Thanks @jrhemstad.

@razajafri
Copy link
Contributor Author

@kuhushukla any update on this?

@kuhushukla
Copy link
Contributor

No update at the moment

@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@revans2
Copy link
Contributor

revans2 commented Aug 11, 2021

I believe that this is still needed. We already have a NaN equality configs for collect_set and merge_set operations. This feels similar. To be clear we could make the proposal #4753 (comment) work for some cases, but not all, and in the cases we can make it work it feels unreasonable to do so.

Instead of doing a single reduction we would have to

  1. call is_nan
  2. do an any reduction to see if we have to go any further.
  3. If there are any nans do an all reduction.
  4. If they are all nans stop because the result will be NaN
  5. if they are not all nans, then call replace_nans to replace the NaN with a min or max value depending on the operation being done.

This makes the code a little more complicated, but we can isolate it and it is not likely to cause too many issues.

For group by aggregations it would be much more complicated. We could do it but to be able to tell on a per group basis if we need to replace a NaN or not is going to require us to do a group by aggregation to check the same any/all conditions and then find a way to put that back with the original input data. That means either doing a join with the original input data on the grouping keys or sorting the data. Both of those would have very large performance penalties, even in the common case when there are no NaN values.

For windowing, which we now also need to support, there is no way to do this. We could have a single row that contributes to multiple separate aggregations. Some of which may have all of the values be NaN. Some of which may have only a few of the values be NaNs. There is no way for us to fix up the input data to work around this.

@jrhemstad jrhemstad added 0 - Backlog In queue waiting for assignment and removed inactive-30d labels Nov 29, 2021
@GregoryKimball
Copy link
Contributor

GregoryKimball commented Feb 13, 2023

Based on some offline discussion, we agreed to keep this issue open. The NaN value issue impacts hash-based groupby aggregations with a float key type. We expect float types to be rarely used as groupby keys, and Spark-RAPIDS has workarounds that are adequate for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

7 participants