-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Groupby MIN/MAX with NaN values does not match what Spark expects #4753
Comments
The example was very confusing until I realized you were describing a groupby aggregation. |
From #4754
running an aggregate(max) op on the double-column will result in the following table
Expected behavior
Additional context
|
There's no way to accomplish this behavior in a hash based groupby without significant performance loss. For min/max, we rely on CUDA Sort-based groupby uses the same |
Similar to #4752, I'm marking this as a feature request rather than a bug. |
I think the first part of that statement is only a consequence of the second part. A reduction not based on atomics (like in ORC/Parquet stats) should be just as fast if not faster than an atomic-based implementation (it may just require a tiny 2nd low utilization pass to aggregate multiple results) |
A groupby reduction (reduce by key) and a column-level reduction are very different things. The implementation of groupby that uses a hash table requires the use of atomics. |
To add here and possibly this is already known - if an aggregation does not have a grouping key, operations like max() still give a different result than Spark's implementation. |
What does that mean? Is that just a column-level reduction? |
Yes |
Based on discussion in #4760, I believe that is a situation where Spark will need to do additional pre-processing to satisfy Spark's requirements. |
Other than not running on the GPU, I am not sure how aggregates/reductions can be pre-processed to handle NaNs. @revans2. Am I missing something here? |
Some simple reductions like min/max/sum/etc. should be straightforward. We would just need to replace NaNs with the value desired if the reduction with NaNs doesn't result in NaN. For example, if we're doing a max reduction then we can check if there's a NaN anywhere and if so just return NaN otherwise do the reduction (via copy_if_else). If we're doing a min then we can just replace NaNs with null or filter them out completely. Aggregations might be able to be handled similarly, but I suspect some aggregations we won't be able to pre-process properly. |
Thanks @jlowe , sounds like a plan. I will follow up on the changes we need on our side. Thanks @jrhemstad. |
@kuhushukla any update on this? |
No update at the moment |
This issue has been labeled |
This issue has been labeled |
I believe that this is still needed. We already have a NaN equality configs for collect_set and merge_set operations. This feels similar. To be clear we could make the proposal #4753 (comment) work for some cases, but not all, and in the cases we can make it work it feels unreasonable to do so. Instead of doing a single reduction we would have to
This makes the code a little more complicated, but we can isolate it and it is not likely to cause too many issues. For group by aggregations it would be much more complicated. We could do it but to be able to tell on a per group basis if we need to replace a NaN or not is going to require us to do a group by aggregation to check the same any/all conditions and then find a way to put that back with the original input data. That means either doing a join with the original input data on the grouping keys or sorting the data. Both of those would have very large performance penalties, even in the common case when there are no NaN values. For windowing, which we now also need to support, there is no way to do this. We could have a single row that contributes to multiple separate aggregations. Some of which may have all of the values be NaN. Some of which may have only a few of the values be NaNs. There is no way for us to fix up the input data to work around this. |
Based on some offline discussion, we agreed to keep this issue open. The NaN value issue impacts hash-based groupby aggregations with a |
Describe the bug
Running min aggregate on a table returns the NaN value as its long value instead of the literal "nan" as it does for the other aggregates. I haven't gotten around to writing a unit test for this but can do if so required
Steps/Code to reproduce bug
Create the following table
running an aggregate(min) op on the double column will result in the following table
Expected behavior
It should output this
Additional context
For context here is what aggregate(sum) does in cudf
The text was updated successfully, but these errors were encountered: