Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSSION] Behavior for NaN comparisons in libcudf #4760

Closed
jrhemstad opened this issue Mar 31, 2020 · 45 comments
Closed

[DISCUSSION] Behavior for NaN comparisons in libcudf #4760

jrhemstad opened this issue Mar 31, 2020 · 45 comments
Labels
libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. Spark Functionality that helps Spark RAPIDS

Comments

@jrhemstad
Copy link
Contributor

jrhemstad commented Mar 31, 2020

Recent issues (#4753 #4752) have called into question how libcudf handles NaN floating point values. We've only ever addressed this issue on an ad hoc basis as opposed to having a larger conversation about the issue.

C++

C++ follows the IEEE 754 standard for floating point values, which for comparisons with NaN has the following behavior:

Comparison NaN ≥ x NaN ≤ x NaN > x NaN < x NaN = x NaN ≠ x
Result Always False Always False Always False Always False Always False Always True

https://en.wikipedia.org/wiki/NaN

Spark

Spark is non-conforming with the IEEE 754 standard:

Comparison NaN ≥ x NaN ≤ x NaN > x NaN < x NaN = x NaN ≠ x
Result Always True False unless x is NaN Always True Always False True only if x is NaN True unless x is NaN

See https://spark.apache.org/docs/latest/sql-reference.html#nan-semantics

Python/Pandas

Python is a bit of a grey area because prior to 1.0, Pandas did not have the concept of "null" values and used NaN's in their stead.

In most regards, Python does respect IEEE 754. For example, see how numpy conforms with the expected IEEE754 behavior in binary ops #4752 (comment) (where Spark does not).

However, there are some cases where Pandas is non-conforming due to the pseudo-null behavior. For example, in sort_values there is a na_position argument to control where NaN values are placed. This requires specializing the libcudf comparator used for sorting to special case floating point values and deviate from the IEEE 754 behavior of NaN < x == false and NaN > x == false. See #2191 and #3226 where this was done previously.

That said, I believe Python's requirements could be satisfied by always converting NaN values to nulls, but @shwina @kkraus14 will need to confirm. Prior to Pandas 1.0, it wasn't possible to have both NaN and NULL values in a floating point column. We should see what the expected behavior is of NaNs vs Nulls will be in 1.0.

Discussion

We need to have a conversation and make decisions on what libcudf will and will not do with respect to NaN behavior.

My stance is that libcudf should adhere to IEEE 754. Spark's semantics redefine a core concept of the C++ language/IEEE standard and satisfying those semantics would require extremely invasive changes that negatively impact both performance and code maintainability.

Even worse, because Spark differs from C++/Pandas, we need to provide separate code paths for all comparison based operations: a "Spark" path, and a "C++/Pandas" path. This further increases code bloat and maintenance costs.

Furthermore, for consistency, I think we should roll back the non-conformant changes introduced for comparators in #3226.

In conclusion, we already have special logic for handling NULLs everywhere in libcudf. Users should leverage that logic by converting NaNs to NULLs. I understand that vanilla Spark treats NaNs and NULLs independently, but I believe trying to imitate that behavior in libcudf comes at too high a cost.

@jrhemstad jrhemstad added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. Spark Functionality that helps Spark RAPIDS labels Mar 31, 2020
@kkraus14
Copy link
Collaborator

That said, I believe Python's requirements could be satisfied by always converting NaN values to nulls, but @shwina @kkraus14 will need to confirm. Prior to Pandas 1.0, it wasn't possible to have both NaN and NULL values in a floating point column. We should see what the expected behavior is of NaNs vs Nulls will be in 1.0.

They only have nullable integer and boolean columns as of 1.0. For floating point columns they only have NaNs are there's ongoing discussions about if they want to support nulls vs NaNs for floating point columns. pandas-dev/pandas#32265

@shwina
Copy link
Contributor

shwina commented Mar 31, 2020

I'm strongly in favour of libcudf following IEEE 754 for NaN behaviour, while offering a good deal of flexibility for null behaviour. There should be no question about what the behaviour of NaN should be - it's defined by the standard. However, null can be interpreted differently in different contexts and in different platforms.

That said, I believe Python's requirements could be satisfied by always converting NaN values to nulls

This is definitely accurate for Pandas < 1.0, i.e., before the introduction of pd.NA. To match Pandas behaviour, we can convert NaNs in the input to null, and use libcudf flexibility to accommodate for Pandas' quirks with regards to NaNs.

For Pandas >= 1.0, it's still not clear (to me, at least), how Pandas treats NaN differently from pd.NA in algorithms like join, groupby, etc., but I strongly believe that should not influence libcudf handling of NaN.

@OlivierNV
Copy link
Contributor

ORC and Parquet both have very clearly defined behavior regardings NaNs, so IO readers/writers will presumably stick with following the corresponding specification in that regard (forcing NaNs to be NULLs would be a problem).

@jrhemstad
Copy link
Contributor Author

ORC and Parquet both have very clearly defined behavior regardings NaNs, so IO readers/writers will presumably stick with following the corresponding specification in that regard (forcing NaNs to be NULLs would be a problem).

Do ORC/Parquet have any behavior that doesn't conform with IEEE 754? If not, then there's no issue.

To be clear, I'm not suggesting NaN's be always converted to Nulls. Instead, if a user desires behavior for NaNs to be anything other than what IEEE 754 specifies, then they can convert those NaNs to nulls and use the various null handling logic present in libcudf.

@jlowe
Copy link
Member

jlowe commented Mar 31, 2020

if a user desires behavior for NaNs to be anything other than what IEEE 754 specifies, then they can convert those NaNs to nulls and use the various null handling logic present in libcudf.

This has come up a number of times, and as discussed previously it is not a workable solution. NaNs are distinct from nulls in Spark, and therefore NaNs cannot be converted to nulls without losing information. Consider a column that contains both NaNs and nulls, and I need it to be sorted, aggregated, whatever. As soon as I change the NaNs to nulls and operate on it I have lost the ability to recover which nulls are really proxies for NaNs so they can be converted back to produce a proper answer, This also makes the assumption that how nulls are handled matches the behavior desired for NaNs which is incorrect in many cases (e.g.: how Spark sorts nulls vs. NaNs).

@kkraus14
Copy link
Collaborator

if a user desires behavior for NaNs to be anything other than what IEEE 754 specifies, then they can convert those NaNs to nulls and use the various null handling logic present in libcudf.

This has come up a number of times, and as discussed previously it is not a workable solution. NaNs are distinct from nulls in Spark, and therefore NaNs cannot be converted to nulls without losing information. Consider a column that contains both NaNs and nulls, and I need it to be sorted, aggregated, whatever. As soon as I change the NaNs to nulls and operate on it I have lost the ability to recover which nulls are really proxies for NaNs so they can be converted back to produce a proper answer, This also makes the assumption that how nulls are handled matches the behavior desired for NaNs which is incorrect in many cases (e.g.: how Spark sorts nulls vs. NaNs).

Is the fact that Spark NaN semantics differs from the IEEE 754 standard something that you think could be reasonably raised upstream in the Spark community? I understand that isn't a short term path, but from the perspective of libcudf being a C++ library, this feels out of scope and will just bloat the library.

@jrhemstad any thoughts on how hard it would be to allow someone to provide their own unary / binary ops via an additional shared library? I.E. I could imagine in the future that any Spark or Pandas or other library specific behaviors could be defined within a helper shared library that depends on libcudf and extends things as needed where it doesn't make sense to include them in libcudf directly.

@jrhemstad
Copy link
Contributor Author

jrhemstad commented Mar 31, 2020

@jrhemstad any thoughts on how hard it would be to allow someone to provide their own unary / binary ops via an additional shared library?

If someone is only interfacing with the precompiled libcudf.so, then this would basically be impossible unless using the PTX UDF kind of stuff we've already experimented with in the past.

Where it would be possible is if a user interfaces directly with libcudf's header-only implementations (currently only in detail::, but that could change). In that case, we could expose function templates that allows the user to specify a custom comparator at compile time.

None of our detail:: APIs are really configured to allow customizing that comparator, but that could change. It would definitely require some decent refactoring work, but it's possible.

However, this solution would require a user to build their own library binary.

@jlowe
Copy link
Member

jlowe commented Mar 31, 2020

Is the fact that Spark NaN semantics differs from the IEEE 754 standard something that you think could be reasonably raised upstream in the Spark community?

No, definitely not. Users do not want sorts of columns that happen to contain NaNs to sort NaNs semi-randomly (as what occurs in practice with IEEE 754 NaN comparison behavior). Other databases have similar sort behaviors to Spark wrt. NaNs, and asking a very long standing and well documented database behavior to change is a non-starter.

I understand the desire to not juggle a bunch of conflicting requirements, but we need to find some way to meet them. As I understand it, that means we have to choose between:

  1. libcudf directly provides a way to perform this behavior
  2. libcudf provides extension points so applications can reuse most of the libcudf code but override the necessary behavior in an application-specific way
  3. application forks a custom libcudf

I really, really don't want to choose that last one. 😅

@jrhemstad
Copy link
Contributor Author

jrhemstad commented Mar 31, 2020

2. libcudf provides extension points so applications can reuse most of the libcudf code but override the necessary behavior in an application-specific way

I think what you're suggesting here is the solution @kkraus14 was hinting at and what I outlined above.

This would take a bit of work to expose the necessary function templates, but still probably less than the complexity of providing two separate code paths that libcudf would currently need to provide in option 1.

@kkraus14
Copy link
Collaborator

No, definitely not. Users do not want sorts of columns that happen to contain NaNs to sort NaNs semi-randomly (as what occurs in practice with IEEE 754 NaN comparison behavior).

I'm all for allowing control of NaNs in sorting because that's allowing the ability to define otherwise undefined behavior, but these binary operations are well defined for NaN values.

@jlowe
Copy link
Member

jlowe commented Mar 31, 2020

The problem is that sorting and comparison are expected to be related. If having the sort operation order NaNs as the largest, then IMO comparing a NaN against other numbers should reflect that sort order. Since IEEE 754 NaNs never compare, I don't see how you can ever sort them. If we want to sort them in a consistent way then there needs to be a way to consistently compare them.

@jrhemstad
Copy link
Contributor Author

The problem is that sorting and comparison are expected to be related.

I agree, which is why I actually suggested removing the specialization for NaNs in the sort comparator :) That way we're consistently conformant rather than a mix of non-conformance with conformance (as we've seen in Raza's issues).

@jlowe
Copy link
Member

jlowe commented Mar 31, 2020

Note that removing the sort specialization doesn't really solve anything other than allowing us to say, "we're just as useful as IEEE 754 when it comes to sorting with data that may contain NaNs. Make sure you don't have NaNs in your data before calling sort/orderby or it's worthless" 😉

@jrhemstad
Copy link
Contributor Author

jrhemstad commented Mar 31, 2020

Note that removing the sort specialization doesn't really solve anything other than allowing us to say, "we're just as useful as IEEE 754 when it comes to sorting with data that may contain NaNs. Make sure you don't have NaNs in your data before calling sort/orderby or it's worthless"

Like I always say, "Better to be consistently wrong than inconsistently right." 😄

@jlowe
Copy link
Member

jlowe commented Mar 31, 2020

Besides Spark and Pandas, what do other apps on cudf want with respect to NaN handling? For example, curious what the Blazing group's thoughts are and if they have a proposal we haven't considered. cc: @felipeblazing

@felipeblazing
Copy link

felipeblazing commented Mar 31, 2020

For Blazing its actually pretty important that NaN comparisons be deterministic. Non deterministic sorting could cause all kinds of issues for us. I think its very valid for a user to want to do something like have that be included as one of the groups when performing a group by.

Also how would it look to a user that they sort a list and see different results when the sort the list [2,0,Nan]

    [ 0 , NaN, 2 ]
    [ 0 , 2, NaN ]
    [ NaN, 0 , 2 ] 

I agree with @jrhemstad when he says its better to be consistently wrong as opposed to following the spec. The spec seems to suck in this case where users can care if something is NaN. Right now we aren't doing anything different from what cudf is doing and we have not tested what happens when we put NaN numbers in columns. Something we will have to rectify :).

We care about consistency. We don't really care if they come before or after. I imagine that it might be most convenient to put them at the end assuming people are most likely to sort things in ascending order.

@jrhemstad
Copy link
Contributor Author

I agree with @jrhemstad when he says its better to be consistently wrong as opposed to following the spec. The spec seems to suck in this case where users can care if something is NaN

@felipeblazing I think you're gravely misunderstanding my statement 😅

I'm saying we should consistently follow the IEEE 754 standard, which we don't currently do (as evidenced by the special handling of NaNs for sorting). This is the opposite of what you seem to be advocating for.

NaNs are, by definition, objects that cannot be ordered. The expectation is that if you have NaNs in your data, you should do something about that (like filtering them out or replacing them with a sentinel value) before trying to do any operation like a sort.

@jlowe
Copy link
Member

jlowe commented Apr 1, 2020

For Blazing its actually pretty important that NaN comparisons be deterministic.

I think this is going to be true for any cudf application that needs to deal directly with NaNs instead of treating them as equivalent to nulls. By adhering to IEEE 754, NaNs become explosive values in the data, ruining sorts, aggregations, lookups, etc. in which they appear. If libcudf strictly adheres to IEEE 754 for NaNs and only exposes function templates, then I would argue most cudf applications needing to wield NaNs will have to build their own libcudf with custom function templates. That's not ideal.

I think function templates are a good way to factor the problem and minimize duplicated code paths in libcudf. However I don't think libcudf's solution should stop there when it comes to NaNs because the NaN issue is likely not limited to a small subset of cudf applications. To me this would be akin to libcudf only sorting nulls first and saying, "Do you need to sort nulls last instead sometimes? Sorry, we don't support that directly. Here's a function template interface to help when you build your own from source."

IMO it comes down to how useful libcudf's build artifacts will be for applications that need to deal with NaNs. If almost everything in libcudf doesn't behave in a useful way when NaNs appear, then there's going to be a lot of libcudf 'flavors' in the wild that are essentially identical just to solve the NaN problem. At that point it would be tempting to fork libcudf, add the necessary function templates to wield NaNs in a useful manner, and publish it as an out-of-the-box solution for apps that must deal with NaNs.

@shwina
Copy link
Contributor

shwina commented Apr 1, 2020

This has been stated implicitly a couple of times, but just to be explicit: applications can still get predictable NaN behaviour with libcudf, just by converting NaNs to nulls.

For applications that need to deal with a combination of NaN and null, e.g., Spark (and also perhaps Pandas in the future), it's not clear to me what the behaviour should be. How does Spark handle sorting a column containing both NaN and null?

@kkraus14
Copy link
Collaborator

kkraus14 commented Apr 1, 2020

This has been stated implicitly a couple of times, but just to be explicit: applications can still get predictable NaN behaviour with libcudf, just by converting NaNs to nulls.

The problem with this is it then becomes impossible to differentiate between NaNs and nulls.

For applications that need to deal with a combination of NaN and null, e.g., Spark (and also perhaps Pandas in the future), it's not clear to me what the behaviour should be. How does Spark handle sorting a column containing both NaN and null?

Spark currently handles it, Pandas hasn't crossed that road as of yet because it doesn't yet have nullable float columns, but does handle sorting NaNs.

@jlowe
Copy link
Member

jlowe commented Apr 1, 2020

How does Spark handle sorting a column containing both NaN and null?

NaNs are sorted relative to numeric values as explained in the PR description and documented at https://spark.apache.org/docs/latest/sql-reference.html#nan-semantics. Null sorting relative to non-null values is treated orthogonally and can be controlled by the user per-column (nulls first or nulls last).

For example, if you want to sort the column [NaN, 5, null, +Inf, -Inf, 7, NaN, 2] in ascending order with nulls first, the result would be [null, -Inf, 2, 5, 7, +Inf, NaN, NaN].

@jlowe
Copy link
Member

jlowe commented Apr 1, 2020

PostgreSQL matches Spark's behavior with respect to NaN handling, see https://www.postgresql.org/docs/current/datatype-numeric.html#DATATYPE-FLOAT.

I'm not sure if Blazing will choose to align, but it's clear they'll need to put NaNs somewhere deterministically and separate from nulls when it comes to sorting and grouping.

@harrism
Copy link
Member

harrism commented Apr 2, 2020

If Pandas is going to start looking more like Spark for NaNs, I have to agree that the custom header version is not very useful.

Seems we should look at whether all of our clients might potentially want the same (or close to the same) thing.

And if not, we should look at how JIT might be put to use.

@jrhemstad
Copy link
Contributor Author

jrhemstad commented Apr 2, 2020

PostgreSQL matches Spark's behavior with respect to NaN handling
Seems we should look at whether all of our clients might potentially want the same (or close to the same) thing.

Java follows IEEE 754

Scala follows IEEE 754.

(While researching Scala, came across a particularly nice quote

NaN is more an abomination than a number. As far as I recall, the IEEE
standard intended it as an error value that's sticky in computations.
It certainly was never intended as a real value -- the only thing you
should ever do on a NaN is ask it with isNaN.

)

(It's wild to me that Spark deviates from its native language's behavior here ⬆️ )

R follows IEEE 754

Julia follows IEEE 754

Matlab follows IEEE 754

MySQL doesn't even allow NaNs.

It's clear that the majority of languages follow IEEE 754. For those that do not, there is no consensus on what the rules are.

Furthermore, consider the fact that all modern hardware architectures (mostly, GPUs don't have signaling NaNs) adhere to IEEE 754, I can't see Spark's NaN behavior as anything other than abhorrent.

If Pandas is going to start looking more like Spark for NaNs

If anything, I believe Pandas is moving further from Spark's NaN behavior, not closer. By adding a proper concept of a NULL, they no longer have to abuse NaN for nulls.

Pandas today is inconsistent in how it follows IEEE 754. For things like sorting, it does break from the standard and allows sorting NaNs to a consistent location (so it agrees with Spark here).

For Join, Pandas treats NaN == NaN (same as Spark).

For element-wise comparisons with NaN (like binary ops), it follows 754 whereas Spark does not.

For GroupBy, Pandas will drop NaNs in both the keys/values column. Spark does not.

It's becoming clear to me that if libcudf were to natively and wholly support Spark's NaN behavior, it would require extremely invasive changes that touch nearly every feature in libcudf. In addition, not all libcudf consumers want Spark's behavior, so we'd have to maintain redundant code paths: one for IEEE 754 behavior and one for Spark behavior.

Consider if someone decided to redefine what 0.0 meant. This means we have to remember to explicitly check for 0.0 in any/all libcudf features. Not only would that make code more complex (and slower), but there will always be situations where C++ developers will forget to check for 0.0 because its contrary to the usual way of thinking and developing C++ code. As such, it will be a constant source of bugs and ongoing maintenance.

Spark's redefinition of NaN is just as bad as this hypothetical redefinition of 0.0.

Just as a tiny example of what I'm talking about, in C++, casting from NaN is undefined, but Spark expects it to be zero. So now any time we do a cast, we have to remember to check for NaN!

need to put NaNs somewhere deterministically and separate from nulls when it comes to sorting and grouping.

I think there's a middle ground here that minimizes the invasiveness of special casing for NaN while allowing Spark to do what it needs to do.

For example, we've identified that to Spark/Pandas/Blazing, deterministically sorting and matching NaNs in join/groupby keys is important (and libcudf already does this today). There's really no other way to emulate this behavior without libcudf natively supporting it.

In contrast, in the casting example, Spark could achieve it's desired behavior by first replacing NaNs with 0.0 and then performing the cast. Is this slower? Certainly. However, since Spark's expected behavior is unique to Spark, it makes sense for Spark to pay for the cost of the added complexity and decreased performance rather than making all consumers/developers of libcudf pay this cost.

Here's my proposal.

We proceed on a case-by-case basis and identify situations where it is literally impossible for Spark to otherwise achieve their desired behavior via composition of other libcudf features, e.g., sort/join/groupby. However, in cases like casting, Spark can achieve their desired behavior by first replacing NaN with 0.

Binary ops look like another situation where Spark can do extra work to add support for it's NaN behavior. One way to do it is to generate a boolean mask from is_nan to indicate where NaN values are, then you can use that to generate the answer you expected. Alternatively, binops support JIT, so Spark can define custom binary operations as CUDA C functions (PTX not required) in the form of a string and pass those in.

As is true of all good compromises, I think all parties will find things to be unhappy about with this solution, but I don't see another way to proceed.

@jlowe
Copy link
Member

jlowe commented Apr 2, 2020

It's wild to me that Spark deviates from its native language's behavior here

Funny, it's wild to me that Pandas is not going to be self-consistent when it comes to sorting NaNs vs. comparing them. Sorting is comparing. PostgreSQL and Spark are self-consistent in their treatment of NaNs, and I suspect that's more useful to end-users in practice. Consistent with IEEE 754? No, but IEEE 754 is not the beginning and end of how to deal with NaNs. And I hope we can all agree IEEE 754 is very unhelpful when trying to deal with them in ETL queries. That's why it sounds like we're leaning towards libcudf breaking with that standard at least when it comes to sorting and grouping.

I'm totally cool if we can implement the required behaviors with libcudf primitives even if it takes multiple steps in some cases. For the casting issue, totally agree the Spark plugin should simply scan for NaNs and replace them ahead of the cast operation. We already do something similar for division where Spark expects division by zero to generate null. All divisor values are replaced with null before calling the divide op.

So big +1 from me if there's away to move forward without resorting to building a separate libcudf.

@kkraus14
Copy link
Collaborator

kkraus14 commented Apr 2, 2020

Funny, it's wild to me that Pandas is not going to be self-consistent when it comes to sorting NaNs vs. comparing them. Sorting is comparing.

Pandas exposes a parameter to control nan positioning of either first or last. This operates separately from ascending / descending sort order, which implies that the positioning of NaN values is decoupled from comparisons for this very reason.

From my perspective, a vectorized comparison operation is entirely different than an internal implementation detail of sorting comparison where it doesn't make sense to compare the two, especially when NaN handling in sorting is handled differently by different parties.

@devavret
Copy link
Contributor

devavret commented Apr 2, 2020

Also, I don't know how would we enable Spark behaviours in libcudf for casting. We'd probably have to go through the same steps:

For the casting issue, totally agree the Spark plugin should simply scan for NaNs and replace them ahead of the cast operation

So it's the same operations, same performance penalty even when done from inside libcudf.

@jrhemstad
Copy link
Contributor Author

jrhemstad commented Apr 2, 2020

Also, I don't know how would we enable Spark behaviours in libcudf for casting. We'd probably have to go through the same steps

You could define a custom unary operator that has a specialization for floating point types that does something like:

template <typename From, typename To>
std::enable_if_t<std::is_floating_point_v<From>, To> operator()(From v){
   if(is_nan(v))
      return To{0};
   else
     return static_cast<To>(v);
}

But this is the kind of over specialization that I really don't want libcudf to have to do.

@wmalpica
Copy link

wmalpica commented Apr 3, 2020

I feel like we can have our cake and eat it too. The IEEE standard can apply to a standard comparison operators, and it should, for example when you do a WHERE clause in SQL. But it should not for the comparison operator that we use in something like a sort. The comparison operator in a sort does not have to be a standard <. It just needs to be strict weak ordering comparator, which it seems like the IEEE standard for < does not comply with that.

So when I saw we can have our cake and eat it too, we have a IEEE compliant comparison operators, and then we have sort comparison operators that are different

@shwina
Copy link
Contributor

shwina commented Apr 3, 2020

One idea I have been playing with along the same lines as @williamBlazing's comment is that it's possible to keep libcudf IEEE-754 compliant and still achieve desired behaviour for sort/merge/groupby. A sort on a column containing NaNs can be reinterpreted as a sort on two columns, the column itself, and a boolean mask col == NaN (or col != NaN). This will place NaNs before/after other values as desired.

@cwharris
Copy link
Contributor

cwharris commented Apr 3, 2020

One idea I have been playing with along the same lines as @williamBlazing's comment is that it's possible to keep libcudf IEEE-754 compliant and still achieve desired behaviour for sort/merge/groupby. A sort on a column containing NaNs can be reinterpreted as a sort on two columns, the column itself, and a boolean mask col == NaN (or col != NaN). This will place NaNs before/after other values as desired.

More-over, this is a generalization that could apply to more than just nans. It's an over-ride mask for forcing elements to the back/front of a sort. This could solve the sorting scenario of @jrhemstad 's hypothetical case where someone changes the meaning of 0.0.

@jrhemstad
Copy link
Contributor Author

So when I saw we can have our cake and eat it too, we have a IEEE compliant comparison operators, and then we have sort comparison operators that are different

It may not have been explicit enough, but in my proposal here I called out that we can keep the custom sort/groupby/join comparators libcudf currently has, even though they strictly aren't IEEE 754 compliant.

Though @shwina brings up a good point that perhaps you can emulate this behavior with an appropriate boolean mask.

@jlowe
Copy link
Member

jlowe commented Apr 3, 2020

The IEEE standard can apply to a standard comparison operators, and it should, for example when you do a WHERE clause in SQL.

Note that in PostgreSQL and Spark, NaN is a value that can compare against itself, so you can select rows that are NaNs by just selecting them directly like any other value.
For example, here's a quick Spark shell session showing that behavior:

scala> val df = Seq(Float.NaN, 0.0, 1.0, Float.NaN).toDF("floats")
df: org.apache.spark.sql.DataFrame = [floats: double]

scala> df.selectExpr("floats").where("floats = 'NaN'").collect
res10: Array[org.apache.spark.sql.Row] = Array([NaN], [NaN])

scala> df.selectExpr("floats = 'NaN'").collect
res11: Array[org.apache.spark.sql.Row] = Array([true], [false], [false], [true])

SQL engines that need this type of behavior can still implement this specific case via libcudf by translating the operation into an is_nan call instead. It gets a bit trickier when it's not a query literal but rather a computed scalar or an entire column-to-column compare where either side could contain NaNs. That's when it'd be nice to be able to override the binary comparison operators rather than force a series of "what if NaN" intermediate results and combine those results with is_nan masks and copy_if_else calls. to get the desired result. But even those are achievable today with some effort.

Aggregations and joins are where things get really messy, because those perform comparisons internally that are not controllable externally. I think we will need some libcudf help to make those work properly for Spark when NaNs are in the input. Note that this might dovetail nicely with some other asks like conditional join support, where the join condition could be pluggable and provided by the application. We eventually will have similar pluggable asks for aggregation functions.

@jrhemstad
Copy link
Contributor Author

jrhemstad commented Apr 3, 2020

Aggregations and joins are where things get really messy, because those perform comparisons internally that are not controllable externally. I think we will need some libcudf help to make those work properly for Spark when NaNs are in the input.

Agreed. Concrete example in hash join/groupby: there needs to be a specialization for the hashing function to make sure NaNs are hashed consistently. There's no way I can think of for the user to really emulate this.

Likewise, the comparator needs to be specialized to treat two rows {1,2,3, NaN} and {1, 2, 3, NaN} as equal.

I don't see how a boolean mask could help you here because you'd have {1, 2, 3, NaN, true} and {1, 2, 3, NaN, true}. Those two rows will still compare as not equal without specialization.

@shwina pointed out that you technically can make the above work by generating the boolean mask and then converting NaNs to nulls, so the example becomes {1, 2, 3, null, true} and {1, 2, 3, null, true}, which can compare equal. The boolean mask ensures you can keep NaNs distinct from other nulls.

I'm not actually suggesting anyone do this, but it is an interesting exercise to show that it is possible.

@shwina
Copy link
Contributor

shwina commented Apr 3, 2020

As discussed offline with Jake, in theory, you could pre-process to convert NaNs to nulls, in which case you'd be comparing {1, 2, 3, null, true} and {1, 2, 3, null, true}. No information has been lost, as the boolean mask still lets you track where your NaNs have been.

@jlowe
Copy link
Member

jlowe commented Apr 3, 2020

As discussed offline with Jake, in theory, you could pre-process to convert NaNs to nulls

Again, NaNs and nulls are separate. As soon as I convert NaNs to nulls, I risk thinking a NaN is a null. If I compare a NaN-now-a-null to another always-was-a-null, that's going to be true, not false as it should be. Now I need to fixup the comparison result. Might as well just do the is_nan check on both sides, compare those masks directly, and OR the results of that into the regular column comparison, Less messy IMO.

@shwina
Copy link
Contributor

shwina commented Apr 3, 2020

@jlowe Agreed. The suggested approach was just for comparing entire rows as equal, not a true elementwise compare.

@jrhemstad
Copy link
Contributor Author

Again, NaNs and nulls are separate. As soon as I convert NaNs to nulls, I risk thinking a NaN is a null. If I compare a NaN-now-a-null to another always-was-a-null, that's going to be true, not false as it should be. Now I need to fixup the comparison result. Might as well just do the is_nan check on both sides, compare those masks directly, and OR the results of that into the regular column comparison, Less messy IMO.

@shwina was suggesting that you can preserve the independence of NaNs vs. nulls by using the boolean mask.

Example:
{1, 2, 3, NaN} -> {1, 2, 3, null, true}
{1, 2, 3, null} -> {1, 2, 3, null, false}

The two rows will still compare as not equal.

@jlowe
Copy link
Member

jlowe commented Apr 3, 2020

Ah yes, sorry I misread. That technique could be used to fix equivalence comparisons for joins.

@kkraus14
Copy link
Collaborator

kkraus14 commented Apr 3, 2020

@shwina was suggesting that you can preserve the independence of NaNs vs. nulls by using the boolean mask.

Example:
{1, 2, 3, NaN} -> {1, 2, 3, null, true}
{1, 2, 3, null} -> {1, 2, 3, null, false}

The two rows will still compare as not equal.

Is removing NaN handling from groupby / join something that's on the table?

@jrhemstad
Copy link
Contributor Author

Is removing NaN handling from groupby / join something that's on the table?

Not any more, no. Though @shwina has pointed out that we technically could and still preserve the semantics that Spark/Pandas expect via boolean masks.

I originally proposed removing the NaN specialization for consistency here: #4760 (comment)

Furthermore, for consistency, I think we should roll back the non-conformant changes introduced for comparators in #3226.

But I've since rescinded that statement several times:

#4760 (comment)

For example, we've identified that to Spark/Pandas/Blazing, deterministically sorting and matching NaNs in join/groupby keys is important (and libcudf already does this today). There's really no other way to emulate this behavior without libcudf natively supporting it.

#4760 (comment)

Yep, I'm not suggesting we do that anymore.

#4760 (comment)

It may not have been explicit enough, but in my proposal here I called out that we can keep the custom sort/groupby/join comparators libcudf currently has, even though they strictly aren't IEEE 754 compliant.

@jrhemstad
Copy link
Contributor Author

This has been resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

10 participants