-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DISCUSSION] Behavior for NaN comparisons in libcudf #4760
Comments
They only have nullable integer and boolean columns as of 1.0. For floating point columns they only have NaNs are there's ongoing discussions about if they want to support nulls vs NaNs for floating point columns. pandas-dev/pandas#32265 |
I'm strongly in favour of libcudf following IEEE 754 for NaN behaviour, while offering a good deal of flexibility for
This is definitely accurate for Pandas < 1.0, i.e., before the introduction of For Pandas >= 1.0, it's still not clear (to me, at least), how Pandas treats |
ORC and Parquet both have very clearly defined behavior regardings NaNs, so IO readers/writers will presumably stick with following the corresponding specification in that regard (forcing NaNs to be NULLs would be a problem). |
Do ORC/Parquet have any behavior that doesn't conform with IEEE 754? If not, then there's no issue. To be clear, I'm not suggesting NaN's be always converted to Nulls. Instead, if a user desires behavior for NaNs to be anything other than what IEEE 754 specifies, then they can convert those NaNs to nulls and use the various null handling logic present in libcudf. |
This has come up a number of times, and as discussed previously it is not a workable solution. NaNs are distinct from nulls in Spark, and therefore NaNs cannot be converted to nulls without losing information. Consider a column that contains both NaNs and nulls, and I need it to be sorted, aggregated, whatever. As soon as I change the NaNs to nulls and operate on it I have lost the ability to recover which nulls are really proxies for NaNs so they can be converted back to produce a proper answer, This also makes the assumption that how nulls are handled matches the behavior desired for NaNs which is incorrect in many cases (e.g.: how Spark sorts nulls vs. NaNs). |
Is the fact that Spark NaN semantics differs from the IEEE 754 standard something that you think could be reasonably raised upstream in the Spark community? I understand that isn't a short term path, but from the perspective of libcudf being a C++ library, this feels out of scope and will just bloat the library. @jrhemstad any thoughts on how hard it would be to allow someone to provide their own unary / binary ops via an additional shared library? I.E. I could imagine in the future that any Spark or Pandas or other library specific behaviors could be defined within a helper shared library that depends on libcudf and extends things as needed where it doesn't make sense to include them in libcudf directly. |
If someone is only interfacing with the precompiled Where it would be possible is if a user interfaces directly with libcudf's header-only implementations (currently only in None of our However, this solution would require a user to build their own library binary. |
No, definitely not. Users do not want sorts of columns that happen to contain NaNs to sort NaNs semi-randomly (as what occurs in practice with IEEE 754 NaN comparison behavior). Other databases have similar sort behaviors to Spark wrt. NaNs, and asking a very long standing and well documented database behavior to change is a non-starter. I understand the desire to not juggle a bunch of conflicting requirements, but we need to find some way to meet them. As I understand it, that means we have to choose between:
I really, really don't want to choose that last one. 😅 |
I think what you're suggesting here is the solution @kkraus14 was hinting at and what I outlined above. This would take a bit of work to expose the necessary function templates, but still probably less than the complexity of providing two separate code paths that libcudf would currently need to provide in option 1. |
I'm all for allowing control of NaNs in sorting because that's allowing the ability to define otherwise undefined behavior, but these binary operations are well defined for NaN values. |
The problem is that sorting and comparison are expected to be related. If having the sort operation order NaNs as the largest, then IMO comparing a NaN against other numbers should reflect that sort order. Since IEEE 754 NaNs never compare, I don't see how you can ever sort them. If we want to sort them in a consistent way then there needs to be a way to consistently compare them. |
I agree, which is why I actually suggested removing the specialization for NaNs in the sort comparator :) That way we're consistently conformant rather than a mix of non-conformance with conformance (as we've seen in Raza's issues). |
Note that removing the sort specialization doesn't really solve anything other than allowing us to say, "we're just as useful as IEEE 754 when it comes to sorting with data that may contain NaNs. Make sure you don't have NaNs in your data before calling sort/orderby or it's worthless" 😉 |
Like I always say, "Better to be consistently wrong than inconsistently right." 😄 |
Besides Spark and Pandas, what do other apps on cudf want with respect to NaN handling? For example, curious what the Blazing group's thoughts are and if they have a proposal we haven't considered. cc: @felipeblazing |
For Blazing its actually pretty important that NaN comparisons be deterministic. Non deterministic sorting could cause all kinds of issues for us. I think its very valid for a user to want to do something like have that be included as one of the groups when performing a group by. Also how would it look to a user that they sort a list and see different results when the sort the list [2,0,Nan]
I agree with @jrhemstad when he says its better to be consistently wrong as opposed to following the spec. The spec seems to suck in this case where users can care if something is NaN. Right now we aren't doing anything different from what cudf is doing and we have not tested what happens when we put NaN numbers in columns. Something we will have to rectify :). We care about consistency. We don't really care if they come before or after. I imagine that it might be most convenient to put them at the end assuming people are most likely to sort things in ascending order. |
@felipeblazing I think you're gravely misunderstanding my statement 😅 I'm saying we should consistently follow the IEEE 754 standard, which we don't currently do (as evidenced by the special handling of NaNs for sorting). This is the opposite of what you seem to be advocating for. NaNs are, by definition, objects that cannot be ordered. The expectation is that if you have NaNs in your data, you should do something about that (like filtering them out or replacing them with a sentinel value) before trying to do any operation like a sort. |
I think this is going to be true for any cudf application that needs to deal directly with NaNs instead of treating them as equivalent to nulls. By adhering to IEEE 754, NaNs become explosive values in the data, ruining sorts, aggregations, lookups, etc. in which they appear. If libcudf strictly adheres to IEEE 754 for NaNs and only exposes function templates, then I would argue most cudf applications needing to wield NaNs will have to build their own libcudf with custom function templates. That's not ideal. I think function templates are a good way to factor the problem and minimize duplicated code paths in libcudf. However I don't think libcudf's solution should stop there when it comes to NaNs because the NaN issue is likely not limited to a small subset of cudf applications. To me this would be akin to libcudf only sorting nulls first and saying, "Do you need to sort nulls last instead sometimes? Sorry, we don't support that directly. Here's a function template interface to help when you build your own from source." IMO it comes down to how useful libcudf's build artifacts will be for applications that need to deal with NaNs. If almost everything in libcudf doesn't behave in a useful way when NaNs appear, then there's going to be a lot of libcudf 'flavors' in the wild that are essentially identical just to solve the NaN problem. At that point it would be tempting to fork libcudf, add the necessary function templates to wield NaNs in a useful manner, and publish it as an out-of-the-box solution for apps that must deal with NaNs. |
This has been stated implicitly a couple of times, but just to be explicit: applications can still get predictable For applications that need to deal with a combination of |
The problem with this is it then becomes impossible to differentiate between
Spark currently handles it, Pandas hasn't crossed that road as of yet because it doesn't yet have nullable float columns, but does handle sorting |
NaNs are sorted relative to numeric values as explained in the PR description and documented at https://spark.apache.org/docs/latest/sql-reference.html#nan-semantics. Null sorting relative to non-null values is treated orthogonally and can be controlled by the user per-column (nulls first or nulls last). For example, if you want to sort the column |
PostgreSQL matches Spark's behavior with respect to NaN handling, see https://www.postgresql.org/docs/current/datatype-numeric.html#DATATYPE-FLOAT. I'm not sure if Blazing will choose to align, but it's clear they'll need to put NaNs somewhere deterministically and separate from nulls when it comes to sorting and grouping. |
If Pandas is going to start looking more like Spark for NaNs, I have to agree that the custom header version is not very useful. Seems we should look at whether all of our clients might potentially want the same (or close to the same) thing. And if not, we should look at how JIT might be put to use. |
Scala follows IEEE 754. (While researching Scala, came across a particularly nice quote
) (It's wild to me that Spark deviates from its native language's behavior here ⬆️ ) MySQL doesn't even allow NaNs. It's clear that the majority of languages follow IEEE 754. For those that do not, there is no consensus on what the rules are. Furthermore, consider the fact that all modern hardware architectures (mostly, GPUs don't have signaling NaNs) adhere to IEEE 754, I can't see Spark's NaN behavior as anything other than abhorrent.
If anything, I believe Pandas is moving further from Spark's NaN behavior, not closer. By adding a proper concept of a NULL, they no longer have to abuse NaN for nulls. Pandas today is inconsistent in how it follows IEEE 754. For things like sorting, it does break from the standard and allows sorting NaNs to a consistent location (so it agrees with Spark here). For Join, Pandas treats NaN == NaN (same as Spark). For element-wise comparisons with NaN (like binary ops), it follows 754 whereas Spark does not. For GroupBy, Pandas will drop NaNs in both the keys/values column. Spark does not. It's becoming clear to me that if libcudf were to natively and wholly support Spark's NaN behavior, it would require extremely invasive changes that touch nearly every feature in libcudf. In addition, not all libcudf consumers want Spark's behavior, so we'd have to maintain redundant code paths: one for IEEE 754 behavior and one for Spark behavior. Consider if someone decided to redefine what Spark's redefinition of Just as a tiny example of what I'm talking about, in C++, casting from NaN is undefined, but Spark expects it to be zero. So now any time we do a cast, we have to remember to check for NaN!
I think there's a middle ground here that minimizes the invasiveness of special casing for NaN while allowing Spark to do what it needs to do. For example, we've identified that to Spark/Pandas/Blazing, deterministically sorting and matching NaNs in join/groupby keys is important (and libcudf already does this today). There's really no other way to emulate this behavior without libcudf natively supporting it. In contrast, in the casting example, Spark could achieve it's desired behavior by first replacing NaNs with Here's my proposal. We proceed on a case-by-case basis and identify situations where it is literally impossible for Spark to otherwise achieve their desired behavior via composition of other libcudf features, e.g., sort/join/groupby. However, in cases like casting, Spark can achieve their desired behavior by first replacing NaN with 0. Binary ops look like another situation where Spark can do extra work to add support for it's NaN behavior. One way to do it is to generate a boolean mask from As is true of all good compromises, I think all parties will find things to be unhappy about with this solution, but I don't see another way to proceed. |
Funny, it's wild to me that Pandas is not going to be self-consistent when it comes to sorting NaNs vs. comparing them. Sorting is comparing. PostgreSQL and Spark are self-consistent in their treatment of NaNs, and I suspect that's more useful to end-users in practice. Consistent with IEEE 754? No, but IEEE 754 is not the beginning and end of how to deal with NaNs. And I hope we can all agree IEEE 754 is very unhelpful when trying to deal with them in ETL queries. That's why it sounds like we're leaning towards libcudf breaking with that standard at least when it comes to sorting and grouping. I'm totally cool if we can implement the required behaviors with libcudf primitives even if it takes multiple steps in some cases. For the casting issue, totally agree the Spark plugin should simply scan for NaNs and replace them ahead of the cast operation. We already do something similar for division where Spark expects division by zero to generate null. All divisor values are replaced with null before calling the divide op. So big +1 from me if there's away to move forward without resorting to building a separate libcudf. |
Pandas exposes a parameter to control nan positioning of either From my perspective, a vectorized comparison operation is entirely different than an internal implementation detail of sorting comparison where it doesn't make sense to compare the two, especially when |
Also, I don't know how would we enable Spark behaviours in libcudf for casting. We'd probably have to go through the same steps:
So it's the same operations, same performance penalty even when done from inside libcudf. |
You could define a custom unary operator that has a specialization for floating point types that does something like: template <typename From, typename To>
std::enable_if_t<std::is_floating_point_v<From>, To> operator()(From v){
if(is_nan(v))
return To{0};
else
return static_cast<To>(v);
} But this is the kind of over specialization that I really don't want libcudf to have to do. |
I feel like we can have our cake and eat it too. The IEEE standard can apply to a standard comparison operators, and it should, for example when you do a WHERE clause in SQL. But it should not for the comparison operator that we use in something like a sort. The comparison operator in a sort does not have to be a standard <. It just needs to be strict weak ordering comparator, which it seems like the IEEE standard for < does not comply with that. So when I saw we can have our cake and eat it too, we have a IEEE compliant comparison operators, and then we have sort comparison operators that are different |
One idea I have been playing with along the same lines as @williamBlazing's comment is that it's possible to keep libcudf IEEE-754 compliant and still achieve desired behaviour for sort/merge/groupby. A sort on a column containing |
More-over, this is a generalization that could apply to more than just nans. It's an over-ride mask for forcing elements to the back/front of a sort. This could solve the sorting scenario of @jrhemstad 's hypothetical case where someone changes the meaning of |
It may not have been explicit enough, but in my proposal here I called out that we can keep the custom sort/groupby/join comparators libcudf currently has, even though they strictly aren't IEEE 754 compliant. Though @shwina brings up a good point that perhaps you can emulate this behavior with an appropriate boolean mask. |
Note that in PostgreSQL and Spark, NaN is a value that can compare against itself, so you can select rows that are NaNs by just selecting them directly like any other value.
SQL engines that need this type of behavior can still implement this specific case via libcudf by translating the operation into an Aggregations and joins are where things get really messy, because those perform comparisons internally that are not controllable externally. I think we will need some libcudf help to make those work properly for Spark when NaNs are in the input. Note that this might dovetail nicely with some other asks like conditional join support, where the join condition could be pluggable and provided by the application. We eventually will have similar pluggable asks for aggregation functions. |
Agreed. Concrete example in hash join/groupby: there needs to be a specialization for the hashing function to make sure NaNs are hashed consistently. There's no way I can think of for the user to really emulate this. Likewise, the comparator needs to be specialized to treat two rows I don't see how a boolean mask could help you here because you'd have @shwina pointed out that you technically can make the above work by generating the boolean mask and then converting NaNs to nulls, so the example becomes I'm not actually suggesting anyone do this, but it is an interesting exercise to show that it is possible. |
As discussed offline with Jake, in theory, you could pre-process to convert NaNs to nulls, in which case you'd be comparing |
Again, NaNs and nulls are separate. As soon as I convert NaNs to nulls, I risk thinking a NaN is a null. If I compare a NaN-now-a-null to another always-was-a-null, that's going to be true, not false as it should be. Now I need to fixup the comparison result. Might as well just do the |
@jlowe Agreed. The suggested approach was just for comparing entire rows as equal, not a true elementwise compare. |
@shwina was suggesting that you can preserve the independence of NaNs vs. nulls by using the boolean mask. Example: The two rows will still compare as not equal. |
Ah yes, sorry I misread. That technique could be used to fix equivalence comparisons for joins. |
Is removing |
Not any more, no. Though @shwina has pointed out that we technically could and still preserve the semantics that Spark/Pandas expect via boolean masks. I originally proposed removing the NaN specialization for consistency here: #4760 (comment)
But I've since rescinded that statement several times:
|
This has been resolved. |
Further splitting up #9452 -- split off at the suggestion of @bdice Related to #10781 and #4760 -- issues and discussions related to NaN comparison behavior. Authors: - Ryan Lee (https://github.com/rwlee) - Bradley Dice (https://github.com/bdice) Approvers: - Jake Hemstad (https://github.com/jrhemstad) - Bradley Dice (https://github.com/bdice) - Nghia Truong (https://github.com/ttnghia) URL: #10870
Recent issues (#4753 #4752) have called into question how libcudf handles NaN floating point values. We've only ever addressed this issue on an ad hoc basis as opposed to having a larger conversation about the issue.
C++
C++ follows the IEEE 754 standard for floating point values, which for comparisons with NaN has the following behavior:
https://en.wikipedia.org/wiki/NaN
Spark
Spark is non-conforming with the IEEE 754 standard:
See https://spark.apache.org/docs/latest/sql-reference.html#nan-semantics
Python/Pandas
Python is a bit of a grey area because prior to 1.0, Pandas did not have the concept of "null" values and used NaN's in their stead.
In most regards, Python does respect IEEE 754. For example, see how numpy conforms with the expected IEEE754 behavior in binary ops #4752 (comment) (where Spark does not).
However, there are some cases where Pandas is non-conforming due to the pseudo-null behavior. For example, in
sort_values
there is ana_position
argument to control where NaN values are placed. This requires specializing the libcudf comparator used for sorting to special case floating point values and deviate from the IEEE 754 behavior ofNaN < x == false
andNaN > x == false
. See #2191 and #3226 where this was done previously.That said, I believe Python's requirements could be satisfied by always converting NaN values to nulls, but @shwina @kkraus14 will need to confirm. Prior to Pandas 1.0, it wasn't possible to have both NaN and NULL values in a floating point column. We should see what the expected behavior is of NaNs vs Nulls will be in 1.0.
Discussion
We need to have a conversation and make decisions on what libcudf will and will not do with respect to NaN behavior.
My stance is that libcudf should adhere to IEEE 754. Spark's semantics redefine a core concept of the C++ language/IEEE standard and satisfying those semantics would require extremely invasive changes that negatively impact both performance and code maintainability.
Even worse, because Spark differs from C++/Pandas, we need to provide separate code paths for all comparison based operations: a "Spark" path, and a "C++/Pandas" path. This further increases code bloat and maintenance costs.
Furthermore, for consistency, I think we should roll back the non-conformant changes introduced for comparators in #3226.
In conclusion, we already have special logic for handling NULLs everywhere in libcudf. Users should leverage that logic by converting NaNs to NULLs. I understand that vanilla Spark treats NaNs and NULLs independently, but I believe trying to imitate that behavior in libcudf comes at too high a cost.
The text was updated successfully, but these errors were encountered: