-
Notifications
You must be signed in to change notification settings - Fork 21
Allow operations mixing Nullable and scalar #85
base: master
Are you sure you want to change the base?
Conversation
Makes op(x::Nullable, y) equivalent to op(x, Nullable(y)), but more efficient.
Current coverage is
|
That coverage check is really great. I'll improve it if you agree with the general idea. |
This scheme does go against the picture of function lifting that has hereto informed development of the package. I worry that we might condition users to expect in general that operators ought to be able to take a mix of |
Well, the principle that This isn't useful only to allow comparing |
For example, with this PR, |
Shouldn't If the goal is to follow the C# convention here, then the lifted versions of the equality and relational operators should actually just return a |
I don't think the goal is to follow exactly the C# convention, otherwise we wouldn't have had so many design discussions. Three-valued logic is really needed for database support (see SQL semantics), so I think it makes sense for That's indeed a bit annoying since Julia does not accept non- |
A case could probably be made for allowing |
I wonder if we should just flip |
I freely admit to not know much about this stuff, but it seems weird to me for |
Think about |
I think the C# spec essentially gets this right by not following the SQL model. 3VL is really confusing and I think keeping it out as much as possible from the core language and base (as C# does) makes a lot more sense than emulating SQL. For those that haven't read the C# spec, here is what they essentially do:
Is this for things like jplyr and LINQ like things? If so, I don't agree. I think there are two alternatives that I would prefer: 1) just have different semantics when you run a query against a SQL database (I think that is what LINQ to SQL does, and I was not able to find anyone complaining about that on the web) or 2) translate your query statements into SQL that actually follows the C# semantics. I don't think it is worth the hassle to introduce 3VL into Base for a use case that is one of very many use cases of
It is hard for me to see when one wouldn't consider this a bug, so I would much prefer that if the compiler detects a
I've read a fair bit, but I'm clearling still missing some of the discussion. If anyone could point me to a discussion of whether equality and relational operators should result in 3VL, I would greatly appreciate a link! |
C# may not be the best model here as it's designed for a quite different purpose as Julia. In Julia, data with missings is going to be extremely common, which could require different design choices. For example, in R, three-valued logic is built inside the core language (non-nullable values don't even exist!). So far, we have followed an intermediate path in which operations between nullables preserve nullability in all cases, forcing the user to be explicit when crossing the barrier between nullables and non-nullables. The advantage of this choice is that it allows to decide what's the best behavior: use The discussions have been scattered all around the place, but see #74, JuliaLang/julia#13207, |
Alright, I've read through all of these issues, and I still think the C# way of handling this is a really reasonable approach.
Well, the way they handle this works extremely well for LINQ, which is all about data from databases that can have NULL values... In general, it seems to me there are actually at least four distinct questions to think about:
|
Having said all of this, I think @johnmyleswhite got it right when he said we need to try out different implementations and especially see how things actually work in situations like dplyr, LINQ etc. So write code, not have long philosophical debates ;) On that front, it would really help me if we could move the definitions of these lifted operators out of |
Could you clarify what you mean by this? I wasn't aware that R has
+1. |
@StefanKarpinski Isn't that kind of the current behavior? julia> a = Nullable(5, true); b = Nullable(6, true)
Nullable{Int64}()
julia> D = Dict{Nullable{Int}, Any}()
Dict{Nullable{Int64},Any} with 0 entries
julia> D[a] = 1
1
julia> D[b] = 2
2
julia> D[a]
2 |
AFAICT they respect 3VL. What do you mean exactly?
As you noted yourself, this is simply a matter of deciding on specific semantics inside We could move lifted operators outside of NullableArrays, but the existence of competing definitions of these methods in the wild will make things confusing. Why not experiment with different semantics inside LINQ macros only as a first step?
Yes, for now the plan is to automatically lift operators only, and lift arbitrary functions inside macros only. The macro semantics would be in line with SQL, which is the only one to make sense IMHO: return NULL when any of the inputs is NULL (cf. JuliaData/DataFrames.jl#1025 (comment)). Other choices are totally arbitrary. @davidagold R has |
Ah, I was not precise. In R |
|
This is precisely why I was getting confused. It seems, in some of the discussion above, that there's a conflation between (i) allowing EDIT: Not so much a conflation, but using 3VL to refer to both things. |
That is really, really difficult. The whole idea of Query.jl/LINQ is that it doesn't mess with the expressions that you pass in, it just picks up whatever semantics are defined for all the operators somewhere else. In fact, I don't even know how I could implement a different semantics for these operators in Query.jl for the Enumerable side of things if they are already defined somewhere else, e.g. in |
Hmm. What if |
I think people are forgetting something about SQL semantics: the way |
I assume you meant |
Err, I guess I meant |
Actually, I take that back, I guess we do need that method defined... I think it should just return |
Here's the discussion @nalimilan mentioned about providing separate |
@davidanthoff Can you provide examples/pointers illustrating why 3VL can be problematic, and why WHERE can be confusing? FWIW, I've just found out that Wikipedia has a great summary of NULL handling in SQL, including criticism: https://en.wikipedia.org/wiki/Null_(SQL) |
That Wikipedia article is a pretty good indication of what I dislike about 3VL: you end up with an endless list of special cases/rules that you just need to memorize in order to understand what is going on. I think control-flow-like constructs like In Query.jl, I really don't want the Here is another example. Assume for a second that the But if you accept the idea that control flow like language constructs shouldn't handle null values, you only have two choices:
|
Apart from these discussions, here is what I'm going to do for Query.jl in the meantime: I will define my own versions of the comparison operators that follow the C# semantics. I think that if I first import I would prefer if we could move these lifting things out of both In fact, I think the only thing where we still have a disagreement are the comparison operators, right? I'm on board with the lifting for arithmetic operators here, and if the definition of |
Right, that's why we suggested above that
Right.
I guess we can do that, but switching DataFrames to use NullableArrays without providing comparison operators either in Base, in NullableArrays or in DataFrames sounds like a impractical path for users. So either we should wait and port DataFrames to NullableArrays only when we have decided on semantcs, or we should provide comparison operators in one of these packages.
Yes, AFAICT the only remain issue is with comparison operators. |
And what should happen if a null value is passed to these? The one concrete suggestion I've seen is from @StefanKarpinski to throw an error in the case of For now I've defined various lifting things and operators in Query.jl here. In my mind those definitions are the most practical for a LINQ like framework. Having said that I'm happy to be convinced otherwise and hopefully we can just sort out one set of semantics that then can go into Base. |
I would suggest raising an error with all operations (i.e. I think I'd still like How does that sound? |
Cool, so we also agree on that! So in your proposal (control-flow-like-things throw an error if a null value comes up, comparison operators return a a = Nullable{Int}[1,2,Nullable{Int}(),4]
collect(i for i in a if i>2) Note that this kind of filtering works both in dplyr and in SQL without an error. (*) I'm using generator expressions as a proxy for a more general query facility. I think all the issues can be discussed in the context of generator syntax, so no need to rope in Query.jl or jplyr syntax. The above query would be equivalent to the following Query code: @from i in a begin
@where i>2
@select i
@collect And yes, I should probably think about a somewhat shorter syntax for that ;)
If both of these operators are guaranteed to never return a null value, why not make things simple and have them return a
So this is something I definitely don't want to do in Query: define different semantics for things like @from in in df begin
@where i.age>18 && i.age<65
@select i
@collect
end and middleage(age) = age>18 && age<65
@from in in df begin
@where middleage(age)
@select i
@collect
end |
Yeah, unless you write it inside a LINQ-like macro. If at some point we agree on semantics, we would still be able to allow this later without breaking anything.
Well, obviously we have to choose between making things more convenient inside a macro, and not altering in any way the standard Julia semantics. Anyway, it's fine to have several packages with different approaches. The point of making |
Ok, so lets assume for a second that collect(i for i in a if i>3) throws an error, while
doesn't, but presumably drops rows where i has a null value, right? I'm not a fan because I think those two things should do exactly the same thing. But for arguments sake, lets assume that is the solution for a moment. What would you suggest for the following query for null values?
I disagree, there is a third choice: we can pick the C# semantics for comparison operators and a) not alter standard julia semantics and b) have a convenient solution in LINQ like macros. And in fact in that case we would have the same semantics for all of these things inside LINQ macros and outside, which seems a huge win in terms of consistency and simplicity of the language/library.
I agree with that for a certain time, but at some point we should have one approach in Base. After all, both NullableArrays and Query are now defining lots of methods for types that are all defined in Base and override each other, which seems not like a good long term situation... |
That's really up to you. My preferred choice would be to return null, but I know you don't like that. From what you said above, I guess it would treat nulls as
We're going over the discussion from above once again. C# semantics are a possibility, but opting for them now in Base would make it harder to experiment with other choices in macros.
If we were to keep diverging semantics in different packages, we would have to implement these operations without defining methods which leak to other packages (i.e. let macros replace some calls with calls to internal functions). |
This is a much more general issue (JuliaStats/NullableArrays.jl#85) which can be tackled later.
This is a much more general issue (JuliaStats/NullableArrays.jl#85) which can be tackled later.
This is a much more general issue (JuliaStats/NullableArrays.jl#85) which can be tackled later.
This is a much more general issue (JuliaStats/NullableArrays.jl#85) which can be tackled later.
This is a much more general issue (JuliaStats/NullableArrays.jl#85) which can be tackled later.
This is a much more general issue (JuliaStats/NullableArrays.jl#85) which can be tackled later.
Let's assume for a moment that we change With these semantics, comparing two |
I think here is an argument for the C# implementation that is not based on history for Plus, I guess the semantics for comparing We would lose consistency with Right now I also have the following in Query: const null = Nullable{Union{}}() and then all the operators that allow you to essentially write null checks like I think the real inconsistency in the C# spec stems from |
In Julia, you already need to use At least, returning a |
This is not about the mental burden for package developers but for "normal" users. If
I think what it comes down to is this: yes, the rules for |
I've done a small review of choices made by other languages, and indeed they don't seem to bother about the inconsistency between Only SQL, R and SAS return the equivalent of So I guess we should go that way, given that it will be the less surprising choice for users (unless of course we decide to return a |
This is a much more general issue (JuliaStats/NullableArrays.jl#85) which can be tackled later.
This is a much more general issue (JuliaStats/NullableArrays.jl#85) which can be tackled later.
This is a much more general issue (JuliaStats/NullableArrays.jl#85) which can be tackled later.
Makes
op(y::Nullable, x)
equivalent toop(y, Nullable(x))
, but more efficient.I have played with the idea of actually defining
op(x::Nullable, y) = op(x, Nullable(y))
, hoping that the compiler would get rid of everything, but the generated code is awful.The presence of ambiguities makes the code much longer than it needs to be. In particular, the ambiguous
<<
and>>
in Base are only there to raise an error, and define something like an interface. We could imagine that with an improved support for traits they would no longer happen in some future release...