Skip to content
This repository has been archived by the owner on May 4, 2019. It is now read-only.

Allow operations mixing Nullable and scalar #85

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

nalimilan
Copy link
Member

Makes op(y::Nullable, x) equivalent to op(y, Nullable(x)), but more efficient.

I have played with the idea of actually defining op(x::Nullable, y) = op(x, Nullable(y)), hoping that the compiler would get rid of everything, but the generated code is awful.

The presence of ambiguities makes the code much longer than it needs to be. In particular, the ambiguous << and >> in Base are only there to raise an error, and define something like an interface. We could imagine that with an improved support for traits they would no longer happen in some future release...

Makes op(x::Nullable, y) equivalent to op(x, Nullable(y)),
but more efficient.
@codecov-io
Copy link

Current coverage is 88.65%

Merging #85 into master will decrease coverage by -1.62% as of 7fe4d13

@@            master     #85   diff @@
======================================
  Files           12      12       
  Stmts          586     608    +22
  Branches         0       0       
  Methods          0       0       
======================================
+ Hit            529     539    +10
  Partial          0       0       
- Missed          57      69    +12

Review entire Coverage Diff as of 7fe4d13

Powered by Codecov. Updated on successful CI builds.

@nalimilan
Copy link
Member Author

That coverage check is really great. I'll improve it if you agree with the general idea.

@davidagold
Copy link
Contributor

This scheme does go against the picture of function lifting that has hereto informed development of the package. I worry that we might condition users to expect in general that operators ought to be able to take a mix of Nullable and non-Nullable arguments -- an expectation I don't think we or anybody else should cater to. Where other than comparing a NullableArray to an Array does this issue come up?

@nalimilan
Copy link
Member Author

Well, the principle that Nullable op scalar = Nullable op Nullable(scalar) doesn't sound too hard to implement -- as can be seen from the fact that this simple PR already supports all common operators (thanks to the clean organization of your code, BTW).

This isn't useful only to allow comparing NullableArray and Array. I see e.g. Nullable(1) == 1 returning Nullable(true) instead of false as a more interesting improvement: the latter is quite confusing IMHO, and can go unnoticed, giving incorrect results, while the former will either behave correctly or raise an explicit error if passed to if.

@nalimilan
Copy link
Member Author

For example, with this PR, find([Nullable(true), Nullable(false)]) raises an error instead of returning the incorrect [1, 2] (#76).

@davidanthoff
Copy link

Shouldn't Nullable(1)==1 just return true? Nullable(1)===1 could still return false.

If the goal is to follow the C# convention here, then the lifted versions of the equality and relational operators should actually just return a Bool, not a Nullable{Bool}. See section 7.3.7 in the C# spec.

@nalimilan
Copy link
Member Author

I don't think the goal is to follow exactly the C# convention, otherwise we wouldn't have had so many design discussions. Three-valued logic is really needed for database support (see SQL semantics), so I think it makes sense for Nullable(1) == 1 to return a Nullable{Bool}, since it's equivalent (via promotion) to Nullable(1) == Nullable(1).

That's indeed a bit annoying since Julia does not accept non-Bool with if conditions, but that's the only way to preserve information regarding missingness (this is e.g. what R does). There is isequal for when you want a boolean result.

@StefanKarpinski
Copy link

A case could probably be made for allowing Nullable{Bool} values in boolean contexts when not null – and raising an error in case they are null.

@quinnj
Copy link
Member

quinnj commented Aug 18, 2016

I wonder if we should just flip isequal and ==. I.e. have Nullable(1) == 1 return true, and isequal can be used for full three-valued logic. I admit I'm not fully informed on the ramifications of doing so though.

@ararslan
Copy link
Member

ararslan commented Aug 18, 2016

I freely admit to not know much about this stuff, but it seems weird to me for == and isequal to behave differently.

@StefanKarpinski
Copy link

Think about isequal as being for hashing things. I don't think you want to be able to stick an infinite number of nullables into a Dict and never get them out.

@davidanthoff
Copy link

I think the C# spec essentially gets this right by not following the SQL model. 3VL is really confusing and I think keeping it out as much as possible from the core language and base (as C# does) makes a lot more sense than emulating SQL. For those that haven't read the C# spec, here is what they essentially do:

  • The actually do support 3VL logic because !, & and | are all defined for nullable values and follow the standard 3VL rules (as an side note, those of course aren't what you would get if you were to just apply the lifting semantics that are discussed as a general lifting semantics in various issues here).
  • The equality and relational operators provide lifted versions that do not throw you into 3VL land but always return a standard bool value.

Three-valued logic is really needed for database support (see SQL semantics)

Is this for things like jplyr and LINQ like things? If so, I don't agree. I think there are two alternatives that I would prefer: 1) just have different semantics when you run a query against a SQL database (I think that is what LINQ to SQL does, and I was not able to find anyone complaining about that on the web) or 2) translate your query statements into SQL that actually follows the C# semantics. I don't think it is worth the hassle to introduce 3VL into Base for a use case that is one of very many use cases of Nullable.

A case could probably be made for allowing Nullable{Bool} values in boolean contexts when not null – and raising an error in case they are null.

It is hard for me to see when one wouldn't consider this a bug, so I would much prefer that if the compiler detects a Nullable{Bool} in an if statement that it complains at compile time.

otherwise we wouldn't have had so many design discussions.

I've read a fair bit, but I'm clearling still missing some of the discussion. If anyone could point me to a discussion of whether equality and relational operators should result in 3VL, I would greatly appreciate a link!

@nalimilan
Copy link
Member Author

C# may not be the best model here as it's designed for a quite different purpose as Julia. In Julia, data with missings is going to be extremely common, which could require different design choices. For example, in R, three-valued logic is built inside the core language (non-nullable values don't even exist!). So far, we have followed an intermediate path in which operations between nullables preserve nullability in all cases, forcing the user to be explicit when crossing the barrier between nullables and non-nullables.

The advantage of this choice is that it allows to decide what's the best behavior: use == (or get) when you want to get a Nullable{Bool}, use isequal when you want to get a Bool. I also find it useful to catch potential bugs: if you get a null value when the code wasn't prepared to handle it, it raises a NullException instead of silently returning false. So it's useful even outside LINQ-like code blocks.

The discussions have been scattered all around the place, but see #74, JuliaLang/julia#13207,
#95 and linked issues.

@davidanthoff
Copy link

Alright, I've read through all of these issues, and I still think the C# way of handling this is a really reasonable approach.

C# may not be the best model here as it's designed for a quite different purpose as Julia.

Well, the way they handle this works extremely well for LINQ, which is all about data from databases that can have NULL values...

In general, it seems to me there are actually at least four distinct questions to think about:

  1. Do we want 3VL semantics or not? Note that SQL, R and C# follow the 3VL semantics, but the current implementation here in NullableArrays does not. I have not read any good reason to not follow the standard 3VL semantics, so I'm a bit confused about the NullableArrays choice here. Note that this only concerns how & | and ! are implemented for Nullable{Bool} arguments but doesn't touch any of the other lifting issues.
  2. How should normal arithmetic operations be lifted? I.e. +, -, * and friends for numbers. This seems the easy case where most people seem to agree that a) the return type should be a Nullable if one (or all) arguments are a Nullable, and null values should propagate.
  3. The most tricky one is probably whether comparison operators like == and friends should return a Nullable{Bool} or just a Bool. SQL and R opt for the former, C# for the latter. I much prefer the latter approach, because it means that you are not thrown into 3VL land when you do comparison operations. In my mind, comparison operations in data heavy situations are by far the most often used for some sort of filtering or if statements. In both cases it seems to me you really don't want to end up in 3VL land... If you do define comparison operators so that they return Nullable{Bool}, I think one has two choices for things like filtering and if statements: a) you can define semantics for if and filtering on how to handle Nullable{Bool}. SQL for example does this for its WHERE clause. I'm not a fan at all of that, it makes things complicated and messy and in the case of SQL constantly trips people up. Or b), you have to explicitly convert your Nullable{Bool} to a Bool before it enters the filter or if clause. That works, but I think it makes things way, way too verbose. In my LINQ implementation for julia (Query.jl) this would really mess things up, even simple things like @where i.age==30. would have to be rewritten in a really ugly way if age was a nullable column in the data source. I think the semantics in the C# spec are especially useful for situations like LINQ in this case.
  4. I'd be very hesitant to define common lifting semantics for other things at this point, nothing seems clear to me on that front. For example, string concatenation: in SQL concatenating a NULL and a string results in a NULL, in C# it results in a string (where the NULL value is handled as if it was an empty string) and in R it results in a string "FIRSTSTRING NA". While the R way of handling that seems truly bad to me, I'm not at all sure whether the SQL or C# approach make more sense. I think that suggests that neither Base nor NullableArrays should do anything about this case right now.

@davidanthoff
Copy link

Having said all of this, I think @johnmyleswhite got it right when he said we need to try out different implementations and especially see how things actually work in situations like dplyr, LINQ etc. So write code, not have long philosophical debates ;) On that front, it would really help me if we could move the definitions of these lifted operators out of NullableArrays. For Query.jl the current choices in NullableArray around lifting are not ideal (to put it mildly), but I still need to use NullableArrays for arrays that can have nulls. So if we could move the lifting out of NullableArrays into its own package, I could use NullableArrays in Query.jl, but use my own lifting semantics that actually work better in a LINQ like context (essentially the C# spec). Other packages could use the lifting semantics that are moved from NullableArrays into its own package, and then we could just try and see for a while which ones work better.

@davidagold
Copy link
Contributor

The most tricky one is probably whether comparison operators like == and friends should return a Nullable{Bool} or just a Bool. SQL and R opt for the former, C# for the latter.

Could you clarify what you mean by this? I wasn't aware that R has Nullables.

On that front, it would really help me if we could move the definitions of these lifted operators out of NullableArrays.

+1.

@davidagold
Copy link
Contributor

Think about isequal as being for hashing things. I don't think you want to be able to stick an infinite number of nullables into a Dict and never get them out.

@StefanKarpinski Isn't that kind of the current behavior?

julia> a = Nullable(5, true); b = Nullable(6, true)
Nullable{Int64}()

julia> D = Dict{Nullable{Int}, Any}()
Dict{Nullable{Int64},Any} with 0 entries

julia> D[a] = 1
1

julia> D[b] = 2
2

julia> D[a]
2

@nalimilan
Copy link
Member Author

Do we want 3VL semantics or not? Note that SQL, R and C# follow the 3VL semantics, but the current implementation here in NullableArrays does not. I have not read any good reason to not follow the standard 3VL semantics, so I'm a bit confused about the NullableArrays choice here. Note that this only concerns how & | and ! are implemented for Nullable{Bool} arguments but doesn't touch any of the other lifting issues.

AFAICT they respect 3VL. What do you mean exactly?

In my LINQ implementation for julia (Query.jl) this would really mess things up, even simple things like @where i.age==30. would have to be rewritten in a really ugly way if age was a nullable column in the data source. I think the semantics in the C# spec are especially useful for situations like LINQ in this case.

As you noted yourself, this is simply a matter of deciding on specific semantics inside @where, for example by considering any nulls as false. We discussed somewhere in AbstractTables the possibility of providing both filter and where with different semantics to allow for both behaviours.

We could move lifted operators outside of NullableArrays, but the existence of competing definitions of these methods in the wild will make things confusing. Why not experiment with different semantics inside LINQ macros only as a first step?

I'd be very hesitant to define common lifting semantics for other things at this point, nothing seems clear to me on that front. For example, string concatenation: in SQL concatenating a NULL and a string results in a NULL, in C# it results in a string (where the NULL value is handled as if it was an empty string) and in R it results in a string "FIRSTSTRING NA". While the R way of handling that seems truly bad to me, I'm not at all sure whether the SQL or C# approach make more sense. I think that suggests that neither Base nor NullableArrays should do anything about this case right now.

Yes, for now the plan is to automatically lift operators only, and lift arbitrary functions inside macros only. The macro semantics would be in line with SQL, which is the only one to make sense IMHO: return NULL when any of the inputs is NULL (cf. JuliaData/DataFrames.jl#1025 (comment)). Other choices are totally arbitrary.

@davidagold R has Nullable in the sense that any boolean value can be NA. So in effect everything is 3VL there.

@davidanthoff
Copy link

The most tricky one is probably whether comparison operators like == and friends should return a Nullable{Bool} or just a Bool. SQL and R opt for the former, C# for the latter.

Could you clarify what you mean by this? I wasn't aware that R has Nullables.

Ah, I was not precise. In R 3==NA returns NA. In C# 3==null returns false (and similiar things hold for <= etc.). For Query.jl, I really want C# like semantics so that my filter clauses stay simple and sane.

@davidanthoff
Copy link

AFAICT they respect 3VL. What do you mean exactly?

Nullable(true) | Nullable{Bool}() returns Nullable{Bool}() right now, and that is not 3VL semantics, it should return Nullable{Bool}(true). Same for the other operators I mentioned, they don't follow the standard value tables of a 3VL.

@davidagold
Copy link
Contributor

davidagold commented Aug 19, 2016

Nullable(true) | Nullable{Bool}() returns Nullable{Bool}() right now, and that is not 3VL semantics, it should return Nullable{Bool}(true). Same for the other operators I mentioned, they don't follow the standard value tables of a 3VL.

This is precisely why I was getting confused. It seems, in some of the discussion above, that there's a conflation between (i) allowing Nullable{Bool}() to be returned from logical operations and (ii) employing 3VL semantics (as codified in truth tables that propagate certain invariants in the face of null values). I think people have been using 3VL to refer to both (i) and (ii) above.

EDIT: Not so much a conflation, but using 3VL to refer to both things.

@davidanthoff
Copy link

We could move lifted operators outside of NullableArrays, but the existence of competing definitions of these methods in the wild will make things confusing. Why not experiment with different semantics inside LINQ macros only as a first step?

That is really, really difficult. The whole idea of Query.jl/LINQ is that it doesn't mess with the expressions that you pass in, it just picks up whatever semantics are defined for all the operators somewhere else. In fact, I don't even know how I could implement a different semantics for these operators in Query.jl for the Enumerable side of things if they are already defined somewhere else, e.g. in NullableArrays. So, short answer, I don't know how I could do that...

@davidagold
Copy link
Contributor

Ah, I was not precise. In R 3==NA returns NA. In C# 3==null returns false (and similiar things hold for <= etc.). For Query.jl, I really want C# like semantics so that my filter clauses stay simple and sane.

Hmm. What if 3 == Nullable{False}() returned Nullable(false)? Does that allow for filter clauses to stay simple and sane? I ask because I'm trying to tease apart whether or not it is the returned value (false versus some sort of NA) or the actually Nullable wrapper that is getting in the way for you.

@johnmyleswhite
Copy link
Member

I think people are forgetting something about SQL semantics: the way WHERE clauses work doesn't intrinsically depend on 3VL, it just depends on including only the rows where the predicate evaluates to true. Whether a given predicate evaluates to true depends on 3VL.

@davidanthoff
Copy link

What if 3 == Nullable{False}() returned Nullable(false)?

I assume you meant 3 == Nullable{Bool}()? That should just be an error, i.e. there shouldn't be a method defined for these type input combinations.

@davidagold
Copy link
Contributor

I assume you meant 3 == Nullable{Bool}()? That should just be an error, i.e. there shouldn't be a method defined for these type input combinations.

Err, I guess I meant 3 == Nullable{Int}() or whatever type the original NA was supposed to be of in your original example.

@davidanthoff
Copy link

Actually, I take that back, I guess we do need that method defined... I think it should just return false, not a Nullable(false).

@garborg
Copy link
Member

garborg commented Aug 20, 2016

Here's the discussion @nalimilan mentioned about providing separate filter and where behaviors: JuliaData/DataFramesMeta.jl#58

@nalimilan
Copy link
Member Author

@davidanthoff Can you provide examples/pointers illustrating why 3VL can be problematic, and why WHERE can be confusing? FWIW, I've just found out that Wikipedia has a great summary of NULL handling in SQL, including criticism: https://en.wikipedia.org/wiki/Null_(SQL)

@davidanthoff
Copy link

That Wikipedia article is a pretty good indication of what I dislike about 3VL: you end up with an endless list of special cases/rules that you just need to memorize in order to understand what is going on. I think control-flow-like constructs like if, while, ?:, &&, || etc. should throw errors when they encounter a null value or a Nullable{Bool}, otherwise you need to define behavior for each of these for null that people need to remember. In my mind filtering also falls into that category, in fact in the julia base syntax for filtering it actually uses a control flow keyword: the if clause in a generator.

In Query.jl, I really don't want the @where clause to have different semantics from the if clause in a julia generator, and I also don't want the if clause in a generator to silently drop rows with null values, but the control flow if to throw an error. I think that kind of inconsistency is a sign of a bad design, I think it is much cleaner and easier if things that ultimately require a boolean value only accept one and error for anything else.

Here is another example. Assume for a second that the @where clause in Query.jl would just drop rows when the filter expression returns a null value, like in SQL. Now assume someone is using the ternary operator on the same column that was used in the filter to do something, i.e. a nullable column. Why should the @where clause just work, but the ternary operator in a @select clause not? But if the ternary operator works in a @select clause for a Nullable, why should it not work in the normal language? I think once you start to handle nulls with anything other than an error in control flow like situations, you either have to handle them everywhere in the language or you just end up with some arbitrary boundaries of where they can be used.

But if you accept the idea that control flow like language constructs shouldn't handle null values, you only have two choices:

  1. you can still have comparison operators return Nullable, but then you need to explicitly convert any Nullable into a Bool before it enters a control flow like construct, i.e. a filter clause. That works and is consistent, but in my mind completely impractible because simple things like a where clause can't be written like i.age > 30 anymore, you have to sprinkle things with isnull etc.
  2. you take the super simple approach of C# and just have comparison operators return Bool. End of story, users have to learn one thing only, namely the semantics of the comparison operators, and then they can deduce what a comparison of Nullables will do in any control flow like situation. And the semantics of the comparison operators as defined by C# are actually quite intuitive, so probably most people don't even have to study them but would just assume them in any case.

@davidanthoff
Copy link

Apart from these discussions, here is what I'm going to do for Query.jl in the meantime: I will define my own versions of the comparison operators that follow the C# semantics. I think that if I first import NullableArrays, and then define them, I should be overwriting the definition in NullableArrays, right?

I would prefer if we could move these lifting things out of both NullableArrays and Query and for the time being have one package that defines the semantics that @nalimilan prefers and one that has the C# semantics. That would enable people to test out both approaches, without packages overriding each other, i.e. one could load one or the other package and see which version works better for Query or jplyr.

In fact, I think the only thing where we still have a disagreement are the comparison operators, right? I'm on board with the lifting for arithmetic operators here, and if the definition of & and | is changed to the standard 3VL semantics, that should also be good from my point of view.

@nalimilan
Copy link
Member Author

I think once you start to handle nulls with anything other than an error in control flow like situations, you either have to handle them everywhere in the language or you just end up with some arbitrary boundaries of where they can be used.

Right, that's why we suggested above that if, && and || should accept Nullable{Bool} too if we choose to return that from comparison operators.

Apart from these discussions, here is what I'm going to do for Query.jl in the meantime: I will define my own versions of the comparison operators that follow the C# semantics. I think that if I first import NullableArrays, and then define them, I should be overwriting the definition in NullableArrays, right?

Right.

I would prefer if we could move these lifting things out of both NullableArrays and Query and for the time being have one package that defines the semantics that @nalimilan prefers and one that has the C# semantics. That would enable people to test out both approaches, without packages overriding each other, i.e. one could load one or the other package and see which version works better for Query or jplyr.

I guess we can do that, but switching DataFrames to use NullableArrays without providing comparison operators either in Base, in NullableArrays or in DataFrames sounds like a impractical path for users. So either we should wait and port DataFrames to NullableArrays only when we have decided on semantcs, or we should provide comparison operators in one of these packages.

In fact, I think the only thing where we still have a disagreement are the comparison operators, right? I'm on board with the lifting for arithmetic operators here, and if the definition of & and | is changed to the standard 3VL semantics, that should also be good from my point of view.

Yes, AFAICT the only remain issue is with comparison operators.

@davidanthoff
Copy link

Right, that's why we suggested above that if, && and || should accept Nullable{Bool} too if we choose to return that from comparison operators.

And what should happen if a null value is passed to these? The one concrete suggestion I've seen is from @StefanKarpinski to throw an error in the case of if. That sounds good to me (I would also throw one for any Nullable{Bool}, but that is actually just a minor side debate). What would you recommend for the others, if a null value is passed? And for while, if in generator statements, the ternary operator and Base.ifelse? I think it would be really helpful to have a concrete proposal on the table on how Nullable{Bool} should interact with all of these.

For now I've defined various lifting things and operators in Query.jl here. In my mind those definitions are the most practical for a LINQ like framework. Having said that I'm happy to be convinced otherwise and hopefully we can just sort out one set of semantics that then can go into Base.

@nalimilan
Copy link
Member Author

I would suggest raising an error with all operations (i.e. if, while, ?, ifelse, && and ||) when a null value is encountered. Indeed, the only other solution would be to adjust the short-circuiting behaviour of && and || to 3VL, but it would mean that e.g. cond() && action() would be different from if cond(); action(); end. Since both idioms are used as equivalents in most of the code base, making them diverge does not sound like a great idea (though not a disaster either...).

I think I'd still like && and ||to return Nullable{Bool}, even if they would never return null values (only Nullable(true) and Nullable(false)). This would allow LINQ-like macros which would want to accept nulls with these operations instead of raising an error to remain very close to standard Julia semantics: the only change would be that what is an error in standard Julia (passing null) would be accepted there (and return null where applicable).

How does that sound?

@davidanthoff
Copy link

I would suggest raising an error with all operations (i.e. if, while, ?, ifelse, && and ||) when a null value is encountered.

Cool, so we also agree on that!

So in your proposal (control-flow-like-things throw an error if a null value comes up, comparison operators return a Nullable{Bool}), this would throw an error (*)?

a = Nullable{Int}[1,2,Nullable{Int}(),4]
collect(i for i in a if i>2)

Note that this kind of filtering works both in dplyr and in SQL without an error.

(*) I'm using generator expressions as a proxy for a more general query facility. I think all the issues can be discussed in the context of generator syntax, so no need to rope in Query.jl or jplyr syntax. The above query would be equivalent to the following Query code:

@from i in a begin
@where i>2
@select i
@collect

And yes, I should probably think about a somewhat shorter syntax for that ;)

I think I'd still like && and || to return Nullable{Bool}, even if they would never return null values (only Nullable(true) and Nullable(false)).

If both of these operators are guaranteed to never return a null value, why not make things simple and have them return a Bool? Returning Nullable{Bool} just makes things complicated: now the downstream code of these operators needs to still be able to deal with null values, even though by definition they will never see one. That seems a really unnecessary complication of code...

This would allow LINQ-like macros which would want to accept nulls with these operations instead of raising an error to remain very close to standard Julia semantics: the only change would be that what is an error in standard Julia (passing null) would be accepted there (and return null where applicable).

So this is something I definitely don't want to do in Query: define different semantics for things like || or && inside queries from the normal language semantics. The reason is very simple: you can call an arbitrary function in a filter clause, and I would want the following two code constructs to have exactly the same behavior:

@from in in df begin
@where i.age>18 && i.age<65
@select i
@collect
end

and

middleage(age) = age>18 && age<65

@from in in df begin
@where middleage(age)
@select i
@collect
end

@nalimilan
Copy link
Member Author

So in your proposal (control-flow-like-things throw an error if a null value comes up, comparison operators return a Nullable{Bool}), this would throw an error (*)?

Yeah, unless you write it inside a LINQ-like macro. If at some point we agree on semantics, we would still be able to allow this later without breaking anything.

So this is something I definitely don't want to do in Query: define different semantics for things like || or && inside queries from the normal language semantics. The reason is very simple: you can call an arbitrary function in a filter clause, and I would want the following two code constructs to have exactly the same behavior:

Well, obviously we have to choose between making things more convenient inside a macro, and not altering in any way the standard Julia semantics. Anyway, it's fine to have several packages with different approaches. The point of making && and || return Nullable{Bool} is precisely to allow different behaviours in macros while still keeping the same common basis when no null is present.

@davidanthoff
Copy link

Yeah, unless you write it inside a LINQ-like macro. If at some point we agree on semantics, we would still be able to allow this later without breaking anything.

Ok, so lets assume for a second that

collect(i for i in a if i>3)

throws an error, while

@from i in a begin
@where i>3
@select i
@collect

doesn't, but presumably drops rows where i has a null value, right? I'm not a fan because I think those two things should do exactly the same thing. But for arguments sake, lets assume that is the solution for a moment.

What would you suggest for the following query for null values?

@from i in a begin
@select i > 3 ? "Something" : "Something else"
@collect

Well, obviously we have to choose between making things more convenient inside a macro, and not altering in any way the standard Julia semantics.

I disagree, there is a third choice: we can pick the C# semantics for comparison operators and a) not alter standard julia semantics and b) have a convenient solution in LINQ like macros. And in fact in that case we would have the same semantics for all of these things inside LINQ macros and outside, which seems a huge win in terms of consistency and simplicity of the language/library.

Anyway, it's fine to have several packages with different approaches.

I agree with that for a certain time, but at some point we should have one approach in Base. After all, both NullableArrays and Query are now defining lots of methods for types that are all defined in Base and override each other, which seems not like a good long term situation...

@nalimilan
Copy link
Member Author

What would you suggest for the following query for null values?

That's really up to you. My preferred choice would be to return null, but I know you don't like that. From what you said above, I guess it would treat nulls as false, and return "Something else".

I disagree, there is a third choice: we can pick the C# semantics for comparison operators and a) not alter standard julia semantics and b) have a convenient solution in LINQ like macros. And in fact in that case we would have the same semantics for all of these things inside LINQ macros and outside, which seems a huge win in terms of consistency and simplicity of the language/library.

We're going over the discussion from above once again. C# semantics are a possibility, but opting for them now in Base would make it harder to experiment with other choices in macros.

I agree with that for a certain time, but at some point we should have one approach in Base. After all, both NullableArrays and Query are now defining lots of methods for types that are all defined in Base and override each other, which seems not like a good long term situation...

If we were to keep diverging semantics in different packages, we would have to implement these operations without defining methods which leak to other packages (i.e. let macros replace some calls with calls to internal functions).

nalimilan added a commit to JuliaData/DataFrames.jl that referenced this pull request Aug 29, 2016
This is a much more general issue (JuliaStats/NullableArrays.jl#85) which
can be tackled later.
nalimilan added a commit to JuliaData/DataFrames.jl that referenced this pull request Aug 31, 2016
This is a much more general issue (JuliaStats/NullableArrays.jl#85) which
can be tackled later.
nalimilan added a commit to JuliaData/DataFrames.jl that referenced this pull request Aug 31, 2016
This is a much more general issue (JuliaStats/NullableArrays.jl#85) which
can be tackled later.
nalimilan added a commit to JuliaData/DataFrames.jl that referenced this pull request Sep 1, 2016
This is a much more general issue (JuliaStats/NullableArrays.jl#85) which
can be tackled later.
nalimilan added a commit to JuliaData/DataFrames.jl that referenced this pull request Sep 1, 2016
This is a much more general issue (JuliaStats/NullableArrays.jl#85) which
can be tackled later.
nalimilan added a commit to JuliaData/DataFrames.jl that referenced this pull request Sep 1, 2016
This is a much more general issue (JuliaStats/NullableArrays.jl#85) which
can be tackled later.
@nalimilan
Copy link
Member Author

Let's assume for a moment that we change ==(::Nullable, ::Nullable) to return a Bool. Then, for consistency with the treatment of NaN, we need to have Nullable() == Nullable() -> false. This is different from C#, but some advocate that this choice would have been more consistent even there as it is probably justified by history rather than logic.

With these semantics, comparing two NullableArrays will return false if they contain at least one null value (just like what happens when an array contains a NaN). This behavior will also propagate to comparing DataFrames. Is this what we want? Just trying to derive the consequences of all possible choices.

@davidanthoff
Copy link

I think here is an argument for the C# implementation that is not based on history for ==: I think ideally we would have == and isequal have the same semantics so that users don't have to flip back and forth between two different types of equality. And then one way to think about this is in the context of @group and @join clauses in something like Query. I think ideally groupings would occur using the same semantics that is used for == and isequal, i.e. if that comparison returns true, rows are grouped. If == were to return false for two null values, each row with a null value would become its own group, which is probably not what we would want (caveat, I didn't actually check what SQL or dplyr do on this...). For joins I guess a similar logic might apply, although I'm also not sure right now how SQL and dplyr handle that case. But at least things would be relatively predicable, i.e. the semantics of == and isequal would be the same and one could essentially figure out what @group and @join would do without necessarily resorting to a doc (i.e. they would just pick up the comparison semantics).

Plus, I guess the semantics for comparing NullableArrays and DataFrames would also make more sense?

We would lose consistency with NaN, but that seems less important to me.

Right now I also have the following in Query:

const null = Nullable{Union{}}()

and then all the operators that allow you to essentially write null checks like
a==null. I'm not really certain that is a good idea, but it is a pretty simple syntax and really very clear...

I think the real inconsistency in the C# spec stems from null>=null being false, butnull==nullbeingtrue``... I think Eric Lippert pointed out quite rightly in the link you posted that there are some compromises one will have to make, certainly I don't have a better idea for that...

@nalimilan
Copy link
Member Author

In Julia, you already need to use isequal to test for containment; any function operating on arbitrary Julia vectors (including macros to work with data) needs to use it too: else any NaN would get its own group. So doing the same with Nullable wouldn't impose a big cost on packages.

At least, returning a Nullable{Bool} would have the advantage of not making semi-arbitrary decisions like these without giving the user a chance of choosing the most appropriate semantics.

@davidanthoff
Copy link

This is not about the mental burden for package developers but for "normal" users. If isequal and == have different semantics for a given type, users need to know and understand that and know and understand which is used when in packages like Query. A system where the semantics for both just coincide more often seems much easier to understand.

At least, returning a Nullable{Bool} would have the advantage of not making semi-arbitrary decisions like these without giving the user a chance of choosing the most appropriate semantics.

I think what it comes down to is this: yes, the rules for == would be somewhat arbitrary if it returns a Bool. But if it returns a Nullable{Bool} we need an arbitrary rule in many, many more places downstream in packages like Query. I'd prefer to have only one arbitrary choice around ==, but then at least the behavior of pretty much the entire rest of the system can just be deduced by users. Plus, the C# semantics for == are actually not that bad...

@nalimilan
Copy link
Member Author

nalimilan commented Sep 4, 2016

I've done a small review of choices made by other languages, and indeed they don't seem to bother about the inconsistency between NaN and NULL. All languages I surveyed which support == for the equivalent of NULL consider two NULL values as equal. In Python, None == None -> True (though using is is recommended). In Ruby and Swift, nil == nil -> true. In C++, C#, Java, JavaScript and Kotlin, null == null -> true.

Only SQL, R and SAS return the equivalent of Nullable{Bool}, and there NULL == NULL -> NULL.

So I guess we should go that way, given that it will be the less surprising choice for users (unless of course we decide to return a Nullable{Bool} à la R).

nalimilan added a commit to JuliaData/DataFrames.jl that referenced this pull request Sep 22, 2016
This is a much more general issue (JuliaStats/NullableArrays.jl#85) which
can be tackled later.
maximerischard pushed a commit to maximerischard/DataFrames.jl that referenced this pull request Sep 28, 2016
This is a much more general issue (JuliaStats/NullableArrays.jl#85) which
can be tackled later.
nalimilan added a commit to JuliaData/DataFrames.jl that referenced this pull request Jul 8, 2017
This is a much more general issue (JuliaStats/NullableArrays.jl#85) which
can be tackled later.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants