-
Notifications
You must be signed in to change notification settings - Fork 21
We need more Nullable types for promotion and conversion #95
Comments
I don't really think this is the right way to go. One of the benefits of Nullable's is that they force explicit handling of missing values by preventing things like This restriction, of course, means that Nullable's are much less abstract than R's solution to missing values, but I see that as a strength rather than a weakness. R's approach make sense when you're working at a much higher level of abstraction that I wanted NullableArrays to live at. For me, something like |
That's the eternal debate regarding the semantics of |
@tshort can you give a specific use case where this is giving you grief? |
@davidagold, my main concerns are clumsiness associated with use of individual Nullables. This could arise when indexing into a NullableVector or operating on a DataFrame row by row. Here are some examples that I think are clumsy: x = NullableArray([1,2,3])
x[1] = 9 # works
## None of the following work
x[1] += 1
2 * x
log(x)
log(x[1])
f(x) = x + 1
f(x[1])
g(x,y) = x + y + 1
g(x[1], x[2])
g(1, x[2])
g(x[1], 2) Here's what I think I need to get the broken statements to work: x = NullableArray([1,2,3])
x[1] = 9 # works
## None of the following work
x[1] += Nullable(1)
# 2 * x ## I can't figure this one out
## Nullable(2) * x # doesn't work
map(z -> 2 * z, x, lift=true) # best way?
map(log, x, lift=true)
isnull(x[1]) ? Nullable(typeof(x[1])) : log(get(x[1]))
f(x) = x + 1
f(x::Nullable) = x + Nullable(1)
f(x[1])
g(x,y) = x + y + 1
g(x::Nullable,y) = g(x,Nullable(y))
g(x,y::Nullable) = g(Nullable(x),y)
g(x::Nullable,y::Nullable) = x + y + Nullable(1)
g(x[1], x[2])
g(1, x[2])
g(x[1], 2) Having to write |
You can also do a lot just by defining promotions and conversions. Here is a bit of code that gets almost all of my first set of examples to work: Base.promote_rule{T <: Number}(::Type{Nullable{T}}, ::Type{T}) = Nullable{T}
Base.promote_rule{T <: Number, V <: Number}(::Type{Nullable{T}}, ::Type{V}) = Nullable{promote_type(T,V)}
Base.convert{T}(::Type{Nullable{T}}, x::Type{T}) = Nullable(x)
for op in (:+, :-, :*, :/, :^, :%, :.*)
@eval function Base.$op(x::Nullable,y::Nullable)
if isnull(x) || isnull(y)
return promote_type(x,y)()
else
return Nullable($op(get(x), get(y)))
end
end
@eval Base.$op(x::Nullable,y::Number) = $op(x, Nullable(y))
@eval Base.$op(x::Number,y::Nullable) = $op(Nullable(x), y)
end
for op in (:sin, :log)
@eval function Base.$op{T <: Number}(x::Nullable{T})
if isnull(x)
return Nullable(T)
else
return Nullable($op(get(x)))
end
end
end The downside is that a Anyway, I get the pushback, so feel free to close with a "wontfix" label. |
Using promotion more is an interesting approach. It just feels like it opens you up to lots of missing cases. Why should My personal feeling is that we should nail the low-level semantics and then get some syntactic sugar for doing stuff like |
It seems like you've already started down the path with math operations in https://github.com/JuliaStats/NullableArrays.jl/blob/master/src/operators.jl. |
Also, #85 addresses similar issues (missed that). |
Okay, finally have a bit of time to address these points. Over the summer I developed a "lift" macro julia> macroexpand(:( @^ f(x, y) + g(x, h(z)) Int ))
quote # /Users/David/.julia/v0.5/Lift/src/liftmacro.jl, line 24:
if (isnull(y) || isnull(z)) || isnull(x) # /Users/David/.julia/v0.5/Lift/src/liftmacro.jl, line 25:
Nullable{Int}()
else # /Users/David/.julia/v0.5/Lift/src/liftmacro.jl, line 27:
Nullable(f(get(x),get(y)) + g(get(x),h(get(z))))
end
end A very simple performance test shows that the macro performs comparably to defining a "semi-lifted method" -- i.e. a method designed to handle a mixture of using NullableArrays
srand(1)
A = rand(5_000_000)
B = rand(5_000_000)
M = rand(Bool, 5_000_000)
X = NullableArray(A)
Y = NullableArray(A, M)
Z = similar(X)
@inline function _g(b::Float64, y::Nullable{Float64})
if y.isnull
return Nullable{Float64}()
else
return Nullable(b * y.value)
end
end
function f(Z, B, Y)
for i in eachindex(Z)
Z[i] = @^ B[i] * Y[i] Float64
end
end
function g(Z, B, Y)
for i in eachindex(Z)
Z[i] = _g(B[i], Y[i])
end
end
f(Z, B, Y);
f(Z, X, Y);
g(Z, B, Y);
@time f(Z, B, Y)
@time f(Z, X, Y)
@time g(Z, B, Y) yields 0.042598 seconds (4 allocations: 160 bytes)
0.064166 seconds (4 allocations: 160 bytes)
0.050128 seconds (4 allocations: 160 bytes) If this sort of thing would be useful, I'll continue to flesh it out and make it available via something like a NullableUtilities package. Right now there's rudimentary support for control flow, but it's a bit buggy and questions remain about what should be the specified behavior. Should we assume that calls in the condition and the body ought to be lifted? Just the body? Also, there's currently only support for lifting functions over variable arguments and no support yet for doing something like |
I like the idea of There are cases where lifting may be confusing. For some code, the generic lifting won't work right. Consider this: f(x,y) = x > 3 ? 4x : -y When lifted, if either x or y are Null, it'll return Null, but if x is greater than three, it shouldn't return Null. Another issue is that because lifting is done as a macro, you can't use code specialized for Nullable numbers. Well, you can, but you have to make sure you don't lift expressions that include methods specialized for Nullable numbers. The combination of these issues may make it hard to write generic code that works with regular numbers and Nullable numbers (and their array equivalents). |
These are definitely good details to point out. I don't expect to produce a macro that can single-handedly lift every possible expression to every possible end. But I do think it should be possible to cover a range of the most common patterns. I think the macro can be made to lift smartly over conditionals such as in your first concern. It seems the second concern requires community standards to minimize the sorts of situations in which one needs to lift Nullable-specialized methods over mixed or entirely non-Nullable arguments. I'm having trouble envisioning the cases in which this would really come up. It would have to be a case in which one only has a Nullable-specialized method. When does that occur? |
A Nullable-only use case is likely to be rare. One I can think of is a method that does sampling-type replacements on a Nullable number or array. I'm curious how you plan to support the conditional case. I can envision how it could work if the conditional is in the lifted expression. If it's in a function called by the lifted expression, I don't see how that would work. |
There would be no general way to make this work if the conditional were hidden inside a method body. I don't think this is a drawback. The case you've described seems essentially like that of three-valued logic. It's not clear that being able to call on such semantics hidden within a method body would be a good thing. The standard semantics for Nullables that we've adopted are that if an expression depends on Nullable objects, then nullity in any of those objects returns an empty Nullable. We probably oughtn't encourage people to rely on alternative semantics unless they are plainly visible. I think discussions about how best to provide lifting facilities need to be accompanied by discussions about what we expect/intend users actually to be doing with data structures built on top of NullableArrays. And this in turn requires some insight into the direction that DataFrames, DataStreams and friends will be taking in the near future. @quinnj @simonbyrne |
@tshort In case you're still interested, I've thrown up a working prototype for the lift macro: https://github.com/davidagold/NullableUtils.jl. Please do let me know if you find it helpful, or if you can think of other utilities/features for |
Good to see some move in this area! I wonder about the return type though: since all functions do not return the same type as their inputs (which might even be of different types), maybe return |
The user specifies the type parameter julia> @lift f(x, y)
ERROR: MethodError: no method matching @lift(::Expr)
Closest candidates are:
@lift(::ANY, ::ANY)
in eval(::Module, ::Any) at ./boot.jl:267 However, I could make it so that |
The consequences are still pretty bad inside of a tight loop. More changes have to happen in the compiler to get that stuff to work well. |
Aye, that's what I thought. So we can add this to the list of decisions about what Julia statistics functionality ought to do automatically for the user (removal of nulls, etc.). Speaking of compiler changes, how far away is |
Sadly I have no idea. |
Is that sort of optimization even possible given the fundamental ambiguity of each element's type? |
@datnamer Simon Byrne's latest post in https://groups.google.com/forum/#!topic/julia-stats/29l5yA87Qss suggests there are strategies that may be able to handle this. |
Interesting, thanks for the pointer. |
Nullable{Float64}
doesn't act like a proper number. You can't do3.5 + Nullable(4)
. You also can't dolog(Nullable(4.0))
. The rest of the Julia ecosystem has great conversion and promotion, and Nullables should, too. The existinglift
mechanism for NullableArrays is also kludgey because we don't have promotion.The main problem is that you can't have both of the following. I haven't heard of any planned enhancements to the type system that would allow that sort of relationship.
Given that, I think we need more Nullable types, so each type can fit in at appropriate places in the type hierarchy to allow promotion and conversion, so
3.5 + NullableInt(4)
==NullableFloat{Float64}(7.5)
.I'm running into the same quandary in the PooledElements package. But there, I think there's less expectation that the user needs to perform arbitrary numerics on a PooledElementArray.
The big drawback is the amount of code needed to implement this. The other is deciding how far to take this concept.
The text was updated successfully, but these errors were encountered: