Skip to content
This repository has been archived by the owner on May 4, 2019. It is now read-only.

Arithmetic operators on Nullable #111

Closed
bkamins opened this issue May 28, 2016 · 16 comments
Closed

Arithmetic operators on Nullable #111

bkamins opened this issue May 28, 2016 · 16 comments

Comments

@bkamins
Copy link
Contributor

bkamins commented May 28, 2016

The line:
https://github.com/JuliaStats/NullableArrays.jl/blob/master/src/operators.jl#L37
has the following consequences:

a) types that are not bits throw error when operated on:

x = Nullable(1)
x+x # works
x = Nullable(BigInt(1))
x+x # error

This probably should be fixed.

b) operation on a missing of type that is not bits fails

x = Nullable(1.0)
y = Nullable()
x+y # error
x+convert(Nullable{Int64}, y) # works as addition of Int64 and Float64 is possible

I am not sure if this should be fixed. Implementation convert for Nullable implies that missing value of Nullable{T} can be always converted to Nullable{S} independent on what T and S are but I am not sure if this conversion should be automatically performed in operations on nullables.

@nalimilan
Copy link
Member

Thanks for looking into this. We should fix it in Julia Base as part of JuliaLang/julia#16080.

I guess the isbits restriction was made to get rid of a branch, as it allows using + even when the value is missing. But we should definitely support other types as well. Automatic promotion should also be supported.

@johnmyleswhite
Copy link
Member

This kind of change needs to be made with some caution. As is, the code is already unsafe since we're implicitly assuming that it's always safe to operate on the raw bits associated with a value: you can create nonsensical immutables this way and it's not impossible to imagine scenarios in which our current strategy introduces bugs. We will almost surely introduce segfaults if we remove this restriction without substantial code changes -- but those code changes could prohibit SIMD optimizations, so care will be required.

The current approach is a way of working around the lack of return types. Making these things both fast and safe requires that we know what type of nullable we should return even when the arguments are nulls.

@nalimilan
Copy link
Member

This kind of change needs to be made with some caution. As is, the code is already unsafe since we're implicitly assuming that it's always safe to operate on the raw bits associated with a value: you can create nonsensical immutables this way and it's not impossible to imagine scenarios in which our current strategy introduces bugs. We will almost surely introduce segfaults if we remove this restriction without substantial code changes -- but those code changes could prohibit SIMD optimizations, so care will be required.

You mean that adding anything to the else branch would break SIMD? I wonder whether LLVM shouldn't be able to perform loop unswitching here.

The current approach is a way of working around the lack of return types. Making these things both fast and safe requires that we know what type of nullable we should return even when the arguments are nulls.

Isn't this fixed with JuliaLang/julia#16432? Anyway, I don't understand why this is needed: can't we just return Nullable{Base.promote_op($op, S1, S2)}()?

@johnmyleswhite
Copy link
Member

You mean that adding anything to the else branch would break SIMD?

I mean that this line (

Nullable($(f)(x.value, y.value), x.isnull | y.isnull)
) is unsafe in general because you can't safely apply a function to arbitrary bits the way we're doing. To be safe, you really need to do something that looks more like:

if !(x.isnull || y.isnull)
    return Nullable(do_something(x.value, y.value))
end

But the introduction of that branch will often prevent SIMD operations. (It definitely prevented SIMD when we wrote that code, although it's conceivable that some compiler magic would resolve that.)

Isn't this fixed with JuliaLang/julia#16432?

I'm hopeful that we'll be able to do a better job now that 16432 has landed. I haven't had time to try, but this is one of the things I hope David and I will look into this summer.

Anyway, I don't understand why this is needed: can't we just return Nullable{Base.promote_op($op, S1, S2)}()?

That's a very, very good point. I was trying to push for maximum generality, but for the core arithmetic operations promote_op should be sufficient.

@nalimilan
Copy link
Member

nalimilan commented Jun 2, 2016

I tried to benchmark the various solutions, but hit JuliaLang/julia#16709 on 0.5. On 0.4.5, the safest/most general version branching on isnull (f3) is only slightly (though noticeably) slower, and a version with a fast path on isbits (f2) is as fast as the current one (since type inference does its job). So it looks like we could at least get a general version working.

In all cases, SIMD instructions appear to be enabled (though my CPU is a bit old, so I may not be able to see more dramatic speed gains offered by e.g. AVX). EDIT: actually it's never enabled. I only get SIMD instructions in testref when removing the | operation and only on 0.5 -- we need to investigate this.

See https://gist.github.com/nalimilan/94a3dc790bc592e8b2b561d9dcca9a9b

@nalimilan
Copy link
Member

@johnmyleswhite @davidagold In which cases is SIMD enabled with the current definitions? In my tests it isn't, which makes it hard to check for possible regressions.

Anyway, a reasonable strategy seems to be f2 from my Gist: keep the current SIMD-friendly path for isbits types, and add a slower branch for other types. This should offer the best of both worlds. We could also increase safety by checking for a (dummy but explicit name) can_run_op_even_when_missing(T, op) trait instead of isbits; that trait would be defined by default for all standard numeric types.

How does that sound? I'd like to get as many functions as possible into Julia 0.5 so that we're not blocked for another cycle.

@davidagold
Copy link
Contributor

I don't have any immediate objections to the above, but I also haven't done much in-depth thinking about this issue. I'm landing at MIT next Monday, so that should be the start of when I can dedicate my full time and energy to JuliaStats projects.

@nalimilan
Copy link
Member

Great to hear that, I had no idea. What are your plans? Would you be willing to try moving the few essential Nullable features from NullableArrays to Base before 0.5?

@nalimilan
Copy link
Member

OK, JuliaLang/julia#16709 is fixed now, so the most general strategy with a fast path for isbits types (f2) works with only a 5% penalty compared with the current implementation from NullableArrays (f1, limited to isbits types).

@davidagold
Copy link
Contributor

@nalimilan Plans are still evolving, but they seem to be settling on developing data manipulation facilities for DataFrames. Which features in particular are you thinking of? And when is 0.5 expected to land?

@nalimilan
Copy link
Member

I don't know when 0.5 is supposed to be released, but I thought it was going to happen around JuliaCon. As regards improving the data management facilities, I think getting arithmetic operators for Nullable into Base is a priority. Support for automatic lifting via a macro would also be very useful: in particular, it could be used inside DataFramesMeta macros to make it easy to work with data.

@davidagold
Copy link
Contributor

I'd like to work on the automatic lifting. Still not sure where it should live. But that's a discussion for elsewhere.

For now, do you envision moving everything in operators.jl to Base and modifying the definitions a la your f2 in the gist you provided?

@nalimilan
Copy link
Member

I'd like to work on the automatic lifting. Still not sure where it should live. But that's a discussion for elsewhere.

Probably in a package for experimentation (possibly in DataFramesMeta), but IMHO it could live in Base once we've settled on a design. That sounds like essential functionality.

For now, do you envision moving everything in operators.jl to Base and modifying the definitions a la your f2 in the gist you provided?

Yes, that's my plan, modulo what I noted at #116 (comment). What do you think of it?

@davidagold
Copy link
Contributor

I'm still working to grok all the relevant issues, including SIMD-ness. In general, I am assessing proposed changes in terms of the amount of progress they would allow and the difficulty in reverting them later. These operations should be in Base, and should be defined for a (limited) set of non bits types. If we decide that this particular implementation is suboptimal, then it can probably be patched in a sub-release of 0.5, as long as doing so doesn't change anything user-facing --- at least that's my understanding. Given these considerations, I'm in favor of this plan. Pinging @johnmyleswhite for his thoughts on such reasoning.

@nalimilan
Copy link
Member

Yes, I think the semantics are well defined now, so we can always improve the implementation later.

@nalimilan
Copy link
Member

Fixed by #119.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants