-
Notifications
You must be signed in to change notification settings - Fork 50
Port to Nulls.jl #288
Port to Nulls.jl #288
Conversation
This replaces NA with Nulls.null and NAtype with Nulls.Null. The only breaking change is that null == null gives true, while NA == NA gave NA (same for other comparison operators).
Base.broadcast{T}(::typeof(isna), a::AbstractArray{T}) = | ||
NAtype <: T ? BitArray(map(x->isa(x, NAtype), a)) : falses(size(a)) # -> BitArray | ||
|
||
# FIXME: type piracy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find a way to remove this without hurting performance, and since the point of porting DataArrays is to preserve performance... Maybe it's OK to keep this for now.
Another solution would be to move this definition to Nulls.jl, and only implement here the DataArray
method. But I'm not sure we want to support this kind of API in the long term.
end | ||
|
||
# All elementary functions return NA when evaluating NA | ||
# All elementary functions return null when evaluating null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the most significant behavior change. See also below.
Turns out we'd better use the current |
This has just been changed in Nulls.jl.
I've added a commit to keep the current logic in which |
test/nas.jl
Outdated
@test collect(each_dropna(dv)) == a | ||
@test collect(each_replacena(dv, 4)) == [4, 4, a..., 4] | ||
|
||
@testset "promotion" for (T1, T2) in ((Int, Float64), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed these because I couldn't get them to work in the general case, and these should live in Nulls anyway. Cf. JuliaData/Missings.jl#23 and #280 (comment).
@andreasnoack Opinions? I've tried a few operations mixing DataArrays from this PR and DataFrames master, and everything seems to work fine. |
At this point, this is just a renaming and migrating out the package, right? No semantical changes. Then it should be fine. |
Yes, mostly. The only significant change is the More systematic testing with DataFrames master wouldn't hurt though (once this PR has been merged). |
Hm. Do tests pass locally? I'd expect them to fail without the promotion definitions. Also, do you need to add |
Yes, they pass locally, but you need Nulls.jl master. I hope we can tag a release soon. |
You'll also need to add it to |
Yes but we need to depend on an unreleased version anyway. |
Why can't we make a new release of |
I'm working on it. The problem is that it breaks DataStreams and CategoricalArrays. So better tag compatible releases first (there's one for the former, I'm preparing one for the latter). |
Now works with Nulls 0.1.0.
Tests passed on 0.6 now that Nulls 0.1.0 has been tagged. Good to merge? |
Do you know why it is failing on nightly? It looks related to this change. |
Actually I hadn't even looked at nightlies since they are so broken. The package doesn't load in a more recent version. But looks like as easy fix, so let's try while it's fresh in my mind. I've added a commit. |
src/operators.jl
Outdated
@@ -213,7 +208,7 @@ end | |||
# Treat ctranspose and * in a special way | |||
for (f, elf) in ((:(Base.ctranspose), :conj), (:(Base.transpose), :identity)) | |||
@eval begin | |||
$(f)(::NAtype) = NA | |||
$(f)(::Null) = null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woops, I've missed this one apparently.
Rounding and transpose operations have been moved to Nulls, functions from SpecialFunctions will have to be handled manually as we don't want Nulls to depend on SpecialFunctions and keeping them in DataArrays would be type piracy.
Nighties have been updated, so now we get a crash there. :-/ That's an "Illegal instruction" just like in DataFrames master, so probably not worth worrying about for now. I've also removed a few lifted methods which should be defined in Nulls instead, see JuliaData/Missings.jl#42. That means we'll have to tag another Nulls release... |
Now that Nulls 0.1.1 has been tagged and that CI passes on 0.6, should be ready to merge? |
…hDropNull) This allows collect(Nulls.skip(x)) to be equivalent to the old dropna(x) for DataArray, but more generic. Also unexport the iterators, which are an implementation detail.
After doing a bit more cleanup and adding another deprecation, tests from current master are passing with the implementation from that PR (with a handful of changes were needed). So breakage should be minimal. The only functions I removed without a deprecation are those which would be type piracy: I haven't renamed the |
Sounds good. What were the problematic |
|
Ah, got it. |
Previously, convert() would create a DataArray{Union{T, Null}} containing nulls which were not considered as missing, giving weird results in particular with equality tests. Constructing a DataArray{T} from an Array{Union{T, Null}} is tricky because the field must be an Array{T}, yet conversion will fail in the presence of nulls. Therefore, we need to allocate a copy and leave the null entries uninitialized.
m[i] = true | ||
# if input array can contain null values, we need to mark corresponding entries as missing | ||
if eltype(d) >: Null | ||
# If the original eltype is wider than the target eltype T, conversion may fail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@quinnj This is a situation where it would be useful to be able to access the underlying data of a Union{T, Null}
array, unsafely reinterpreting it as an Array{T}
. Since DataArray
stores the missingness mask separately, we are certain we won't access elements which are null
. Is there a hacky way to do that currently? Would it make sense to provide an API to do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's just a type-assert (using the eltype parameter from DataArrays)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But how can I convert an Array{Union{T, Null}}
to an Array{T}
without copying in the presence of null
entries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like something reinterpret should be allowed to do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it returns a ReinterpretArray
now, so that would require allowing the field to be any `AbstractArray, right? Not the end of the world, probably.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vtjnash Am I right?
how are things looking @nalimilan? |
I think that's mostly good. I'd like to define I'm also checking that old DataArrays-based DataFrames tests pass with the new Nulls-based DataFrames (modulo a few reasonable changes where needed). So far it mostly works, but that process allowed me to catch some bugs and problematic changes (like JuliaData/DataFrames.jl#1249), so I'd rather finish it before merging this PR. |
Sounds great; thanks for doing all this. |
This prevents a conflict with CategoricalArrays.
I think this should be ready for a final review. With this branch, DataFrames tests with |
""" | ||
function levels(da::DataArray) # -> DataVector{T} | ||
unique_values, firstna = finduniques(da) | ||
function Nulls.levels(da::DataArray) # -> DataVector{T} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guess the old docs here got moved somewhere else? Or they just use the docs in Nulls.jl?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this method just does the same thing as the fallback in Nulls.
end | ||
|
||
# ambiguity | ||
@swappable (==)(::NAtype, ::WeakRef) = NA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we have this in Nulls, I'm guess it's not that important?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice I doubt it's useful. These were probably implemented at the time ambiguity warnings were printed when loading the package. Though we could enable ambiguity checks in Nulls and fix this kind of situation (there might be others).
DataArray(Array{T,N}(size(b)), trues(size(b))) | ||
@dataarray_binary_scalar(/, /, nothing, false) | ||
|
||
for f in [:(Base.maximum), :(Base.minimum)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These seem like weird definitions on scalars anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think that's because numbers are iterable. We could define these in Nulls since that would be consistent with the goal of accepting null
everywhere Number
is, but we can also wait until somebody complains so that we're certain it's useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally had a chance to review all this. Looks great!
NAException, | ||
NAtype, | ||
padna, | ||
padnull, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left this method here, but we could define it in Nulls if that's really useful. No hurry, though.
This replaces
NA
withNulls.null
andNAtype
withNulls.Null
.The only breaking change is that
null == null
givestrue
, whileNA == NA
gaveNA
(same for other comparison operators).The goal of this PR is to make DataArrays compatible with the new Nulls-based DataFrames, so that they can continue to be used until
Array{Union{T, Null}}
is efficient enough, as discussed in JuliaData/DataFrames.jl#1209. The port should be relatively easy for users, since everything goes though relatively simple deprecations.The most/only breaking change concerns comparison operators (as noted above). However it shouldn't be too painful either since in general it means that code which would previously fail in the presence of
NA
s will now work. The only common situation in which there will be a difference is when user code replacedNA
s resulting from a comparison manually. Nevertheless, I think we should discuss (in Nulls rather than here) these semantics carefully before committing to them and forcing users to adapt their code.Tests will fail until we have a release of Nulls.jl including JuliaData/Missings.jl#30, JuliaData/Missings.jl#31 and JuliaData/Missings.jl#32.