Skip to content
This repository has been archived by the owner on May 4, 2019. It is now read-only.

Port to Nulls.jl #288

Merged
merged 19 commits into from
Oct 19, 2017
Merged

Port to Nulls.jl #288

merged 19 commits into from
Oct 19, 2017

Conversation

nalimilan
Copy link
Member

This replaces NA with Nulls.null and NAtype with Nulls.Null.
The only breaking change is that null == null gives true, while NA == NAgave NA (same for other comparison operators).


The goal of this PR is to make DataArrays compatible with the new Nulls-based DataFrames, so that they can continue to be used until Array{Union{T, Null}} is efficient enough, as discussed in JuliaData/DataFrames.jl#1209. The port should be relatively easy for users, since everything goes though relatively simple deprecations.

The most/only breaking change concerns comparison operators (as noted above). However it shouldn't be too painful either since in general it means that code which would previously fail in the presence of NAs will now work. The only common situation in which there will be a difference is when user code replaced NAs resulting from a comparison manually. Nevertheless, I think we should discuss (in Nulls rather than here) these semantics carefully before committing to them and forcing users to adapt their code.

Tests will fail until we have a release of Nulls.jl including JuliaData/Missings.jl#30, JuliaData/Missings.jl#31 and JuliaData/Missings.jl#32.

This replaces NA with Nulls.null and NAtype with Nulls.Null.
The only breaking change is that null == null gives true, while NA == NA
gave NA (same for other comparison operators).
Base.broadcast{T}(::typeof(isna), a::AbstractArray{T}) =
NAtype <: T ? BitArray(map(x->isa(x, NAtype), a)) : falses(size(a)) # -> BitArray

# FIXME: type piracy
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find a way to remove this without hurting performance, and since the point of porting DataArrays is to preserve performance... Maybe it's OK to keep this for now.

Another solution would be to move this definition to Nulls.jl, and only implement here the DataArray method. But I'm not sure we want to support this kind of API in the long term.

end

# All elementary functions return NA when evaluating NA
# All elementary functions return null when evaluating null
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the most significant behavior change. See also below.

@nalimilan
Copy link
Member Author

The most/only breaking change concerns comparison operators (as noted above). However it shouldn't be too painful either since in general it means that code which would previously fail in the presence of NAs will now work. The only common situation in which there will be a difference is when user code replaced NAs resulting from a comparison manually. Nevertheless, I think we should discuss (in Nulls rather than here) these semantics carefully before committing to them and forcing users to adapt their code.

Turns out we'd better use the current NA behavior for null. See JuliaData/Missings.jl#33.

This has just been changed in Nulls.jl.
@nalimilan
Copy link
Member Author

I've added a commit to keep the current logic in which NA == NA -> NA, since JuliaData/Missings.jl#33 just implemented null == null -> null.

test/nas.jl Outdated
@test collect(each_dropna(dv)) == a
@test collect(each_replacena(dv, 4)) == [4, 4, a..., 4]

@testset "promotion" for (T1, T2) in ((Int, Float64),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed these because I couldn't get them to work in the general case, and these should live in Nulls anyway. Cf. JuliaData/Missings.jl#23 and #280 (comment).

@nalimilan
Copy link
Member Author

@andreasnoack Opinions? I've tried a few operations mixing DataArrays from this PR and DataFrames master, and everything seems to work fine.

@andreasnoack
Copy link
Member

andreasnoack commented Oct 3, 2017

At this point, this is just a renaming and migrating out the package, right? No semantical changes. Then it should be fine.

@nalimilan
Copy link
Member Author

nalimilan commented Oct 3, 2017

Yes, mostly. The only significant change is the promote issue, but I think it's OK now (see comment above and its links). There's also the probable removal of iteration, but that's really small (JuliaData/Missings.jl#39). Finally, comparison operators have been made consistent with NA, so I should be able to reinstate have reinstated part of the removed tests.

More systematic testing with DataFrames master wouldn't hurt though (once this PR has been merged).

@andreasnoack
Copy link
Member

Hm. Do tests pass locally? I'd expect them to fail without the promotion definitions. Also, do you need to add Nulls to REQUIRE?

@nalimilan
Copy link
Member Author

Yes, they pass locally, but you need Nulls.jl master. I hope we can tag a release soon.

@andreasnoack
Copy link
Member

You'll also need to add it to REQUIRE. See https://github.com/JuliaStats/DataArrays.jl/blob/9169e9f0a2bfab04cc115428e1c19aa513a9455e/REQUIRE`. That is the current reason why the tests fail.

@nalimilan
Copy link
Member Author

Yes but we need to depend on an unreleased version anyway.

@andreasnoack
Copy link
Member

Why can't we make a new release of Nulls right now?

@nalimilan
Copy link
Member Author

I'm working on it. The problem is that it breaks DataStreams and CategoricalArrays. So better tag compatible releases first (there's one for the former, I'm preparing one for the latter).

@nalimilan
Copy link
Member Author

Tests passed on 0.6 now that Nulls 0.1.0 has been tagged. Good to merge?

@andreasnoack
Copy link
Member

Do you know why it is failing on nightly? It looks related to this change.

@nalimilan
Copy link
Member Author

Actually I hadn't even looked at nightlies since they are so broken. The package doesn't load in a more recent version. But looks like as easy fix, so let's try while it's fresh in my mind. I've added a commit.

src/operators.jl Outdated
@@ -213,7 +208,7 @@ end
# Treat ctranspose and * in a special way
for (f, elf) in ((:(Base.ctranspose), :conj), (:(Base.transpose), :identity))
@eval begin
$(f)(::NAtype) = NA
$(f)(::Null) = null
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woops, I've missed this one apparently.

Rounding and transpose operations have been moved to Nulls,
functions from SpecialFunctions will have to be handled manually as we don't
want Nulls to depend on SpecialFunctions and keeping them in DataArrays
would be type piracy.
@nalimilan
Copy link
Member Author

Nighties have been updated, so now we get a crash there. :-/ That's an "Illegal instruction" just like in DataFrames master, so probably not worth worrying about for now.

I've also removed a few lifted methods which should be defined in Nulls instead, see JuliaData/Missings.jl#42. That means we'll have to tag another Nulls release...

@nalimilan
Copy link
Member Author

Now that Nulls 0.1.1 has been tagged and that CI passes on 0.6, should be ready to merge?

…hDropNull)

This allows collect(Nulls.skip(x)) to be equivalent to the old dropna(x)
for DataArray, but more generic. Also unexport the iterators, which are
an implementation detail.
@nalimilan
Copy link
Member Author

After doing a bit more cleanup and adding another deprecation, tests from current master are passing with the implementation from that PR (with a handful of changes were needed). So breakage should be minimal. The only functions I removed without a deprecation are those which would be type piracy: length, size, SpecialFunctions.erf, SpecialFunctions.erfc, SpecialFunctions.digamma. The cause of the error should be quite easy to identify anyway.

I haven't renamed the na field of DataArray, because it would be tedious to do for no real gain, and it could break code relying on the internals. Does that make sense?

@quinnj
Copy link
Member

quinnj commented Oct 6, 2017

Sounds good. What were the problematic length size definitions?

@nalimilan
Copy link
Member Author

length(::Null) = 1 and size(::Null) = (). That is, the ones we just removed from Nulls.jl.

@quinnj
Copy link
Member

quinnj commented Oct 6, 2017

Ah, got it.

Previously, convert() would create a DataArray{Union{T, Null}} containing
nulls which were not considered as missing, giving weird results in particular
with equality tests.

Constructing a DataArray{T} from an Array{Union{T, Null}} is tricky because
the field must be an Array{T}, yet conversion will fail in the presence of
nulls. Therefore, we need to allocate a copy and leave the null entries
uninitialized.
m[i] = true
# if input array can contain null values, we need to mark corresponding entries as missing
if eltype(d) >: Null
# If the original eltype is wider than the target eltype T, conversion may fail
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@quinnj This is a situation where it would be useful to be able to access the underlying data of a Union{T, Null} array, unsafely reinterpreting it as an Array{T}. Since DataArray stores the missingness mask separately, we are certain we won't access elements which are null. Is there a hacky way to do that currently? Would it make sense to provide an API to do that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we need to figure out the right api for that. It was in the original isbits Union optimizations PR, but @vtjnash and @yuyichao had concerns w/ how it was implemented, so I just took it out for the initial PR. I'll open an issue so we can push forward on it now that everything's in Base.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's just a type-assert (using the eltype parameter from DataArrays)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But how can I convert an Array{Union{T, Null}} to an Array{T} without copying in the presence of null entries?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like something reinterpret should be allowed to do

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it returns a ReinterpretArray now, so that would require allowing the field to be any `AbstractArray, right? Not the end of the world, probably.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vtjnash Am I right?

@quinnj
Copy link
Member

quinnj commented Oct 13, 2017

how are things looking @nalimilan?

@nalimilan
Copy link
Member Author

I think that's mostly good. I'd like to define levels in Nulls (JuliaData/Missings.jl#46) and override it here, so that DataArrays and CategoricalArrays can be used at the same time without conflicts.

I'm also checking that old DataArrays-based DataFrames tests pass with the new Nulls-based DataFrames (modulo a few reasonable changes where needed). So far it mostly works, but that process allowed me to catch some bugs and problematic changes (like JuliaData/DataFrames.jl#1249), so I'd rather finish it before merging this PR.

@quinnj
Copy link
Member

quinnj commented Oct 13, 2017

Sounds great; thanks for doing all this.

@nalimilan
Copy link
Member Author

I think this should be ready for a final review. With this branch, DataFrames tests with DataArray columns pass (JuliaData/DataFrames.jl#1260). We should also deprecate PooledDataArray in favor of CategoricalArray (or PooledArray depending on the case), but we can do that later.

"""
function levels(da::DataArray) # -> DataVector{T}
unique_values, firstna = finduniques(da)
function Nulls.levels(da::DataArray) # -> DataVector{T}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guess the old docs here got moved somewhere else? Or they just use the docs in Nulls.jl?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this method just does the same thing as the fallback in Nulls.

end

# ambiguity
@swappable (==)(::NAtype, ::WeakRef) = NA
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have this in Nulls, I'm guess it's not that important?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice I doubt it's useful. These were probably implemented at the time ambiguity warnings were printed when loading the package. Though we could enable ambiguity checks in Nulls and fix this kind of situation (there might be others).

DataArray(Array{T,N}(size(b)), trues(size(b)))
@dataarray_binary_scalar(/, /, nothing, false)

for f in [:(Base.maximum), :(Base.minimum)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seem like weird definitions on scalars anyway

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that's because numbers are iterable. We could define these in Nulls since that would be consistent with the goal of accepting null everywhere Number is, but we can also wait until somebody complains so that we're certain it's useful.

Copy link
Member

@quinnj quinnj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally had a chance to review all this. Looks great!

NAException,
NAtype,
padna,
padnull,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left this method here, but we could define it in Nulls if that's really useful. No hurry, though.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants