Port to Nulls.jl #288

nalimilan · 2017-09-27T12:37:09Z

This replaces NA with Nulls.null and NAtype with Nulls.Null.
The only breaking change is that null == null gives true, while NA == NAgave NA (same for other comparison operators).

The goal of this PR is to make DataArrays compatible with the new Nulls-based DataFrames, so that they can continue to be used until Array{Union{T, Null}} is efficient enough, as discussed in JuliaData/DataFrames.jl#1209. The port should be relatively easy for users, since everything goes though relatively simple deprecations.

The most/only breaking change concerns comparison operators (as noted above). However it shouldn't be too painful either since in general it means that code which would previously fail in the presence of NAs will now work. The only common situation in which there will be a difference is when user code replaced NAs resulting from a comparison manually. Nevertheless, I think we should discuss (in Nulls rather than here) these semantics carefully before committing to them and forcing users to adapt their code.

Tests will fail until we have a release of Nulls.jl including JuliaData/Missings.jl#30, JuliaData/Missings.jl#31 and JuliaData/Missings.jl#32.

This replaces NA with Nulls.null and NAtype with Nulls.Null. The only breaking change is that null == null gives true, while NA == NA gave NA (same for other comparison operators).

nalimilan · 2017-09-27T12:39:04Z

src/abstractdataarray.jl

-Base.broadcast{T}(::typeof(isna), a::AbstractArray{T}) =
-    NAtype <: T ? BitArray(map(x->isa(x, NAtype), a)) : falses(size(a)) # -> BitArray
-
+# FIXME: type piracy


I couldn't find a way to remove this without hurting performance, and since the point of porting DataArrays is to preserve performance... Maybe it's OK to keep this for now.

Another solution would be to move this definition to Nulls.jl, and only implement here the DataArray method. But I'm not sure we want to support this kind of API in the long term.

nalimilan · 2017-09-27T12:42:38Z

test/operators.jl

    end

-    # All elementary functions return NA when evaluating NA
+    # All elementary functions return null when evaluating null


This is the most significant behavior change. See also below.

nalimilan · 2017-09-27T16:55:53Z

The most/only breaking change concerns comparison operators (as noted above). However it shouldn't be too painful either since in general it means that code which would previously fail in the presence of NAs will now work. The only common situation in which there will be a difference is when user code replaced NAs resulting from a comparison manually. Nevertheless, I think we should discuss (in Nulls rather than here) these semantics carefully before committing to them and forcing users to adapt their code.

Turns out we'd better use the current NA behavior for null. See JuliaData/Missings.jl#33.

This has just been changed in Nulls.jl.

nalimilan · 2017-09-27T18:49:36Z

I've added a commit to keep the current logic in which NA == NA -> NA, since JuliaData/Missings.jl#33 just implemented null == null -> null.

nalimilan · 2017-09-27T19:08:13Z

test/nas.jl

-    @test collect(each_dropna(dv)) == a
-    @test collect(each_replacena(dv, 4)) == [4, 4, a..., 4]
-
-    @testset "promotion" for (T1, T2) in ((Int, Float64),


I removed these because I couldn't get them to work in the general case, and these should live in Nulls anyway. Cf. JuliaData/Missings.jl#23 and #280 (comment).

nalimilan · 2017-10-03T16:49:28Z

@andreasnoack Opinions? I've tried a few operations mixing DataArrays from this PR and DataFrames master, and everything seems to work fine.

andreasnoack · 2017-10-03T17:12:07Z

At this point, this is just a renaming and migrating out the package, right? No semantical changes. Then it should be fine.

nalimilan · 2017-10-03T17:18:04Z

Yes, mostly. The only significant change is the promote issue, but I think it's OK now (see comment above and its links). There's also the probable removal of iteration, but that's really small (JuliaData/Missings.jl#39). Finally, comparison operators have been made consistent with NA, so I ~~should be able to reinstate~~ have reinstated part of the removed tests.

More systematic testing with DataFrames master wouldn't hurt though (once this PR has been merged).

andreasnoack · 2017-10-03T17:27:21Z

Hm. Do tests pass locally? I'd expect them to fail without the promotion definitions. Also, do you need to add Nulls to REQUIRE?

nalimilan · 2017-10-03T19:27:12Z

Yes, they pass locally, but you need Nulls.jl master. I hope we can tag a release soon.

andreasnoack · 2017-10-03T19:50:02Z

You'll also need to add it to REQUIRE. See https://github.com/JuliaStats/DataArrays.jl/blob/9169e9f0a2bfab04cc115428e1c19aa513a9455e/REQUIRE`. That is the current reason why the tests fail.

nalimilan · 2017-10-03T19:50:45Z

Yes but we need to depend on an unreleased version anyway.

andreasnoack · 2017-10-03T19:52:07Z

Why can't we make a new release of Nulls right now?

nalimilan · 2017-10-03T19:56:50Z

I'm working on it. The problem is that it breaks DataStreams and CategoricalArrays. So better tag compatible releases first (there's one for the former, I'm preparing one for the latter).

Now works with Nulls 0.1.0.

nalimilan · 2017-10-05T08:33:40Z

Tests passed on 0.6 now that Nulls 0.1.0 has been tagged. Good to merge?

andreasnoack · 2017-10-05T08:45:07Z

Do you know why it is failing on nightly? It looks related to this change.

nalimilan · 2017-10-05T08:54:23Z

Actually I hadn't even looked at nightlies since they are so broken. The package doesn't load in a more recent version. But looks like as easy fix, so let's try while it's fresh in my mind. I've added a commit.

nalimilan · 2017-10-05T08:56:15Z

src/operators.jl

@@ -213,7 +208,7 @@ end
 # Treat ctranspose and * in a special way
 for (f, elf) in ((:(Base.ctranspose), :conj), (:(Base.transpose), :identity))
    @eval begin
-        $(f)(::NAtype) = NA
+        $(f)(::Null) = null


Woops, I've missed this one apparently.

Rounding and transpose operations have been moved to Nulls, functions from SpecialFunctions will have to be handled manually as we don't want Nulls to depend on SpecialFunctions and keeping them in DataArrays would be type piracy.

nalimilan · 2017-10-05T12:57:22Z

Nighties have been updated, so now we get a crash there. :-/ That's an "Illegal instruction" just like in DataFrames master, so probably not worth worrying about for now.

I've also removed a few lifted methods which should be defined in Nulls instead, see JuliaData/Missings.jl#42. That means we'll have to tag another Nulls release...

nalimilan · 2017-10-05T19:48:33Z

Now that Nulls 0.1.1 has been tagged and that CI passes on 0.6, should be ready to merge?

…hDropNull) This allows collect(Nulls.skip(x)) to be equivalent to the old dropna(x) for DataArray, but more generic. Also unexport the iterators, which are an implementation detail.

nalimilan · 2017-10-06T12:38:50Z

After doing a bit more cleanup and adding another deprecation, tests from current master are passing with the implementation from that PR (with a handful of changes were needed). So breakage should be minimal. The only functions I removed without a deprecation are those which would be type piracy: length, size, SpecialFunctions.erf, SpecialFunctions.erfc, SpecialFunctions.digamma. The cause of the error should be quite easy to identify anyway.

I haven't renamed the na field of DataArray, because it would be tedious to do for no real gain, and it could break code relying on the internals. Does that make sense?

quinnj · 2017-10-06T12:40:10Z

Sounds good. What were the problematic length size definitions?

nalimilan · 2017-10-06T12:43:09Z

length(::Null) = 1 and size(::Null) = (). That is, the ones we just removed from Nulls.jl.

quinnj · 2017-10-06T12:45:28Z

Ah, got it.

Previously, convert() would create a DataArray{Union{T, Null}} containing nulls which were not considered as missing, giving weird results in particular with equality tests. Constructing a DataArray{T} from an Array{Union{T, Null}} is tricky because the field must be an Array{T}, yet conversion will fail in the presence of nulls. Therefore, we need to allocate a copy and leave the null entries uninitialized.

nalimilan · 2017-10-08T13:04:07Z

src/dataarray.jl

-                    m[i] = true
+        # if input array can contain null values, we need to mark corresponding entries as missing
+        if eltype(d) >: Null
+            # If the original eltype is wider than the target eltype T, conversion may fail


@quinnj This is a situation where it would be useful to be able to access the underlying data of a Union{T, Null} array, unsafely reinterpreting it as an Array{T}. Since DataArray stores the missingness mask separately, we are certain we won't access elements which are null. Is there a hacky way to do that currently? Would it make sense to provide an API to do that?

Yeah, we need to figure out the right api for that. It was in the original isbits Union optimizations PR, but @vtjnash and @yuyichao had concerns w/ how it was implemented, so I just took it out for the initial PR. I'll open an issue so we can push forward on it now that everything's in Base.

I think that's just a type-assert (using the eltype parameter from DataArrays)?

But how can I convert an Array{Union{T, Null}} to an Array{T} without copying in the presence of null entries?

Seems like something reinterpret should be allowed to do

But it returns a ReinterpretArray now, so that would require allowing the field to be any `AbstractArray, right? Not the end of the world, probably.

@vtjnash Am I right?

quinnj · 2017-10-13T02:31:36Z

how are things looking @nalimilan?

nalimilan · 2017-10-13T07:06:35Z

I think that's mostly good. I'd like to define levels in Nulls (JuliaData/Missings.jl#46) and override it here, so that DataArrays and CategoricalArrays can be used at the same time without conflicts.

I'm also checking that old DataArrays-based DataFrames tests pass with the new Nulls-based DataFrames (modulo a few reasonable changes where needed). So far it mostly works, but that process allowed me to catch some bugs and problematic changes (like JuliaData/DataFrames.jl#1249), so I'd rather finish it before merging this PR.

quinnj · 2017-10-13T13:23:22Z

Sounds great; thanks for doing all this.

This prevents a conflict with CategoricalArrays.

…ay{>:Null})

nalimilan · 2017-10-14T15:25:40Z

I think this should be ready for a final review. With this branch, DataFrames tests with DataArray columns pass (JuliaData/DataFrames.jl#1260). We should also deprecate PooledDataArray in favor of CategoricalArray (or PooledArray depending on the case), but we can do that later.

quinnj · 2017-10-17T04:14:44Z

src/dataarray.jl

-"""
-function levels(da::DataArray) # -> DataVector{T}
-    unique_values, firstna = finduniques(da)
+function Nulls.levels(da::DataArray) # -> DataVector{T}


I'm guess the old docs here got moved somewhere else? Or they just use the docs in Nulls.jl?

Yes, this method just does the same thing as the fallback in Nulls.

quinnj · 2017-10-17T04:34:30Z

src/operators.jl

 end

 # ambiguity
-@swappable (==)(::NAtype, ::WeakRef) = NA


I don't think we have this in Nulls, I'm guess it's not that important?

In practice I doubt it's useful. These were probably implemented at the time ambiguity warnings were printed when loading the package. Though we could enable ambiguity checks in Nulls and fix this kind of situation (there might be others).

quinnj · 2017-10-17T04:37:14Z

src/operators.jl

    DataArray(Array{T,N}(size(b)), trues(size(b)))
 @dataarray_binary_scalar(/, /, nothing, false)

-for f in [:(Base.maximum), :(Base.minimum)]


These seem like weird definitions on scalars anyway

Yeah, I think that's because numbers are iterable. We could define these in Nulls since that would be consistent with the goal of accepting null everywhere Number is, but we can also wait until somebody complains so that we're certain it's useful.

quinnj

Finally had a chance to review all this. Looks great!

nalimilan · 2017-10-17T06:35:13Z

src/DataArrays.jl

-           NAException,
-           NAtype,
-           padna,
+           padnull,


I've left this method here, but we could define it in Nulls if that's really useful. No hurry, though.

Port to Nulls.jl

a1c2c1a

This replaces NA with Nulls.null and NAtype with Nulls.Null. The only breaking change is that null == null gives true, while NA == NA gave NA (same for other comparison operators).

nalimilan commented Sep 27, 2017

View reviewed changes

nalimilan mentioned this pull request Sep 27, 2017

RFC: Adjust to eltype changes in DataArrays where the function now returns Union{T,NAtype} JuliaData/DataFrames.jl#1209

Closed

Revert behavior of == in presence of null to match that of NA

9169e9f

This has just been changed in Nulls.jl.

nalimilan commented Sep 27, 2017

View reviewed changes

quinnj mentioned this pull request Oct 3, 2017

length and iteration? JuliaData/Missings.jl#39

Closed

nalimilan added 2 commits October 4, 2017 17:20

Require Nulls 0.1.0

be02c30

Add back promotion tests

f5534d5

Now works with Nulls 0.1.0.

Fix rep() test on 0.7

19c79a8

nalimilan commented Oct 5, 2017

View reviewed changes

Remove more lifted operations on Null

4ccfaf3

Rounding and transpose operations have been moved to Nulls, functions from SpecialFunctions will have to be handled manually as we don't want Nulls to depend on SpecialFunctions and keeping them in DataArrays would be type piracy.

nalimilan mentioned this pull request Oct 5, 2017

Add rounding and transposition methods JuliaData/Missings.jl#42

Merged

Require Nulls 0.1.1

c3f7a50

nalimilan force-pushed the nl/null branch from af73b95 to 3bff91c Compare October 5, 2017 20:39

nalimilan added 6 commits October 6, 2017 11:07

Remove dropnull() thanks to efficient specialization of collect(::Eac…

122fa56

…hDropNull) This allows collect(Nulls.skip(x)) to be equivalent to the old dropna(x) for DataArray, but more generic. Also unexport the iterators, which are an implementation detail.

Remove remaining occurrences of NA

e3776ba

Deprecate skipna argument in favor of skipnull

b9999ee

Remove even more uses of na

8859847

Fix use NA with @DaTa and @pdata

920f92c

Add deprecation for NAException

22ebdb0

nalimilan force-pushed the nl/null branch from 2978767 to 02bb120 Compare October 6, 2017 21:44

nalimilan commented Oct 8, 2017

View reviewed changes

Stop exporting nonexistent head() and tail()

f3a7af6

nalimilan added 2 commits October 14, 2017 15:51

Override Nulls.levels() instead of defining custom function

96f7b05

This prevents a conflict with CategoricalArrays.

Remove method redundant with ==(::AbstractArray{>:Null, ::AbstractArr…

622613a

…ay{>:Null})

nalimilan mentioned this pull request Oct 14, 2017

WIP: reintroduce DataArrays-based tests from 0.10.1 JuliaData/DataFrames.jl#1260

Closed

quinnj reviewed Oct 17, 2017

View reviewed changes

quinnj approved these changes Oct 17, 2017

View reviewed changes

nalimilan commented Oct 17, 2017

View reviewed changes

nalimilan merged commit 8a7003b into master Oct 19, 2017

nalimilan deleted the nl/null branch October 19, 2017 16:53

nalimilan mentioned this pull request Oct 21, 2017

Avoid copying input Array{Union{T, Null}} in DataArray{T} constructor #291

Open

Port to Nulls.jl #288

Port to Nulls.jl #288

Conversation

nalimilan commented Sep 27, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nalimilan commented Sep 27, 2017

nalimilan commented Sep 27, 2017

Choose a reason for hiding this comment

nalimilan commented Oct 3, 2017

andreasnoack commented Oct 3, 2017 • edited Loading

nalimilan commented Oct 3, 2017 • edited Loading

andreasnoack commented Oct 3, 2017

nalimilan commented Oct 3, 2017

andreasnoack commented Oct 3, 2017

nalimilan commented Oct 3, 2017

andreasnoack commented Oct 3, 2017

nalimilan commented Oct 3, 2017

nalimilan commented Oct 5, 2017

andreasnoack commented Oct 5, 2017

nalimilan commented Oct 5, 2017

Choose a reason for hiding this comment

nalimilan commented Oct 5, 2017

nalimilan commented Oct 5, 2017

nalimilan commented Oct 6, 2017

quinnj commented Oct 6, 2017

nalimilan commented Oct 6, 2017

quinnj commented Oct 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

quinnj commented Oct 13, 2017 • edited Loading

nalimilan commented Oct 13, 2017

quinnj commented Oct 13, 2017

nalimilan commented Oct 14, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

quinnj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreasnoack commented Oct 3, 2017 •

edited

Loading

nalimilan commented Oct 3, 2017 •

edited

Loading

quinnj commented Oct 13, 2017 •

edited

Loading