[RFC] Base.vcat AbstractDataFrame should rely on Base.vcat for columns #1118

gustafsson · 2016-11-01T21:02:27Z

This creates unnecessary copies of columns that doesn't exist in all concatenated dataframes. But uses Base.vcat to let each array type decide what vcat means.

Related #990

…iaData#1042) Add compatibility with pre-contrasts ModelFrame constructor

…ise for speed improvement (JuliaData#1070)

Completely remove support for DataArrays.

This depends on PRs moving these into NullableArrays.jl. Also use isequal() instead of ==, as the latter is in Base and unlikely to change its semantics.

groupby() did not follow the order of levels, and wasn't robust to reordering levels. Add tests for corner cases.

Use the fallbacks for now, should be added back after JuliaData/CategoricalArrays.jl#12 is fixed.

Not sure what I meant by this. If it was really serious, we'll discover it sooner or later.

This is a much more general issue (JuliaStats/NullableArrays.jl#85) which can be tackled later.

For now, preserve the current semantics: conversion to NullableArray does not happen via insert!().

Again a broader issue which doesn't particularly affect DataFrames. Cf. JuliaStats/NullableArrays.jl#143

Better handle that separately.

Shorter written that way for now. Filed as JuliaStats/NullableArrays.jl#144.

This depends on a CategoricalArrays change by which levels are sorted when creating the array.

There's no inconsistency here: when the input is a Matrix, there's no point in returning a NullableArray. Anyway, these are test methods.

We don't have to handle this right now.

Keep this in DataFrames for now, renaming it to the more explicit sharepools(). Also relax signatures to accept non-Nullable categorical arrays.

These were not exercized by the tests, and the use case for them isn't obvious. (They were formerly methods of DataArrays.PooledDataArray().)

For NullableArrays, even current git master is not enough at this time.

Tests pass, but the Nullable{Any} results could be annoying for users.

New type merging NominalArray and OrdinalArray in 0.0.5.

These shouldn't live in DataFrames.

…w references

nalimilan

Thanks, it would be good to remove special-casing and rely only on vcat.

nalimilan · 2016-11-03T14:05:47Z

src/abstractdataframe/abstractdataframe.jl

-            if haskey(df, colnam)
-                copy!(col, i, df[colnam])
+    nrows = sum(nrow, dfs)
+    for colnam in unique([(names(e) for e in dfs)...;])


Avoid splatting on an arbitrary number of arguments when possible. Here, better use unique(Base.flatten(names.(df))).

Oh I didn't know about Base.flatten. And I get to use the new cool f.(x)-notation!

nalimilan · 2016-11-03T14:09:51Z

src/abstractdataframe/abstractdataframe.jl

+    nrows = sum(nrow, dfs)
+    for colnam in unique([(names(e) for e in dfs)...;])
+        k = Bool[haskey(e, colnam) for e in dfs]
+        c = vcat((e[colnam] for e in view(dfs, k))...)


view is probably not worth it as dfs is just a short vector of references. Could use a if guard instead.

I don't follow, could you show me?

Something like vcat((dfs[i][colnam] for i in 1:length(dfs) if k[i])...).

Oh, I didn't know you could use if in a for-generator

nalimilan · 2016-11-03T14:11:35Z

src/abstractdataframe/abstractdataframe.jl

+        if all(k)
+            col = c
+        else
+            col = if _isnullable(c)


Better assign to col separately within each branch. Also, wrong indentation below.

Do you mean that I should assign to res[colnam] separately within each branch?

That was just a stylistic comment to avoid col = if.

nalimilan · 2016-11-03T14:12:22Z

test/grouping.jl

+    a[:x] = compact(a[:x])
+    b[:x] = compact(b[:x])
+    r = vcat(a,b)
+    @test isequal(r, DataFrame(x=[categorical(1:200);categorical(100:300)]))


Please add spaces after commas and semicolons here and below.

nalimilan · 2016-11-03T14:12:50Z

src/abstractdataframe/abstractdataframe.jl

-                copy!(col, i, df[colnam])
+    nrows = sum(nrow, dfs)
+    for colnam in unique([(names(e) for e in dfs)...;])
+        k = Bool[haskey(e, colnam) for e in dfs]


Could find a more explicit/logical names than k, e and c.

Still could use more explicit names.

nalimilan · 2016-11-03T14:17:27Z

src/abstractdataframe/abstractdataframe.jl

+            i = 1
+            j = 1
+            for df in dfs
+                if haskey(df, colnam)


Could reuse k here.

nalimilan · 2016-11-03T14:18:21Z

src/abstractdataframe/abstractdataframe.jl

+            j = 1
+            for df in dfs
+                if haskey(df, colnam)
+                    copy!(col, i, view(c, j:j+nrow(df)-1))


You don't need view as copy! can take offsets and number of elements to copy directly.

nalimilan · 2016-11-03T14:29:11Z

src/abstractdataframe/abstractdataframe.jl

+            col = c
+        else
+            col = if _isnullable(c)
+              similar(c, nrows)


Choosing the column type that way isn't correct. Until JuliaLang/julia#18472 is fixed, I think we should be able to find out what's the most appropriate return type by calling Base.return_types on vcat and the types of the input columns. If inference fails, we could fall back to allocating an empty Array of the required length when the column is missing, and calling vcat on all columns; it would be wasteful, but that would only happen with broken types (and until we have a better mechanism).

I don't follow. vcat decides the column type, I'm just allocating a larger array. If vcat settled on something that couldn't cope with nulls I add that. Maybe I'm missing the point and you're actually disagreeing with the whole wasteful approach of allocating columns "twice"?

Ah, I misread the code. So indeed that's correct, just suboptimal. You could use Base.return_types as I described to avoid doing this when inference works, and only fall back to it when it fails.

Is this what you're suggesting?

cols = (dfs[i][colnam] for i in 1:length(dfs) if k[i]) T = Base.return_types(vcat, Base.typesof(c...) if T <: Union # implementation so far in this PR r = vcat(cols) ... else T = makeTsupportnull(T) r = T(N) # don't use vcat and just do `copy!` to fill r? ... end

Yeah, more or less, though in the second branch, I think I'd call vcat on existing columns as you do now for consistency. Also, you'll have to take the first result from return_types, and T<:Union should be isleaftype(T).

If you do r = vcat(cols) in both branches of if isleafttype(T) they become identical so you don't need the branch in the first place. No?

I thought the problem was that you wanted some other way to allocate an array with a nullable eltype? Is that related? What benefit does return_types have over similar?

I don't know what you want to achieve so my questions become rather random.

Sorry if I wasn't clear. The goal is to avoid an unneeded allocation. My idea was that if isleaftype(T), then you can create the (empty) final array, and fill it directly with copy!. To be sure you can have missing values, add NullableArray{promote_type(eltypes.(c)...)} to the types of existing columns.

If !isleaftype(T), then the only way to find out the type of that array is to create it by actually calling vcat. Since you need to do that, you can as well create empty NullableArrays of the needed length and add them to the vcat call so that it gives the final array directly. But honestly, I don't expect this branch to happen often in practice; we could almost raise an error.

Does that make sense?

gustafsson · 2016-11-07T18:12:48Z

Implemented your comments, thanks. You're right that the return type is mostly stable. Meaning we'll only use vcat to find the array type, but not for concatenating data.

I only found one example in the unit tests where this fails though and return_types(vcat, c) doesn't give a leaf type. From cat.jl:121 Base.return_types get's called like this:

Base.return_types(vcat, (NullableCategoricalArray{Int64,1,UInt32}, NullableArray{Int64,1}))
1-element Array{Any,1}:
 CategoricalArrays.NullableCategoricalArray{T,N,R<:Integer}

This is an example where the return type of vcat depend on the order of arguments. In this case it returns a NullableCategoricalArray{Int64,1,UInt32}.

gustafsson · 2016-11-07T18:28:47Z

This PR depends on JuliaStats/NullableArrays.jl#152

nalimilan · 2016-11-07T22:12:18Z

Thanks, will give a longer look later.

This is an example where the return type of vcat depend on the order of arguments. In this case it returns a NullableCategoricalArray{Int64,1,UInt32}.

Sounds like we should change this to be type-stable. I guess the problem is in the method defined in CategoricalArrays.jl?

gustafsson · 2016-11-08T18:33:54Z

@nalimilan the example where the return type of vcat depend on the order of arguments comes from Base.typed_vcat which uses similar on its first argument.

gustafsson · 2016-11-08T18:41:24Z

Related; CategoricalArrays might need a special vcat(V1::CategoricalArray,V::AbstractVector...) that either creates a regular Array or arrange so that a[pos:p1] = Vk in typed_vcat uses the equivalent of CategoricalArrays.copy! somehow. But's that's really an issue for CategoricalArrays not DataFrames.

nalimilan · 2016-11-13T10:51:35Z

src/abstractdataframe/abstractdataframe.jl

-    end
-    res
-end
+        c = ((typeof(dfs[i][colnam]) for i in 1:length(dfs) if k[i])...,)


Trailing comma isn't needed, right?

nalimilan · 2016-11-13T10:52:17Z

src/abstractdataframe/abstractdataframe.jl

-                    end
+        if length(C)==1 && isleaftype(C[1])
+            if _isnullable(C[1])
+              NC = C[1]


Four-space indent.

nalimilan · 2016-11-13T11:08:07Z

src/abstractdataframe/abstractdataframe.jl

 end

+_isnullable{T}(::AbstractArray{T}) = T <: Nullable


Better define these above since they're used there.

nalimilan · 2016-11-13T11:20:45Z

@nalimilan the example where the return type of vcat depend on the order of arguments comes from Base.typed_vcat which uses similar on its first argument.

OK, I guess that's good enough for now. Looks like it's going to be changed pretty soon by https://github.com/JuliaLang/julia/pull/16740/files#diff-2264bb51acec4e7e2219a3cb1c733651R1186.

Though we'll still need a mechanism to promote the input types (JuliaLang/julia#18472). Since you're now familiar with vcat, would you experiment with a PR against Base? One approach would be to use promote for that; you could add a promote_rule method for AbstractArray (similar to what I recently did for Pair at JuliaLang/julia#19171).

Related; CategoricalArrays might need a special vcat(V1::CategoricalArray,V::AbstractVector...) that either creates a regular Array or arrange so that a[pos:p1] = Vk in typed_vcat uses the equivalent of CategoricalArrays.copy! somehow. But's that's really an issue for CategoricalArrays not DataFrames.

It seems that indeed a[pos:p1] = Vk should be equivalent to copy!, shouldn't it? But could you elaborate on how it's different? As regards handling the levels?

CategoricalArray also likely need specific methods so that concatenation with AbstractArray{<:Nullable} gives a NullableCategoricalArray. Ideally it would be done via the promote mechanism, but currently it seems we need a vcat method.

gustafsson · 2016-11-15T20:11:31Z

Since we don't handle setindex!(::CategoricalArray, ::AbstractArray, ::AbstractArray) the call to a[pos:p1] = Vk will call the default implementation in base which in turn will call setindex! for each element. This doesn't use the efficient concatenation implemented in copy! and vcat for CategoricalArray. Might just be a simple fix of implementing a setindex! that just calls copy!. I'm not sure if there are any side-effects to consider.

nalimilan · 2016-11-16T09:57:19Z

src/abstractdataframe/abstractdataframe.jl

-            end
-            i += size(df, 1)
+    nrows = sum(nrow, dfs)
+    for colnam in unique(Base.flatten(names.(dfs)))


flatten lives under Base.Iterators on Julia 0.6. You can just test this using isdefined and do using Base: flatten or using Base.Iterators: flatten at the top of the file.

nalimilan · 2016-11-16T09:58:00Z

src/abstractdataframe/abstractdataframe.jl

-        end
-    end
+        else
+            # warn("Unstable return types: ", C, " from vcat of ", [typeof(dfs[i][colnam]) for i in 1:length(dfs) if k[i]])


Can you remove this once it's ready?

nalimilan · 2016-11-16T10:00:53Z

OK. I'd rather implement copy! in terms of setindex!, though it doesn't really matter.

There's still a failure which appears to be real (besides the flatten problem on 0.6).

nalimilan · 2016-11-28T09:09:22Z

Is there anything blocking here?

ararslan · 2016-11-28T18:37:33Z

@nalimilan If you mean blocking a merge, your review still says "requested changes" and all checks failed on all platforms...

nalimilan · 2016-11-28T18:59:00Z

I just wondered whether there were hard issues to tackle to get this to a mergeable state.

nalimilan · 2017-01-15T21:07:30Z

@gustafsson Can you revise this so that we merge it?

nalimilan · 2017-03-26T15:50:58Z

Actually, this cannot be merged until Base provides a mechanism for cat promotion (JuliaLang/julia#20815). See also discussion at https://github.com/JuliaData/DataTables.jl/pull/30/files#r105499372.

nalimilan · 2017-09-11T21:42:40Z

@cjprybol Do you think we still need this, or did you implement it in DataTables? (See the last four commits.)

cjprybol · 2017-09-12T00:08:12Z

I think everything here should be covered by what we merged in JuliaData/DataTables.jl#45. And until something like JuliaLang/julia#20815 is implemented in Base, we can't rely on Base.vcat anyway, so this looks like it follows similar logic to what we have implemented currently.

cjprybol · 2017-09-12T00:11:45Z

Thanks for your efforts here @gustafsson! Sorry it got left behind while we were experimenting in the DataTables package

Gord Stephen and others added 30 commits September 14, 2016 10:13

RFC: Add compatibility with pre-contrasts ModelFrame constructor (Jul…

968e980

…iaData#1042) Add compatibility with pre-contrasts ModelFrame constructor

Reindex transposed sparse contrast matrix into modelmat_cols column-w…

d4ad15b

…ise for speed improvement (JuliaData#1070)

Fill existing arrays with scalars (JuliaData#1057)

2931693

Port to NullableArrays and CategoricalArrays

e4662fd

Completely remove support for DataArrays.

Get rid of custom Nullable operators and functions

9de5c08

This depends on PRs moving these into NullableArrays.jl. Also use isequal() instead of ==, as the latter is in Base and unlikely to change its semantics.

Fix grouping

6ac7549

groupby() did not follow the order of levels, and wasn't robust to reordering levels. Add tests for corner cases.

Remove custom isnull() definition

653fc1d

Remove optimized sorting methods

a17f264

Use the fallbacks for now, should be added back after JuliaData/CategoricalArrays.jl#12 is fixed.

Remove inscrutable FIXME

9a71705

Not sure what I meant by this. If it was really serious, we'll discover it sooner or later.

More Julia 0.4 compatibility

9f1e5e6

Remove another FIXME

a75a4a4

This is a much more general issue (JuliaStats/NullableArrays.jl#85) which can be tackled later.

Remove FIXME about insert!()

1b44ffe

For now, preserve the current semantics: conversion to NullableArray does not happen via insert!().

Remove FIXME about +(::NullableArray{Int}, ::Int)

0ff4dc8

Again a broader issue which doesn't particularly affect DataFrames. Cf. JuliaStats/NullableArrays.jl#143

Remove FIXME about test/indexing.jl

110deac

Better handle that separately.

Remove FIXME about map()

0ff6373

Shorter written that way for now. Filed as JuliaStats/NullableArrays.jl#144.

Fix sortperm() tests

431d135

This depends on a CategoricalArrays change by which levels are sorted when creating the array.

Remove FIXME about predict()

cc87f46

There's no inconsistency here: when the input is a Matrix, there's no point in returning a NullableArray. Anyway, these are test methods.

Remove FIXME about head() and tail()

ec9b706

We don't have to handle this right now.

Remove FIXME about PooledDataVecs

e9a1c8c

Keep this in DataFrames for now, renaming it to the more explicit sharepools(). Also relax signatures to accept non-Nullable categorical arrays.

Remove unused NominalArray methods

bf16c5f

These were not exercized by the tests, and the use case for them isn't obvious. (They were formerly methods of DataArrays.PooledDataArray().)

Mention Julia bug in FIXME

3bb1323

Bump dependencies on NullableArrays and CategoricalArrays

95789bf

For NullableArrays, even current git master is not enough at this time.

Require NullableArrays 0.0.8

5c33249

Bump CategoricalArrays requirement

e1df391

Fix tests on Julia 0.4

63c1d96

Tests pass, but the Nullable{Any} results could be annoying for users.

Use CategoricalArray instead of NominalArray

2ec131e

New type merging NominalArray and OrdinalArray in 0.0.5.

Remove DataArrays benchmarks

ad75f67

These shouldn't live in DataFrames.

Update docs

492351c

Fix failures introduced when rebasing

d48d7f8

Update docs to remove references to DataArrays and fully qualify a fe…

f8dc8c6

…w references

gustafsson mentioned this pull request Nov 1, 2016

vcat should expand pooled columns when needed #990

Closed

gustafsson changed the title ~~Base.vcat AbstractDataFrame should rely on Base.vcat for columns~~ [RFC] Base.vcat AbstractDataFrame should rely on Base.vcat for columns Nov 1, 2016

nalimilan requested changes Nov 3, 2016

View reviewed changes

Johan Gustafsson added 3 commits November 7, 2016 18:04

Use vcat to find column type but not for concatenation

1a96917

spaces

ce33c83

Use promote_eltype to find eltype

f10ae4a

nalimilan reviewed Nov 13, 2016

View reviewed changes

fixed review comments

4dcc24b

nalimilan reviewed Nov 16, 2016

View reviewed changes

ararslan force-pushed the master branch from 1013694 to e5347cf Compare February 11, 2017 18:48

cjprybol closed this Sep 12, 2017

[RFC] Base.vcat AbstractDataFrame should rely on Base.vcat for columns #1118

[RFC] Base.vcat AbstractDataFrame should rely on Base.vcat for columns #1118

Conversation

gustafsson commented Nov 1, 2016

nalimilan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gustafsson Nov 3, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gustafsson Nov 3, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gustafsson commented Nov 7, 2016

gustafsson commented Nov 7, 2016

nalimilan commented Nov 7, 2016

gustafsson commented Nov 8, 2016

gustafsson commented Nov 8, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nalimilan commented Nov 13, 2016

gustafsson commented Nov 15, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nalimilan commented Nov 16, 2016

nalimilan commented Nov 28, 2016

ararslan commented Nov 28, 2016

nalimilan commented Nov 28, 2016

nalimilan commented Jan 15, 2017

nalimilan commented Mar 26, 2017

nalimilan commented Sep 11, 2017

cjprybol commented Sep 12, 2017

cjprybol commented Sep 12, 2017

gustafsson Nov 3, 2016 •

edited

Loading

gustafsson Nov 3, 2016 •

edited

Loading

gustafsson commented Nov 15, 2016 •

edited

Loading