vcat should expand pooled columns when needed #990

gustafsson · 2016-06-08T15:21:39Z

pool! creates compact pooled dataframes that doesn't have room for more levels than needed.

When vcat-ing together such pooled dataframes the resulting pool might be need to be larger. This PR checks which levels that are needed prior to copying the data.

Canonical example

a = DataFrame(x=pool(1:200))
b = DataFrame(x=pool(100:300))
vcat(a,b)

A similar problem with DataArrays occurs for vcat(pool(1:200), pool(100:300)) but that is not addressed by this PR.

nalimilan · 2016-06-09T12:35:54Z

Good catch. Could you put that function into DataArrays.jl instead? It really belongs there. Shouldn't it even be done by the vcat method automatically?

EDIT: actually, this could be a method of similar, taking a keyword argument giving the new levels. We'd need to differentiate it from the standard similar method, though, to avoid the type-instability there.

gustafsson · 2016-06-10T08:54:51Z

DataArrays.vcat doesn't exist, t's just deferred to the default Base.vcat in abstractvector.jl but I guess it could be specialised for pooled arrays. DataFrames.vcat however doesn't rely on Base.vcat so fixing this for DataArrays wouldn't necessarily fix this for DataFrames. DataFrames.vcat and Base.vcat does both use similar though, which I assume your edited comment was referring to.

If similar would take the new levels as a keyword argument it would need to have these levels computed by the caller. Which is what I implemented for DataFrames.vcat, but the new levels are passed in with the first argument to similar instead of an additional keyword argument. A new version of similar could also take the entire vector and from that figure out a common type with room for all. Which is essentially what DataFrames._colinfo already does.

Or how would you propose that the new levels are created before passing them to similar? Something like this?

function Base.similararray{T,R<:Integer,N}(a::PooledDataArray{T,R,N}, T, n; array=[])
    pool = levels([map(levels,array)...])
    norefs = DataArrays.RefArray(Array{DataArrays.DEFAULT_POOLED_REF_TYPE,N}())
    compact(PooledDataArray(norefs, pool))
end

# default
Base.similararray(a, T, n; kwargs...) = similar(a, T, n)

# Use like so:
a = similararray(full(V[1]), T, n; array=V)

And then change for instance https://github.com/JuliaLang/julia/blob/master/base/abstractarray.jl#L638 as well as related calls in the same file?

nalimilan · 2016-06-10T09:18:00Z

I'm not sure what's the best strategy, but my general idea was to move all PooledDataArray implementation details to DataArrays.jl and use only a relatively high level API from DataFrames.jl. The plan is to move away from DataArrays to something like CategoricalData.jl, which will be easier if we don't mess with internals (which is never a good idea anyway).

While similar[array] can be passed a list of arrays, I would find it better to pass it the merged list of levels directly: that would be a natural extension of similar, since levels are a property of PDAs just like element type and dimensions are for all arrays. As you say, this is quite "similar" to what you do here, except that no unexported DataArrays functions would be used: those bits would be move to that package instead.

gustafsson · 2016-06-22T08:24:09Z

The number of levels that a PDA can hold is inferred from its integer type parameter. The actual levels do not define the type, as an instance of a PDA can be expanded with more levels, but only up to a limit defined by the type parameter.

If we call similar[array] and "pass it the merged list of levels directly", who would create that merged list of levels? The caller who is outside of the DataArrays package?

Another approach would be to rewrite the DataFrames.vcat to rely on Base.vcat, something like

function Base.vcat(a::Array{DataFrame})
  r = DataFrame()
  for n in unique([map(names,a)...;])
    r[n] = vcat([haskey(e,n) ? e[n] : DataArray(NAtype, nrow(e)) for e in a])
  end
  r
end

vcat of DataArrays would then properly handle PooledDataArrays internally. (DataArray(NAtype, nrow(e)) can already be promoted to any type, i.e vcat(DataArray(NAtype,1),[1]) |> typeof == DataArrays.DataArray{Int64,1})

nalimilan · 2016-06-22T14:47:28Z

Yes, using the highest-level API sounds like a good idea, as it allows each type to choose the most appropriate behaviour.

gustafsson · 2016-09-12T11:41:41Z

Waiting for review on JuliaStats/DataArrays.jl#213

nalimilan · 2016-09-27T08:27:46Z

Fixes landed in both DataArrays and CategoricalArrays (with JuliaData/CategoricalArrays.jl#18). Do we still need to do something here?

gustafsson · 2016-09-27T11:14:51Z

Yes because DataFrames doesn't use Base.vcat on its columns: abstractdataframe.jl#691

Any comments on the approach suggested above? It may need something more complicated than DataArray(NAtype, nrow(e))

nalimilan · 2016-09-27T11:47:45Z

Honestly, I'm not familiar enough with this code to be completely sure. Your proposal to rely on vcat sounds good and it means we can support any types without reimplementing the concatenation logic here. Though it has the drawback that we need to allocate arrays full of null values just to call vcat on them and destroy them afterwards. That said, if that turns out it's an issue for performance, I guess we could create an internal NullArray type which would return only nulls for a specified length.

So unless we can find other drawbacks to it, I'd say go ahead. DataArray(NAtype, nrow(e)) can be replaced with NullableArray(nrow(e)), though we may have to implement a vcat method to efficiently concatenate it with CategoricalArray/NullableCategoricalArray (together with promote rules). These specialized methods are probably needed anyway.

gustafsson · 2016-11-01T16:41:03Z

Not needed as of JuliaData/CategoricalArrays.jl#37

gustafsson · 2016-11-01T17:58:54Z

Correction: JuliaData/CategoricalArrays.jl#37 solves the issue of efficient concatenation discussed above, but not about concatenating compact pooled arrays which this PR was originally intended to fix. This was more of an issue with the old PooledDataArray which created a compact pool by default when using pool(a,b,c). The new categorical(a,b,c) on the other hand isn't compacted. That is, this issue doesn't occur unless you explicitly use compact.

The old example using PooledDataArray:

a = DataFrame(x=pool(1:200))
b = DataFrame(x=pool(100:300))
vcat(a,b) # <- error

With CategoricalArrays the same issue occurs like this:

a = DataFrame(x=categorical(1:200))
b = DataFrame(x=categorical(100:300))
a[:x] = compact(a[:x])
b[:x] = compact(b[:x])
vcat(a,b) # <- error

which can be fixed by simply not calling compact:

a = DataFrame(x=categorical(1:200))
b = DataFrame(x=categorical(100:300))
vcat(a,b) # <- works

That's fine by me. Besides, CategoricalArrays gives a fairly clear error message what to do when this happens; i.e use a larger reference type. So I don't need this PR anymore.

Related (even though the patch wouldn't belong here): it could perhaps be nice to have a function to uncompact like this:

uncompact{T, D, R}(ca::NullableCategoricalArray{T, D, R}) =
    convert(NullableCategoricalArray{T, D, CategoricalArrays.DefaultRefType}, ca)

nalimilan · 2016-11-01T18:56:14Z

Yet this failure is inconsistent with the fact that your vcat implementation for CategoricalArray always uses DefaultRefType (i.e. UInt32) to avoid throwing an error. Either raising an error is fine, and vcat should preserve the reference type, or it's not, and data frames should work the same as vcat.

Whether or not to compact columns will depend on the input method, e.g. CSV.jl, Query.jl...

gustafsson · 2016-11-01T21:03:42Z

Yes it is inconsistent.

I fiddled a bit with using vcat for columns in DataFrames. But couldn't get around either allocating arrays twice, allocating arrays filled with null just to discard them, or resorting to the current implementation which probably is the most efficient of those alternatives, albeit inconsistent.

An example: #1118

nalimilan · 2016-11-03T14:38:10Z

Even if we manage to rely only on vcat thanks to #1118, we could include a special-case for CategoricalArray in DataFrames, which would widen the reference type if there are too many levels. Indeed that's possible only in DataFrames, where concatenation is already type-unstable anyway.

gustafsson force-pushed the expand_vcat_pool branch from 72d88d9 to 6e83df9 Compare June 8, 2016 15:22

expand pool of pooled columns when needed by vcat

cdee34a

gustafsson force-pushed the expand_vcat_pool branch from 6e83df9 to cdee34a Compare June 8, 2016 18:16

gustafsson mentioned this pull request Sep 12, 2016

vcat of PooledDataVector might need to expand the reftype JuliaStats/DataArrays.jl#213

Merged

gustafsson closed this Nov 1, 2016

gustafsson reopened this Nov 1, 2016

gustafsson closed this Nov 1, 2016

gustafsson deleted the expand_vcat_pool branch November 1, 2016 18:03

This was referenced Nov 1, 2016

uncompact JuliaData/CategoricalArrays.jl#41

Merged

[RFC] Base.vcat AbstractDataFrame should rely on Base.vcat for columns #1118

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vcat should expand pooled columns when needed #990

vcat should expand pooled columns when needed #990

gustafsson commented Jun 8, 2016

nalimilan commented Jun 9, 2016 •

edited

Loading

gustafsson commented Jun 10, 2016

nalimilan commented Jun 10, 2016

gustafsson commented Jun 22, 2016 •

edited

Loading

nalimilan commented Jun 22, 2016

gustafsson commented Sep 12, 2016

nalimilan commented Sep 27, 2016

gustafsson commented Sep 27, 2016

nalimilan commented Sep 27, 2016

gustafsson commented Nov 1, 2016

gustafsson commented Nov 1, 2016 •

edited

Loading

nalimilan commented Nov 1, 2016

gustafsson commented Nov 1, 2016

nalimilan commented Nov 3, 2016

vcat should expand pooled columns when needed #990

vcat should expand pooled columns when needed #990

Conversation

gustafsson commented Jun 8, 2016

nalimilan commented Jun 9, 2016 • edited Loading

gustafsson commented Jun 10, 2016

nalimilan commented Jun 10, 2016

gustafsson commented Jun 22, 2016 • edited Loading

nalimilan commented Jun 22, 2016

gustafsson commented Sep 12, 2016

nalimilan commented Sep 27, 2016

gustafsson commented Sep 27, 2016

nalimilan commented Sep 27, 2016

gustafsson commented Nov 1, 2016

gustafsson commented Nov 1, 2016 • edited Loading

nalimilan commented Nov 1, 2016

gustafsson commented Nov 1, 2016

nalimilan commented Nov 3, 2016

nalimilan commented Jun 9, 2016 •

edited

Loading

gustafsson commented Jun 22, 2016 •

edited

Loading

gustafsson commented Nov 1, 2016 •

edited

Loading