Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vcat should expand pooled columns when needed #990

Closed
wants to merge 1 commit into from

Conversation

gustafsson
Copy link
Contributor

pool! creates compact pooled dataframes that doesn't have room for more levels than needed.

When vcat-ing together such pooled dataframes the resulting pool might be need to be larger. This PR checks which levels that are needed prior to copying the data.

Canonical example

a = DataFrame(x=pool(1:200))
b = DataFrame(x=pool(100:300))
vcat(a,b)

A similar problem with DataArrays occurs for vcat(pool(1:200), pool(100:300)) but that is not addressed by this PR.

@nalimilan
Copy link
Member

nalimilan commented Jun 9, 2016

Good catch. Could you put that function into DataArrays.jl instead? It really belongs there. Shouldn't it even be done by the vcat method automatically?

EDIT: actually, this could be a method of similar, taking a keyword argument giving the new levels. We'd need to differentiate it from the standard similar method, though, to avoid the type-instability there.

@gustafsson
Copy link
Contributor Author

DataArrays.vcat doesn't exist, t's just deferred to the default Base.vcat in abstractvector.jl but I guess it could be specialised for pooled arrays. DataFrames.vcat however doesn't rely on Base.vcat so fixing this for DataArrays wouldn't necessarily fix this for DataFrames. DataFrames.vcat and Base.vcat does both use similar though, which I assume your edited comment was referring to.

If similar would take the new levels as a keyword argument it would need to have these levels computed by the caller. Which is what I implemented for DataFrames.vcat, but the new levels are passed in with the first argument to similar instead of an additional keyword argument. A new version of similar could also take the entire vector and from that figure out a common type with room for all. Which is essentially what DataFrames._colinfo already does.

Or how would you propose that the new levels are created before passing them to similar? Something like this?

function Base.similararray{T,R<:Integer,N}(a::PooledDataArray{T,R,N}, T, n; array=[])
    pool = levels([map(levels,array)...])
    norefs = DataArrays.RefArray(Array{DataArrays.DEFAULT_POOLED_REF_TYPE,N}())
    compact(PooledDataArray(norefs, pool))
end

# default
Base.similararray(a, T, n; kwargs...) = similar(a, T, n)

# Use like so:
a = similararray(full(V[1]), T, n; array=V)

And then change for instance https://github.com/JuliaLang/julia/blob/master/base/abstractarray.jl#L638 as well as related calls in the same file?

@nalimilan
Copy link
Member

I'm not sure what's the best strategy, but my general idea was to move all PooledDataArray implementation details to DataArrays.jl and use only a relatively high level API from DataFrames.jl. The plan is to move away from DataArrays to something like CategoricalData.jl, which will be easier if we don't mess with internals (which is never a good idea anyway).

While similar[array] can be passed a list of arrays, I would find it better to pass it the merged list of levels directly: that would be a natural extension of similar, since levels are a property of PDAs just like element type and dimensions are for all arrays. As you say, this is quite "similar" to what you do here, except that no unexported DataArrays functions would be used: those bits would be move to that package instead.

@gustafsson
Copy link
Contributor Author

gustafsson commented Jun 22, 2016

The number of levels that a PDA can hold is inferred from its integer type parameter. The actual levels do not define the type, as an instance of a PDA can be expanded with more levels, but only up to a limit defined by the type parameter.

If we call similar[array] and "pass it the merged list of levels directly", who would create that merged list of levels? The caller who is outside of the DataArrays package?

Another approach would be to rewrite the DataFrames.vcat to rely on Base.vcat, something like

function Base.vcat(a::Array{DataFrame})
  r = DataFrame()
  for n in unique([map(names,a)...;])
    r[n] = vcat([haskey(e,n) ? e[n] : DataArray(NAtype, nrow(e)) for e in a])
  end
  r
end

vcat of DataArrays would then properly handle PooledDataArrays internally. (DataArray(NAtype, nrow(e)) can already be promoted to any type, i.e vcat(DataArray(NAtype,1),[1]) |> typeof == DataArrays.DataArray{Int64,1})

@nalimilan
Copy link
Member

Yes, using the highest-level API sounds like a good idea, as it allows each type to choose the most appropriate behaviour.

@gustafsson
Copy link
Contributor Author

Waiting for review on JuliaStats/DataArrays.jl#213

@nalimilan
Copy link
Member

Fixes landed in both DataArrays and CategoricalArrays (with JuliaData/CategoricalArrays.jl#18). Do we still need to do something here?

@gustafsson
Copy link
Contributor Author

Yes because DataFrames doesn't use Base.vcat on its columns: abstractdataframe.jl#691

Any comments on the approach suggested above? It may need something more complicated than DataArray(NAtype, nrow(e))

@nalimilan
Copy link
Member

Honestly, I'm not familiar enough with this code to be completely sure. Your proposal to rely on vcat sounds good and it means we can support any types without reimplementing the concatenation logic here. Though it has the drawback that we need to allocate arrays full of null values just to call vcat on them and destroy them afterwards. That said, if that turns out it's an issue for performance, I guess we could create an internal NullArray type which would return only nulls for a specified length.

So unless we can find other drawbacks to it, I'd say go ahead. DataArray(NAtype, nrow(e)) can be replaced with NullableArray(nrow(e)), though we may have to implement a vcat method to efficiently concatenate it with CategoricalArray/NullableCategoricalArray (together with promote rules). These specialized methods are probably needed anyway.

@gustafsson
Copy link
Contributor Author

Not needed as of JuliaData/CategoricalArrays.jl#37

@gustafsson gustafsson closed this Nov 1, 2016
@gustafsson gustafsson reopened this Nov 1, 2016
@gustafsson
Copy link
Contributor Author

gustafsson commented Nov 1, 2016

Correction: JuliaData/CategoricalArrays.jl#37 solves the issue of efficient concatenation discussed above, but not about concatenating compact pooled arrays which this PR was originally intended to fix. This was more of an issue with the old PooledDataArray which created a compact pool by default when using pool(a,b,c). The new categorical(a,b,c) on the other hand isn't compacted. That is, this issue doesn't occur unless you explicitly use compact.

The old example using PooledDataArray:

a = DataFrame(x=pool(1:200))
b = DataFrame(x=pool(100:300))
vcat(a,b) # <- error

With CategoricalArrays the same issue occurs like this:

a = DataFrame(x=categorical(1:200))
b = DataFrame(x=categorical(100:300))
a[:x] = compact(a[:x])
b[:x] = compact(b[:x])
vcat(a,b) # <- error

which can be fixed by simply not calling compact:

a = DataFrame(x=categorical(1:200))
b = DataFrame(x=categorical(100:300))
vcat(a,b) # <- works

That's fine by me. Besides, CategoricalArrays gives a fairly clear error message what to do when this happens; i.e use a larger reference type. So I don't need this PR anymore.

Related (even though the patch wouldn't belong here): it could perhaps be nice to have a function to uncompact like this:

uncompact{T, D, R}(ca::NullableCategoricalArray{T, D, R}) =
    convert(NullableCategoricalArray{T, D, CategoricalArrays.DefaultRefType}, ca)

@gustafsson gustafsson closed this Nov 1, 2016
@gustafsson gustafsson deleted the expand_vcat_pool branch November 1, 2016 18:03
@nalimilan
Copy link
Member

Yet this failure is inconsistent with the fact that your vcat implementation for CategoricalArray always uses DefaultRefType (i.e. UInt32) to avoid throwing an error. Either raising an error is fine, and vcat should preserve the reference type, or it's not, and data frames should work the same as vcat.

Whether or not to compact columns will depend on the input method, e.g. CSV.jl, Query.jl...

@gustafsson
Copy link
Contributor Author

Yes it is inconsistent.

I fiddled a bit with using vcat for columns in DataFrames. But couldn't get around either allocating arrays twice, allocating arrays filled with null just to discard them, or resorting to the current implementation which probably is the most efficient of those alternatives, albeit inconsistent.

An example: #1118

@nalimilan
Copy link
Member

Even if we manage to rely only on vcat thanks to #1118, we could include a special-case for CategoricalArray in DataFrames, which would widen the reference type if there are too many levels. Indeed that's possible only in DataFrames, where concatenation is already type-unstable anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants