-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vcat should expand pooled columns when needed #990
Conversation
72d88d9
to
6e83df9
Compare
6e83df9
to
cdee34a
Compare
Good catch. Could you put that function into DataArrays.jl instead? It really belongs there. Shouldn't it even be done by the EDIT: actually, this could be a method of |
If Or how would you propose that the new levels are created before passing them to
And then change for instance https://github.com/JuliaLang/julia/blob/master/base/abstractarray.jl#L638 as well as related calls in the same file? |
I'm not sure what's the best strategy, but my general idea was to move all While |
The number of levels that a PDA can hold is inferred from its integer type parameter. The actual levels do not define the type, as an instance of a PDA can be expanded with more levels, but only up to a limit defined by the type parameter. If we call similar[array] and "pass it the merged list of levels directly", who would create that merged list of levels? The caller who is outside of the DataArrays package? Another approach would be to rewrite the DataFrames.vcat to rely on Base.vcat, something like
vcat of DataArrays would then properly handle PooledDataArrays internally. ( |
Yes, using the highest-level API sounds like a good idea, as it allows each type to choose the most appropriate behaviour. |
Waiting for review on JuliaStats/DataArrays.jl#213 |
Fixes landed in both DataArrays and CategoricalArrays (with JuliaData/CategoricalArrays.jl#18). Do we still need to do something here? |
Yes because DataFrames doesn't use Any comments on the approach suggested above? It may need something more complicated than |
Honestly, I'm not familiar enough with this code to be completely sure. Your proposal to rely on So unless we can find other drawbacks to it, I'd say go ahead. |
Not needed as of JuliaData/CategoricalArrays.jl#37 |
Correction: JuliaData/CategoricalArrays.jl#37 solves the issue of efficient concatenation discussed above, but not about concatenating compact pooled arrays which this PR was originally intended to fix. This was more of an issue with the old PooledDataArray which created a compact pool by default when using The old example using PooledDataArray:
With CategoricalArrays the same issue occurs like this:
which can be fixed by simply not calling
That's fine by me. Besides, Related (even though the patch wouldn't belong here): it could perhaps be nice to have a function to
|
Yet this failure is inconsistent with the fact that your Whether or not to compact columns will depend on the input method, e.g. CSV.jl, Query.jl... |
Yes it is inconsistent. I fiddled a bit with using An example: #1118 |
Even if we manage to rely only on |
pool!
creates compact pooled dataframes that doesn't have room for more levels than needed.When vcat-ing together such pooled dataframes the resulting pool might be need to be larger. This PR checks which levels that are needed prior to copying the data.
Canonical example
A similar problem with DataArrays occurs for
vcat(pool(1:200), pool(100:300))
but that is not addressed by this PR.