Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decide if we want to copy levels of CategoricalValue if we do Tables.allocatecolumn #2104

Closed
bkamins opened this issue Feb 9, 2020 · 5 comments
Labels
Milestone

Comments

@bkamins
Copy link
Member

bkamins commented Feb 9, 2020

Given JuliaData/Tables.jl#99 (comment) should we then add after Tables.allocatecolumn a special check for CategoricalValue passed and set levels properly (there like 5 to 10 places in DataFrames.jl where this happens; I can do this if we agree on this).

CC @nalimilan @quinnj

@bkamins
Copy link
Member Author

bkamins commented Feb 9, 2020

OK. I have investigated the cases when we might have problems with Tables.allocatecolumn and we have only one with vcat:

julia> x = categorical(String[])
0-element CategoricalArray{String,1,UInt32}

julia> levels!(x, ["a", "b"])
0-element CategoricalArray{String,1,UInt32}

julia> df = DataFrame(x=x)
0×1 DataFrame


julia> levels(df.x)
2-element Array{String,1}:
 "a"
 "b"

julia> levels([df;df].x)
0-element Array{String,1}

(for non-zero row operations this is OK, as we perform assignment)

So we should decide if we are actually OK with this behavior or not, i.e. if 0-length categorical vector should carry over its levels to a vcat-ted vector if the rules say that the resulting vector should also be categorical.

@bkamins bkamins added the breaking The proposed change is breaking. label Feb 12, 2020
@bkamins bkamins added this to the 1.0 milestone Feb 12, 2020
@nalimilan
Copy link
Member

That's definitely a bug. That's indeed the limit of the approach in which we merge levels on setindex!: for empty vectors it has no effect.

Though in this particular case, we could change copyto! to always copy levels, even when not copying any values (the line to change is here). That would be inconsistent with calling setindex! repeatedly (which is what the AbstractArray fallback does), but that would make more sense and probably fix a lot of cases. But maybe that's OK, as we can't fix setindex! in general: it covers situations where you may really not know what are the possible levels when the input is empty (e.g. if the categorical value is returned by a user-provided function which cannot be called for empty inputs).

(All of this would not happen if we had a good promotion mechanism in vcat (JuliaLang/julia#20815), as then we would just call vcat on columns and CategoricalArrays provide the right method. But the copyto! behavior would still have to be decided.)

@bkamins bkamins added bug and removed breaking The proposed change is breaking. labels Feb 14, 2020
@bkamins
Copy link
Member Author

bkamins commented Feb 14, 2020

I would go for the the fix of copyto! then.
Shall I open an issue in CategoricalArrays.jl or you will make a PR without it (as usual - I think it is safer if you make the changes in CategoricalArrays.jl 😄). Thank you!

We should release CategoricalArrays.jl soon anyway given the JuliaData/CategoricalArrays.jl#250 fix (maybe also JuliaData/CategoricalArrays.jl#244 could be fixed by then - but it is less priority).

@nalimilan
Copy link
Member

JuliaData/CategoricalArrays.jl#253 fixes copyto!.

@bkamins
Copy link
Member Author

bkamins commented Apr 15, 2020

OK - this can be closed then.

@bkamins bkamins closed this as completed Apr 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants