-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strange behaviour with non-ASCII column names #2901
Comments
This is a limitation in Julia design that opts-in to canonicalize Unicode identifiers. Here is an explanation JuliaLang/julia#5434. CC @stevengj as he was the OP so that he is aware of this consequence. A more verbose example showing the problem and how to work-around it:
and here is how it works for custom structs:
|
Realize that this is pre-dates JuliaLang/julia#5434, because it applies equally well to canonically equivalent Unicode strings: julia> a, b = "no\u00EBl", "Noe\u0308l"
("noël", "Noël")
julia> d = DataFrame(a => 1, b => 2)
1×2 DataFrame
Row │ noël Noël
│ Int64 Int64
─────┼──────────────
1 │ 1 2
julia> d.noël # autocompleted
1-element Vector{Int64}:
1
julia> d.Noël # autocompleted
ERROR: ArgumentError: column name :Noël not found in the data frame; existing most similar names are: :noël and :Noël
Stacktrace:
[1] lookupname
@ ~/.julia/packages/DataFrames/3mEXm/src/other/index.jl:291 [inlined]
[2] getindex
@ ~/.julia/packages/DataFrames/3mEXm/src/other/index.jl:297 [inlined]
[3] getindex(df::DataFrame, #unused#::typeof(!), col_ind::Symbol)
@ DataFrames ~/.julia/packages/DataFrames/3mEXm/src/dataframe/dataframe.jl:440
[4] getproperty(df::DataFrame, col_ind::Symbol)
@ DataFrames ~/.julia/packages/DataFrames/3mEXm/src/abstractdataframe/abstractdataframe.jl:348
[5] top-level scope
@ REPL[9]:1 Recommendation: you should probably normalize strings before checking that they are equal (e.g. normalize all of the column names when they are stored). (See also JuliaLang/julia/pull/42561 if you want to perform the Julia-identifier normalization.) |
@nalimilan - what do you think?
I am not sure we want to do this. I think it is important for programmatic use cases to store column names as the user asks them to be stored. If we normalized - two columns having distinct names e.g. The problem is only when the user wants to manually pass a
If we go along my proposal I will add a documentation explaining this case. |
In Unicode, these are considered "canonically equivalent" strings. According to the Unicode standard:
Of course, Julia itself does not compare two strings as Basically, I think you should apply at least NFC normalization when the string ceases to be "data" (bytes/codepoints) and becomes symbolic information for the user (e.g. a column label). If you are comparing a string to a |
(Alternatively, you could store both the normalized and the original non-normalized versions, and only use the former for comparing to |
I think this is a crucial thing. In my opinion column name in a The reason why we allow to refer to columns via programmer identifiers is:
Also note that Julia Base does not apply canonical equivalence not only for strings but also for
it is only when the symbol is passed as a programmer identifier it gets normalized. And this is the reason of the problem, as
No we do not compare So in other words the problem is that:
Note though that even if we have not allowed users to pass string literals as column names such column names could be read-in into a |
This is a good point. Actually we could compute it on the fly if the initial lookup fails so that we would not introduce an overhead for standard situations. |
Here is a thing you can do in Julia Base now because of what we have discussed:
Ah - we could not, because duplicate detection would have to be done at creation time. So we would have to discuss if we want to have such a feature (given the example above clearly |
Yes, we decided to keep spaces and special characters when reading CSV files precisely to ensure that writing the data back gives the same column names. And Julia does that when reading strings. So it wouldn't be super logical to normalize column names automatically. Maybe when a name isn't found we could try to check whether there's a column which would match after normalization, and if so explain that in the error? We could provide a function or a hint to a short syntax to normalize column names. |
@quinnj - does https://csv.juliadata.org/latest/reading.html#normalizenames in CSV.jl also provide this kind of normalization? |
I will make a PR for this
I will add this hint in the error message:
|
I have implemented it in #2904. The only problem, as commented there, is that |
It was decided not to perform normalization on symbols constructed with |
When a DataFrame contains non-ASCII column name like here:
access to the column leads to an "interesting" error:
This is weird, because typing
d.
and pressingTAB
completes exactly the above command.When using the following syntax, everything works as expected:
Happens with DataFrames 1.2.2 as well as with [a93c6f00] DataFrames v1.2.2
https://github.com/JuliaData/DataFrames.jl.git#main
.The text was updated successfully, but these errors were encountered: