Strange behaviour with non-ASCII column names #2901

wentasah · 2021-10-08T13:50:35Z

When a DataFrame contains non-ASCII column name like here:

d = DataFrame("power_µW" => [1,2,3])

access to the column leads to an "interesting" error:

d.power_µW
# ERROR: ArgumentError: column name :power_μW not found in the data frame; 
# existing most similar names are: :power_µW

This is weird, because typing d. and pressing TAB completes exactly the above command.

When using the following syntax, everything works as expected:

d."power_µW"
# 3-element Vector{Int64}:
#  1
#  2
#  3

Happens with DataFrames 1.2.2 as well as with [a93c6f00] DataFrames v1.2.2 https://github.com/JuliaData/DataFrames.jl.git#main.

julia> versioninfo()
Julia Version 1.6.2
Commit 1b93d53fc4 (2021-07-14 15:36 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)

The text was updated successfully, but these errors were encountered:

bkamins · 2021-10-08T15:00:11Z

This is a limitation in Julia design that opts-in to canonicalize Unicode identifiers. Here is an explanation JuliaLang/julia#5434. CC @stevengj as he was the OP so that he is aware of this consequence.

A more verbose example showing the problem and how to work-around it:

julia> a, b = "xμ", "yµ"
("xμ", "yµ")

julia> collect(a)
2-element Vector{Char}:
 'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)
 'μ': Unicode U+03BC (category Ll: Letter, lowercase)

julia> collect(b)
2-element Vector{Char}:
 'y': ASCII/Unicode U+0079 (category Ll: Letter, lowercase)
 'µ': Unicode U+00B5 (category Ll: Letter, lowercase)

julia> d = DataFrame(a => 1, b => 2)
1×2 DataFrame
 Row │ xμ     yµ    
     │ Int64  Int64 
─────┼──────────────
   1 │     1      2

julia> d.xμ # autocompleted
1-element Vector{Int64}:
 1

julia> d.yµ # autocompleted
ERROR: ArgumentError: column name :yμ not found in the data frame; existing most similar names are: :yµ and :xμ
Stacktrace:
 [1] lookupname
   @ C:\Users\bogum\.julia\packages\DataFrames\vuMM8\src\other\index.jl:291 [inlined]
 [2] getindex
   @ C:\Users\bogum\.julia\packages\DataFrames\vuMM8\src\other\index.jl:297 [inlined]
 [3] getindex(df::DataFrame, #unused#::typeof(!), col_ind::Symbol)
   @ DataFrames C:\Users\bogum\.julia\packages\DataFrames\vuMM8\src\dataframe\dataframe.jl:513
 [4] getproperty(df::DataFrame, col_ind::Symbol)
   @ DataFrames C:\Users\bogum\.julia\packages\DataFrames\vuMM8\src\abstractdataframe\abstractdataframe.jl:356
 [5] top-level scope
   @ REPL[97]:1

julia> foreach(x -> display(collect(x)), names(d))
2-element Vector{Char}:
 'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)
 'μ': Unicode U+03BC (category Ll: Letter, lowercase)
2-element Vector{Char}:
 'y': ASCII/Unicode U+0079 (category Ll: Letter, lowercase)
 'µ': Unicode U+00B5 (category Ll: Letter, lowercase)

julia> d[:, a] # works
1-element Vector{Int64}:
 1

julia> d[:, b] # works
1-element Vector{Int64}:
 2

and here is how it works for custom structs:

julia> struct X
       xμ # note that this mu
       yµ # and this mu are not the same characters when we define X struct; you need a use a proper font to see the difference
       end

julia> foreach(d -> display(collect(d)), string.(fieldnames(X))) # but they end up being the same after it gets defined
2-element Vector{Char}:
 'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)
 'μ': Unicode U+03BC (category Ll: Letter, lowercase)
2-element Vector{Char}:
 'y': ASCII/Unicode U+0079 (category Ll: Letter, lowercase)
 'μ': Unicode U+03BC (category Ll: Letter, lowercase)

stevengj · 2021-10-08T18:31:16Z

Realize that this is pre-dates JuliaLang/julia#5434, because it applies equally well to canonically equivalent Unicode strings:

julia> a, b = "no\u00EBl", "Noe\u0308l"
("noël", "Noël")

julia> d = DataFrame(a => 1, b => 2)
1×2 DataFrame
 Row │ noël   Noël  
     │ Int64  Int64 
─────┼──────────────
   1 │     1      2

julia> d.noël # autocompleted
1-element Vector{Int64}:
 1

julia> d.Noël # autocompleted
ERROR: ArgumentError: column name :Noël not found in the data frame; existing most similar names are: :noël and :Noël
Stacktrace:
 [1] lookupname
   @ ~/.julia/packages/DataFrames/3mEXm/src/other/index.jl:291 [inlined]
 [2] getindex
   @ ~/.julia/packages/DataFrames/3mEXm/src/other/index.jl:297 [inlined]
 [3] getindex(df::DataFrame, #unused#::typeof(!), col_ind::Symbol)
   @ DataFrames ~/.julia/packages/DataFrames/3mEXm/src/dataframe/dataframe.jl:440
 [4] getproperty(df::DataFrame, col_ind::Symbol)
   @ DataFrames ~/.julia/packages/DataFrames/3mEXm/src/abstractdataframe/abstractdataframe.jl:348
 [5] top-level scope
   @ REPL[9]:1

Recommendation: you should probably normalize strings before checking that they are equal (e.g. normalize all of the column names when they are stored). (See also JuliaLang/julia/pull/42561 if you want to perform the Julia-identifier normalization.)

bkamins · 2021-10-08T19:39:22Z

@nalimilan - what do you think?

normalize all of the column names when they are stored

I am not sure we want to do this. I think it is important for programmatic use cases to store column names as the user asks them to be stored. If we normalized - two columns having distinct names e.g. "no\u00EBl" and "noe\u0308l" would be considered as duplicate names, which we do not want I think.

The problem is only when the user wants to manually pass a Symbol using a literal. In all other cases the things work as expected:

julia> a, b = "no\u00EBl", "Noe\u0308l"
("noël", "Noël")

julia> sa, sb = Symbol.((a, b))
(:noël, :Noël)

julia> using DataFrames

julia> d = DataFrame(a => 1, b => 2)
1×2 DataFrame
 Row │ noël   Noël
     │ Int64  Int64
─────┼──────────────
   1 │     1      2

julia> d[:, sa]
1-element Vector{Int64}:
 1

julia> d[:, sb]
1-element Vector{Int64}:
 2

julia> d[:, a]
1-element Vector{Int64}:
 1

julia> d[:, b]
1-element Vector{Int64}:
 2

julia> d."no\u00EBl"
1-element Vector{Int64}:
 1

julia> d."Noe\u0308l"
1-element Vector{Int64}:
 2

julia> d.noël
1-element Vector{Int64}:
 1

julia> d.Noël
ERROR: ArgumentError: column name :Noël not found in the data frame; existing most similar names are: :noël and :Noël

julia> d[:, :noël]
1-element Vector{Int64}:
 1

julia> d[:, :Noël]
ERROR: ArgumentError: column name :Noël not found in the data frame; existing most similar names are: :noël and :Noël

If we go along my proposal I will add a documentation explaining this case.

stevengj · 2021-10-08T20:01:50Z

two columns having distinct names e.g. "no\u00EBl" and "noe\u0308l"

In Unicode, these are considered "canonically equivalent" strings. According to the Unicode standard:

Programs should always compare canonical-equivalent Unicode strings as equal

Of course, Julia itself does not compare two strings as == if they contain different codepoints, even if they are canonically equivalent, so it becomes a judgement call at what point you want to start applying canonical equivalence to comparisons. But Julia does apply canonical equivalence to programmer identifiers, so there is an argument for DataFrames to do so as well.

Basically, I think you should apply at least NFC normalization when the string ceases to be "data" (bytes/codepoints) and becomes symbolic information for the user (e.g. a column label). If you are comparing a string to a Symbol, you have probably hit that point.

stevengj · 2021-10-08T20:14:23Z

(Alternatively, you could store both the normalized and the original non-normalized versions, and only use the former for comparing to Symbol, throwing an error if it is ambiguous.)

bkamins · 2021-10-08T20:18:44Z

I think you should apply normalization when the string ceases to be "data"

I think this is a crucial thing. In my opinion column name in a DataFrame is data (e.g. in CSV file it is a first row of data typically). But this can be debatable, so let us wait for @nalimilan's opinion.

The reason why we allow to refer to columns via programmer identifiers is:

historical - originally Symbols were used for the design so we stick to this as an alternative;
for convenience, so that you can write df.column and do not have to write df."column".

Also note that Julia Base does not apply canonical equivalence not only for strings but also for Symbols:

julia> a, b = "no\u00EBl", "noe\u0308l"
("noël", "noël")

julia> sa, sb = Symbol.((a, b))
(:noël, :noël)

julia> sa == sb
false

it is only when the symbol is passed as a programmer identifier it gets normalized. And this is the reason of the problem, as

If you are comparing a string to a Symbol, you have probably hit that point.

No we do not compare Symbols to strings. We always only compare Symbols (but if the user passes a string this string is converted to Symbol; the problem is - as noted above - that conversion to symbol does not make sure that Symbol is in canonical form).

So in other words the problem is that:

julia> Symbol("noe\u0308l") == :noël # here the ë was typed in as \u0308l
false

Note though that even if we have not allowed users to pass string literals as column names such column names could be read-in into a DataFrame via e.g. CSV.jl.

bkamins · 2021-10-08T20:22:13Z

(Alternatively, you could store both the normalized and the original non-normalized versions, and only use the former for comparing to Symbol, throwing an error if it is ambiguous.)

This is a good point. Actually we could compute it on the fly if the initial lookup fails so that we would not introduce an overhead for standard situations.

bkamins · 2021-10-08T20:34:40Z

Here is a thing you can do in Julia Base now because of what we have discussed:

julia> nt = (; Symbol("no\u00EBl") => 1, Symbol("noe\u0308l") => 2)
(noël = 1, noël = 2)

Actually we could compute it on the fly if the initial lookup fails

Ah - we could not, because duplicate detection would have to be done at creation time. So we would have to discuss if we want to have such a feature (given the example above clearly NamedTuple does not care and allows such field names 😄).

nalimilan · 2021-10-08T21:17:11Z

Yes, we decided to keep spaces and special characters when reading CSV files precisely to ensure that writing the data back gives the same column names. And Julia does that when reading strings. So it wouldn't be super logical to normalize column names automatically. Maybe when a name isn't found we could try to check whether there's a column which would match after normalization, and if so explain that in the error? We could provide a function or a hint to a short syntax to normalize column names.

bkamins · 2021-10-08T21:20:54Z

@quinnj - does https://csv.juliadata.org/latest/reading.html#normalizenames in CSV.jl also provide this kind of normalization?

nalimilan · 2021-10-08T21:28:49Z

Yes it does: https://github.com/JuliaData/CSV.jl/blob/0f97e66e998e0a7918b73ab6d54460c253c86d67/src/utils.jl#L333-L338

bkamins · 2021-10-08T21:35:43Z

Maybe when a name isn't found we could try to check whether there's a column which would match after normalization, and if so explain that in the error?

I will make a PR for this

We could provide a function or a hint to a short syntax to normalize column names.

I will add this hint in the error message:

using Unicode
rename!(Unicode.normalize, df)

bkamins · 2021-10-08T22:08:51Z

I have implemented it in #2904. The only problem, as commented there, is that Unicode.normalize does not handle the \mu case. Is there a function that would also handle this?

stevengj · 2021-10-09T03:18:12Z

It was decided not to perform normalization on symbols constructed with Symbol(“…”) because those are typically programmatically constructed (rather than human identifiers) and are not required to be valid Julia identifiers.

bkamins closed this as completed Oct 8, 2021

bkamins reopened this Oct 8, 2021

bkamins added the doc label Oct 8, 2021

bkamins added this to the 1.3 milestone Oct 8, 2021

bkamins mentioned this issue Oct 8, 2021

Try to detect unicode normalization issues in column names #2904

Merged

bkamins closed this as completed in #2904 Nov 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strange behaviour with non-ASCII column names #2901

Strange behaviour with non-ASCII column names #2901

wentasah commented Oct 8, 2021

bkamins commented Oct 8, 2021

stevengj commented Oct 8, 2021 •

edited

Loading

bkamins commented Oct 8, 2021

stevengj commented Oct 8, 2021 •

edited

Loading

stevengj commented Oct 8, 2021 •

edited

Loading

bkamins commented Oct 8, 2021

bkamins commented Oct 8, 2021

bkamins commented Oct 8, 2021

nalimilan commented Oct 8, 2021

bkamins commented Oct 8, 2021

nalimilan commented Oct 8, 2021

bkamins commented Oct 8, 2021

bkamins commented Oct 8, 2021

stevengj commented Oct 9, 2021

Strange behaviour with non-ASCII column names #2901

Strange behaviour with non-ASCII column names #2901

Comments

wentasah commented Oct 8, 2021

bkamins commented Oct 8, 2021

stevengj commented Oct 8, 2021 • edited Loading

bkamins commented Oct 8, 2021

stevengj commented Oct 8, 2021 • edited Loading

stevengj commented Oct 8, 2021 • edited Loading

bkamins commented Oct 8, 2021

bkamins commented Oct 8, 2021

bkamins commented Oct 8, 2021

nalimilan commented Oct 8, 2021

bkamins commented Oct 8, 2021

nalimilan commented Oct 8, 2021

bkamins commented Oct 8, 2021

bkamins commented Oct 8, 2021

stevengj commented Oct 9, 2021

stevengj commented Oct 8, 2021 •

edited

Loading

stevengj commented Oct 8, 2021 •

edited

Loading

stevengj commented Oct 8, 2021 •

edited

Loading