Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange behaviour with non-ASCII column names #2901

Closed
wentasah opened this issue Oct 8, 2021 · 14 comments · Fixed by #2904
Closed

Strange behaviour with non-ASCII column names #2901

wentasah opened this issue Oct 8, 2021 · 14 comments · Fixed by #2904
Labels
Milestone

Comments

@wentasah
Copy link

wentasah commented Oct 8, 2021

When a DataFrame contains non-ASCII column name like here:

d = DataFrame("power_µW" => [1,2,3])

access to the column leads to an "interesting" error:

d.power_µW
# ERROR: ArgumentError: column name :power_μW not found in the data frame; 
# existing most similar names are: :power_µW

This is weird, because typing d. and pressing TAB completes exactly the above command.

When using the following syntax, everything works as expected:

d."power_µW"
# 3-element Vector{Int64}:
#  1
#  2
#  3

Happens with DataFrames 1.2.2 as well as with [a93c6f00] DataFrames v1.2.2 https://github.com/JuliaData/DataFrames.jl.git#main.

julia> versioninfo()
Julia Version 1.6.2
Commit 1b93d53fc4 (2021-07-14 15:36 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
@bkamins
Copy link
Member

bkamins commented Oct 8, 2021

This is a limitation in Julia design that opts-in to canonicalize Unicode identifiers. Here is an explanation JuliaLang/julia#5434. CC @stevengj as he was the OP so that he is aware of this consequence.

A more verbose example showing the problem and how to work-around it:

julia> a, b = "xμ", "yµ"
("xμ", "yµ")

julia> collect(a)
2-element Vector{Char}:
 'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)
 'μ': Unicode U+03BC (category Ll: Letter, lowercase)

julia> collect(b)
2-element Vector{Char}:
 'y': ASCII/Unicode U+0079 (category Ll: Letter, lowercase)
 'µ': Unicode U+00B5 (category Ll: Letter, lowercase)

julia> d = DataFrame(a => 1, b => 2)
1×2 DataFrame
 Row │ xμ     yµ    
     │ Int64  Int64 
─────┼──────────────
   1 │     1      2

julia> d.xμ # autocompleted
1-element Vector{Int64}:
 1

julia> d.yµ # autocompleted
ERROR: ArgumentError: column name :yμ not found in the data frame; existing most similar names are: :yµ and :xμ
Stacktrace:
 [1] lookupname
   @ C:\Users\bogum\.julia\packages\DataFrames\vuMM8\src\other\index.jl:291 [inlined]
 [2] getindex
   @ C:\Users\bogum\.julia\packages\DataFrames\vuMM8\src\other\index.jl:297 [inlined]
 [3] getindex(df::DataFrame, #unused#::typeof(!), col_ind::Symbol)
   @ DataFrames C:\Users\bogum\.julia\packages\DataFrames\vuMM8\src\dataframe\dataframe.jl:513
 [4] getproperty(df::DataFrame, col_ind::Symbol)
   @ DataFrames C:\Users\bogum\.julia\packages\DataFrames\vuMM8\src\abstractdataframe\abstractdataframe.jl:356
 [5] top-level scope
   @ REPL[97]:1

julia> foreach(x -> display(collect(x)), names(d))
2-element Vector{Char}:
 'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)
 'μ': Unicode U+03BC (category Ll: Letter, lowercase)
2-element Vector{Char}:
 'y': ASCII/Unicode U+0079 (category Ll: Letter, lowercase)
 'µ': Unicode U+00B5 (category Ll: Letter, lowercase)

julia> d[:, a] # works
1-element Vector{Int64}:
 1

julia> d[:, b] # works
1-element Vector{Int64}:
 2

and here is how it works for custom structs:

julia> struct X
       xμ # note that this mu
       yµ # and this mu are not the same characters when we define X struct; you need a use a proper font to see the difference
       end

julia> foreach(d -> display(collect(d)), string.(fieldnames(X))) # but they end up being the same after it gets defined
2-element Vector{Char}:
 'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)
 'μ': Unicode U+03BC (category Ll: Letter, lowercase)
2-element Vector{Char}:
 'y': ASCII/Unicode U+0079 (category Ll: Letter, lowercase)
 'μ': Unicode U+03BC (category Ll: Letter, lowercase)

@bkamins bkamins closed this as completed Oct 8, 2021
@stevengj
Copy link

stevengj commented Oct 8, 2021

Realize that this is pre-dates JuliaLang/julia#5434, because it applies equally well to canonically equivalent Unicode strings:

julia> a, b = "no\u00EBl", "Noe\u0308l"
("noël", "Noël")

julia> d = DataFrame(a => 1, b => 2)
1×2 DataFrame
 Row │ noël   Noël  
     │ Int64  Int64 
─────┼──────────────
   11      2

julia> d.noël # autocompleted
1-element Vector{Int64}:
 1

julia> d.Noël # autocompleted
ERROR: ArgumentError: column name :Noël not found in the data frame; existing most similar names are: :noël and :Noël
Stacktrace:
 [1] lookupname
   @ ~/.julia/packages/DataFrames/3mEXm/src/other/index.jl:291 [inlined]
 [2] getindex
   @ ~/.julia/packages/DataFrames/3mEXm/src/other/index.jl:297 [inlined]
 [3] getindex(df::DataFrame, #unused#::typeof(!), col_ind::Symbol)
   @ DataFrames ~/.julia/packages/DataFrames/3mEXm/src/dataframe/dataframe.jl:440
 [4] getproperty(df::DataFrame, col_ind::Symbol)
   @ DataFrames ~/.julia/packages/DataFrames/3mEXm/src/abstractdataframe/abstractdataframe.jl:348
 [5] top-level scope
   @ REPL[9]:1

Recommendation: you should probably normalize strings before checking that they are equal (e.g. normalize all of the column names when they are stored). (See also JuliaLang/julia/pull/42561 if you want to perform the Julia-identifier normalization.)

@bkamins bkamins reopened this Oct 8, 2021
@bkamins
Copy link
Member

bkamins commented Oct 8, 2021

@nalimilan - what do you think?

normalize all of the column names when they are stored

I am not sure we want to do this. I think it is important for programmatic use cases to store column names as the user asks them to be stored. If we normalized - two columns having distinct names e.g. "no\u00EBl" and "noe\u0308l" would be considered as duplicate names, which we do not want I think.

The problem is only when the user wants to manually pass a Symbol using a literal. In all other cases the things work as expected:

julia> a, b = "no\u00EBl", "Noe\u0308l"
("noël", "Noël")

julia> sa, sb = Symbol.((a, b))
(:noël, :Noël)

julia> using DataFrames

julia> d = DataFrame(a => 1, b => 2)
1×2 DataFrame
 Row │ noël   Noël
     │ Int64  Int64
─────┼──────────────
   1 │     1      2

julia> d[:, sa]
1-element Vector{Int64}:
 1

julia> d[:, sb]
1-element Vector{Int64}:
 2

julia> d[:, a]
1-element Vector{Int64}:
 1

julia> d[:, b]
1-element Vector{Int64}:
 2

julia> d."no\u00EBl"
1-element Vector{Int64}:
 1

julia> d."Noe\u0308l"
1-element Vector{Int64}:
 2

julia> d.noël
1-element Vector{Int64}:
 1

julia> d.Noël
ERROR: ArgumentError: column name :Noël not found in the data frame; existing most similar names are: :noël and :Noël

julia> d[:, :noël]
1-element Vector{Int64}:
 1

julia> d[:, :Noël]
ERROR: ArgumentError: column name :Noël not found in the data frame; existing most similar names are: :noël and :Noël

If we go along my proposal I will add a documentation explaining this case.

@bkamins bkamins added the doc label Oct 8, 2021
@bkamins bkamins added this to the 1.3 milestone Oct 8, 2021
@stevengj
Copy link

stevengj commented Oct 8, 2021

two columns having distinct names e.g. "no\u00EBl" and "noe\u0308l"

In Unicode, these are considered "canonically equivalent" strings. According to the Unicode standard:

Programs should always compare canonical-equivalent Unicode strings as equal

Of course, Julia itself does not compare two strings as == if they contain different codepoints, even if they are canonically equivalent, so it becomes a judgement call at what point you want to start applying canonical equivalence to comparisons. But Julia does apply canonical equivalence to programmer identifiers, so there is an argument for DataFrames to do so as well.

Basically, I think you should apply at least NFC normalization when the string ceases to be "data" (bytes/codepoints) and becomes symbolic information for the user (e.g. a column label). If you are comparing a string to a Symbol, you have probably hit that point.

@stevengj
Copy link

stevengj commented Oct 8, 2021

(Alternatively, you could store both the normalized and the original non-normalized versions, and only use the former for comparing to Symbol, throwing an error if it is ambiguous.)

@bkamins
Copy link
Member

bkamins commented Oct 8, 2021

I think you should apply normalization when the string ceases to be "data"

I think this is a crucial thing. In my opinion column name in a DataFrame is data (e.g. in CSV file it is a first row of data typically). But this can be debatable, so let us wait for @nalimilan's opinion.

The reason why we allow to refer to columns via programmer identifiers is:

  • historical - originally Symbols were used for the design so we stick to this as an alternative;
  • for convenience, so that you can write df.column and do not have to write df."column".

Also note that Julia Base does not apply canonical equivalence not only for strings but also for Symbols:

julia> a, b = "no\u00EBl", "noe\u0308l"
("noël", "noël")

julia> sa, sb = Symbol.((a, b))
(:noël, :noël)

julia> sa == sb
false

it is only when the symbol is passed as a programmer identifier it gets normalized. And this is the reason of the problem, as

If you are comparing a string to a Symbol, you have probably hit that point.

No we do not compare Symbols to strings. We always only compare Symbols (but if the user passes a string this string is converted to Symbol; the problem is - as noted above - that conversion to symbol does not make sure that Symbol is in canonical form).

So in other words the problem is that:

julia> Symbol("noe\u0308l") == :noël # here the ë was typed in as \u0308l
false

Note though that even if we have not allowed users to pass string literals as column names such column names could be read-in into a DataFrame via e.g. CSV.jl.

@bkamins
Copy link
Member

bkamins commented Oct 8, 2021

(Alternatively, you could store both the normalized and the original non-normalized versions, and only use the former for comparing to Symbol, throwing an error if it is ambiguous.)

This is a good point. Actually we could compute it on the fly if the initial lookup fails so that we would not introduce an overhead for standard situations.

@bkamins
Copy link
Member

bkamins commented Oct 8, 2021

Here is a thing you can do in Julia Base now because of what we have discussed:

julia> nt = (; Symbol("no\u00EBl") => 1, Symbol("noe\u0308l") => 2)
(noël = 1, noël = 2)

Actually we could compute it on the fly if the initial lookup fails

Ah - we could not, because duplicate detection would have to be done at creation time. So we would have to discuss if we want to have such a feature (given the example above clearly NamedTuple does not care and allows such field names 😄).

@nalimilan
Copy link
Member

Yes, we decided to keep spaces and special characters when reading CSV files precisely to ensure that writing the data back gives the same column names. And Julia does that when reading strings. So it wouldn't be super logical to normalize column names automatically. Maybe when a name isn't found we could try to check whether there's a column which would match after normalization, and if so explain that in the error? We could provide a function or a hint to a short syntax to normalize column names.

@bkamins
Copy link
Member

bkamins commented Oct 8, 2021

@quinnj - does https://csv.juliadata.org/latest/reading.html#normalizenames in CSV.jl also provide this kind of normalization?

@nalimilan
Copy link
Member

Yes it does: https://github.com/JuliaData/CSV.jl/blob/0f97e66e998e0a7918b73ab6d54460c253c86d67/src/utils.jl#L333-L338

@bkamins
Copy link
Member

bkamins commented Oct 8, 2021

Maybe when a name isn't found we could try to check whether there's a column which would match after normalization, and if so explain that in the error?

I will make a PR for this

We could provide a function or a hint to a short syntax to normalize column names.

I will add this hint in the error message:

using Unicode
rename!(Unicode.normalize, df)

@bkamins
Copy link
Member

bkamins commented Oct 8, 2021

I have implemented it in #2904. The only problem, as commented there, is that Unicode.normalize does not handle the \mu case. Is there a function that would also handle this?

@stevengj
Copy link

stevengj commented Oct 9, 2021

It was decided not to perform normalization on symbols constructed with Symbol(“…”) because those are typically programmatically constructed (rather than human identifiers) and are not required to be valid Julia identifiers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants