Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to detect unicode normalization issues in column names #2904

Merged
merged 14 commits into from
Nov 15, 2021

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Oct 8, 2021

Fixes #2901

@bkamins bkamins requested a review from nalimilan October 8, 2021 22:05
@bkamins
Copy link
Member Author

bkamins commented Oct 8, 2021

The only problem is that this fix does not resolve the OP issue as still:

julia> d = DataFrame("power_µW" => [1,2,3])
3×1 DataFrame   
 Row │ power_µW 
     │ Int64    
─────┼──────────
   1 │        1 
   2 │        2 
   3 │        3 

julia> d.power_µW
ERROR: ArgumentError: column name :power_μW not found in the data frame; existing most similar names are: :power_µW

as this kind of change is not covered by Unicode.normalize.

@bkamins bkamins added the feature label Oct 8, 2021
@bkamins bkamins added this to the 1.3 milestone Oct 8, 2021
@stevengj
Copy link

stevengj commented Oct 9, 2021

For the mu fix, currently the only way to apply the Julia-specific normalization is to call Meta.parse, which is a bit slow and will fail on names that aren’t valid Julia identifiers. Or you can wait for JuliaLang/julia#42561.

@@ -288,9 +288,28 @@ function fuzzymatch(l::Dict{Symbol, Int}, idx::Symbol)
return [s for (d, s) in dist if d <= maxd]
end

function normalizedmatch(l::Dict{Symbol, Int}, idx::Symbol)
idxs = Unicode.normalize(string(idx))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why normalize the passed index? Even if users normalize the data frame's column names, we won't normalize the passed index, so they will get the same error again. We could have two different checks and two error messages.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that both column name and passed index can have a mixture of normalized and unnormalized codeunits. I have improved the error message indicating that it is recommended to normalize both.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. But it's interesting for users to know whether the problem (i.e. non-normalized name) comes from the data frame, from the index, or both.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I will add this information to the message.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the error messages are as follows:

julia>     d1 = DataFrame("no\u00EBl" => 1)
1×1 DataFrame
 Row │ noël  
     │ Int64 
─────┼───────
   1 │     1 

julia>     d1[:, "noe\u0308l"]
ERROR: ArgumentError: column name :noël not found in the data frame. However there is a match of
Unicode normalized passed column name with a normalized column name found in the data frame.
In the passed data passed column name was not normalized. It is recommended to use normalized
column names and then refer to them using normalized names to avoid ambiguity. In order to
normalize column names in an existing data frame `df` do `using Unicode;
rename!(Unicode.normalize, df)`.

julia>     d2 = DataFrame("noe\u0308l" => 1)
1×1 DataFrame
 Row │ noël  
     │ Int64 
─────┼───────
   1 │     1

julia>     d2[:, "no\u00EBl"]
ERROR: ArgumentError: column name :noël not found in the data frame. However there is a match of
Unicode normalized passed column name with a normalized column name found in the data frame.
In the passed data column name found in the data frame was not normalized. It is recommended
to use normalized column names and then refer to them using normalized names to avoid ambiguity.
In order to normalize column names in an existing data frame `df` do `using Unicode;
rename!(Unicode.normalize, df)`.

julia>     d3 = DataFrame("noe\u0308\u00EBl" => 1)
1×1 DataFrame
 Row │ noëël 
     │ Int64 
─────┼───────
   1 │     1

julia>     d3[:, "no\u00EBe\u0308l"]
ERROR: ArgumentError: column name :noëël not found in the data frame. However there is a match of
Unicode normalized passed column name with a normalized column name found in the data frame.
In the passed data both names were not normalized. It is recommended to use normalized column
names and then refer to them using normalized names to avoid ambiguity. In order to normalize
column names in an existing data frame `df` do `using Unicode; rename!(Unicode.normalize, df)`.

@bkamins
Copy link
Member Author

bkamins commented Oct 9, 2021

Following the suggestion by @stevengj I have hard-coded also the non-standard mappings, so now we have:

julia> d = DataFrame("power_µW" => [1,2,3])
3×1 DataFrame   
 Row │ power_µW 
     │ Int64    
─────┼──────────
   1 │        1 
   2 │        2 
   3 │        3 

julia> d.power_µW
ERROR: ArgumentError: column name :power_μW not found in the data frame. However there is a similar column name in the
data frame where character μ is used is instead of µ. Note that these characters are displayed very similarly but are
different as their normalized codepoints are 956 and 181 respectively. The error is most likely caused by the Julia
parser which normalizes `Symbol` literals containing such characters. In order to avoid such problems use only μ
(codepoint: 956) character when naming columns.

and

julia> d1 = DataFrame("no\u00EBl" => 1)
1×1 DataFrame
 Row │ noël  
     │ Int64 
─────┼───────
   1 │     1

julia> d1[:, "noe\u0308l"]
ERROR: ArgumentError: column name :noël not found in the data frame. However there is a match of Unicode normalized
passed column name with a normalized column name found in the data frame. It is recommended to use normalized column
names and then refer to them using normalized names to avoid ambiguity. In order to normalize column names in an existing
data frame `df` do `using Unicode; rename!(Unicode.normalize, df)`.

@nalimilan
Copy link
Member

What's annoying is that we don't give any concrete solution when the parser replaced some chars with others during normalization. We should probably wait until @stevengj's PR is merged.

src/other/index.jl Show resolved Hide resolved
src/other/index.jl Outdated Show resolved Hide resolved
Comment on lines 300 to 303
s1 = iterate(a1)
s2 = iterate(a2)

while !(s1 === nothing || s2 === nothing)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you use zip here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - because zip does not check that its arguments have the same length and running length on strings is expensive.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have re-implemented it to provide a simpler logic.

@@ -288,9 +288,28 @@ function fuzzymatch(l::Dict{Symbol, Int}, idx::Symbol)
return [s for (d, s) in dist if d <= maxd]
end

function normalizedmatch(l::Dict{Symbol, Int}, idx::Symbol)
idxs = Unicode.normalize(string(idx))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. But it's interesting for users to know whether the problem (i.e. non-normalized name) comes from the data frame, from the index, or both.

@stevengj
Copy link

stevengj commented Oct 9, 2021

If there is no ambiguity, I'm not quite clear on why you are throwing an error rather than giving the equivalent (modulo normalization) column?

@bkamins
Copy link
Member Author

bkamins commented Oct 9, 2021

I'm not quite clear on why you are throwing an error rather than giving the equivalent (modulo normalization) column?

This is always a hard decision to make. First off - it would be (mildly) breaking.

A deeper reason is that:

  • in programmatic mode - I would prefer to stick to the rule that we are "strict" (i.e. we do verbatim what user asked for) and otherwise throw an error
  • in interactive mode - we could indeed add a small amount of convenience but that is why I have written a verbose error information so that:
    • the user would exactly know what the problem is and how to fix
    • the user would not be surprised when moving from interactive work to production

Of course I am open to discuss this as usual, but this is my thinking about the issue (i.e. that it is safer to be conservative).

@nalimilan
Copy link
Member

Also performance will be worse when we need to check all column names, so better signal the problem to the user than silently being slower.

@bkamins
Copy link
Member Author

bkamins commented Oct 9, 2021

What's annoying is that we don't give any concrete solution when the parser replaced some chars with others during normalization.

I believe we do. This is the reason I made such a verbose error messages both printing the proper character that should be used and its codeunit so that the user can fix it by copy-paste in their code (of course it would be nicer to be able to give an "easy fix" command, but we have to wait till Julia Base has it to add this - and also then it will be conditional on Unicode module having this part as I assume Julia LTS will not include it any time soon).

@stevengj
Copy link

Also performance will be worse when we need to check all column names, so better signal the problem to the user than silently being slower.

If you precompute the normalization of the column names, and look up symbols assuming that they have already been normalized by Julia, then it seems like the performance hit could be made quite small.

Comment on lines 347 to 356
"data frame. However there is a match of " *
"Unicode normalized passed column name with " *
"a normalized column name found in the " *
"data frame. In the passed data $case not " *
"normalized. It is recommended to use " *
"normalized column names and then refer to them " *
"using normalized names to avoid ambiguity. " *
"In order to normalize column names in " *
"an existing data frame `df` do " *
"`using Unicode; rename!(Unicode.normalize, df)`."))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last advice doesn't apply if case == "passed column name was", as the name is already normalized. Likewise, "a normalized column name found in the data frame" is potentially confusing. Given that these issues are tricky and that the error message is super long it would be good to simplify it as much as possible depending on the situation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made three separate error messages (tests cover all options).

@nalimilan
Copy link
Member

I believe we do. This is the reason I made such a verbose error messages both printing the proper character that should be used and its codeunit so that the user can fix it by copy-paste in their code (of course it would be nicer to be able to give an "easy fix" command, but we have to wait till Julia Base has it to add this - and also then it will be conditional on Unicode module having this part as I assume Julia LTS will not include it any time soon).

@bkamins As you prefer, but given there's no real hurry it could be simpler to wait until the PR is merged (hopefully soon) so that we can print the command to use conditional on VERSION. Otherwise we'll probably forget about this. :-)

If you precompute the normalization of the column names, and look up symbols assuming that they have already been normalized by Julia, then it seems like the performance hit could be made quite small.

@stevengj So that would mean doing first a lookup of raw names, and if that fails a lookup on normalized names? Probably acceptable in terms of performance, though that would make the code more complex for a corner case. (We still need to keep the raw names to avoid the cost of normalizing on each lookup and because we want to to preserve them exactly as they were.)
If we followed that approach we should also disallow having multiple columns whose normalized names are equal, which is currently allowed by NamedTuple as @bkamins noted ((; Symbol("no\u00EBl") => 1, Symbol("noe\u0308l") => 2) has two entries).

@stevengj
Copy link

stevengj commented Oct 10, 2021

So that would mean doing first a lookup of raw names, and if that fails a lookup on normalized names?

Yes, but furthermore in getproperty I think you should only look up normalized names, and should assume that the Symbol argument has already been normalized by Julia, so there would be zero overhead.

If we followed that approach we should also disallow having multiple columns whose normalized names are equal

No, you could still allow such names in the non-normalized lookup, but you would omit names with normalization conflicts from the normalized lookup table—hence only a normalized lookup of an ambiguous normalized name (e.g. via getproperty) would be an error.

@bkamins
Copy link
Member Author

bkamins commented Oct 10, 2021

@nalimilan - yes; there is no rush; let us first have a good understanding of what we want.

@stevengj I think that a good starting point is to decide what should happen in this case in Julia Base:

julia> nt = (; Symbol("noe\u0308l") => "noel")
(noël = "noel",)

julia> nt.noël # autocompleted
ERROR: type NamedTuple has no field noël

Essentially - as everywhere in DataFrames.jl (e.g. when designing broadcasting support for DataFrame object) I prefer to provide the functionality that is consistent with Julia Base.

In this case the reason is that in:

julia> using DataFrames

julia> table = (; Symbol("noe\u0308l") => ["noel"])
(noël = ["noel"],)

julia> df = DataFrame(table)
1×1 DataFrame
 Row │ noël
     │ String
─────┼────────
   1 │ noel

julia> table2 = Tables.columntable(df)
(noël = ["noel"],)

the three objects table, df, and table2 should work the same way if queried for their columns in generic code (i.e. in the code that is only aware that "some table" was passed to it).

@stevengj
Copy link

@bkamins, it's already been decided that Symbol("...") constructs a programmatic symbol that may not be type-able by the user, and is not normalized.

@bkamins
Copy link
Member Author

bkamins commented Oct 11, 2021

it's already been decided that Symbol("...") constructs a programmatic symbol that may not be type-able by the user, and is not normalized.

Yes. I understand this. What I do not understand is why you think that:

julia> nt = (; Symbol("noe\u0308l") => ["noel"])
(noël = ["noel"],)

julia> julia> nt.noël # autocompleted
ERROR: type NamedTuple has no field noël

is OK (as this is what I understand your comment implies).

While you think that this behavior should be changed:

julia> df = DataFrame(nt)
1×1 DataFrame  
 Row │ noël   
     │ String  
─────┼──────── 
   1 │ noel    

julia> df.noël
ERROR: ArgumentError: column name :noël not found in the data frame; existing most similar names are: :noël

Or putting it in different words why:

you could still allow such names in the non-normalized lookup, but you would omit names with normalization conflicts from the normalized lookup table—hence only a normalized lookup of an ambiguous normalized name (e.g. via getproperty) would be an error.

should apply to DataFrame, but should not apply to NamedTuple?

The reason I am pressing for clarifying this (and avoid just implementing it) is that there is an intended duality between these two types. NamedTuple of vectors of equal length and DataFrame should be useable in exactly the same way with the exception that NamedTuple is type stable and DataFrame is not type stable. This e.g. has a consequence that internally in DataFrames.jl we construct such NamedTuples to improve performance of some operations via a function barrier and having such discrepancy would be problematic.

@stevengj
Copy link

stevengj commented Oct 11, 2021

The reason for the difference is that I was under the impression that d = DataFrame("power_µW" => [1,2,3]) — users entering unnormalized strings (or reading them from a file) — was the usual way to create a DataFrame by human input, whereas with a NamedTuple the usual mechanism for human input is to employ the Julia parser ala nt = (power_µW = [1,2,3],), which performs normalization, in which case DataFrame(nt) is also normalized and d.power_μW works as expected. nt = (Symbol("...") = ..., ...) is very atypical, and would normally only be for program-generated named tuples whose field names are not intended to be typed by humans.

i.e. for typical usage a NamedTuple user never needs to think about normalization, whereas for typical DataFrame usage the user must be aware of normalization if Unicode names are employed.

You get quite different behavior with d = DataFrame(:power_µW => [1,2,3]) vs. d = DataFrame("power_µW" => [1,2,3]) … is that what you want?

@bkamins
Copy link
Member Author

bkamins commented Oct 11, 2021

You get quite different behavior with d = DataFrame(:power_µW => [1,2,3]) vs. d = DataFrame("power_µW" => [1,2,3]) … is that what you want?

For me it is not a question if I want it. I accept it as this is a consequence of Julia design.

What I would normally assume for DataFrames.jl - if someone uses Symbols for accessing column names then most likely these columns are created using Symbol literals. If someone creates columns using strings then I would expect that these columns are also accessed using strings. The mixing of creating a column using a string and then trying to access it using Symbol is not typical. The reason is that the most common case of using strings (and that why we started to accept them) is when you want column names that are not valid identifiers like DataFrame("column one" => [1]). In the OP case it is kind of corner case that the column name is Unicode but at the same time it is a valid identifier. If OP have e.g. written:

julia> using DataFrames

julia> d = DataFrame("power µW" => [1,2,3])
3×1 DataFrame
 Row │ power µW
     │ Int64
─────┼──────────
   1 │        1
   2 │        2
   3 │        3

julia> d."power µW" # autocompleted, but have to add quotes for this to work
3-element Vector{Int64}:
 1
 2
 3

all would work and no one would notice that there is a problem.

Actually this is the first time ever someone reports the issue we are discussing here, so it is super rare. I have been maintaining this package for several years and this is the first time someone reports this problem. Note that it is even super hard to type in "power_µW" in Julia - it was most likely a programmatically captured string.

That is why I think @nalimilan and I tend to think that while we want to handle this somehow, we do not want to change anything in the core of the package that would add overhead (even a small one) in normal processing. @nalimilan - do I get the right impression of your preferences?

src/other/index.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Nov 14, 2021

@nalimilan - I have updated the PR. The current approach is not to make any breaking changes, but just give more informative error messages.

In the future if we get reports from users that what we do now is problematic we could go back to discuss changes that @stevengj proposed (however, maybe adding special errors will be enough and this way we do not make breaking changes).

@stevengj
Copy link

stevengj commented Nov 14, 2021

Note that it is even super hard to type in "power_µW" in Julia

Not on a Mac, where option-m is µ (micro). That's actually why we included this in Julia's custom normalization: people kept confusing µ (micro) and μ (mu) in real code.

@bkamins
Copy link
Member Author

bkamins commented Nov 14, 2021

people kept confusing µ (micro) and μ (mu) in real code.

This is indeed a valid issue (it is interesting how such things influence experience; in my region Mac is quite rare).

However, I would still recommend that we take the path:

  • first improve the error message in 1.3 release and wait for the feedback;
  • based on the feedback decide if we want to make a breaking change.

src/other/index.jl Outdated Show resolved Hide resolved
src/other/index.jl Outdated Show resolved Hide resolved
src/other/index.jl Outdated Show resolved Hide resolved
src/other/index.jl Outdated Show resolved Hide resolved
src/other/index.jl Outdated Show resolved Hide resolved
"error is most likely caused by the Julia parser which " *
"normalizes `Symbol` literals containing such " *
"characters. In order to avoid such problems use only " *
"$refc1 (codepoint: $(UInt32(refc1))) in column names."))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that https://github.com/JuliaLang/julia/pull/42561has been merged, could we print the command that fixes column names on Julia ≥ 1.8?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I have changed the error message under v"1.8-".

@nalimilan
Copy link
Member

people kept confusing µ (micro) and μ (mu) in real code.

This is indeed a valid issue (it is interesting how such things influence experience; in my region Mac is quite rare).

FWIW, on French PC keyboards, we also have micro available (with Shift+a key close to Return). Apparently some variants of US and Polish keyboards also support it via AltGr+M. That's indeed probably the main reason why people get confused.

Comment on lines 317 to 318
suffix = "use the Unicode.normalize function setting its " *
"chartransform keyword argument to Unicode.julia_chartransform."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about a more direct suggestion like this?

Suggested change
suffix = "use the Unicode.normalize function setting its " *
"chartransform keyword argument to Unicode.julia_chartransform."
suffix = "normalize column names of the data frame by calling " *
"`using Unicode; rename!(n -> Unicode.normalize(n, chartransform=Unicode.julia_chartransform), df)`."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a half-measure solution (just like the one below earlier). I will propose a better error (as we need to signal the user if the passed name or the name in data frame was incorrect)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nalimilan In the end I have reverted the simpler error message. The issue is that the number of potential cases we would have to handle is too large to warrant covering all of them given how rare the case is. I now just write which character should be replaced by which.

Just to give you a flavor of what happens:

"\u0387" after standard normalization via Unicode.normalize becomes "\u00B7" which after Julia-specific normalization (chartransform=Unicode.julia_chartransform) becomes "\u22C5".

I think that the problematic case is rare enough that we can just write what characters do not match and let the user decide how to handle this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've lost track of all the possible cases to handle...

@bkamins bkamins merged commit e243f72 into main Nov 15, 2021
@bkamins bkamins deleted the bk/unicode_normalization branch November 15, 2021 17:58
@bkamins
Copy link
Member Author

bkamins commented Nov 15, 2021

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Strange behaviour with non-ASCII column names
3 participants