Try to detect unicode normalization issues in column names #2904

bkamins · 2021-10-08T22:05:19Z

bkamins · 2021-10-08T22:07:35Z

The only problem is that this fix does not resolve the OP issue as still:

julia> d = DataFrame("power_µW" => [1,2,3])
3×1 DataFrame   
 Row │ power_µW 
     │ Int64    
─────┼──────────
   1 │        1 
   2 │        2 
   3 │        3 

julia> d.power_µW
ERROR: ArgumentError: column name :power_μW not found in the data frame; existing most similar names are: :power_µW

as this kind of change is not covered by Unicode.normalize.

stevengj · 2021-10-09T03:14:20Z

For the mu fix, currently the only way to apply the Julia-specific normalization is to call Meta.parse, which is a bit slow and will fail on names that aren’t valid Julia identifiers. Or you can wait for JuliaLang/julia#42561.

nalimilan · 2021-10-09T09:06:09Z

src/other/index.jl

@@ -288,9 +288,28 @@ function fuzzymatch(l::Dict{Symbol, Int}, idx::Symbol)
        return [s for (d, s) in dist if d <= maxd]
 end

+function normalizedmatch(l::Dict{Symbol, Int}, idx::Symbol)
+    idxs = Unicode.normalize(string(idx))


Why normalize the passed index? Even if users normalize the data frame's column names, we won't normalize the passed index, so they will get the same error again. We could have two different checks and two error messages.

The issue is that both column name and passed index can have a mixture of normalized and unnormalized codeunits. I have improved the error message indicating that it is recommended to normalize both.

Right. But it's interesting for users to know whether the problem (i.e. non-normalized name) comes from the data frame, from the index, or both.

OK - I will add this information to the message.

Now the error messages are as follows:

julia> d1 = DataFrame("no\u00EBl" => 1) 1×1 DataFrame Row │ noël │ Int64 ─────┼─────── 1 │ 1 julia> d1[:, "noe\u0308l"] ERROR: ArgumentError: column name :noël not found in the data frame. However there is a match of Unicode normalized passed column name with a normalized column name found in the data frame. In the passed data passed column name was not normalized. It is recommended to use normalized column names and then refer to them using normalized names to avoid ambiguity. In order to normalize column names in an existing data frame `df` do `using Unicode; rename!(Unicode.normalize, df)`. julia> d2 = DataFrame("noe\u0308l" => 1) 1×1 DataFrame Row │ noël │ Int64 ─────┼─────── 1 │ 1 julia> d2[:, "no\u00EBl"] ERROR: ArgumentError: column name :noël not found in the data frame. However there is a match of Unicode normalized passed column name with a normalized column name found in the data frame. In the passed data column name found in the data frame was not normalized. It is recommended to use normalized column names and then refer to them using normalized names to avoid ambiguity. In order to normalize column names in an existing data frame `df` do `using Unicode; rename!(Unicode.normalize, df)`. julia> d3 = DataFrame("noe\u0308\u00EBl" => 1) 1×1 DataFrame Row │ noëël │ Int64 ─────┼─────── 1 │ 1 julia> d3[:, "no\u00EBe\u0308l"] ERROR: ArgumentError: column name :noëël not found in the data frame. However there is a match of Unicode normalized passed column name with a normalized column name found in the data frame. In the passed data both names were not normalized. It is recommended to use normalized column names and then refer to them using normalized names to avoid ambiguity. In order to normalize column names in an existing data frame `df` do `using Unicode; rename!(Unicode.normalize, df)`.

bkamins · 2021-10-09T15:03:32Z

Following the suggestion by @stevengj I have hard-coded also the non-standard mappings, so now we have:

julia> d = DataFrame("power_µW" => [1,2,3])
3×1 DataFrame   
 Row │ power_µW 
     │ Int64    
─────┼──────────
   1 │        1 
   2 │        2 
   3 │        3 

julia> d.power_µW
ERROR: ArgumentError: column name :power_μW not found in the data frame. However there is a similar column name in the
data frame where character μ is used is instead of µ. Note that these characters are displayed very similarly but are
different as their normalized codepoints are 956 and 181 respectively. The error is most likely caused by the Julia
parser which normalizes `Symbol` literals containing such characters. In order to avoid such problems use only μ
(codepoint: 956) character when naming columns.

and

julia> d1 = DataFrame("no\u00EBl" => 1)
1×1 DataFrame
 Row │ noël  
     │ Int64 
─────┼───────
   1 │     1

julia> d1[:, "noe\u0308l"]
ERROR: ArgumentError: column name :noël not found in the data frame. However there is a match of Unicode normalized
passed column name with a normalized column name found in the data frame. It is recommended to use normalized column
names and then refer to them using normalized names to avoid ambiguity. In order to normalize column names in an existing
data frame `df` do `using Unicode; rename!(Unicode.normalize, df)`.

nalimilan · 2021-10-09T15:54:40Z

What's annoying is that we don't give any concrete solution when the parser replaced some chars with others during normalization. We should probably wait until @stevengj's PR is merged.

src/other/index.jl

nalimilan · 2021-10-09T15:47:36Z

src/other/index.jl

+    s1 = iterate(a1)
+    s2 = iterate(a2)
+
+    while !(s1 === nothing || s2 === nothing)


Can't you use zip here?

No - because zip does not check that its arguments have the same length and running length on strings is expensive.

I have re-implemented it to provide a simpler logic.

nalimilan · 2021-10-09T15:58:03Z

src/other/index.jl

@@ -288,9 +288,28 @@ function fuzzymatch(l::Dict{Symbol, Int}, idx::Symbol)
        return [s for (d, s) in dist if d <= maxd]
 end

+function normalizedmatch(l::Dict{Symbol, Int}, idx::Symbol)
+    idxs = Unicode.normalize(string(idx))


Right. But it's interesting for users to know whether the problem (i.e. non-normalized name) comes from the data frame, from the index, or both.

stevengj · 2021-10-09T18:19:03Z

If there is no ambiguity, I'm not quite clear on why you are throwing an error rather than giving the equivalent (modulo normalization) column?

bkamins · 2021-10-09T19:50:13Z

I'm not quite clear on why you are throwing an error rather than giving the equivalent (modulo normalization) column?

This is always a hard decision to make. First off - it would be (mildly) breaking.

A deeper reason is that:

in programmatic mode - I would prefer to stick to the rule that we are "strict" (i.e. we do verbatim what user asked for) and otherwise throw an error
in interactive mode - we could indeed add a small amount of convenience but that is why I have written a verbose error information so that:
- the user would exactly know what the problem is and how to fix
- the user would not be surprised when moving from interactive work to production

Of course I am open to discuss this as usual, but this is my thinking about the issue (i.e. that it is safer to be conservative).

nalimilan · 2021-10-09T19:54:00Z

Also performance will be worse when we need to check all column names, so better signal the problem to the user than silently being slower.

bkamins · 2021-10-09T19:54:02Z

What's annoying is that we don't give any concrete solution when the parser replaced some chars with others during normalization.

I believe we do. This is the reason I made such a verbose error messages both printing the proper character that should be used and its codeunit so that the user can fix it by copy-paste in their code (of course it would be nicer to be able to give an "easy fix" command, but we have to wait till Julia Base has it to add this - and also then it will be conditional on Unicode module having this part as I assume Julia LTS will not include it any time soon).

Co-authored-by: Milan Bouchet-Valat <[email protected]>

stevengj · 2021-10-10T00:24:04Z

Also performance will be worse when we need to check all column names, so better signal the problem to the user than silently being slower.

If you precompute the normalization of the column names, and look up symbols assuming that they have already been normalized by Julia, then it seems like the performance hit could be made quite small.

nalimilan · 2021-10-10T14:31:24Z

src/other/index.jl

+                                "data frame. However there is a match of " *
+                                "Unicode normalized passed column name with " *
+                                "a normalized column name found in the " *
+                                "data frame. In the passed data $case not " *
+                                "normalized. It is recommended to use " *
+                                "normalized column names and then refer to them " *
+                                "using normalized names to avoid ambiguity. " *
+                                "In order to normalize column names in " *
+                                "an existing data frame `df` do " *
+                                "`using Unicode; rename!(Unicode.normalize, df)`."))


The last advice doesn't apply if case == "passed column name was", as the name is already normalized. Likewise, "a normalized column name found in the data frame" is potentially confusing. Given that these issues are tricky and that the error message is super long it would be good to simplify it as much as possible depending on the situation.

I have made three separate error messages (tests cover all options).

nalimilan · 2021-10-10T14:49:35Z

I believe we do. This is the reason I made such a verbose error messages both printing the proper character that should be used and its codeunit so that the user can fix it by copy-paste in their code (of course it would be nicer to be able to give an "easy fix" command, but we have to wait till Julia Base has it to add this - and also then it will be conditional on Unicode module having this part as I assume Julia LTS will not include it any time soon).

@bkamins As you prefer, but given there's no real hurry it could be simpler to wait until the PR is merged (hopefully soon) so that we can print the command to use conditional on VERSION. Otherwise we'll probably forget about this. :-)

If you precompute the normalization of the column names, and look up symbols assuming that they have already been normalized by Julia, then it seems like the performance hit could be made quite small.

@stevengj So that would mean doing first a lookup of raw names, and if that fails a lookup on normalized names? Probably acceptable in terms of performance, though that would make the code more complex for a corner case. (We still need to keep the raw names to avoid the cost of normalizing on each lookup and because we want to to preserve them exactly as they were.)
If we followed that approach we should also disallow having multiple columns whose normalized names are equal, which is currently allowed by NamedTuple as @bkamins noted ((; Symbol("no\u00EBl") => 1, Symbol("noe\u0308l") => 2) has two entries).

stevengj · 2021-10-10T19:37:52Z

So that would mean doing first a lookup of raw names, and if that fails a lookup on normalized names?

Yes, but furthermore in getproperty I think you should only look up normalized names, and should assume that the Symbol argument has already been normalized by Julia, so there would be zero overhead.

If we followed that approach we should also disallow having multiple columns whose normalized names are equal

No, you could still allow such names in the non-normalized lookup, but you would omit names with normalization conflicts from the normalized lookup table—hence only a normalized lookup of an ambiguous normalized name (e.g. via getproperty) would be an error.

bkamins · 2021-10-10T20:11:39Z

@nalimilan - yes; there is no rush; let us first have a good understanding of what we want.

@stevengj I think that a good starting point is to decide what should happen in this case in Julia Base:

julia> nt = (; Symbol("noe\u0308l") => "noel")
(noël = "noel",)

julia> nt.noël # autocompleted
ERROR: type NamedTuple has no field noël

Essentially - as everywhere in DataFrames.jl (e.g. when designing broadcasting support for DataFrame object) I prefer to provide the functionality that is consistent with Julia Base.

In this case the reason is that in:

julia> using DataFrames

julia> table = (; Symbol("noe\u0308l") => ["noel"])
(noël = ["noel"],)

julia> df = DataFrame(table)
1×1 DataFrame
 Row │ noël
     │ String
─────┼────────
   1 │ noel

julia> table2 = Tables.columntable(df)
(noël = ["noel"],)

the three objects table, df, and table2 should work the same way if queried for their columns in generic code (i.e. in the code that is only aware that "some table" was passed to it).

stevengj · 2021-10-11T00:47:43Z

@bkamins, it's already been decided that Symbol("...") constructs a programmatic symbol that may not be type-able by the user, and is not normalized.

bkamins · 2021-10-11T06:32:01Z

it's already been decided that Symbol("...") constructs a programmatic symbol that may not be type-able by the user, and is not normalized.

Yes. I understand this. What I do not understand is why you think that:

julia> nt = (; Symbol("noe\u0308l") => ["noel"])
(noël = ["noel"],)

julia> julia> nt.noël # autocompleted
ERROR: type NamedTuple has no field noël

is OK (as this is what I understand your comment implies).

While you think that this behavior should be changed:

julia> df = DataFrame(nt)
1×1 DataFrame  
 Row │ noël   
     │ String  
─────┼──────── 
   1 │ noel    

julia> df.noël
ERROR: ArgumentError: column name :noël not found in the data frame; existing most similar names are: :noël

Or putting it in different words why:

you could still allow such names in the non-normalized lookup, but you would omit names with normalization conflicts from the normalized lookup table—hence only a normalized lookup of an ambiguous normalized name (e.g. via getproperty) would be an error.

should apply to DataFrame, but should not apply to NamedTuple?

The reason I am pressing for clarifying this (and avoid just implementing it) is that there is an intended duality between these two types. NamedTuple of vectors of equal length and DataFrame should be useable in exactly the same way with the exception that NamedTuple is type stable and DataFrame is not type stable. This e.g. has a consequence that internally in DataFrames.jl we construct such NamedTuples to improve performance of some operations via a function barrier and having such discrepancy would be problematic.

stevengj · 2021-10-11T13:01:02Z

The reason for the difference is that I was under the impression that d = DataFrame("power_µW" => [1,2,3]) — users entering unnormalized strings (or reading them from a file) — was the usual way to create a DataFrame by human input, whereas with a NamedTuple the usual mechanism for human input is to employ the Julia parser ala nt = (power_µW = [1,2,3],), which performs normalization, in which case DataFrame(nt) is also normalized and d.power_μW works as expected. nt = (Symbol("...") = ..., ...) is very atypical, and would normally only be for program-generated named tuples whose field names are not intended to be typed by humans.

i.e. for typical usage a NamedTuple user never needs to think about normalization, whereas for typical DataFrame usage the user must be aware of normalization if Unicode names are employed.

You get quite different behavior with d = DataFrame(:power_µW => [1,2,3]) vs. d = DataFrame("power_µW" => [1,2,3]) … is that what you want?

bkamins · 2021-10-11T14:32:09Z

You get quite different behavior with d = DataFrame(:power_µW => [1,2,3]) vs. d = DataFrame("power_µW" => [1,2,3]) … is that what you want?

For me it is not a question if I want it. I accept it as this is a consequence of Julia design.

What I would normally assume for DataFrames.jl - if someone uses Symbols for accessing column names then most likely these columns are created using Symbol literals. If someone creates columns using strings then I would expect that these columns are also accessed using strings. The mixing of creating a column using a string and then trying to access it using Symbol is not typical. The reason is that the most common case of using strings (and that why we started to accept them) is when you want column names that are not valid identifiers like DataFrame("column one" => [1]). In the OP case it is kind of corner case that the column name is Unicode but at the same time it is a valid identifier. If OP have e.g. written:

julia> using DataFrames

julia> d = DataFrame("power µW" => [1,2,3])
3×1 DataFrame
 Row │ power µW
     │ Int64
─────┼──────────
   1 │        1
   2 │        2
   3 │        3

julia> d."power µW" # autocompleted, but have to add quotes for this to work
3-element Vector{Int64}:
 1
 2
 3

all would work and no one would notice that there is a problem.

Actually this is the first time ever someone reports the issue we are discussing here, so it is super rare. I have been maintaining this package for several years and this is the first time someone reports this problem. Note that it is even super hard to type in "power_µW" in Julia - it was most likely a programmatically captured string.

That is why I think @nalimilan and I tend to think that while we want to handle this somehow, we do not want to change anything in the core of the package that would add overhead (even a small one) in normal processing. @nalimilan - do I get the right impression of your preferences?

src/other/index.jl

bkamins · 2021-11-14T00:18:06Z

@nalimilan - I have updated the PR. The current approach is not to make any breaking changes, but just give more informative error messages.

In the future if we get reports from users that what we do now is problematic we could go back to discuss changes that @stevengj proposed (however, maybe adding special errors will be enough and this way we do not make breaking changes).

stevengj · 2021-11-14T02:28:39Z

Note that it is even super hard to type in "power_µW" in Julia

Not on a Mac, where option-m is µ (micro). That's actually why we included this in Julia's custom normalization: people kept confusing µ (micro) and μ (mu) in real code.

bkamins · 2021-11-14T09:04:49Z

people kept confusing µ (micro) and μ (mu) in real code.

This is indeed a valid issue (it is interesting how such things influence experience; in my region Mac is quite rare).

However, I would still recommend that we take the path:

first improve the error message in 1.3 release and wait for the feedback;
based on the feedback decide if we want to make a breaking change.

src/other/index.jl

nalimilan · 2021-11-14T11:42:24Z

src/other/index.jl

+                                "error is most likely caused by the Julia parser which " *
+                                "normalizes `Symbol` literals containing such " *
+                                "characters. In order to avoid such problems use only " *
+                                "$refc1 (codepoint: $(UInt32(refc1))) in column names."))


Now that https://github.com/JuliaLang/julia/pull/42561has been merged, could we print the command that fixes column names on Julia ≥ 1.8?

OK - I have changed the error message under v"1.8-".

nalimilan · 2021-11-14T14:07:04Z

people kept confusing µ (micro) and μ (mu) in real code.

This is indeed a valid issue (it is interesting how such things influence experience; in my region Mac is quite rare).

FWIW, on French PC keyboards, we also have micro available (with Shift+a key close to Return). Apparently some variants of US and Polish keyboards also support it via AltGr+M. That's indeed probably the main reason why people get confused.

Co-authored-by: Milan Bouchet-Valat <[email protected]>

nalimilan · 2021-11-14T19:00:05Z

src/other/index.jl

+                suffix = "use the Unicode.normalize function setting its " *
+                         "chartransform keyword argument to Unicode.julia_chartransform."


How about a more direct suggestion like this?

Suggested change

suffix = "use the Unicode.normalize function setting its " *

"chartransform keyword argument to Unicode.julia_chartransform."

suffix = "normalize column names of the data frame by calling " *

"`using Unicode; rename!(n -> Unicode.normalize(n, chartransform=Unicode.julia_chartransform), df)`."

This is a half-measure solution (just like the one below earlier). I will propose a better error (as we need to signal the user if the passed name or the name in data frame was incorrect)

@nalimilan In the end I have reverted the simpler error message. The issue is that the number of potential cases we would have to handle is too large to warrant covering all of them given how rare the case is. I now just write which character should be replaced by which.

Just to give you a flavor of what happens:

"\u0387" after standard normalization via Unicode.normalize becomes "\u00B7" which after Julia-specific normalization (chartransform=Unicode.julia_chartransform) becomes "\u22C5".

I think that the problematic case is rare enough that we can just write what characters do not match and let the user decide how to handle this.

OK, I've lost track of all the possible cases to handle...

bkamins · 2021-11-15T17:58:30Z

Thank you!

try to detect unicode normalization issues in column names

aff4ecc

bkamins requested a review from nalimilan October 8, 2021 22:05

bkamins added the feature label Oct 8, 2021

bkamins added this to the 1.3 milestone Oct 8, 2021

bkamins mentioned this pull request Oct 8, 2021

Strange behaviour with non-ASCII column names #2901

Closed

nalimilan reviewed Oct 9, 2021

View reviewed changes

bkamins added 3 commits October 9, 2021 13:05

improve error message

8065298

try to hanlde all corner cases informatively

a6c6a0f

fix error message

c968081

nalimilan reviewed Oct 9, 2021

View reviewed changes

bkamins and others added 2 commits October 9, 2021 22:13

Apply suggestions from code review

73ea391

Co-authored-by: Milan Bouchet-Valat <[email protected]>

improve error messages

24a77b6

nalimilan reviewed Oct 10, 2021

View reviewed changes

Merge branch 'main' into bk/unicode_normalization

4136e68

bkamins commented Nov 13, 2021

View reviewed changes

src/other/index.jl Outdated Show resolved Hide resolved

bkamins added 2 commits November 13, 2021 17:25

Apply suggestions from code review

802537f

updates after code review

0061a7c

correct alignment

2c713a0

nalimilan reviewed Nov 14, 2021

View reviewed changes

bkamins and others added 2 commits November 14, 2021 18:44

Apply suggestions from code review

44db64b

Co-authored-by: Milan Bouchet-Valat <[email protected]>

fix error message

53bc8f3

nalimilan reviewed Nov 14, 2021

View reviewed changes

bkamins added 2 commits November 14, 2021 23:39

simplify eror message

33f3d4b

add dot to the sentence

98bf94a

nalimilan approved these changes Nov 15, 2021

View reviewed changes

bkamins merged commit e243f72 into main Nov 15, 2021

bkamins deleted the bk/unicode_normalization branch November 15, 2021 17:58

		suffix = "use the Unicode.normalize function setting its " *
		"chartransform keyword argument to Unicode.julia_chartransform."

Try to detect unicode normalization issues in column names #2904

Try to detect unicode normalization issues in column names #2904

Conversation

bkamins commented Oct 8, 2021

bkamins commented Oct 8, 2021

stevengj commented Oct 9, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkamins commented Oct 9, 2021 • edited Loading

nalimilan commented Oct 9, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevengj commented Oct 9, 2021

bkamins commented Oct 9, 2021

nalimilan commented Oct 9, 2021

bkamins commented Oct 9, 2021

stevengj commented Oct 10, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nalimilan commented Oct 10, 2021

stevengj commented Oct 10, 2021 • edited Loading

bkamins commented Oct 10, 2021 • edited Loading

stevengj commented Oct 11, 2021

bkamins commented Oct 11, 2021

stevengj commented Oct 11, 2021 • edited Loading

bkamins commented Oct 11, 2021

bkamins commented Nov 14, 2021

stevengj commented Nov 14, 2021 • edited Loading

bkamins commented Nov 14, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nalimilan commented Nov 14, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkamins commented Nov 15, 2021

stevengj commented Oct 9, 2021 •

edited

Loading

bkamins commented Oct 9, 2021 •

edited

Loading

stevengj commented Oct 10, 2021 •

edited

Loading

bkamins commented Oct 10, 2021 •

edited

Loading

stevengj commented Oct 11, 2021 •

edited

Loading

stevengj commented Nov 14, 2021 •

edited

Loading