-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try to detect unicode normalization issues in column names #2904
Conversation
The only problem is that this fix does not resolve the OP issue as still:
as this kind of change is not covered by |
For the mu fix, currently the only way to apply the Julia-specific normalization is to call |
src/other/index.jl
Outdated
@@ -288,9 +288,28 @@ function fuzzymatch(l::Dict{Symbol, Int}, idx::Symbol) | |||
return [s for (d, s) in dist if d <= maxd] | |||
end | |||
|
|||
function normalizedmatch(l::Dict{Symbol, Int}, idx::Symbol) | |||
idxs = Unicode.normalize(string(idx)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why normalize the passed index? Even if users normalize the data frame's column names, we won't normalize the passed index, so they will get the same error again. We could have two different checks and two error messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that both column name and passed index can have a mixture of normalized and unnormalized codeunits. I have improved the error message indicating that it is recommended to normalize both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. But it's interesting for users to know whether the problem (i.e. non-normalized name) comes from the data frame, from the index, or both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - I will add this information to the message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now the error messages are as follows:
julia> d1 = DataFrame("no\u00EBl" => 1)
1×1 DataFrame
Row │ noël
│ Int64
─────┼───────
1 │ 1
julia> d1[:, "noe\u0308l"]
ERROR: ArgumentError: column name :noël not found in the data frame. However there is a match of
Unicode normalized passed column name with a normalized column name found in the data frame.
In the passed data passed column name was not normalized. It is recommended to use normalized
column names and then refer to them using normalized names to avoid ambiguity. In order to
normalize column names in an existing data frame `df` do `using Unicode;
rename!(Unicode.normalize, df)`.
julia> d2 = DataFrame("noe\u0308l" => 1)
1×1 DataFrame
Row │ noël
│ Int64
─────┼───────
1 │ 1
julia> d2[:, "no\u00EBl"]
ERROR: ArgumentError: column name :noël not found in the data frame. However there is a match of
Unicode normalized passed column name with a normalized column name found in the data frame.
In the passed data column name found in the data frame was not normalized. It is recommended
to use normalized column names and then refer to them using normalized names to avoid ambiguity.
In order to normalize column names in an existing data frame `df` do `using Unicode;
rename!(Unicode.normalize, df)`.
julia> d3 = DataFrame("noe\u0308\u00EBl" => 1)
1×1 DataFrame
Row │ noëël
│ Int64
─────┼───────
1 │ 1
julia> d3[:, "no\u00EBe\u0308l"]
ERROR: ArgumentError: column name :noëël not found in the data frame. However there is a match of
Unicode normalized passed column name with a normalized column name found in the data frame.
In the passed data both names were not normalized. It is recommended to use normalized column
names and then refer to them using normalized names to avoid ambiguity. In order to normalize
column names in an existing data frame `df` do `using Unicode; rename!(Unicode.normalize, df)`.
Following the suggestion by @stevengj I have hard-coded also the non-standard mappings, so now we have:
and
|
What's annoying is that we don't give any concrete solution when the parser replaced some chars with others during normalization. We should probably wait until @stevengj's PR is merged. |
src/other/index.jl
Outdated
s1 = iterate(a1) | ||
s2 = iterate(a2) | ||
|
||
while !(s1 === nothing || s2 === nothing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you use zip
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No - because zip
does not check that its arguments have the same length and running length
on strings is expensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have re-implemented it to provide a simpler logic.
src/other/index.jl
Outdated
@@ -288,9 +288,28 @@ function fuzzymatch(l::Dict{Symbol, Int}, idx::Symbol) | |||
return [s for (d, s) in dist if d <= maxd] | |||
end | |||
|
|||
function normalizedmatch(l::Dict{Symbol, Int}, idx::Symbol) | |||
idxs = Unicode.normalize(string(idx)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. But it's interesting for users to know whether the problem (i.e. non-normalized name) comes from the data frame, from the index, or both.
If there is no ambiguity, I'm not quite clear on why you are throwing an error rather than giving the equivalent (modulo normalization) column? |
This is always a hard decision to make. First off - it would be (mildly) breaking. A deeper reason is that:
Of course I am open to discuss this as usual, but this is my thinking about the issue (i.e. that it is safer to be conservative). |
Also performance will be worse when we need to check all column names, so better signal the problem to the user than silently being slower. |
I believe we do. This is the reason I made such a verbose error messages both printing the proper character that should be used and its codeunit so that the user can fix it by copy-paste in their code (of course it would be nicer to be able to give an "easy fix" command, but we have to wait till Julia Base has it to add this - and also then it will be conditional on |
Co-authored-by: Milan Bouchet-Valat <[email protected]>
If you precompute the normalization of the column names, and look up symbols assuming that they have already been normalized by Julia, then it seems like the performance hit could be made quite small. |
src/other/index.jl
Outdated
"data frame. However there is a match of " * | ||
"Unicode normalized passed column name with " * | ||
"a normalized column name found in the " * | ||
"data frame. In the passed data $case not " * | ||
"normalized. It is recommended to use " * | ||
"normalized column names and then refer to them " * | ||
"using normalized names to avoid ambiguity. " * | ||
"In order to normalize column names in " * | ||
"an existing data frame `df` do " * | ||
"`using Unicode; rename!(Unicode.normalize, df)`.")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last advice doesn't apply if case == "passed column name was"
, as the name is already normalized. Likewise, "a normalized column name found in the data frame" is potentially confusing. Given that these issues are tricky and that the error message is super long it would be good to simplify it as much as possible depending on the situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made three separate error messages (tests cover all options).
@bkamins As you prefer, but given there's no real hurry it could be simpler to wait until the PR is merged (hopefully soon) so that we can print the command to use conditional on
@stevengj So that would mean doing first a lookup of raw names, and if that fails a lookup on normalized names? Probably acceptable in terms of performance, though that would make the code more complex for a corner case. (We still need to keep the raw names to avoid the cost of normalizing on each lookup and because we want to to preserve them exactly as they were.) |
Yes, but furthermore in
No, you could still allow such names in the non-normalized lookup, but you would omit names with normalization conflicts from the normalized lookup table—hence only a normalized lookup of an ambiguous normalized name (e.g. via getproperty) would be an error. |
@nalimilan - yes; there is no rush; let us first have a good understanding of what we want. @stevengj I think that a good starting point is to decide what should happen in this case in Julia Base:
Essentially - as everywhere in DataFrames.jl (e.g. when designing broadcasting support for In this case the reason is that in:
the three objects |
@bkamins, it's already been decided that |
Yes. I understand this. What I do not understand is why you think that:
is OK (as this is what I understand your comment implies). While you think that this behavior should be changed:
Or putting it in different words why:
should apply to The reason I am pressing for clarifying this (and avoid just implementing it) is that there is an intended duality between these two types. |
The reason for the difference is that I was under the impression that i.e. for typical usage a You get quite different behavior with |
For me it is not a question if I want it. I accept it as this is a consequence of Julia design. What I would normally assume for DataFrames.jl - if someone uses
all would work and no one would notice that there is a problem. Actually this is the first time ever someone reports the issue we are discussing here, so it is super rare. I have been maintaining this package for several years and this is the first time someone reports this problem. Note that it is even super hard to type in That is why I think @nalimilan and I tend to think that while we want to handle this somehow, we do not want to change anything in the core of the package that would add overhead (even a small one) in normal processing. @nalimilan - do I get the right impression of your preferences? |
@nalimilan - I have updated the PR. The current approach is not to make any breaking changes, but just give more informative error messages. In the future if we get reports from users that what we do now is problematic we could go back to discuss changes that @stevengj proposed (however, maybe adding special errors will be enough and this way we do not make breaking changes). |
Not on a Mac, where option-m is |
This is indeed a valid issue (it is interesting how such things influence experience; in my region Mac is quite rare). However, I would still recommend that we take the path:
|
src/other/index.jl
Outdated
"error is most likely caused by the Julia parser which " * | ||
"normalizes `Symbol` literals containing such " * | ||
"characters. In order to avoid such problems use only " * | ||
"$refc1 (codepoint: $(UInt32(refc1))) in column names.")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that https://github.com/JuliaLang/julia/pull/42561has been merged, could we print the command that fixes column names on Julia ≥ 1.8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - I have changed the error message under v"1.8-"
.
FWIW, on French PC keyboards, we also have micro available (with Shift+a key close to Return). Apparently some variants of US and Polish keyboards also support it via AltGr+M. That's indeed probably the main reason why people get confused. |
Co-authored-by: Milan Bouchet-Valat <[email protected]>
src/other/index.jl
Outdated
suffix = "use the Unicode.normalize function setting its " * | ||
"chartransform keyword argument to Unicode.julia_chartransform." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about a more direct suggestion like this?
suffix = "use the Unicode.normalize function setting its " * | |
"chartransform keyword argument to Unicode.julia_chartransform." | |
suffix = "normalize column names of the data frame by calling " * | |
"`using Unicode; rename!(n -> Unicode.normalize(n, chartransform=Unicode.julia_chartransform), df)`." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a half-measure solution (just like the one below earlier). I will propose a better error (as we need to signal the user if the passed name or the name in data frame was incorrect)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nalimilan In the end I have reverted the simpler error message. The issue is that the number of potential cases we would have to handle is too large to warrant covering all of them given how rare the case is. I now just write which character should be replaced by which.
Just to give you a flavor of what happens:
"\u0387"
after standard normalization via Unicode.normalize
becomes "\u00B7"
which after Julia-specific normalization (chartransform=Unicode.julia_chartransform
) becomes "\u22C5".
I think that the problematic case is rare enough that we can just write what characters do not match and let the user decide how to handle this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I've lost track of all the possible cases to handle...
Thank you! |
Fixes #2901