-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better identification of invalid Unicode characters #9127
Comments
relevant JuliaStrings/utf8proc#11 |
Uses JuliaParser.jl to detect code points that are denotes as unary or binary operators in Unicode data but are not parseable as such in Julia. Ref: JuliaLang/julia#9127
I wrote a notebook some time ago to programmatically inspect the parsability of operators and only just got round to cleaning it up. The full results can be seen at JIN 9127. Here's the summary results broken down by Unicode category and type of expression in v0.3.6. (Note that not all these make sense to parse.) Unary operators (do expressions of the form "+a" parse?)
Infix binary operators (do expressions of the form "a+a" parse?)
Prefix binary operators (do expressions of the form "+(a,a)" parse?)
n-ary operators (do expressions of the form "+(a, a, a)" parse?)
|
If there's any interest, there shouldn't be too much difficulty to put into julia> printcharinfo('⤴')
U+02934 (⤴) ARROW POINTING RIGHTWARDS THEN CURVING UPWARDS
|
Can't we just extend When we get '⤴'.name == "ARROW POINTING RIGHTWARDS THEN CURVING UPWARDS"
'⤴'.block == "Supplemental Arrows-B"
'⤴'.category == :Sm
'⤴'.combine == False
# and so on. |
We do now have that information — but it's not in the parser where this particular error is thrown. julia> −1
ERROR: syntax: invalid character "−" near column 1
Stacktrace:
[1] top-level scope
@ none:1
julia> '−'
'−': Unicode U+2212 (category Sm: Symbol, math)
help?> −
"−" can be typed by \minus<tab> |
Starting with version 1.7,
|
I just happened to see http://stackoverflow.com/questions/27092738/defining-tables-strange-bug, where the problem is that the user used
minus
instead ofhyphen-minus
. The error has improved tremendously since 0.2, but there are still room for improvement. Do we have any way of providing the unicode name, html entity or other information to help the user see that he has a unexpected code point.The error will then be something like
ERROR: syntax: invalid character "−" (U+2212 - "minus")
PS: Any reason why I shouldn't suggest normalizing
minus
tohypen-minus
? Being pedantic seems somewhat useless here.The text was updated successfully, but these errors were encountered: