Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better identification of invalid Unicode characters #9127

Closed
ivarne opened this issue Nov 23, 2014 · 6 comments
Closed

Better identification of invalid Unicode characters #9127

ivarne opened this issue Nov 23, 2014 · 6 comments
Labels
parser Language parsing and surface syntax unicode Related to unicode characters and encodings

Comments

@ivarne
Copy link
Member

ivarne commented Nov 23, 2014

I just happened to see http://stackoverflow.com/questions/27092738/defining-tables-strange-bug, where the problem is that the user used minus instead of hyphen-minus. The error has improved tremendously since 0.2, but there are still room for improvement. Do we have any way of providing the unicode name, html entity or other information to help the user see that he has a unexpected code point.

The error will then be something like

ERROR: syntax: invalid character "−" (U+2212 - "minus")

PS: Any reason why I shouldn't suggest normalizing minus to hypen-minus? Being pedantic seems somewhat useless here.

@tonyhffong
Copy link

relevant JuliaStrings/utf8proc#11

@jiahao jiahao added the unicode Related to unicode characters and encodings label Dec 1, 2014
jiahao added a commit to jiahao/jin that referenced this issue Feb 20, 2015
Uses JuliaParser.jl to detect code points that are denotes as unary or
binary operators in Unicode data but are not parseable as such in Julia.

Ref: JuliaLang/julia#9127
@jiahao
Copy link
Member

jiahao commented Feb 20, 2015

I wrote a notebook some time ago to programmatically inspect the parsability of operators and only just got round to cleaning it up.

The full results can be seen at JIN 9127.

Here's the summary results broken down by Unicode category and type of expression in v0.3.6. (Note that not all these make sense to parse.)

Unary operators (do expressions of the form "+a" parse?)

Category: Unary - operators that are only unary
Valid: 3 Invalid: 4

Category: Vary - operators that can be unary or binary depending on context
Valid: 1 Invalid: 3

Infix binary operators (do expressions of the form "a+a" parse?)

Category: Binary
Valid: 153 Invalid: 48

Category: Relation - includes arrows
Valid: 528 Invalid: 129

Category: Vary - operators that can be unary or binary depending on context
Valid: 3 Invalid: 1

Category: Special - characters not covered by other classes
Valid: 9 Invalid: 19

Prefix binary operators (do expressions of the form "+(a,a)" parse?)

Category: Binary
Valid: 152 Invalid: 49

Category: Relation - includes arrows
Valid: 527 Invalid: 130

Category: Vary - operators that can be unary or binary depending on context
Valid: 3 Invalid: 1

Category: Special - characters not covered by other classes
Valid: 7 Invalid: 21

n-ary operators (do expressions of the form "+(a, a, a)" parse?)

Category: Glyph_Part - piece of large operator
Valid: 2 Invalid: 25

Category: Large - n-ary or large operator, often takes limits
Valid: 47 Invalid: 19

@jiahao
Copy link
Member

jiahao commented Feb 20, 2015

If there's any interest, there shouldn't be too much difficulty to put into Base something like printcharinfo:

julia> printcharinfo('')
U+02934 (⤴) ARROW POINTING RIGHTWARDS THEN CURVING UPWARDS

printcharinfo can also be much more detailed, similar to sites like this.

@ivarne
Copy link
Member Author

ivarne commented Feb 21, 2015

Can't we just extend show(io::IO, c::Char) (or whatever writemime variant the REPL uses) to print more information?

When we get . overloading a natural api might be to define

''.name == "ARROW POINTING RIGHTWARDS THEN CURVING UPWARDS"
''.block == "Supplemental Arrows-B"
''.category == :Sm
''.combine == False
# and so on.

@JeffBezanson JeffBezanson added the parser Language parsing and surface syntax label Jan 28, 2016
@mbauman
Copy link
Member

mbauman commented Apr 9, 2021

We do now have that information — but it's not in the parser where this particular error is thrown.

julia> 1
ERROR: syntax: invalid character "" near column 1
Stacktrace:
 [1] top-level scope
   @ none:1

julia> ''
'': Unicode U+2212 (category Sm: Symbol, math)

help?> 
"" can be typed by \minus<tab>

@inkydragon
Copy link
Member

Starting with version 1.7, −1 === -1.

$ hyperfine --show-output -L VER 1.0,1.6,1.7,1.8,1.9,1.10 --runs 1 -i 'julia +{VER} -E "−1 === -1"'
Benchmark 1: julia +1.0 -E "−1 === -1"
ERROR: syntax: invalid character "−"
  Time (abs ≡):        351.9 ms               [User: 512.4 ms, System: 612.3 ms]

  Warning: Ignoring non-zero exit code.

Benchmark 2: julia +1.6 -E "−1 === -1"
ERROR: syntax: invalid character "−" near column 1
Stacktrace:
 [1] top-level scope
   @ none:1
  Time (abs ≡):        173.5 ms               [User: 280.3 ms, System: 476.6 ms]

  Warning: Ignoring non-zero exit code.

Benchmark 3: julia +1.7 -E "−1 === -1"
true
  Time (abs ≡):        103.5 ms               [User: 76.4 ms, System: 106.7 ms]

Benchmark 4: julia +1.8 -E "−1 === -1"
true
  Time (abs ≡):        113.5 ms               [User: 256.1 ms, System: 628.7 ms]

Benchmark 5: julia +1.9 -E "−1 === -1"
true
  Time (abs ≡):        116.3 ms               [User: 125.7 ms, System: 268.5 ms]

Benchmark 6: julia +1.10 -E "−1 === -1"
true
  Time (abs ≡):        107.0 ms               [User: 90.2 ms, System: 178.4 ms]

@vtjnash vtjnash closed this as completed Feb 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parser Language parsing and surface syntax unicode Related to unicode characters and encodings
Projects
None yet
Development

No branches or pull requests

7 participants