Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'\xc0\x80' should either error or make an overlong Char #25072

Closed
StefanKarpinski opened this issue Dec 14, 2017 · 2 comments · Fixed by #44989
Closed

'\xc0\x80' should either error or make an overlong Char #25072

StefanKarpinski opened this issue Dec 14, 2017 · 2 comments · Fixed by #44989
Labels
help wanted Indicates that a maintainer wants help on an issue or pull request strings "Strings!"

Comments

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Dec 14, 2017

Currently we have this:

julia> '\xc0\x80'
'\0': ASCII/Unicode U+0000 (category Cc: Other, control)

julia> '\xc0\x80' == '\0'
true

julia> String(b"\xc0\x80")[1] == '\0'
false

In other words, you can write an overlong Char literal as bytes and you get the standard character representing that code point. This should either be an error or, preferably IMO, now that we actually have Char values for overlong encodings of characters, it should produce that Char value (i.e. the one I've produced via byte string literal syntax). This should technically be done before the 1.0 feature freeze, but I really doubt anyone is relying on this, so I think we can sneak it in whenever.

@StefanKarpinski StefanKarpinski added the strings "Strings!" label Dec 14, 2017
@StefanKarpinski StefanKarpinski changed the title '\xc0\x80' should either be an error or an overlong Char '\xc0\x80' should either error or make an overlong Char Dec 14, 2017
StefanKarpinski added a commit that referenced this issue Dec 14, 2017
also print invalid UTF-8 characters correctly

related: #25072
StefanKarpinski added a commit that referenced this issue Dec 14, 2017
also print invalid UTF-8 characters correctly

related: #25072
StefanKarpinski added a commit that referenced this issue Dec 14, 2017
also print invalid UTF-8 characters correctly

related: #25072
StefanKarpinski added a commit that referenced this issue Dec 14, 2017
also print invalid UTF-8 characters correctly

related: #25072
@StefanKarpinski
Copy link
Member Author

StefanKarpinski commented Jan 11, 2019

Jeff, what's the best approach to fix this? The complication as I understand it is that we use femtolisp chars to represent char literals when parsing and lowering and femtolisp can't represent invalid chars like Julia can. Behaviorally, I think that if "<...>" creates a one-character string, even if that character is invalid, we should have '<...>' == "<...>"[1]. This violates that since we have this discrepancy:

julia> '\xc0\x80'
'\0': ASCII/Unicode U+0000 (category Cc: Other, control)

julia> "\xc0\x80"[1]
'\xc0\x80': [overlong] ASCII/Unicode U+0000 (category Cc: Other, control)

What I'm proposing is that '\xc0\x80' should produce the same overlong Char as "\xc0\x80"[1]. Similarly, I think that these invalid char literals should be allowed to work:

julia> '\x80'
ERROR: syntax: invalid character literal

julia> '\xff'
ERROR: syntax: malformed expression

(I'm not sure why these give different errors, they probably shouldn't.) This change would bring them into line with the corresponding invalid string syntax:

julia> "\x80"[1]
'\x80': Malformed UTF-8 (category Ma: Malformed, bad data)

julia> "\xff"[1]
'\xff': Malformed UTF-8 (category Ma: Malformed, bad data)

I believe such a change would bring char and string literals into full syntactic agreement.

@JeffBezanson
Copy link
Member

I think the most direct way to do it is to replace the call (string.char str 0) near the end of the code for char literals in parse-atom with a call that behaves like julia getindex. At that point, str is exactly what you'd get if you wrote double quotes instead of single quotes. Flisp aref can be used to get bytes from a string.

@StefanKarpinski StefanKarpinski added the help wanted Indicates that a maintainer wants help on an issue or pull request label Jul 29, 2019
simeonschaub added a commit that referenced this issue Mar 27, 2022
Fixes the parsing of char literals like `'\xc0\x80'`. At first, I tried
to replicate the behavior of `getindex` on a string in Julia here, but
then I noticed that we probably also want to support cases like
`'\xff\xff'`, which would give two characters in a Julia string. Now
this supports any combination of characters as long as they are between
1 and 4 bytes, so even literals like `'abcd'` are allowed. I think this
makes sense because otherwise we wouldn't be able to reparse every valid
char in Julia, but I could see Python users being confused about why
Julia only supports strings up to length 4... 😄

fixes #25072
simeonschaub added a commit that referenced this issue Mar 27, 2022
Fixes the parsing of char literals like `'\xc0\x80'`. At first, I tried
to replicate the behavior of `getindex` on a string in Julia here, but
then I noticed that we probably also want to support cases like
`'\xff\xff'`, which would give two characters in a Julia string. Now
this supports any combination of characters as long as they are between
1 and 4 bytes, so even literals like `'abcd'` are allowed. I think this
makes sense because otherwise we wouldn't be able to reparse every valid
char in Julia, but I could see Python users being confused about why
Julia only supports strings up to length 4... 😄

fixes #25072
simeonschaub added a commit that referenced this issue Apr 15, 2022
Alternative to #44765. This disallows character literals that can not be
created from iterating a UTF-8 string.

fixes #25072
JeffBezanson pushed a commit that referenced this issue May 19, 2022
Make the syntax for character literals the same as what is allowed in
single-character string literals.

Alternative to #44765

fixes #25072
JeffBezanson pushed a commit that referenced this issue May 24, 2022
Make the syntax for character literals the same as what is allowed in
single-character string literals.

Alternative to #44765

fixes #25072
KristofferC pushed a commit that referenced this issue May 25, 2022
Make the syntax for character literals the same as what is allowed in
single-character string literals.

Alternative to #44765

fixes #25072
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Indicates that a maintainer wants help on an issue or pull request strings "Strings!"
Projects
None yet
2 participants