-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
'\xc0\x80' should either error or make an overlong Char #25072
Comments
Char
also print invalid UTF-8 characters correctly related: #25072
also print invalid UTF-8 characters correctly related: #25072
also print invalid UTF-8 characters correctly related: #25072
also print invalid UTF-8 characters correctly related: #25072
Jeff, what's the best approach to fix this? The complication as I understand it is that we use femtolisp chars to represent char literals when parsing and lowering and femtolisp can't represent invalid chars like Julia can. Behaviorally, I think that if julia> '\xc0\x80'
'\0': ASCII/Unicode U+0000 (category Cc: Other, control)
julia> "\xc0\x80"[1]
'\xc0\x80': [overlong] ASCII/Unicode U+0000 (category Cc: Other, control) What I'm proposing is that julia> '\x80'
ERROR: syntax: invalid character literal
julia> '\xff'
ERROR: syntax: malformed expression (I'm not sure why these give different errors, they probably shouldn't.) This change would bring them into line with the corresponding invalid string syntax: julia> "\x80"[1]
'\x80': Malformed UTF-8 (category Ma: Malformed, bad data)
julia> "\xff"[1]
'\xff': Malformed UTF-8 (category Ma: Malformed, bad data) I believe such a change would bring char and string literals into full syntactic agreement. |
I think the most direct way to do it is to replace the call |
Fixes the parsing of char literals like `'\xc0\x80'`. At first, I tried to replicate the behavior of `getindex` on a string in Julia here, but then I noticed that we probably also want to support cases like `'\xff\xff'`, which would give two characters in a Julia string. Now this supports any combination of characters as long as they are between 1 and 4 bytes, so even literals like `'abcd'` are allowed. I think this makes sense because otherwise we wouldn't be able to reparse every valid char in Julia, but I could see Python users being confused about why Julia only supports strings up to length 4... 😄 fixes #25072
Fixes the parsing of char literals like `'\xc0\x80'`. At first, I tried to replicate the behavior of `getindex` on a string in Julia here, but then I noticed that we probably also want to support cases like `'\xff\xff'`, which would give two characters in a Julia string. Now this supports any combination of characters as long as they are between 1 and 4 bytes, so even literals like `'abcd'` are allowed. I think this makes sense because otherwise we wouldn't be able to reparse every valid char in Julia, but I could see Python users being confused about why Julia only supports strings up to length 4... 😄 fixes #25072
Currently we have this:
In other words, you can write an overlong
Char
literal as bytes and you get the standard character representing that code point. This should either be an error or, preferably IMO, now that we actually haveChar
values for overlong encodings of characters, it should produce thatChar
value (i.e. the one I've produced via byte string literal syntax). This should technically be done before the 1.0 feature freeze, but I really doubt anyone is relying on this, so I think we can sneak it in whenever.The text was updated successfully, but these errors were encountered: