'\xc0\x80' should either error or make an overlong Char #25072

StefanKarpinski · 2017-12-14T03:32:59Z

Currently we have this:

julia> '\xc0\x80'
'\0': ASCII/Unicode U+0000 (category Cc: Other, control)

julia> '\xc0\x80' == '\0'
true

julia> String(b"\xc0\x80")[1] == '\0'
false

In other words, you can write an overlong Char literal as bytes and you get the standard character representing that code point. This should either be an error or, preferably IMO, now that we actually have Char values for overlong encodings of characters, it should produce that Char value (i.e. the one I've produced via byte string literal syntax). This should technically be done before the 1.0 feature freeze, but I really doubt anyone is relying on this, so I think we can sneak it in whenever.

The text was updated successfully, but these errors were encountered:

also print invalid UTF-8 characters correctly related: #25072

StefanKarpinski · 2019-01-11T15:50:50Z

Jeff, what's the best approach to fix this? The complication as I understand it is that we use femtolisp chars to represent char literals when parsing and lowering and femtolisp can't represent invalid chars like Julia can. Behaviorally, I think that if "<...>" creates a one-character string, even if that character is invalid, we should have '<...>' == "<...>"[1]. This violates that since we have this discrepancy:

julia> '\xc0\x80'
'\0': ASCII/Unicode U+0000 (category Cc: Other, control)

julia> "\xc0\x80"[1]
'\xc0\x80': [overlong] ASCII/Unicode U+0000 (category Cc: Other, control)

What I'm proposing is that '\xc0\x80' should produce the same overlong Char as "\xc0\x80"[1]. Similarly, I think that these invalid char literals should be allowed to work:

julia> '\x80'
ERROR: syntax: invalid character literal

julia> '\xff'
ERROR: syntax: malformed expression

(I'm not sure why these give different errors, they probably shouldn't.) This change would bring them into line with the corresponding invalid string syntax:

julia> "\x80"[1]
'\x80': Malformed UTF-8 (category Ma: Malformed, bad data)

julia> "\xff"[1]
'\xff': Malformed UTF-8 (category Ma: Malformed, bad data)

I believe such a change would bring char and string literals into full syntactic agreement.

JeffBezanson · 2019-01-11T17:16:34Z

I think the most direct way to do it is to replace the call (string.char str 0) near the end of the code for char literals in parse-atom with a call that behaves like julia getindex. At that point, str is exactly what you'd get if you wrote double quotes instead of single quotes. Flisp aref can be used to get bytes from a string.

Fixes the parsing of char literals like `'\xc0\x80'`. At first, I tried to replicate the behavior of `getindex` on a string in Julia here, but then I noticed that we probably also want to support cases like `'\xff\xff'`, which would give two characters in a Julia string. Now this supports any combination of characters as long as they are between 1 and 4 bytes, so even literals like `'abcd'` are allowed. I think this makes sense because otherwise we wouldn't be able to reparse every valid char in Julia, but I could see Python users being confused about why Julia only supports strings up to length 4... 😄 fixes #25072

Alternative to #44765. This disallows character literals that can not be created from iterating a UTF-8 string. fixes #25072

Make the syntax for character literals the same as what is allowed in single-character string literals. Alternative to #44765 fixes #25072

StefanKarpinski added the strings "Strings!" label Dec 14, 2017

StefanKarpinski changed the title ~~'\xc0\x80' should either be an error or an overlong Char~~ '\xc0\x80' should either error or make an overlong Char Dec 14, 2017

StefanKarpinski added a commit that referenced this issue Dec 14, 2017

unvalid UTF-8 in string literals: allow it and print it

682f51f

also print invalid UTF-8 characters correctly related: #25072

StefanKarpinski added a commit that referenced this issue Dec 14, 2017

invalid UTF-8 in string literals: allow it and print it

a632530

also print invalid UTF-8 characters correctly related: #25072

StefanKarpinski mentioned this issue Dec 14, 2017

allow invalid UTF-8 string literals, deprecate b"..." #25073

Merged

StefanKarpinski added a commit that referenced this issue Dec 14, 2017

invalid UTF-8 in string literals: allow it and print it

ccc3352

also print invalid UTF-8 characters correctly related: #25072

StefanKarpinski added a commit that referenced this issue Dec 14, 2017

invalid UTF-8 in string literals: allow it and print it

0ab2c19

also print invalid UTF-8 characters correctly related: #25072

JeffBezanson mentioned this issue Jan 20, 2018

Handling of overlong character literals #25645

Closed

StefanKarpinski added the help wanted Indicates that a maintainer wants help on an issue or pull request label Jul 29, 2019

simeonschaub mentioned this issue Mar 16, 2022

literals for invalid characters? #44646

Closed

simeonschaub mentioned this issue Mar 27, 2022

properly support malformed char literals #44765

Closed

simeonschaub added a commit that referenced this issue Apr 15, 2022

support malformed characters, the second

71683df

Alternative to #44765. This disallows character literals that can not be created from iterating a UTF-8 string. fixes #25072

simeonschaub mentioned this issue Apr 15, 2022

support malformed char literals, the second #44989

Merged

JeffBezanson pushed a commit that referenced this issue May 19, 2022

support malformed chars in char literal syntax

16e37c2

Make the syntax for character literals the same as what is allowed in single-character string literals. Alternative to #44765 fixes #25072

JeffBezanson pushed a commit that referenced this issue May 24, 2022

support malformed chars in char literal syntax

a1ce793

Make the syntax for character literals the same as what is allowed in single-character string literals. Alternative to #44765 fixes #25072

KristofferC closed this as completed in #44989 May 25, 2022

KristofferC pushed a commit that referenced this issue May 25, 2022

support malformed chars in char literal syntax (#44989)

991190f

Make the syntax for character literals the same as what is allowed in single-character string literals. Alternative to #44765 fixes #25072

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'\xc0\x80' should either error or make an overlong Char #25072

'\xc0\x80' should either error or make an overlong Char #25072

StefanKarpinski commented Dec 14, 2017 •

edited

Loading

StefanKarpinski commented Jan 11, 2019 •

edited

Loading

JeffBezanson commented Jan 11, 2019

'\xc0\x80' should either error or make an overlong Char #25072

'\xc0\x80' should either error or make an overlong Char #25072

Comments

StefanKarpinski commented Dec 14, 2017 • edited Loading

StefanKarpinski commented Jan 11, 2019 • edited Loading

JeffBezanson commented Jan 11, 2019

StefanKarpinski commented Dec 14, 2017 •

edited

Loading

StefanKarpinski commented Jan 11, 2019 •

edited

Loading