-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not accept unicode escape characters in byte strings or as byte #23625
Do not accept unicode escape characters in byte strings or as byte #23625
Conversation
(rust_highfive has picked a reviewer for you, use r? to override) |
@@ -799,7 +799,15 @@ impl<'a> StringReader<'a> { | |||
'n' | 'r' | 't' | '\\' | '\'' | '"' | '0' => true, | |||
'x' => self.scan_byte_escape(delim, !ascii_only), | |||
'u' if self.curr_is('{') => { | |||
self.scan_unicode_escape(delim) | |||
if self.scan_unicode_escape(delim) && ascii_only { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If scan_unicode_escape
returns false
, doesn't this whole branch return true
? I think that only happens when the unicode escape is not a valid character (e.g. \u{ffffff}
). Could you make sure that still emits an error?
I think it's even supposed to be |
59e8c62
to
fe64595
Compare
I have updated the pull request and found other escape sequences that lead to ICEs. The problem seems to be that the parser expect the lexer to reject invalid byte literals, but the lexer just returns In this PR, I just abort as soon as an invalid literal is encountered. Note that using Unfortunately, this affects the following tests:
The fail, because they contain invalid string literals, which cause |
Hm would it be possible to intern a valid token instead of |
fe64595
to
d43aebc
Compare
@alexcrichton Yes, |
escaped_pos, | ||
self.last_pos, | ||
"Unicode escape sequences cannot be used as byte or in \ | ||
byte string." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error messages in Rust conventionally do not start with an uppercase letter and also don't end with a period
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, perhaps "as a byte" instead of "as byte"?
Looks great to me, thanks @fhahn! Just a small nit about the error message and a tidy failure on travis and I think this is otherwise good to go. |
Thanks for the feedback, I've updated the PR. |
Perhaps the push was forgotten? (looks the same) |
d43aebc
to
5ced095
Compare
5ced095
to
afaa3b6
Compare
Damn, I pushed it to the wrong branch. It should be updated now |
…tring, r=alexcrichton closes rust-lang#23620 This PR patches the issue mentioned in rust-lang#23620, but there is also an ICE for invalid escape sequences in byte literals. This is due to the fact that the `scan_byte` function returns ` token::intern(\"??\") ` for invalid bytes, resulting in an ICE later on. Is there a reason for this behavior? Shouldn't `scan_byte` fail when it encounters an invalid byte? And I noticed a small inconsistency in the documentation. According to the formal byte literal definition in http://doc.rust-lang.org/reference.html#byte-and-byte-string-literals , a byte string literal contains `string_body *`, but according to the text (and the behavior of the lexer) it should not accept unicode escape sequences. Hence it should be replaced by `byte_body *`. If this is valid, I can add this fix to this PR.
…ring closes rust-lang#23620 This PR patches the issue mentioned in rust-lang#23620, but there is also an ICE for invalid escape sequences in byte literals. This is due to the fact that the `scan_byte` function returns ` token::intern("??") ` for invalid bytes, resulting in an ICE later on. Is there a reason for this behavior? Shouldn't `scan_byte` fail when it encounters an invalid byte? And I noticed a small inconsistency in the documentation. According to the formal byte literal definition in http://doc.rust-lang.org/reference.html#byte-and-byte-string-literals , a byte string literal contains `string_body *`, but according to the text (and the behavior of the lexer) it should not accept unicode escape sequences. Hence it should be replaced by `byte_body *`. If this is valid, I can add this fix to this PR.
closes #23620
This PR patches the issue mentioned in #23620, but there is also an ICE for invalid escape sequences in byte literals. This is due to the fact that the
scan_byte
function returnstoken::intern("??")
for invalid bytes, resulting in an ICE later on. Is there a reason for this behavior? Shouldn'tscan_byte
fail when it encounters an invalid byte?And I noticed a small inconsistency in the documentation. According to the formal byte literal definition in http://doc.rust-lang.org/reference.html#byte-and-byte-string-literals , a byte string literal contains
string_body *
, but according to the text (and the behavior of the lexer) it should not accept unicode escape sequences. Hence it should be replaced bybyte_body *
. If this is valid, I can add this fix to this PR.