Do not accept unicode escape characters in byte strings or as byte #23625

fhahn · 2015-03-22T23:30:12Z

This PR patches the issue mentioned in #23620, but there is also an ICE for invalid escape sequences in byte literals. This is due to the fact that the scan_byte function returns token::intern("??") for invalid bytes, resulting in an ICE later on. Is there a reason for this behavior? Shouldn't scan_byte fail when it encounters an invalid byte?

And I noticed a small inconsistency in the documentation. According to the formal byte literal definition in http://doc.rust-lang.org/reference.html#byte-and-byte-string-literals , a byte string literal contains string_body *, but according to the text (and the behavior of the lexer) it should not accept unicode escape sequences. Hence it should be replaced by byte_body *. If this is valid, I can add this fix to this PR.

rust-highfive · 2015-03-22T23:30:24Z

r? @nikomatsakis

(rust_highfive has picked a reviewer for you, use r? to override)

alexcrichton · 2015-03-23T16:44:46Z

src/libsyntax/parse/lexer/mod.rs

@@ -799,7 +799,15 @@ impl<'a> StringReader<'a> {
                            'n' | 'r' | 't' | '\\' | '\'' | '"' | '0' => true,
                            'x' => self.scan_byte_escape(delim, !ascii_only),
                            'u' if self.curr_is('{') => {
-                                self.scan_unicode_escape(delim)
+                                if self.scan_unicode_escape(delim) && ascii_only {


If scan_unicode_escape returns false, doesn't this whole branch return true? I think that only happens when the unicode escape is not a valid character (e.g. \u{ffffff}). Could you make sure that still emits an error?

alexcrichton · 2015-03-23T16:46:01Z

If this is valid, I can add this fix to this PR.

I think it's even supposed to be byte_string_body, and sure, feel free to include it!

fhahn · 2015-03-24T23:20:33Z

I have updated the pull request and found other escape sequences that lead to ICEs. The problem seems to be that the parser expect the lexer to reject invalid byte literals, but the lexer just returns ?? tokens (e.g. https://github.com/rust-lang/rust/pull/23625/files#diff-d06ad31c547d2d94c8b8770006977767L1333), which lead to the ICEs.

In this PR, I just abort as soon as an invalid literal is encountered. Note that using fatal_span_ would be also possible, but the current version displays error messages for the complete literal before it aborts. Does this sound like a reasonable approach?

Unfortunately, this affects the following tests:

[parse-fail] parse-fail/ascii-only-character-escape.rs
[parse-fail] parse-fail/bad-char-literals.rs
[parse-fail] parse-fail/byte-literals.rs
[parse-fail] parse-fail/byte-string-literals.rs
[parse-fail] parse-fail/lex-bad-char-literals.rs
[parse-fail] parse-fail/lex-bare-cr-string-literal-doc-comment.rs

The fail, because they contain invalid string literals, which cause rustc to abort sooner as without the changes in the PR.

alexcrichton · 2015-03-25T00:12:16Z

Hm would it be possible to intern a valid token instead of ??? If an error has already been emitted then we're guaranteed that it already won't finish compiling, so it could help paper over the later ICEs perhaps?

fhahn · 2015-03-26T09:50:15Z

@alexcrichton Yes, scan_byte now returns token::intern("?") instead of token::intern("??") which is an invalid byte. I also changed some fatal_span_ to err_span_ in scan_hex_digits so the parsing can continue.
Now it only uses fatal_span_ when the escape sequence is not terminated (https://github.com/rust-lang/rust/pull/23625/files#diff-d06ad31c547d2d94c8b8770006977767R749), but maybe we should read until the literal is terminated in order to continue parsing in this case?

alexcrichton · 2015-03-26T20:37:55Z

src/libsyntax/parse/lexer/mod.rs

+                                    escaped_pos,
+                                    self.last_pos,
+                                    "Unicode escape sequences cannot be used as byte or in \
+                                    byte string."


Error messages in Rust conventionally do not start with an uppercase letter and also don't end with a period

Also, perhaps "as a byte" instead of "as byte"?

alexcrichton · 2015-03-26T20:40:10Z

Looks great to me, thanks @fhahn! Just a small nit about the error message and a tidy failure on travis and I think this is otherwise good to go.

fhahn · 2015-03-26T21:09:32Z

Thanks for the feedback, I've updated the PR.

alexcrichton · 2015-03-26T23:46:23Z

Perhaps the push was forgotten? (looks the same)

fhahn · 2015-03-27T16:47:55Z

Damn, I pushed it to the wrong branch. It should be updated now

alexcrichton · 2015-03-27T16:56:37Z

@bors: r+ afaa3b6

…tring, r=alexcrichton closes rust-lang#23620 This PR patches the issue mentioned in rust-lang#23620, but there is also an ICE for invalid escape sequences in byte literals. This is due to the fact that the `scan_byte` function returns ` token::intern(\"??\") ` for invalid bytes, resulting in an ICE later on. Is there a reason for this behavior? Shouldn't `scan_byte` fail when it encounters an invalid byte? And I noticed a small inconsistency in the documentation. According to the formal byte literal definition in http://doc.rust-lang.org/reference.html#byte-and-byte-string-literals , a byte string literal contains `string_body *`, but according to the text (and the behavior of the lexer) it should not accept unicode escape sequences. Hence it should be replaced by `byte_body *`. If this is valid, I can add this fix to this PR.

…ring closes rust-lang#23620 This PR patches the issue mentioned in rust-lang#23620, but there is also an ICE for invalid escape sequences in byte literals. This is due to the fact that the `scan_byte` function returns ` token::intern("??") ` for invalid bytes, resulting in an ICE later on. Is there a reason for this behavior? Shouldn't `scan_byte` fail when it encounters an invalid byte? And I noticed a small inconsistency in the documentation. According to the formal byte literal definition in http://doc.rust-lang.org/reference.html#byte-and-byte-string-literals , a byte string literal contains `string_body *`, but according to the text (and the behavior of the lexer) it should not accept unicode escape sequences. Hence it should be replaced by `byte_body *`. If this is valid, I can add this fix to this PR.

rust-highfive assigned nikomatsakis Mar 22, 2015

alexcrichton reviewed Mar 23, 2015
View reviewed changes

fhahn force-pushed the issue-23620-ice-unicode-bytestring branch 3 times, most recently from 59e8c62 to fe64595 Compare March 24, 2015 23:03

fhahn force-pushed the issue-23620-ice-unicode-bytestring branch from fe64595 to d43aebc Compare March 26, 2015 08:44

alexcrichton reviewed Mar 26, 2015
View reviewed changes

fhahn force-pushed the issue-23620-ice-unicode-bytestring branch from d43aebc to 5ced095 Compare March 27, 2015 16:46

Prevent ICEs when parsing invalid escapes, closes rust-lang#23620

afaa3b6

fhahn force-pushed the issue-23620-ice-unicode-bytestring branch from 5ced095 to afaa3b6 Compare March 27, 2015 16:47

bors merged commit afaa3b6 into rust-lang:master Mar 28, 2015

fhahn deleted the issue-23620-ice-unicode-bytestring branch March 30, 2015 11:20

nibags mentioned this pull request Jun 25, 2018

Add unicode escapes, allow non-ASCII identifiers & others improvements zargony/atom-language-rust#136

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not accept unicode escape characters in byte strings or as byte #23625

Do not accept unicode escape characters in byte strings or as byte #23625

fhahn commented Mar 22, 2015

rust-highfive commented Mar 22, 2015

alexcrichton Mar 23, 2015

alexcrichton commented Mar 23, 2015

fhahn commented Mar 24, 2015

alexcrichton commented Mar 25, 2015

fhahn commented Mar 26, 2015

alexcrichton Mar 26, 2015

alexcrichton Mar 26, 2015

alexcrichton commented Mar 26, 2015

fhahn commented Mar 26, 2015

alexcrichton commented Mar 26, 2015

fhahn commented Mar 27, 2015

alexcrichton commented Mar 27, 2015

Do not accept unicode escape characters in byte strings or as byte #23625

Do not accept unicode escape characters in byte strings or as byte #23625

Conversation

fhahn commented Mar 22, 2015

rust-highfive commented Mar 22, 2015

alexcrichton Mar 23, 2015

Choose a reason for hiding this comment

alexcrichton commented Mar 23, 2015

fhahn commented Mar 24, 2015

alexcrichton commented Mar 25, 2015

fhahn commented Mar 26, 2015

alexcrichton Mar 26, 2015

Choose a reason for hiding this comment

alexcrichton Mar 26, 2015

Choose a reason for hiding this comment

alexcrichton commented Mar 26, 2015

fhahn commented Mar 26, 2015

alexcrichton commented Mar 26, 2015

fhahn commented Mar 27, 2015

alexcrichton commented Mar 27, 2015