Panic with surrogate code points in EscapedUnicode #608

SimonSapin · 2023-08-03T10:28:04Z

Description

apollo-parser 0.5.3 provides impl From<&'_ ast::StringValue> for String to extract the value represented by a string literal by resolving escape sequences. This part of the spec:

https://spec.graphql.org/October2021/#sec-String-Value.Semantics

StringCharacter :: \u EscapedUnicode

Let value be the 16-bit hexadecimal value represented by the sequence of hexadecimal digits within EscapedUnicode.

Return the code point value.

Is implemented here:

apollo-rs/crates/apollo-parser/src/ast/node_ext.rs

Lines 169 to 176 in 126816f

    
           // 1. Let value be the 16-bit hexadecimal value represented 
        
           // by the sequence of hexadecimal digits within EscapedUnicode. 
        
           let value = iter.by_ref().take(4).fold(0, |acc, c| { 
        
               let digit = c.to_digit(16).unwrap(); 
        
               (acc << 4) + digit 
        
           }); 
        
           // 2. Return the code point value. 
        
           char::from_u32(value).unwrap()

The spec is written assuming JavaScript-like strings made of 16-bit code units, but Rust’s char represents a Unicode scalar value which excludes the range of surrogate code units, U+D800 to U+DFFF. (Surrogates are reserved to leave space for UTF-16 to encode code points beyond U+FFFF as a pair of leading of leading and trailing surrogates.) As a result, this .unwrap() call can panic.

The draft spec fixes this (and adds a new bit of syntax):

https://spec.graphql.org/draft/#sec-String-Value

two forms of Unicode escape sequences: one with a fixed-width of 4 hexadecimal digits (for example \u000A) and one with a variable-width most useful for representing a supplementary character such as an Emoji (for example \u{1F4A9}).
[…]
For legacy reasons, a supplementary character may be escaped by two fixed-width unicode escape sequences forming a surrogate pair. For example the input "\uD83D\uDCA9" is a valid StringValue which represents the same Unicode text as "\u{1F4A9}". While this legacy form is allowed, it should be avoided as a variable-width unicode escape sequence is a clearer way to encode such code points.
[…]
These semantics describe how to apply the StringValue grammar to a source text to evaluate a Unicode text. Errors encountered during this evaluation are considered a failure to apply the StringValue grammar to a source and must result in a parsing error.
[…]
StringCharacter :: \u EscapedUnicode

Let value be the hexadecimal value represented by the sequence of HexDigit within EscapedUnicode.

Assert value is a within the Unicode scalar value range (≥ 0x0000 and ≤ 0xD7FF or ≥ 0xE000 and ≤ 0x10FFFF).

Return the Unicode scalar value value.

StringCharacter :: \u HexDigit HexDigit HexDigit HexDigit

Let leadingValue be the hexadecimal value represented by the first sequence of HexDigit.

Let trailingValue be the hexadecimal value represented by the second sequence of HexDigit.

If leadingValue is ≥ 0xD800 and ≤ 0xDBFF (a Leading Surrogate):

Assert trailingValue is ≥ 0xDC00 and ≤ 0xDFFF (a Trailing Surrogate).

Return (leadingValue - 0xD800) × 0x400 + (trailingValue - 0xDC00) + 0x10000.

Otherwise:

Assert leadingValue is within the Unicode scalar value range.

Assert trailingValue is within the Unicode scalar value range.

Return the sequence of the Unicode scalar value leadingValue followed by the Unicode scalar value trailingValue.

In particular:

The conversion should be fallible (TryFrom instead of From)
Conversions errors “are considered a failure to apply the grammar and must result in a parsing error”. In apollo-parser this is SyntaxTree::errors(), but that is completed without actually extracting the unescaped value of string literals. So the lexer is likely the right place to check whether that conversion would succeed.

Steps to reproduce

(Short version, see test case without apollo-compiler below.)

let mut compiler = apollo_compiler::ApolloCompiler::new();
let input = r#"{ field(string: "\uD83E\uDD80") }"#;
compiler.add_document(input, "");
compiler.validate();

Expected result

Either:

Validation includes a syntax error
(Per draft spec) no syntax error, and HIR contains Value::String("🦀") (the UTF-16 decoding of 0xD83E 0xDD80)

Actual result

thread 'database::hir::tests::surrogates' panicked at 'called `Option::unwrap()` on a `None` value', crates/apollo-parser/src/ast/node_ext.rs:176:43
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Test case

Fixing this per the draft spec should make this test pass:

crates/apollo-parser/src/ast/node_ext.rs

#[cfg(test)]
mod tests {
    use crate::ast;
    use crate::Parser;

    #[test]
    fn test_value_conversions() {
        let input = r#"
            query ($arg: Int!) {
                field(
                    float: 1.234,
                    int: 1234,
                    bool: true,
                    variable: $arg,
                    string: "some \"text\"\n\uD83E\uDD80",
                )
            }
        "#;
        let values = extract_argument_values(input);
        let ast::Value::FloatValue(float) = &values[0] else { panic!() };
        let ast::Value::IntValue(int) = &values[1] else { panic!() };
        let ast::Value::BooleanValue(bool) = &values[2] else { panic!() };
        let ast::Value::Variable(var) = &values[3] else { panic!() };
        let ast::Value::StringValue(string) = &values[4] else { panic!() };
        assert_eq!(f64::try_from(float), Ok(1.234));
        assert_eq!(i32::try_from(int), Ok(1234));
        assert_eq!(bool::try_from(bool), Ok(true));
        assert_eq!(var.name().unwrap().text().as_str(), "arg");
        assert_eq!(String::try_from(string).as_deref(), Ok("some \"text\"\n🦀"));
    }

    fn extract_argument_values(input: &str) -> Vec<ast::Value> {
        let ast = Parser::new(input).parse();
        assert_eq!(ast.errors().as_slice(), []);
        let ast::Definition::OperationDefinition(operation) = ast
            .document()
            .definitions()
            .next()
            .unwrap()
        else { panic!("expected an operation") };
        let ast::Selection::Field(field) = operation
            .selection_set()
            .unwrap()
            .selections()
            .next()
            .unwrap()
        else { panic!("expected a field") };
        field
            .arguments()
            .unwrap()
            .arguments()
            .map(|arg| arg.value().unwrap())
            .collect()
    }
}

The text was updated successfully, but these errors were encountered:

SimonSapin · 2023-09-25T17:32:12Z

The new feature of decoding surragate pairs is split off to a new issue: #657, leaving this one about the panic.

Fixes #608

* fix(parser): emit lexer error on escaped surrogate to avoid panic Fixes #608 * Changelog

SimonSapin added bug Something isn't working apollo-parser triage labels Aug 3, 2023

goto-bus-stop self-assigned this Aug 28, 2023

goto-bus-stop removed the triage label Aug 28, 2023

goto-bus-stop mentioned this issue Aug 31, 2023

fix(parser): fix some escape sequences in strings and indentation in block strings #633

Merged

3 tasks

SimonSapin changed the title ~~Parsing surrogate code points in EscapedUnicode~~ Panic with surrogate code points in EscapedUnicode Sep 25, 2023

SimonSapin mentioned this issue Sep 25, 2023

Decode paired surrogates in unicode escapes #657

Open

SimonSapin added a commit that referenced this issue Sep 25, 2023

fix(parser): emit lexer error on escaped surrogate to avoid panic

c891016

Fixes #608

SimonSapin added a commit that referenced this issue Sep 25, 2023

fix(parser): emit lexer error on escaped surrogate to avoid panic

2b43296

Fixes #608

SimonSapin assigned SimonSapin and unassigned goto-bus-stop Sep 25, 2023

SimonSapin mentioned this issue Sep 25, 2023

fix(parser): emit lexer error on escaped surrogate to avoid panic #658

Merged

SimonSapin added a commit that referenced this issue Sep 28, 2023

fix(parser): emit lexer error on escaped surrogate to avoid panic

ef6d7c6

Fixes #608

SimonSapin closed this as completed in #658 Sep 28, 2023

SimonSapin added a commit that referenced this issue Sep 28, 2023

fix(parser): emit lexer error on escaped surrogate to avoid panic (#658)

7719507

* fix(parser): emit lexer error on escaped surrogate to avoid panic Fixes #608 * Changelog

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Panic with surrogate code points in EscapedUnicode #608

Panic with surrogate code points in EscapedUnicode #608

SimonSapin commented Aug 3, 2023 •

edited

Loading

SimonSapin commented Sep 25, 2023

Panic with surrogate code points in EscapedUnicode #608

Panic with surrogate code points in EscapedUnicode #608

Comments

SimonSapin commented Aug 3, 2023 • edited Loading

Description

Steps to reproduce

Expected result

Actual result

Test case

SimonSapin commented Sep 25, 2023

SimonSapin commented Aug 3, 2023 •

edited

Loading