Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic with surrogate code points in EscapedUnicode #608

Closed
SimonSapin opened this issue Aug 3, 2023 · 1 comment · Fixed by #658
Closed

Panic with surrogate code points in EscapedUnicode #608

SimonSapin opened this issue Aug 3, 2023 · 1 comment · Fixed by #658
Assignees
Labels
apollo-parser bug Something isn't working

Comments

@SimonSapin
Copy link
Contributor

SimonSapin commented Aug 3, 2023

Description

apollo-parser 0.5.3 provides impl From<&'_ ast::StringValue> for String to extract the value represented by a string literal by resolving escape sequences. This part of the spec:

https://spec.graphql.org/October2021/#sec-String-Value.Semantics

StringCharacter :: \u EscapedUnicode

  1. Let value be the 16-bit hexadecimal value represented by the sequence of hexadecimal digits within EscapedUnicode.
  2. Return the code point value.

Is implemented here:

// 1. Let value be the 16-bit hexadecimal value represented
// by the sequence of hexadecimal digits within EscapedUnicode.
let value = iter.by_ref().take(4).fold(0, |acc, c| {
let digit = c.to_digit(16).unwrap();
(acc << 4) + digit
});
// 2. Return the code point value.
char::from_u32(value).unwrap()

The spec is written assuming JavaScript-like strings made of 16-bit code units, but Rust’s char represents a Unicode scalar value which excludes the range of surrogate code units, U+D800 to U+DFFF. (Surrogates are reserved to leave space for UTF-16 to encode code points beyond U+FFFF as a pair of leading of leading and trailing surrogates.) As a result, this .unwrap() call can panic.

The draft spec fixes this (and adds a new bit of syntax):

https://spec.graphql.org/draft/#sec-String-Value

two forms of Unicode escape sequences: one with a fixed-width of 4 hexadecimal digits (for example \u000A) and one with a variable-width most useful for representing a supplementary character such as an Emoji (for example \u{1F4A9}).
[…]
For legacy reasons, a supplementary character may be escaped by two fixed-width unicode escape sequences forming a surrogate pair. For example the input "\uD83D\uDCA9" is a valid StringValue which represents the same Unicode text as "\u{1F4A9}". While this legacy form is allowed, it should be avoided as a variable-width unicode escape sequence is a clearer way to encode such code points.
[…]
These semantics describe how to apply the StringValue grammar to a source text to evaluate a Unicode text. Errors encountered during this evaluation are considered a failure to apply the StringValue grammar to a source and must result in a parsing error.
[…]
StringCharacter :: \u EscapedUnicode

  1. Let value be the hexadecimal value represented by the sequence of HexDigit within EscapedUnicode.
  2. Assert value is a within the Unicode scalar value range (≥ 0x0000 and ≤ 0xD7FF or ≥ 0xE000 and ≤ 0x10FFFF).
  3. Return the Unicode scalar value value.

StringCharacter :: \u HexDigit HexDigit HexDigit HexDigit

  1. Let leadingValue be the hexadecimal value represented by the first sequence of HexDigit.
  2. Let trailingValue be the hexadecimal value represented by the second sequence of HexDigit.
  3. If leadingValue is ≥ 0xD800 and ≤ 0xDBFF (a Leading Surrogate):
    1. Assert trailingValue is ≥ 0xDC00 and ≤ 0xDFFF (a Trailing Surrogate).
    2. Return (leadingValue - 0xD800) × 0x400 + (trailingValue - 0xDC00) + 0x10000.
  4. Otherwise:
    1. Assert leadingValue is within the Unicode scalar value range.
    2. Assert trailingValue is within the Unicode scalar value range.
    3. Return the sequence of the Unicode scalar value leadingValue followed by the Unicode scalar value trailingValue.

In particular:

  • The conversion should be fallible (TryFrom instead of From)
  • Conversions errors “are considered a failure to apply the grammar and must result in a parsing error”. In apollo-parser this is SyntaxTree::errors(), but that is completed without actually extracting the unescaped value of string literals. So the lexer is likely the right place to check whether that conversion would succeed.

Steps to reproduce

(Short version, see test case without apollo-compiler below.)

let mut compiler = apollo_compiler::ApolloCompiler::new();
let input = r#"{ field(string: "\uD83E\uDD80") }"#;
compiler.add_document(input, "");
compiler.validate();

Expected result

Either:

  • Validation includes a syntax error
  • (Per draft spec) no syntax error, and HIR contains Value::String("🦀") (the UTF-16 decoding of 0xD83E 0xDD80)

Actual result

thread 'database::hir::tests::surrogates' panicked at 'called `Option::unwrap()` on a `None` value', crates/apollo-parser/src/ast/node_ext.rs:176:43
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Test case

Fixing this per the draft spec should make this test pass:

crates/apollo-parser/src/ast/node_ext.rs

#[cfg(test)]
mod tests {
    use crate::ast;
    use crate::Parser;

    #[test]
    fn test_value_conversions() {
        let input = r#"
            query ($arg: Int!) {
                field(
                    float: 1.234,
                    int: 1234,
                    bool: true,
                    variable: $arg,
                    string: "some \"text\"\n\uD83E\uDD80",
                )
            }
        "#;
        let values = extract_argument_values(input);
        let ast::Value::FloatValue(float) = &values[0] else { panic!() };
        let ast::Value::IntValue(int) = &values[1] else { panic!() };
        let ast::Value::BooleanValue(bool) = &values[2] else { panic!() };
        let ast::Value::Variable(var) = &values[3] else { panic!() };
        let ast::Value::StringValue(string) = &values[4] else { panic!() };
        assert_eq!(f64::try_from(float), Ok(1.234));
        assert_eq!(i32::try_from(int), Ok(1234));
        assert_eq!(bool::try_from(bool), Ok(true));
        assert_eq!(var.name().unwrap().text().as_str(), "arg");
        assert_eq!(String::try_from(string).as_deref(), Ok("some \"text\"\n🦀"));
    }

    fn extract_argument_values(input: &str) -> Vec<ast::Value> {
        let ast = Parser::new(input).parse();
        assert_eq!(ast.errors().as_slice(), []);
        let ast::Definition::OperationDefinition(operation) = ast
            .document()
            .definitions()
            .next()
            .unwrap()
        else { panic!("expected an operation") };
        let ast::Selection::Field(field) = operation
            .selection_set()
            .unwrap()
            .selections()
            .next()
            .unwrap()
        else { panic!("expected a field") };
        field
            .arguments()
            .unwrap()
            .arguments()
            .map(|arg| arg.value().unwrap())
            .collect()
    }
}
@SimonSapin SimonSapin added bug Something isn't working apollo-parser triage labels Aug 3, 2023
@goto-bus-stop goto-bus-stop self-assigned this Aug 28, 2023
@SimonSapin SimonSapin changed the title Parsing surrogate code points in EscapedUnicode Panic with surrogate code points in EscapedUnicode Sep 25, 2023
@SimonSapin
Copy link
Contributor Author

The new feature of decoding surragate pairs is split off to a new issue: #657, leaving this one about the panic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
apollo-parser bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants