Lexing fails for string containing Unicode escape sequence #55

wahajenius · 2024-10-23T13:43:20Z

The lexer does not correctly handle input strings containing a Unicode escape sequence like 'Fran\u00E7ois', due to token recognition error. Wrapping the input stream in a CaseInsensitiveInputStream makes it work though.

Here is a unit test demo:

    @Test
    void testLexerUnicodeEscapes() {
        String s = "'Fran\\u00E7ois'";

        // Using a plain CodePointCharStream fails
        IllegalStateException exc = assertThrows(IllegalStateException.class, () -> {
            tryLexing(CharStreams.fromString(s));
        });
        assertEquals("Syntax error on line 1:0: token recognition error at: ''Fran\\u00E'.", exc.getMessage());

        // Wrapping it in a CaseInsensitiveInputStream makes it work. Why?
        CommonTokenStream tokens = tryLexing(new CaseInsensitiveInputStream(CharStreams.fromString(s)));
        assertEquals(2, tokens.size());
    }

    private CommonTokenStream tryLexing(CharStream stream) {
        ApexLexer lexer = new ApexLexer(stream);
        lexer.removeErrorListeners(); // Avoid distracting "token recognition error" stderr output
        lexer.addErrorListener(new BaseErrorListener() {
            @Override
            public void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol, int line,
                int charPositionInLine, String msg, RecognitionException e) {
                throw new IllegalStateException(String.format("Syntax error on line %d:%d: %s.",
                    line, charPositionInLine, msg));
            }
        });
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        tokens.fill();
        return tokens;
    }

Is this a by design or a bug? The Apex language is case-insensitive but that shouldn't affect these string values.

Notes:

Upgrading ANTLR from 4.9.1 to 4.13.2 does not solve it, but it's still good practice
Lexing with CommonTokenStream works correctly for literal non-ASCII Unicode characters like 'François'

The text was updated successfully, but these errors were encountered:

adangel · 2024-10-24T08:36:43Z

ok, the problem would be that the HexCharacters in the grammar only allow lowercase characters:

apex-parser/antlr/ApexLexer.g4

Line 313 in eb37247

: Digit | 'a' | 'b' | 'c' | 'd' | 'e' | 'f'

That's why String s = "'Fran\\u00E7ois'"; produces this syntax error, but String s = "'Fran\\u00e7ois'"; should work though...

wahajenius · 2024-11-04T09:01:25Z

That makes sense. Then I would suggest to change it to [0-9a-fA-F] as done by ANTLR's own Apex and Java grammars.

adangel mentioned this issue Oct 24, 2024

[apex] Use case-insensitive input stream to avoid choking on Unicode escape sequences pmd/pmd#5284

Merged

4 tasks

pwrightcertinia mentioned this issue Nov 12, 2024

Fix lexer hex characters #56

Merged

pwrightcertinia closed this as completed in #56 Nov 14, 2024

adangel mentioned this issue Nov 14, 2024

[apex] Token recognition errors for string containing unicode escape sequence pmd/pmd#5333

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lexing fails for string containing Unicode escape sequence #55

Lexing fails for string containing Unicode escape sequence #55

wahajenius commented Oct 23, 2024

adangel commented Oct 24, 2024

wahajenius commented Nov 4, 2024

Lexing fails for string containing Unicode escape sequence #55

Lexing fails for string containing Unicode escape sequence #55

Comments

wahajenius commented Oct 23, 2024

adangel commented Oct 24, 2024

wahajenius commented Nov 4, 2024