Ignored token still being captured #180

stoneRdev · 2021-07-26T03:34:56Z

Hello, I have been working on string parsing, and can't seem to get the token ignore operator to work.

Here's my grammar:

String <- StringStart < StringContent > StringEnd
StringContent <- StringInnerContent*
StringInnerContent 
    <- !($tag / "\\") < . >
    / Escape
    / FullEscapeSequence
~StringStart <- $tag< StringTag >
~StringEnd <- $tag
StringTag <- "'" / '"'
FullEscapeSequence
    <- "\\n" | "\\r" | "\\t" | "\\b" | "\\f" | "\\v" | "\\0"
Escape 
    <- NonCapturingEscape < (StringTag / ("x" [0-9a-fA-F]{2} ) / "\\") >
~NonCapturingEscape <- "\\"

Test input:
"a \n \r \t \b \f \v \0 \' \" \\ \xff"
Output (from PEG Playground):

- String (a \n \r \t \b \f \v \0 \' \" \\ \xff)

Expected output:

- String (a \n \r \t \b \f \v \0 ' " \ xff)

I have tried a few variations, but this one seems to me to be the one most indicative of what I need.

I'm trying to parse strings with escape sequences, some keeping the escape while others drop it, but the Escape still includes the backslashes, even though I'm using token boundaries and the ignore operator to exclude NonCapturingEscape.

Am I doing something wrong? Or is this a bug in the parser?

Thank you for your time!

-- As a side, after I get this figured out I'm planning on making my grammar also accept escaped escape sequences, where input \\n becomes \n. If you have any pointers there I would be greatly appreciative! Thank you

The text was updated successfully, but these errors were encountered:

yhirose · 2021-07-31T21:20:07Z

@stoneRdev, in short, cpp-peglib can't handle what you are trying to do with AstBase::token since it's a string view region to the original parsed text instead of a string as below:

cpp-peglib/peglib.h

Lines 3852 to 3853 in 4109480

    
           const bool is_token; 
        
           const std::string_view token;

In other words, resolving escape sequence isn't the parser's job...

If I do the similar to what you want to do, String shouldn't be a token rule, but a regular rule which contains child character nodes. I actually have done the similar in my interpreter to implement interpolated string. An interpolated string is expressed as a list of EXPRESSION and INTERPOLATED_CONTENT nodes.
https://github.com/yhirose/culebra/blob/cb52dd66a7a205ace5296d09a6ec27764d658d04/include/interpreter.h#L75-L76

The interpolated content nodes will be properly resolved later in the interpreter.
https://github.com/yhirose/culebra/blob/cb52dd66a7a205ace5296d09a6ec27764d658d04/include/interpreter.h#L1044-L1056

You may want to check #44 as well.

Hoe the above explanation helps you!

yhirose closed this as completed Jul 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignored token still being captured #180

Ignored token still being captured #180

stoneRdev commented Jul 26, 2021 •

edited

Loading

yhirose commented Jul 31, 2021 •

edited

Loading

Ignored token still being captured #180

Ignored token still being captured #180

Comments

stoneRdev commented Jul 26, 2021 • edited Loading

yhirose commented Jul 31, 2021 • edited Loading

stoneRdev commented Jul 26, 2021 •

edited

Loading

yhirose commented Jul 31, 2021 •

edited

Loading