Locally disable %whitespace #44

olivren · 2018-08-02T16:28:15Z

I have a grammar for a programming language. It defines %whitespace, because whitespaces are not significant.

Now, I want to parse string literals with a rule like this:

StrQuot   <- '"' (StrEscape / StrChars)* '"'
StrEscape <- < '\\' any >
StrChars  <- < (!'"' !'\\' any)+ >

StrEscape and StrChars both have rules that produces std::string, that I combine together in the rule of StrQuot. The problem is that the whitespaces in the strings are ignored, and thus the resulting string has all the whitespaces filtered out.

Is there a way to deactivate locally the %whitespace rule?

The text was updated successfully, but these errors were encountered:

yhirose · 2018-08-03T02:03:20Z

@olivren, thanks for the report. I tried to handle this situation without making any code change in peglib.h, and found the following solution.

TEST_CASE("WHITESPACE test3", "[general]") {
    peg::parser parser(R"(
        StrQuot      <- < '"' < (StrEscape / StrChars)* > '"' > # Nested token operators
        StrEscape    <- '\\' any
        StrChars     <- (!'"' !'\\' any)+
        any          <- .
        %whitespace  <- [ \t]*
    )");

    parser["StrQuot"] = [](const SemanticValues& sv) {
        REQUIRE(sv.token() == R"(  aaa \" bbb  )"); // Get text in the inner token operator
    };

    auto ret = parser.parse(R"( "  aaa \" bbb  " )");
    REQUIRE(ret == true);
}

The key of the solution is to use token operators effectively. The peglib ignores white-spaces in text surrounded by a token operator.

Also I discovered that when we use nested token operators in a rule and call sv.token() method in the corresponding action handler, we can capture only text in the inner token operator like aaa \" bbb .

Please let me know if the above solution can work for you. Thanks!

olivren · 2018-08-03T09:04:20Z

Thanks for your answer. I played a bit with the placement of the token operators in my grammar, and I found a combination that works for my case. I am not sure I understand exactly the logic of when whitespaces are ignored though.

Here is the full code for reference. It may be useful for other users. It implements python-style string literals (no string prefix, no raw string, no \o \x \N \u escape codes, and no \ newline management)

StrDblQuot      <- < '"' < (StrEscape / StrDblQuotChars)* > '"' >
StrEscape       <- '\\' any
StrDblQuotChars <- (!'"' !'\\' any)+
any             <- !'\n' !'\r' .
%whitespace     <- [ \t]*

rule["StrDblQuot"] = [](const SemanticValues& sv) -> string {
  ostringstream ss;
  for(string& e: sv.transform<string>())
    ss << e;
  return ss.str();
};

rule["StrEscape"] = [](const SemanticValues& sv) -> string {
  string tok = sv.token();
  assert (tok.size() == 2);
  switch(tok.back()) {
  case '\\': return "\\";
  case '\'': return "'";
  case '"': return "\"";
  case 'a': return "\a";
  case 'b': return "\b";
  case 'f': return "\f";
  case 'n': return "\n";
  case 'r': return "\r";
  case 't': return "\t";
  case 'v': return "\v";
  default: return tok;
  }
};

rule["StrDblQuotChars"] = [](const SemanticValues& sv) -> string {
  return sv.token();
};

The input ("   a  \"  \\   b   ") yields the string (   a  "  \   b   )

yhirose · 2018-08-03T23:17:18Z

Looks great! I labeled it as 'information', so that users could benefit from it.

Beedeebee · 2020-08-19T12:54:11Z

Hello, I have a similar problem with whitespaces. Not sure if I should contribute to this issue or create a new one. I apologize in advance if I chose the wrong option.

I'm trying to write a grammar where the whitespace matters only in one point: between an identifier and a '(' to distinguish between a function call (no space admitted) and a sequence of an identifier and a grouped expression. Something like:

EXPR    <- GROUP+
GROUP   <- '(' EXPR ')' / PRIMARY
PRIMARY <- IDENT '(' EXPR (',' EXPR)* ')' # function call - no space allowed after IDENT
           / IDENT                        # variable name
           / LITERAL                      # literal

I'd like to parse x(x) as a function call and x (x) as an IDENT followed by a GROUP.

I believe that cpp-peglib's syntax works exactly like this as well.

Is it possible to use Tokens or any other solution to say that no whitespace is allowed between IDENT and '(' for function calls, without having to give up %whitespace altogether?

yhirose · 2020-08-19T16:05:51Z

@Beedeebee, thanks for the feedback. I think there is no easy solution for that unless you give up using %whitespace...

A workaround that I can come up with this situation is as below:

I know this is not a super beautiful solution, but it works with %whitespace. Hope it helps!

Beedeebee · 2020-08-19T16:51:11Z

Thanks, the workaround you're suggesting works perfectly for me! :)

Details: yhirose/cpp-peglib#44

yhirose added the information label Aug 3, 2018

VasiliyRyabtsev added a commit to VasiliyRyabtsev/quirrel-peg that referenced this issue Feb 5, 2021

Fix whitespace handing in string literals

b15aca3

Details: yhirose/cpp-peglib#44

yhirose mentioned this issue Aug 3, 2021

Ignored token still being captured #180

Closed

fdedinec mentioned this issue Dec 10, 2022

Various issue with recursive tokens #257

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Locally disable %whitespace #44

Locally disable %whitespace #44

olivren commented Aug 2, 2018

yhirose commented Aug 3, 2018 •

edited

Loading

olivren commented Aug 3, 2018

yhirose commented Aug 3, 2018

Beedeebee commented Aug 19, 2020 •

edited

Loading

yhirose commented Aug 19, 2020 •

edited

Loading

Beedeebee commented Aug 19, 2020

Locally disable %whitespace #44

Locally disable %whitespace #44

Comments

olivren commented Aug 2, 2018

yhirose commented Aug 3, 2018 • edited Loading

olivren commented Aug 3, 2018

yhirose commented Aug 3, 2018

Beedeebee commented Aug 19, 2020 • edited Loading

yhirose commented Aug 19, 2020 • edited Loading

Beedeebee commented Aug 19, 2020

yhirose commented Aug 3, 2018 •

edited

Loading

Beedeebee commented Aug 19, 2020 •

edited

Loading

yhirose commented Aug 19, 2020 •

edited

Loading