Skip to content

Latest commit

 

History

History

ParserGen

Command line

Output of parsergen -h
USAGE: parsergen <Grammar> <Filename> <TokensOutput> [-I ARG] [-a] [--with-bom] [-h]

DESCRIPTION:
  LR(1)/GLR-Parser Generator command line tool v1.7.4

  Where:
  1. <Grammar> is a grammar definition file.
  2. Generated C++ source files are named as:
     <Filename>IdDef.h - Lexical token enumerations
     <Filename>.h      - Header of parser class
     <Filename>.cpp    - Implementation of parser class
  3. Generated token definitions are written to <TokensOutput> to feed scannergen

VALID FLAGS:
  -I, --include-dir ARG
	Search path of #include directive within <Grammar>
  -a, --yes-to-all
	Quietly overwrite all existing output files
  --with-bom
	Prefix BOM to all output files
  -h, --help
	Display this help and exit

Grammar

Grammar definition file consists of lines. There are 7 types of lines which can be mixed up almost in no particular order. No line type is mandatory. The cases where line order does matter will be highlighted by 💣

1. Empty line & Comment

Empty line together with

/*
 *      Multi-lined Comment
 */
C++-style Comment
// Single-lined Comment

They are all helpful to enhance readability. Put as many as you like where you like.

💡
Comments can also disable other types of lines (and later re-enable them just as quickly):
//%SHOW_UNDEFINED

2. Production rule

Syntax

<NID> ::= (<NID>|AnythingElse)*
<NID> ::= (<NID>|AnythingElse)* [[
? Multi-lined reduction code in C++
]]

💡
NID is short for Non-terminal ID.

Example

<All> ::= <Line>                        // (1)
<All> ::= <All> "\n" <Line>
<Line> ::= <Production> <Semantics> [[  // (2)
    auto &c = $c;
    if (!c.testCond())
        return;

    auto &prod = dynamic_cast<C_Production&>(*$1);
    if (!c.addProduction(prod, bux::tryUnlex<C_Semantic>($2))) [[unlikely]] // (3)
        $p.onError($1, "Production re-defined:\n"
                             "\t" + prod.str());
]]
  1. No [[ ]], no reduction.

  2. Doubly-bracketed [[reduction code]] may contain the following mnemonics:

    • $p - Reference to the base parser class, either of type bux::LR1::C_Parser & or of type bux::GLR::C_Parser &

    • $P - Reference to the generated parser class

    • $c - Reference to the context instance of the generated parser class if context type is defined by option %CONTEXT

    • $r - The result token buffer which can be freely assigned

    • $1 $2 $3 …​ denote the 1st, 2nd, 3rd, .. operand to the right of ::= respectively, terminal or non-terminal.

  3. C++ Attributes can still be used in an reduction block

Notes

  1. parsergen deals only context-free grammars. Therefore exactly one non-terminal is allowed to the left of ::= per production.

  2. A reduced non-terminal operand combines with zero or more terminal/ non-terminal neighbors, reduces again into 'upper' non-terminal, …​ and eventually reduces into <@>, the root non-terminal aka the start symbol. Thus the whole input string parsed is deemed accepted.

  3. When there is no production rule at all, the grammar defined a language only accepting empty string, demonstrated by MinLang

🔥
💣 parsergen has to know the start symbol before calculation. If there is a production like <@> ::= …​, then <@> is the start symbol; otherwise, the left side of ::= in the first parsed production, say <All>, becomes the start symbol, and an extra production <@> ::= <All> is added implicitly.

3. % Option definition

Syntax

%Id
%Id [[Single-lined contents]]
%Id [[
? Multi-lined contents
]]

Example
%SHOW_UNDEFINED
%CONTEXT            [[C_BNFContext]]
%HEADERS_FOR_HEADER [[
#include "BNFContext.h"     // C_BNFContext
]]

Known Options

Known Option Output To Action / Meaning

%IDDEF_SOURCE

ParserIdDef.h

Let "Path/To/IdDef.h" be value of %IDDEF_SOURCE
ParserIdDef.h will have one line:

#include "Path/To/IdDef.h"

ℹ️ Defining this option means the parser will work with an existing scanner. "Path/To/IdDef.h" should have all token ids of the scanner and also happens to have all token ids needed by the target parser.
User is on his own to ensure this.

%ERROR_TOKEN

Parser.cpp

If %ERROR_TOKEN is either defined valuelessly or with value [[Error]] and $Error is found in productions, the underlying error recovery mechanism of the base parser class will be awakened by telling I_ParserPolicy the error token id is TID_LEX_Error, which will be defined in ParserIdDef.h (to be explained)

%UPCAST_TOKEN

Parser.cpp

Implement the following policy method with valid mnemonics $token $attr

bool C_ParserPolicy::changeToken(T_LexID &token, C_LexPtr &attr) const

A try to break down a scanned token input and take its first char as new input to resume parsing.

Example
%UPCAST_TOKEN [[
    if (isascii($token) &&
       !iscntrl($token) &&
       !isalnum($token) &&
       !isspace($token))
    {
        $attr.assign(bux::createLex<std::string>(1,char($token)), true);
        $token = TID_LEX_Operator;
        return true;
    }
    return false;
]]

%ON_ERROR

Parser.cpp

Implement the following policy method with valid mnemonics $p $P $c $pos $message

void C_ParserPolicy::onError(
     bux::LR1::C_Parser     &,
     const bux::C_SourcePos &pos,
     const std::string      &message) const
Example 1
%CONTEXT  [[C_Context]]
%ON_ERROR [[
    $c.issueError(LL_ERROR, $pos, $message);
]]
Example 2
%CONTEXT  [[std::ostream &]]
%ON_ERROR [[
    $c <<'(' <<$pos.m_Line <<',' <<$pos.m_Col <<"): " <<$message <<'\n';
]]

%SHOW_UNDEFINED

Parser.cpp
Parser.h
tokens.txt

When defined, for every other known option not defined, say %FOO, and where output should be spared, output

 // %FOO undefined (expanded here otherwise)

Read all 3 output files of MinLang to find exact locations of such comment lines for various known options.

%CONTEXT

Parser.cpp
Parser.h

Type of public member data m_context of the generated parser class. This becomes necessary when user needs more tailored controls within code blocks either for reduction or defined by some of these known options thru mnemonic $c

%IGNORE_KEYWORD_CASE

ParserIdDef.h
tokens.txt

This option tells parsergen to treat keywords case-insensitively. Convenient when you define a case-insensitive language, e.g. SQL

%HEADERS_FOR_HEADER

Parser.h

Output before entering namespace scope of the target parser class:

 // %HEADERS_FOR_HEADER expanded BEGIN
...(your code)...
 // %HEADERS_FOR_HEADER expanded END

%PRECLASSDECL

Parser.h

Output within namespace scope of the target parser class and before the class is defined:

 // %PRECLASSDECL expanded BEGIN
...(your code)...
 // %PRECLASSDECL expanded END

%INCLASSDECL

Parser.h

Output within the definition of target parser class and right after the common members are declared:

 // %INCLASSDECL expanded BEGIN
...(your code)...
 // %INCLASSDECL expanded END

ℹ️ If %CONTEXT is not defined, the embedding block starts with public access; otherwise, the embedding block starts with private access. The starting access can be explicitly changed within to whichever access you want, of course.

%HEADERS_FOR_CPP

Parser.cpp

Output after the banner comment and before any non-comment code:

 // %HEADERS_FOR_CPP expanded BEGIN
...(your code)...
 // %HEADERS_FOR_CPP expanded END

%LOCAL_CPP

Parser.cpp

Output within anonymous namespace scope and between common using namespace declarations and in-module constant definitions:

 // %LOCAL_CPP expanded BEGIN
...(your code)...
 // %LOCAL_CPP expanded END

%SCOPED_CPP_HEAD

Parser.cpp

Output within namespace scope of the target parser class and before ctor/method bodies of the class:

 // %SCOPED_CPP_HEAD expanded BEGIN
...(your code)...
 // %SCOPED_CPP_HEAD expanded END

%SCOPED_CPP_TAIL

Parser.cpp

Output within namespace scope of the target parser class and after ctor/method bodies of the class:

 // %SCOPED_CPP_TAIL expanded BEGIN
...(your code)...
 // %SCOPED_CPP_TAIL expanded END

%SCANNEROPTION

tokens.txt

Output as the first part of tokens.txt

%EXTRA_TOKENS

tokens.txt

|-separated token identifiers which again | with parsergen-generated keywords & compound operators to for the final token definition for scannergen. The very last token is the mandated initial state of the underlying finite state machine.

ℹ️ Multiple %EXTRA_TOKENS are allowed. The result token will |-concatenate all of them.

Input

%EXTRA_TOKENS [[dec_num|hex_num|identifier|c_char|c_str|spaces]]
%EXTRA_TOKENS [[bracketed|c_comment|line_comment]]
%EXTRA_TOKENS [[LexSymbol|Nonterminal|CompoundSymbol]]

Output

the_very_last = …​(generated keywords & compound operators)…​ | dec_num|hex_num|identifier|c_char|c_str|spaces|bracketed| …​(the rest)…​

%HEADERS_FOR_SCANNER_CPP

tokens.txt

Output as part of %HEADERS_FOR_CPP option value for scannergen like

%HEADERS_FOR_CPP     [[
#include "ParserIdDef.h"

 // %HEADERS_FOR_SCANNER_CPP expanded BEGIN
#include "BracketBalance.h"
 // %HEADERS_FOR_SCANNER_CPP expanded END
using namespace Main;
]]

%LOCALS_FOR_SCANNER_CPP

tokens.txt

Output as %LOCAL_ACTION_DEFS option value for scannergen like

%LOCAL_ACTION_DEFS     [[
 // %LOCALS_FOR_SCANNER_CPP expanded BEGIN
...(your code)...
 // %LOCALS_FOR_SCANNER_CPP expanded END
]]

%SUPPRESS_GLR_CONFLICTS

When defined, all conflicted actions turning the target parser a GLR will not be printed to console thruout parser generation.

4. $ New lexid

Syntax

lexid Id1 Id2 …​

Example

lexid Spaces

Notes
  1. If you lexid an identifier, say foo, and you also use $foo in production rules, then the lexid line is completely redundant.

  2. Currently the only recurring use case is the example above where the ready-made "RE_Suite.txt" defines continuous space chars, C_style comment, and C++-style comment to be created into a Spaces token (specifically a lexical token with id TID_LEX_Spaces), and the target language(parser) tries to ignore all spaces. This is when the screener comes in handy.

C_Parser                            parser;
bux::C_ScreenerNo<TID_LEX_Spaces>   screener{parser};
C_Scanner                           scanner{screener};
bux::scanFile(">", in, scanner);

// Test acceptance
if (!parser.accepted())
{
   std::cerr <<"Incomplete expression!\n";
   continue; // or break or return
}

// Apply the result
// ... parser.getFinalLex()

5. # Directives

Seriously, these are not preprocessor directives but processed in the same pass as other type of lines. They just happen to use same old syntaxes:

Directive

Meaning

#include "Foo.txt"

Replace this line with lines read from file "Foo.txt"
The grammar definition file of parsergen is pretty much a POC of this directive.

#ifdef Bar

💣 If option %Bar is defined, include subsequent lines until whichever the paired #else or #endif is reached first; otherwise, include lines between #else and #endif if #else is present.

#ifndef Bar

💣 If option %Bar is not defined, include subsequent lines until whichever the paired #else or #endif is reached first; otherwise, include lines between #else and #endif if #else is present.

#else

💣

#endif

💣

💣 Pairing rules of #ifdef, #ifndef, #else, #endif comply with C++ preprocessor counterparts
💡
No #if (expr) and #elif (expr) because relevant scenarios are yet to be seen and the implementing effort is estimated high.

6. Parser class naming

Syntax

class (<namespace> ::)* <class_name>

Example

class Main::C_BNFParser

Notes
  1. At most one such line is allowed.

  2. When absent, the parser class has the default name formatted from the base name of the 2nd commandline argument, i.e. <Filename>, except every char which is neither letter nor digit will be replaced by '_'. For example:

    • If <Filename> is Parser, the class name will be C_Parser.

    • If <Filename> is Script/Parser, the class name will still be C_Parser.

    • If <Filename> is Parser-2nd, the class name will be C_Parser_2nd.

  3. This will become a problem only when an application uses multiple parsergen-generated parsers.

  4. Use of namespace(s) is encouraged when the generated parser is part of a library.

7. Operator precedence

Syntax

(left|right|prec) op1 op2 op3 …​

ℹ️
left: Left-associative, left operator first
right: Right-associative, right operator first
prec: No associativity, conflict leads error directly.
Example

left + -
left * / %
right ( )

ℹ️
Lines parsed later get higher precedence.

Error Recovery & Parser Logger

Token $Error which is assured to never be generated by scanner is used in some of productions. Parser always matches those productions not using $Error first to shift or reduce. Only if that attempt fails, parser starts to rollback the process (or state stack) seeking the first doable point to insert $Error, i.e. matching one of those productions using $Error so that parsing can move on. That’s all for the current error recovery, folks!

A supported way to have parser logger is by declaring user’s context type which supports methods to do so, illustrated below:

Typical example (Use both $Error & $c.log())

Routine options
%ERROR_TOKEN                              // (1)
%CONTEXT    [[bux::C_ParserOStreamCount]] // (2)
%ON_ERROR   [[                            // (3)
    $c.log(LL_ERROR, $pos, $message);
]]
  1. Awaken the target parser’s error recovery.
    If grammar token $Error, which has C++ token id TID_LEX_Error, is not possibly produced by scanner, $Error appears in right halves of productions to indicate the context & position where the parsing goes wrong with C++ code annotations to issue parser logs and/or to make parsing move on (to catch more errors in one run);
    Otherwise, simply assign the error token a new name, say
    %ERROR_TOKEN MyErr
    and thus we have token $MyErr and corresponding token id TID_LEX_MyErr to replace $Error and TID_LEX_Error. Use $Error ro represent real inputs like any other normal tokens, e.g. $Num, $Id, …​

  2. The current support to log parser messages in chronological order while counting them in 5 error levels, i.e. LL_FATAL, LL_ERROR, LL_WARNING, LL_INFO, LL_VERBOSE. The class is defined in ParserBase.h (implicitly included by every generated parser header). Surely you can still have your own context class either deriving from bux::C_ParserOStreamCount or having it as a member data.

  3. Implement policy method onError() by calling bux::C_ParserOStreamCount::log()

Identify specific errors/warnings & log them
<value> ::= { <members> }   [[ // (1)
    $r = bux::createLex<json::value>(bux::unlex<json::object>($2));
]]

<members> ::= <member>              [[
    json::object t;
    auto &src = bux::unlex<std::pair<std::string,json::value>>($1);
    t.try_emplace(std::move(src.first), std::move(src.second));
    $r = bux::createLex(std::move(t));
]]
<members> ::= <members> , <member>  [[          // (2)
    auto &src = bux::unlex<std::pair<std::string,json::value>>($3);
    bux::unlex<json::object>($1).try_emplace(std::move(src.first), std::move(src.second));
    $r = $1;
]]
<members> ::= <members> , $Error    [[          // (3)
    $c.log(LL_WARNING, $2, "Superfluous ','");  // (4)
    $r = $1;                                    // (5)
]]

<member> ::= $String : <value>          [[      // (6)
    $r = bux::createLex(std::pair{bux::unlex<std::string>($1), bux::unlex<json::value>($3)});
]]
<member> ::= $String : $Error           [[      // (7)
    $p.onError($3, "Expect <value>");           // (8)
    $r = bux::createLex(std::pair{bux::unlex<std::string>($1), json::value{}}); // (9)
]]
<member> ::= $String $Error             [[      // (10)
    $p.onError($2, "Expect ':'");               // (11)
    $r = bux::createLex(std::pair{bux::unlex<std::string>($1), json::value{}}); // (12)
]]
<member> ::= $Error <value> : <value>   [[      // (13)
    $p.onError($1, "Only string key allowed");  // (14)
    $r = bux::createLex(std::pair{std::string{"NonStrKey__"}, bux::unlex<json::value>($4)}); // (15)
]]
  1. In JSON doc, an object consists of key:value pairs (members) which as a whole is braced by { }

  2. Members are comma(,)-separated.

  3. Trailing comma is not legal, but acceptable (negligible).

  4. Treat a trailing comma as warning rather than error. Warning count incremented.

  5. Just move on the parsing (recover it as nothing happened).

  6. Legit key:value pair.

  7. No value after ':'

  8. Issue an error. Error count incremented. The following line means the same:
    $c.log(LL_ERROR, $3, "Expect <value>");

  9. Pair the key with null value and move on (recover it with a fake value)

  10. No ':' after key

  11. Issue an error. Error count incremented. The following line means the same:
    $c.log(LL_ERROR, $2, "Expect ':'");

  12. Pair the key with null value and move on (recover it with a fake value)

  13. Non-string key

  14. Issue an error. Error count incremented. The following line means the same:
    $c.log(LL_ERROR, $1, "Only string key allowed");

  15. Use "NonStrKey__" as key to pair with the value after ':' and move on (recover it with a fake key)

Boilerplate code to parse JSON stream (source)
    C_Parser            parser{*log};
    bux::C_Screener     preparser(parser, [](auto token){ return token == TID_LEX_Spaces || token == '\n'; });
    C_JSONScanner       scanner(preparser);
    bux::scanFile({}, in, scanner);

    // Check if parsing is ok
    if (const auto n_errs =
        parser.m_context.getCount(LL_FATAL) +
        parser.m_context.getCount(LL_ERROR))      // (1)
        RUNTIME_ERROR("Total {} errors", n_errs);

    // Acceptance
    if (!parser.accepted())
        RUNTIME_ERROR("Incomplete expression!");

    return bux::unlex<value>(parser.getFinalLex());;
  1. Any fatal or error fails the parsing. IOW, parsing is ok with any number of warning, info, verbose messages. But it is totally fine to have different criteria to be deemed ok with.

Parsing error/warning without $Error

Example 1
<members> ::= <members> ,   [[  // (1)
    $c.log(LL_WARNING, $2, "Superfluous ','");
    $r = $1;
]]
  1. The almost same production issues a warning already exemplified above except this one is $Error-free. The effect is completely identical.

Example 2
<value> ::= ( <elements> )  [[
    $p.onError($1, "Tuple (...) not allowed, use array [...] instead");
    $r = bux::createLex<json::value>(bux::unlex<json::array>($2));
]]

$Error not error

<member> ::= $String : $Error           [[  // (1)
    $r = bux::createLex(std::pair{bux::unlex<std::string>($1), json::value{}});
]]
  1. Extend JSON syntax by allowing default value null (and not issuing anything)

Generated files

(To be explained)