parsergen/ParserGen at main · buck-yeh/parsergen

Name	Name	Last commit message	Last commit date
parent directory ..
doc	doc
ginc	ginc
ActionMap.cpp	ActionMap.cpp
ActionMap.h	ActionMap.h
BNFContext.cpp	BNFContext.cpp
BNFContext.h	BNFContext.h
CMakeLists.txt	CMakeLists.txt
Cfa.cpp	Cfa.cpp
Cfa.h	Cfa.h
GrammarStrip.cpp	GrammarStrip.cpp
Output.cpp	Output.cpp
Output.h	Output.h
ParseFile.cpp	ParseFile.cpp
ParseFile.h	ParseFile.h
Parser.cpp	Parser.cpp
Parser.h	Parser.h
ParserGenBase.cpp	ParserGenBase.cpp
ParserGenBase.h	ParserGenBase.h
ParserIdDef.h	ParserIdDef.h
README.adoc	README.adoc
Scanner.cpp	Scanner.cpp
Scanner.h	Scanner.h
grammar.txt	grammar.txt
main.cpp	main.cpp
rationale.txt	rationale.txt
reparse	reparse
tokens.txt	tokens.txt

Table of Contents

Command line
Grammar
Error Recovery & Parser Logger
Generated files

Command line

Output of parsergen -h

USAGE: parsergen <Grammar> <Filename> <TokensOutput> [-I ARG] [-a] [--with-bom] [-h]

DESCRIPTION:
  LR(1)/GLR-Parser Generator command line tool v1.7.4

  Where:
  1. <Grammar> is a grammar definition file.
  2. Generated C++ source files are named as:
     <Filename>IdDef.h - Lexical token enumerations
     <Filename>.h      - Header of parser class
     <Filename>.cpp    - Implementation of parser class
  3. Generated token definitions are written to <TokensOutput> to feed scannergen

VALID FLAGS:
  -I, --include-dir ARG
	Search path of #include directive within <Grammar>
  -a, --yes-to-all
	Quietly overwrite all existing output files
  --with-bom
	Prefix BOM to all output files
  -h, --help
	Display this help and exit

Grammar

Grammar definition file consists of lines. There are 7 types of lines which can be mixed up almost in no particular order. No line type is mandatory. The cases where line order does matter will be highlighted by 💣

1. Empty line & Comment

Empty line together with

C-style Comment

/*
 *      Multi-lined Comment
 */

C++-style Comment

// Single-lined Comment

They are all helpful to enhance readability. Put as many as you like where you like.

💡	Comments can also disable other types of lines (and later re-enable them just as quickly): `//%SHOW_UNDEFINED`

2. Production rule

Syntax

<NID> ::= (<NID>|AnythingElse)*
<NID> ::= (<NID>|AnythingElse)* [[
? Multi-lined reduction code in C++
]]

💡	NID is short for Non-terminal ID.

Example

<All> ::= <Line>                        // (1)
<All> ::= <All> "\n" <Line>
<Line> ::= <Production> <Semantics> [[  // (2)
    auto &c = $c;
    if (!c.testCond())
        return;

    auto &prod = dynamic_cast<C_Production&>(*$1);
    if (!c.addProduction(prod, bux::tryUnlex<C_Semantic>($2))) [[unlikely]] // (3)
        $p.onError($1, "Production re-defined:\n"
                             "\t" + prod.str());
]]

No [[ ]], no reduction.
Doubly-bracketed [[reduction code]] may contain the following mnemonics:
- $p - Reference to the base parser class, either of type bux::LR1::C_Parser & or of type bux::GLR::C_Parser &
- $P - Reference to the generated parser class
- $c - Reference to the context instance of the generated parser class if context type is defined by option %CONTEXT
- $r - The result token buffer which can be freely assigned
- $1 $2 $3 … denote the 1st, 2nd, 3rd, .. operand to the right of ::= respectively, terminal or non-terminal.
C++ Attributes can still be used in an reduction block

Notes

parsergen deals only context-free grammars. Therefore exactly one non-terminal is allowed to the left of ::= per production.
A reduced non-terminal operand combines with zero or more terminal/ non-terminal neighbors, reduces again into 'upper' non-terminal, … and eventually reduces into <@>, the root non-terminal aka the start symbol. Thus the whole input string parsed is deemed accepted.
When there is no production rule at all, the grammar defined a language only accepting empty string, demonstrated by MinLang

🔥

💣 parsergen has to know the start symbol before calculation. If there is a production like <@> ::= …, then <@> is the start symbol; otherwise, the left side of ::= in the first parsed production, say <All>, becomes the start symbol, and an extra production <@> ::= <All> is added implicitly.

3. % Option definition

Syntax

%Id
%Id [[Single-lined contents]]
%Id [[
? Multi-lined contents
]]

Example

%SHOW_UNDEFINED
%CONTEXT            [[C_BNFContext]]
%HEADERS_FOR_HEADER [[
#include "BNFContext.h"     // C_BNFContext
]]

Known Options

Known Option	Output To	Action / Meaning
`%IDDEF_SOURCE`	ParserIdDef.h	Let "Path/To/IdDef.h" be value of `%IDDEF_SOURCE` ParserIdDef.h will have one line: #include "Path/To/IdDef.h" ℹ️ Defining this option means the parser will work with an existing scanner. "Path/To/IdDef.h" should have all token ids of the scanner and also happens to have all token ids needed by the target parser. User is on his own to ensure this.
`%ERROR_TOKEN`	Parser.cpp	If `%ERROR_TOKEN` is either defined valuelessly or with value `[[Error]]` and `$Error` is found in productions, the underlying error recovery mechanism of the base parser class will be awakened by telling `I_ParserPolicy` the error token id is `TID_LEX_Error`, which will be defined in ParserIdDef.h (to be explained)
`%UPCAST_TOKEN`	Parser.cpp	Implement the following policy method with valid mnemonics `$token` `$attr` bool C_ParserPolicy::changeToken(T_LexID &token, C_LexPtr &attr) const A try to break down a scanned token input and take its first char as new input to resume parsing. Example %UPCAST_TOKEN [[ if (isascii($token) && !iscntrl($token) && !isalnum($token) && !isspace($token)) { $attr.assign(bux::createLex<std::string>(1,char($token)), true); $token = TID_LEX_Operator; return true; } return false; ]]
`%ON_ERROR`	Parser.cpp	Implement the following policy method with valid mnemonics `$p` `$P` `$c` `$pos` `$message` void C_ParserPolicy::onError( bux::LR1::C_Parser &, const bux::C_SourcePos &pos, const std::string &message) const Example 1 %CONTEXT [[C_Context]] %ON_ERROR [[ $c.issueError(LL_ERROR, $pos, $message); ]] Example 2 %CONTEXT [[std::ostream &]] %ON_ERROR [[ $c <<'(' <<$pos.m_Line <<',' <<$pos.m_Col <<"): " <<$message <<'\n'; ]]
`%SHOW_UNDEFINED`	Parser.cpp Parser.h tokens.txt	When defined, for every other known option not defined, say `%FOO`, and where output should be spared, output // %FOO undefined (expanded here otherwise) Read all 3 output files of MinLang to find exact locations of such comment lines for various known options.
`%CONTEXT`	Parser.cpp Parser.h	Type of public member data `m_context` of the generated parser class. This becomes necessary when user needs more tailored controls within code blocks either for reduction or defined by some of these known options thru mnemonic `$c`
`%IGNORE_KEYWORD_CASE`	ParserIdDef.h tokens.txt	This option tells `parsergen` to treat keywords case-insensitively. Convenient when you define a case-insensitive language, e.g. `SQL`
`%HEADERS_FOR_HEADER`	Parser.h	Output before entering namespace scope of the target parser class: // %HEADERS_FOR_HEADER expanded BEGIN ...(your code)... // %HEADERS_FOR_HEADER expanded END
`%PRECLASSDECL`	Parser.h	Output within namespace scope of the target parser class and before the class is defined: // %PRECLASSDECL expanded BEGIN ...(your code)... // %PRECLASSDECL expanded END
`%INCLASSDECL`	Parser.h	Output within the definition of target parser class and right after the common members are declared: // %INCLASSDECL expanded BEGIN ...(your code)... // %INCLASSDECL expanded END ℹ️ If `%CONTEXT` is not defined, the embedding block starts with public access; otherwise, the embedding block starts with private access. The starting access can be explicitly changed within to whichever access you want, of course.
`%HEADERS_FOR_CPP`	Parser.cpp	Output after the banner comment and before any non-comment code: // %HEADERS_FOR_CPP expanded BEGIN ...(your code)... // %HEADERS_FOR_CPP expanded END
`%LOCAL_CPP`	Parser.cpp	Output within anonymous namespace scope and between common `using namespace` declarations and in-module constant definitions: // %LOCAL_CPP expanded BEGIN ...(your code)... // %LOCAL_CPP expanded END
`%SCOPED_CPP_HEAD`	Parser.cpp	Output within namespace scope of the target parser class and before ctor/method bodies of the class: // %SCOPED_CPP_HEAD expanded BEGIN ...(your code)... // %SCOPED_CPP_HEAD expanded END
`%SCOPED_CPP_TAIL`	Parser.cpp	Output within namespace scope of the target parser class and after ctor/method bodies of the class: // %SCOPED_CPP_TAIL expanded BEGIN ...(your code)... // %SCOPED_CPP_TAIL expanded END
`%SCANNEROPTION`	tokens.txt	Output as the first part of tokens.txt
`%EXTRA_TOKENS`	tokens.txt	\|-separated token identifiers which again \| with `parsergen`-generated keywords & compound operators to for the final token definition for `scannergen`. The very last token is the mandated initial state of the underlying finite state machine. ℹ️ Multiple `%EXTRA_TOKENS` are allowed. The result token will \|-concatenate all of them. Input %EXTRA_TOKENS [[dec_num\|hex_num\|identifier\|c_char\|c_str\|spaces]] %EXTRA_TOKENS [[bracketed\|c_comment\|line_comment]] %EXTRA_TOKENS [[LexSymbol\|Nonterminal\|CompoundSymbol]] Output the_very_last = …(generated keywords & compound operators)… \| dec_num\|hex_num\|identifier\|c_char\|c_str\|spaces\|bracketed\| …(the rest)…
`%HEADERS_FOR_SCANNER_CPP`	tokens.txt	Output as part of `%HEADERS_FOR_CPP` option value for `scannergen` like %HEADERS_FOR_CPP [[ #include "ParserIdDef.h" // %HEADERS_FOR_SCANNER_CPP expanded BEGIN #include "BracketBalance.h" // %HEADERS_FOR_SCANNER_CPP expanded END using namespace Main; ]]
`%LOCALS_FOR_SCANNER_CPP`	tokens.txt	Output as `%LOCAL_ACTION_DEFS` option value for `scannergen` like %LOCAL_ACTION_DEFS [[ // %LOCALS_FOR_SCANNER_CPP expanded BEGIN ...(your code)... // %LOCALS_FOR_SCANNER_CPP expanded END ]]
`%SUPPRESS_GLR_CONFLICTS`		When defined, all conflicted actions turning the target parser a GLR will not be printed to console thruout parser generation.

4. $ New lexid

Syntax

lexid Id1 Id2 …

Example

lexid Spaces

Notes

If you lexid an identifier, say foo, and you also use $foo in production rules, then the lexid line is completely redundant.
Currently the only recurring use case is the example above where the ready-made "RE_Suite.txt" defines continuous space chars, C_style comment, and C++-style comment to be created into a Spaces token (specifically a lexical token with id TID_LEX_Spaces), and the target language(parser) tries to ignore all spaces. This is when the screener comes in handy.

C_Parser                            parser;
bux::C_ScreenerNo<TID_LEX_Spaces>   screener{parser};
C_Scanner                           scanner{screener};
bux::scanFile(">", in, scanner);

// Test acceptance
if (!parser.accepted())
{
   std::cerr <<"Incomplete expression!\n";
   continue; // or break or return
}

// Apply the result
// ... parser.getFinalLex()

5. # Directives

Seriously, these are not preprocessor directives but processed in the same pass as other type of lines. They just happen to use same old syntaxes:

Directive	Meaning
#include "Foo.txt"	Replace this line with lines read from file "Foo.txt" The grammar definition file of `parsergen` is pretty much a POC of this directive.
#ifdef Bar	💣 If option `%Bar` is defined, include subsequent lines until whichever the paired `#else` or `#endif` is reached first; otherwise, include lines between `#else` and `#endif` if `#else` is present.
#ifndef Bar	💣 If option `%Bar` is not defined, include subsequent lines until whichever the paired `#else` or `#endif` is reached first; otherwise, include lines between `#else` and `#endif` if `#else` is present.
#else	💣
#endif	💣

❗	💣 Pairing rules of `#ifdef`, `#ifndef`, `#else`, `#endif` comply with C++ preprocessor counterparts

💡	No `#if (expr)` and `#elif (expr)` because relevant scenarios are yet to be seen and the implementing effort is estimated high.

6. Parser class naming

Syntax

class (<namespace> ::)* <class_name>

Example

class Main::C_BNFParser

Notes

At most one such line is allowed.
When absent, the parser class has the default name formatted from the base name of the 2nd commandline argument, i.e. <Filename>, except every char which is neither letter nor digit will be replaced by '_'. For example:
- If <Filename> is Parser, the class name will be C_Parser.
- If <Filename> is Script/Parser, the class name will still be C_Parser.
- If <Filename> is Parser-2nd, the class name will be C_Parser_2nd.
This will become a problem only when an application uses multiple parsergen-generated parsers.
Use of namespace(s) is encouraged when the generated parser is part of a library.

7. Operator precedence

Syntax

(left|right|prec) op1 op2 op3 …

ℹ️	left: Left-associative, left operator first right: Right-associative, right operator first prec: No associativity, conflict leads error directly.

Example

left + -
left * / %
right ( )

ℹ️	Lines parsed later get higher precedence.

Error Recovery & Parser Logger

Token $Error which is assured to never be generated by scanner is used in some of productions. Parser always matches those productions not using $Error first to shift or reduce. Only if that attempt fails, parser starts to rollback the process (or state stack) seeking the first doable point to insert $Error, i.e. matching one of those productions using $Error so that parsing can move on. That’s all for the current error recovery, folks!

A supported way to have parser logger is by declaring user’s context type which supports methods to do so, illustrated below:

Typical example (Use both `$Error` & `$c.log()`)

From grammar of JSON parser:

Routine options

%ERROR_TOKEN                              // (1)
%CONTEXT    [[bux::C_ParserOStreamCount]] // (2)
%ON_ERROR   [[                            // (3)
    $c.log(LL_ERROR, $pos, $message);
]]

Awaken the target parser’s error recovery.
If grammar token $Error, which has C++ token id TID_LEX_Error, is not possibly produced by scanner, $Error appears in right halves of productions to indicate the context & position where the parsing goes wrong with C++ code annotations to issue parser logs and/or to make parsing move on (to catch more errors in one run);
Otherwise, simply assign the error token a new name, say
%ERROR_TOKEN MyErr
and thus we have token $MyErr and corresponding token id TID_LEX_MyErr to replace $Error and TID_LEX_Error. Use $Error ro represent real inputs like any other normal tokens, e.g. $Num, $Id, …
The current support to log parser messages in chronological order while counting them in 5 error levels, i.e. LL_FATAL, LL_ERROR, LL_WARNING, LL_INFO, LL_VERBOSE. The class is defined in ParserBase.h (implicitly included by every generated parser header). Surely you can still have your own context class either deriving from bux::C_ParserOStreamCount or having it as a member data.
Implement policy method onError() by calling bux::C_ParserOStreamCount::log()

Identify specific errors/warnings & log them

<value> ::= { <members> }   [[ // (1)
    $r = bux::createLex<json::value>(bux::unlex<json::object>($2));
]]

<members> ::= <member>              [[
    json::object t;
    auto &src = bux::unlex<std::pair<std::string,json::value>>($1);
    t.try_emplace(std::move(src.first), std::move(src.second));
    $r = bux::createLex(std::move(t));
]]
<members> ::= <members> , <member>  [[          // (2)
    auto &src = bux::unlex<std::pair<std::string,json::value>>($3);
    bux::unlex<json::object>($1).try_emplace(std::move(src.first), std::move(src.second));
    $r = $1;
]]
<members> ::= <members> , $Error    [[          // (3)
    $c.log(LL_WARNING, $2, "Superfluous ','");  // (4)
    $r = $1;                                    // (5)
]]

<member> ::= $String : <value>          [[      // (6)
    $r = bux::createLex(std::pair{bux::unlex<std::string>($1), bux::unlex<json::value>($3)});
]]
<member> ::= $String : $Error           [[      // (7)
    $p.onError($3, "Expect <value>");           // (8)
    $r = bux::createLex(std::pair{bux::unlex<std::string>($1), json::value{}}); // (9)
]]
<member> ::= $String $Error             [[      // (10)
    $p.onError($2, "Expect ':'");               // (11)
    $r = bux::createLex(std::pair{bux::unlex<std::string>($1), json::value{}}); // (12)
]]
<member> ::= $Error <value> : <value>   [[      // (13)
    $p.onError($1, "Only string key allowed");  // (14)
    $r = bux::createLex(std::pair{std::string{"NonStrKey__"}, bux::unlex<json::value>($4)}); // (15)
]]

In JSON doc, an object consists of key:value pairs (members) which as a whole is braced by { }
Members are comma(,)-separated.
Trailing comma is not legal, but acceptable (negligible).
Treat a trailing comma as warning rather than error. Warning count incremented.
Just move on the parsing (recover it as nothing happened).
Legit key:value pair.
No value after ':'
Issue an error. Error count incremented. The following line means the same:
$c.log(LL_ERROR, $3, "Expect <value>");
Pair the key with null value and move on (recover it with a fake value)
No ':' after key
Issue an error. Error count incremented. The following line means the same:
$c.log(LL_ERROR, $2, "Expect ':'");
Pair the key with null value and move on (recover it with a fake value)
Non-string key
Issue an error. Error count incremented. The following line means the same:
$c.log(LL_ERROR, $1, "Only string key allowed");
Use "NonStrKey__" as key to pair with the value after ':' and move on (recover it with a fake key)

Boilerplate code to parse JSON stream (source)

    C_Parser            parser{*log};
    bux::C_Screener     preparser(parser, [](auto token){ return token == TID_LEX_Spaces || token == '\n'; });
    C_JSONScanner       scanner(preparser);
    bux::scanFile({}, in, scanner);

    // Check if parsing is ok
    if (const auto n_errs =
        parser.m_context.getCount(LL_FATAL) +
        parser.m_context.getCount(LL_ERROR))      // (1)
        RUNTIME_ERROR("Total {} errors", n_errs);

    // Acceptance
    if (!parser.accepted())
        RUNTIME_ERROR("Incomplete expression!");

    return bux::unlex<value>(parser.getFinalLex());;

Any fatal or error fails the parsing. IOW, parsing is ok with any number of warning, info, verbose messages. But it is totally fine to have different criteria to be deemed ok with.

Parsing error/warning without `$Error`

Example 1

<members> ::= <members> ,   [[  // (1)
    $c.log(LL_WARNING, $2, "Superfluous ','");
    $r = $1;
]]

The almost same production issues a warning already exemplified above except this one is $Error-free. The effect is completely identical.

Example 2

<value> ::= ( <elements> )  [[
    $p.onError($1, "Tuple (...) not allowed, use array [...] instead");
    $r = bux::createLex<json::value>(bux::unlex<json::array>($2));
]]

`$Error` not error

<member> ::= $String : $Error           [[  // (1)
    $r = bux::createLex(std::pair{bux::unlex<std::string>($1), json::value{}});
]]

Extend JSON syntax by allowing default value null (and not issuing anything)

Generated files

(To be explained)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ParserGen

ParserGen

README.adoc

Command line

Grammar

1. Empty line & Comment

2. Production rule

Example

Notes

3. % Option definition

Known Options

4. $ New lexid

5. # Directives

6. Parser class naming

7. Operator precedence

Error Recovery & Parser Logger

Typical example (Use both `$Error` & `$c.log()`)

Parsing error/warning without `$Error`

`$Error` not error

Generated files

Files

ParserGen

Directory actions

More options

Directory actions

More options

Latest commit

History

ParserGen

Folders and files

parent directory

README.adoc

Command line

Grammar

1. Empty line & Comment

2. Production rule

Example

Notes

3. % Option definition

Known Options

4. $ New lexid

5. # Directives

6. Parser class naming

7. Operator precedence

Error Recovery & Parser Logger

Typical example (Use both $Error & $c.log())

Parsing error/warning without $Error

$Error not error

Generated files

Typical example (Use both `$Error` & `$c.log()`)

Parsing error/warning without `$Error`

`$Error` not error