parsergen -h
USAGE: parsergen <Grammar> <Filename> <TokensOutput> [-I ARG] [-a] [--with-bom] [-h]
DESCRIPTION:
LR(1)/GLR-Parser Generator command line tool v1.7.4
Where:
1. <Grammar> is a grammar definition file.
2. Generated C++ source files are named as:
<Filename>IdDef.h - Lexical token enumerations
<Filename>.h - Header of parser class
<Filename>.cpp - Implementation of parser class
3. Generated token definitions are written to <TokensOutput> to feed scannergen
VALID FLAGS:
-I, --include-dir ARG
Search path of #include directive within <Grammar>
-a, --yes-to-all
Quietly overwrite all existing output files
--with-bom
Prefix BOM to all output files
-h, --help
Display this help and exit
Grammar definition file consists of lines. There are 7 types of lines which can be mixed up almost in no particular order. No line type is mandatory. The cases where line order does matter will be highlighted by 💣
Empty line together with
/*
* Multi-lined Comment
*/
// Single-lined Comment
They are all helpful to enhance readability. Put as many as you like where you like.
💡
|
Comments can also disable other types of lines (and later re-enable them just as quickly)://%SHOW_UNDEFINED
|
<NID> ::= (<NID>|AnythingElse)*
<NID> ::= (<NID>|AnythingElse)* [[
? Multi-lined reduction code in C++
]]
💡
|
NID is short for Non-terminal ID. |
<All> ::= <Line> // (1)
<All> ::= <All> "\n" <Line>
<Line> ::= <Production> <Semantics> [[ // (2)
auto &c = $c;
if (!c.testCond())
return;
auto &prod = dynamic_cast<C_Production&>(*$1);
if (!c.addProduction(prod, bux::tryUnlex<C_Semantic>($2))) [[unlikely]] // (3)
$p.onError($1, "Production re-defined:\n"
"\t" + prod.str());
]]
-
No [[ ]], no reduction.
-
Doubly-bracketed [[reduction code]] may contain the following mnemonics:
-
$p
- Reference to the base parser class, either of typebux::LR1::C_Parser &
or of typebux::GLR::C_Parser &
-
$P
- Reference to the generated parser class -
$c
- Reference to the context instance of the generated parser class if context type is defined by option%CONTEXT
-
$r
- The result token buffer which can be freely assigned -
$1
$2
$3
… denote the 1st, 2nd, 3rd, .. operand to the right of::=
respectively, terminal or non-terminal.
-
-
C++ Attributes can still be used in an reduction block
-
parsergen
deals only context-free grammars. Therefore exactly one non-terminal is allowed to the left of::=
per production. -
A reduced non-terminal operand combines with zero or more terminal/ non-terminal neighbors, reduces again into 'upper' non-terminal, … and eventually reduces into <@>, the root non-terminal aka the start symbol. Thus the whole input string parsed is deemed accepted.
-
When there is no production rule at all, the grammar defined a language only accepting empty string, demonstrated by MinLang
🔥
|
💣 parsergen has to know the start symbol before calculation. If there is a production like <@> ::= … , then <@> is the start symbol; otherwise, the left side of ::= in the first parsed production, say <All> , becomes the start symbol, and an extra production <@> ::= <All> is added implicitly.
|
%Id
%Id [[Single-lined contents]]
%Id [[
? Multi-lined contents
]]
%SHOW_UNDEFINED
%CONTEXT [[C_BNFContext]]
%HEADERS_FOR_HEADER [[
#include "BNFContext.h" // C_BNFContext
]]
Known Option | Output To | Action / Meaning |
---|---|---|
|
ParserIdDef.h |
#include "Path/To/IdDef.h" ℹ️ Defining this option means the parser will work with an existing scanner. "Path/To/IdDef.h" should have all token ids of the scanner and also happens to have all token ids needed by the target parser. |
|
Parser.cpp |
If |
|
Parser.cpp |
Implement the following policy method with valid mnemonics bool C_ParserPolicy::changeToken(T_LexID &token, C_LexPtr &attr) const A try to break down a scanned token input and take its first char as new input to resume parsing. Example
%UPCAST_TOKEN [[
if (isascii($token) &&
!iscntrl($token) &&
!isalnum($token) &&
!isspace($token))
{
$attr.assign(bux::createLex<std::string>(1,char($token)), true);
$token = TID_LEX_Operator;
return true;
}
return false;
]] |
|
Parser.cpp |
Implement the following policy method with valid mnemonics void C_ParserPolicy::onError(
bux::LR1::C_Parser &,
const bux::C_SourcePos &pos,
const std::string &message) const Example 1
%CONTEXT [[C_Context]]
%ON_ERROR [[
$c.issueError(LL_ERROR, $pos, $message);
]] Example 2
%CONTEXT [[std::ostream &]]
%ON_ERROR [[
$c <<'(' <<$pos.m_Line <<',' <<$pos.m_Col <<"): " <<$message <<'\n';
]] |
|
Parser.cpp |
When defined, for every other known option not defined, say // %FOO undefined (expanded here otherwise) Read all 3 output files of MinLang to find exact locations of such comment lines for various known options. |
|
Parser.cpp |
Type of public member data |
|
ParserIdDef.h |
This option tells |
|
Parser.h |
Output before entering namespace scope of the target parser class: // %HEADERS_FOR_HEADER expanded BEGIN
...(your code)...
// %HEADERS_FOR_HEADER expanded END |
|
Parser.h |
Output within namespace scope of the target parser class and before the class is defined: // %PRECLASSDECL expanded BEGIN
...(your code)...
// %PRECLASSDECL expanded END |
|
Parser.h |
Output within the definition of target parser class and right after the common members are declared: // %INCLASSDECL expanded BEGIN
...(your code)...
// %INCLASSDECL expanded END ℹ️ If |
|
Parser.cpp |
Output after the banner comment and before any non-comment code: // %HEADERS_FOR_CPP expanded BEGIN
...(your code)...
// %HEADERS_FOR_CPP expanded END |
|
Parser.cpp |
Output within anonymous namespace scope and between common // %LOCAL_CPP expanded BEGIN
...(your code)...
// %LOCAL_CPP expanded END |
|
Parser.cpp |
Output within namespace scope of the target parser class and before ctor/method bodies of the class: // %SCOPED_CPP_HEAD expanded BEGIN
...(your code)...
// %SCOPED_CPP_HEAD expanded END |
|
Parser.cpp |
Output within namespace scope of the target parser class and after ctor/method bodies of the class: // %SCOPED_CPP_TAIL expanded BEGIN
...(your code)...
// %SCOPED_CPP_TAIL expanded END |
|
tokens.txt |
Output as the first part of tokens.txt |
|
tokens.txt |
|-separated token identifiers which again | with ℹ️ Multiple Input
%EXTRA_TOKENS [[dec_num|hex_num|identifier|c_char|c_str|spaces]] Output
the_very_last = …(generated keywords & compound operators)… | dec_num|hex_num|identifier|c_char|c_str|spaces|bracketed| …(the rest)… |
|
tokens.txt |
Output as part of %HEADERS_FOR_CPP [[
#include "ParserIdDef.h"
// %HEADERS_FOR_SCANNER_CPP expanded BEGIN
#include "BracketBalance.h"
// %HEADERS_FOR_SCANNER_CPP expanded END
using namespace Main;
]] |
|
tokens.txt |
Output as %LOCAL_ACTION_DEFS [[
// %LOCALS_FOR_SCANNER_CPP expanded BEGIN
...(your code)...
// %LOCALS_FOR_SCANNER_CPP expanded END
]] |
|
When defined, all conflicted actions turning the target parser a GLR will not be printed to console thruout parser generation. |
lexid Id1 Id2 …
lexid Spaces
-
If you lexid an identifier, say foo, and you also use $foo in production rules, then the lexid line is completely redundant.
-
Currently the only recurring use case is the example above where the ready-made "RE_Suite.txt" defines continuous space chars, C_style comment, and C++-style comment to be created into a Spaces token (specifically a lexical token with id
TID_LEX_Spaces
), and the target language(parser) tries to ignore all spaces. This is when the screener comes in handy.
C_Parser parser;
bux::C_ScreenerNo<TID_LEX_Spaces> screener{parser};
C_Scanner scanner{screener};
bux::scanFile(">", in, scanner);
// Test acceptance
if (!parser.accepted())
{
std::cerr <<"Incomplete expression!\n";
continue; // or break or return
}
// Apply the result
// ... parser.getFinalLex()
Seriously, these are not preprocessor directives but processed in the same pass as other type of lines. They just happen to use same old syntaxes:
Directive |
Meaning |
---|---|
#include "Foo.txt" |
Replace this line with lines read from file "Foo.txt" |
#ifdef Bar |
💣 If option |
#ifndef Bar |
💣 If option |
#else |
💣 |
#endif |
💣 |
❗
|
💣 Pairing rules of #ifdef , #ifndef , #else , #endif comply with C++ preprocessor counterparts
|
💡
|
No #if (expr) and #elif (expr) because relevant scenarios are yet to be seen and the implementing effort is estimated high.
|
class (
<namespace>
::)*<class_name>
class
Main::C_BNFParser
-
At most one such line is allowed.
-
When absent, the parser class has the default name formatted from the base name of the 2nd commandline argument, i.e.
<Filename>
, except every char which is neither letter nor digit will be replaced by '_'. For example:-
If
<Filename>
isParser
, the class name will beC_Parser
. -
If
<Filename>
isScript/Parser
, the class name will still beC_Parser
. -
If
<Filename>
isParser-2nd
, the class name will beC_Parser_2nd
.
-
-
This will become a problem only when an application uses multiple
parsergen
-generated parsers. -
Use of namespace(s) is encouraged when the generated parser is part of a library.
Token $Error
which is assured to never be generated by scanner is used in some of productions. Parser always matches those productions not using $Error
first to shift or reduce. Only if that attempt fails, parser starts to rollback the process (or state stack) seeking the first doable point to insert $Error
, i.e. matching one of those productions using $Error
so that parsing can move on. That’s all for the current error recovery, folks!
A supported way to have parser logger is by declaring user’s context type which supports methods to do so, illustrated below:
From grammar of JSON parser:
%ERROR_TOKEN // (1)
%CONTEXT [[bux::C_ParserOStreamCount]] // (2)
%ON_ERROR [[ // (3)
$c.log(LL_ERROR, $pos, $message);
]]
-
Awaken the target parser’s error recovery.
If grammar token$Error
, which has C++ token idTID_LEX_Error
, is not possibly produced by scanner,$Error
appears in right halves of productions to indicate the context & position where the parsing goes wrong with C++ code annotations to issue parser logs and/or to make parsing move on (to catch more errors in one run);
Otherwise, simply assign the error token a new name, say
%ERROR_TOKEN MyErr
and thus we have token$MyErr
and corresponding token idTID_LEX_MyErr
to replace$Error
andTID_LEX_Error
. Use$Error
ro represent real inputs like any other normal tokens, e.g.$Num
,$Id
, … -
The current support to log parser messages in chronological order while counting them in 5 error levels, i.e.
LL_FATAL
,LL_ERROR
,LL_WARNING
,LL_INFO
,LL_VERBOSE
. The class is defined in ParserBase.h (implicitly included by every generated parser header). Surely you can still have your own context class either deriving frombux::C_ParserOStreamCount
or having it as a member data. -
Implement policy method
onError()
by callingbux::C_ParserOStreamCount::log()
<value> ::= { <members> } [[ // (1)
$r = bux::createLex<json::value>(bux::unlex<json::object>($2));
]]
<members> ::= <member> [[
json::object t;
auto &src = bux::unlex<std::pair<std::string,json::value>>($1);
t.try_emplace(std::move(src.first), std::move(src.second));
$r = bux::createLex(std::move(t));
]]
<members> ::= <members> , <member> [[ // (2)
auto &src = bux::unlex<std::pair<std::string,json::value>>($3);
bux::unlex<json::object>($1).try_emplace(std::move(src.first), std::move(src.second));
$r = $1;
]]
<members> ::= <members> , $Error [[ // (3)
$c.log(LL_WARNING, $2, "Superfluous ','"); // (4)
$r = $1; // (5)
]]
<member> ::= $String : <value> [[ // (6)
$r = bux::createLex(std::pair{bux::unlex<std::string>($1), bux::unlex<json::value>($3)});
]]
<member> ::= $String : $Error [[ // (7)
$p.onError($3, "Expect <value>"); // (8)
$r = bux::createLex(std::pair{bux::unlex<std::string>($1), json::value{}}); // (9)
]]
<member> ::= $String $Error [[ // (10)
$p.onError($2, "Expect ':'"); // (11)
$r = bux::createLex(std::pair{bux::unlex<std::string>($1), json::value{}}); // (12)
]]
<member> ::= $Error <value> : <value> [[ // (13)
$p.onError($1, "Only string key allowed"); // (14)
$r = bux::createLex(std::pair{std::string{"NonStrKey__"}, bux::unlex<json::value>($4)}); // (15)
]]
-
In JSON doc, an object consists of key:value pairs (members) which as a whole is braced by { }
-
Members are comma(,)-separated.
-
Trailing comma is not legal, but acceptable (negligible).
-
Treat a trailing comma as warning rather than error. Warning count incremented.
-
Just move on the parsing (recover it as nothing happened).
-
Legit key:value pair.
-
No value after ':'
-
Issue an error. Error count incremented. The following line means the same:
$c.log(LL_ERROR, $3, "Expect <value>");
-
Pair the key with null value and move on (recover it with a fake value)
-
No ':' after key
-
Issue an error. Error count incremented. The following line means the same:
$c.log(LL_ERROR, $2, "Expect ':'");
-
Pair the key with null value and move on (recover it with a fake value)
-
Non-string key
-
Issue an error. Error count incremented. The following line means the same:
$c.log(LL_ERROR, $1, "Only string key allowed");
-
Use
"NonStrKey__"
as key to pair with the value after ':' and move on (recover it with a fake key)
C_Parser parser{*log};
bux::C_Screener preparser(parser, [](auto token){ return token == TID_LEX_Spaces || token == '\n'; });
C_JSONScanner scanner(preparser);
bux::scanFile({}, in, scanner);
// Check if parsing is ok
if (const auto n_errs =
parser.m_context.getCount(LL_FATAL) +
parser.m_context.getCount(LL_ERROR)) // (1)
RUNTIME_ERROR("Total {} errors", n_errs);
// Acceptance
if (!parser.accepted())
RUNTIME_ERROR("Incomplete expression!");
return bux::unlex<value>(parser.getFinalLex());;
-
Any fatal or error fails the parsing. IOW, parsing is ok with any number of warning, info, verbose messages. But it is totally fine to have different criteria to be deemed ok with.
<members> ::= <members> , [[ // (1)
$c.log(LL_WARNING, $2, "Superfluous ','");
$r = $1;
]]
-
The almost same production issues a warning already exemplified above except this one is
$Error
-free. The effect is completely identical.
<value> ::= ( <elements> ) [[
$p.onError($1, "Tuple (...) not allowed, use array [...] instead");
$r = bux::createLex<json::value>(bux::unlex<json::array>($2));
]]