Skip to content

Self-hosted parser/scanner generator from LR grammar with semantic annotations in C++20

License

Notifications You must be signed in to change notification settings

buck-yeh/parsergen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

  • The parsergen/scannergen combo generates source code files of LR1/GLR parser & scanner from a set of annotated production rules, aka grammar.
  • Both parsergen & scannergen use the same combo (i.e. themselves) to re-generate their own parser & scanner, respectively, to evolve.
  • Building the generated code with -std=c++2a is required.
  • 🧘 Most often you need the combo, but not always:
    • Sometimes reusing an existing scanner with another parser is feasible and cheaper. (%IDDEF_SOURCE)
    • Sometimes a standalone scanner suffices. (see CBrackets)

Table of Contents

(Created by gh-md-toc)

Installation

  1. Make sure you have installed yay or any other pacman wrapper.

  2. yay -S parsergen to install.

  3. yay -Ql parsergen to see the installed files:

    parsergen /usr/
    parsergen /usr/bin/
    parsergen /usr/bin/grammarstrip
    parsergen /usr/bin/parsergen
    parsergen /usr/bin/scannergen
    parsergen /usr/share/
    parsergen /usr/share/licenses/
    parsergen /usr/share/licenses/parsergen/
    parsergen /usr/share/licenses/parsergen/LICENSE
    parsergen /usr/share/parsergen/
    parsergen /usr/share/parsergen/RE_Suite.txt
  4. Three commands grammarstrip parsergen scannergen at your disposal.

from github in any of Linux distros

  1. Make sure you have installed cmake make gcc git, or the likes.

  2. git clone https://github.com/buck-yeh/parsergen.git
    cd parsergen
    cmake -D FETCH_DEPENDEES=1 -D DEPENDEE_ROOT=_deps .
    make -j
    PSGEN_DIR="/full/path/to/current/dir"

    p.s. You can install a tagged version by replacing main with tag name.

  3. Three commands at your disposal:

    • $PSGEN_DIR/ParserGen/grammarstrip
    • $PSGEN_DIR/ParserGen/parsergen
    • $PSGEN_DIR/ScannerGen/scannergen
  4. 🤔 But is it possible to just type grammarstrip parsergen scannergen to run them?
    💡 Append the following lines to ~/.bashrc:

    PSGEN_DIR="/full/path/to/parsergen/dir"
    alias grammarstrip="$PSGEN_DIR/ParserGen/grammarstrip"
    alias parsergen="$PSGEN_DIR/ParserGen/parsergen"
    alias scannergen="$PSGEN_DIR/ScannerGen/scannergen"

    And run the following line:

    . ~/.bashrc

    There you go! It will also take effect in subsequently opened console windows and will last after reboot.

A quick guide to parsergen/scannergen combo

When you need to quickly implement a parser for an improvised or deliberately designed DSL, prepare a grammar file in simple BNF rules with semantic annotations and then let the combo generate C++ code of parser & scanner.

Write grammar

example/CalcInt/grammar.txt defines a calculator for basic arithmetics + - * / % of integral constants in decimal, octal, or hexadecimal.

lexid   Spaces // (1)

//
//      Output Options (2)
//
%CONTEXT [[std::ostream &]]

%ON_ERROR [[
    $c <<"COL#" <<$pos.m_Col <<": " <<$message <<'\n';
]]

%EXTRA_TOKENS   [[dec_num|oct_num|hex_num|spaces]]
//%SHOW_UNDEFINED

//
//      Operator Precedence (3)
//
left   + -
left   * / %
right  ( )

//
//      Grammar with Reduction Code (4)
//
<@> ::= <Expr>  [[
    $r = $1;
]]

<Expr> ::= <Expr> + <Expr>  [[
    bux::unlex<int>($1) += bux::unlex<int>($3);
    $r = $1;
]]
<Expr> ::= <Expr> - <Expr>  [[
    bux::unlex<int>($1) -= bux::unlex<int>($3);
    $r = $1;
]]
<Expr> ::= <Expr> * <Expr>  [[
    bux::unlex<int>($1) *= bux::unlex<int>($3);
    $r = $1;
]]
<Expr> ::= <Expr> / <Expr>  [[
    bux::unlex<int>($1) /= bux::unlex<int>($3);
    $r = $1;
]]
<Expr> ::= <Expr> % <Expr>  [[
    bux::unlex<int>($1) %= bux::unlex<int>($3);
    $r = $1;
]]
<Expr> ::= ( <Expr> )       [[
    $r = $2;
]]
<Expr> ::= $Num             [[
    $r = bux::createLex(dynamic_cast<bux::C_IntegerLex&>(*$1).value<int>());
]]

(1) New lexid

(2) % Option

(3) Operator precedence

(4) Production rule

Generate C++ code of parser & scanner

When package parsergen is installed in ArchLinux

parsergen grammar.txt Parser tokens.txt && \
scannergen Scanner /usr/share/parsergen/RE_Suite.txt tokens.txt

When parsergen is built from github

parsergen grammar.txt Parser tokens.txt && \
scannergen Scanner "$PSGEN_DIR/ScannerGen/RE_Suite.txt" tokens.txt

where

Parameter Description
grammar.txt Annotated BNF rules and other types of options.
Parser Output file base - parsergen generates Parser.cpp Parser.h ParserIdDef.h
Scanner Output file base - scannergen generates Scanner.cpp Scanner.h
tokens.txt Output of parsergen & input of scannergen
RE_Suite.txt Recurring token definitions provided with scannergen and used by tokens.txt

If target source files already exist

💡 Put the commands in a script called reparse for recurring uses.

ℹ️ parsergen will prompt (y/n) questions three times and scannergen will prompt twice.

> ./reparse 
About to parse 'grammar.txt' ...
Total 1 lex-symbols 1 nonterms 9 literals
states = 30	shifts = 106
Spent 0.005232879"
38 out of 106 goto keys erased for redundancy.
ParserIdDef.h already exists. Overwrite it ?(y/n)y
Parser.h already exists. Overwrite it ?(y/n)y
Parser.cpp already exists. Overwrite it ?(y/n)y
Parser created
#pos_args = 4
About to parse '/usr/share/parsergen/RE_Suite.txt' ...
About to parse 'tokens.txt' ...
Scanner.h already exists. Overwrite it ?(y/n)y
Scanner.cpp already exists. Overwrite it ?(y/n)y
> _ 

Use the generated

ℹ️ from example/CalcInt/main.cpp

Includes

#include "Parser.h"         // C_Parser
#include "ParserIdDef.h"    // TID_LEX_Spaces
#include "Scanner.h"        // C_Scanner

💡 Including ParserIdDef.h may not be necessary when spaces can't be ignored.

Scanner|screener|parser piped to parse

C_Parser                            parser{/*args of context ctor*/};
bux::C_ScreenerNo<TID_LEX_Spaces>   screener{parser}; // (1)
C_Scanner                           scanner{screener};
bux::C_IMemStream                   in{line}; // or other std::istream derived
bux::scanFile(">", in, scanner);

// Check if parsing is ok
// ... (2)

// Acceptance
if (!parser.accepted())
{
   std::cerr <<"Incomplete expression!\n";
   continue; // or break or return
}

// Apply the result 
// parser.getFinalLex() ... (3)

(1) Screener is filter of scanner and can filter out, change, aggregate selected tokens. Don't use it if you don't need it:

C_Parser                            parser{/*args of context ctor*/};
C_Scanner                           scanner{parser};
bux::C_IMemStream                   in{line}; // or other std::istream derived
bux::scanFile(">", in, scanner);

(2) Time to check integrity of your context status.

(3) parser.getFinalLex() returns reference to the merged result of type bux::LR1::C_LexInfo. In this example, the expected result is integral value of type int and can be conveniently obtained by calling bux::unlex<T>()

bux::unlex<int>(parser.getFinalLex())

An alternative way is to store the result in the user context instance thru "production code" instead of calling parser.getFinalLex().