UniLex is lexical analyzer generator (similar to lex
and flex
) with Unicode support.
It's written in PHP and generates code in PHP.
[WIP] Work in progress
- PHP 8
UniLex library is licensed under MIT license.
Installation is as simple as any other composer library's one:
composer require remorhaz/php-unilex
Let's imagine we want to write a simple calculator and we need a lexer (lexical analyzer) that provides a stream of IDs, numbers and operators. Create a new Composer project and execute following command from project directory:
composer require --dev remorhaz/php-unilex
Next step is creating a lexer specification in LexerSpec.php
file. We use @lexToken
tag in comments to specify regular expression for a token:
<?php
/**
* @var \Remorhaz\UniLex\Lexer\TokenMatcherContextInterface $context
* @lexTargetClass TokenMatcher
* @lexHeader
*/
const TOKEN_ID = 1;
const TOKEN_OPERATOR = 2;
const TOKEN_NUMBER = 3;
/** @lexToken /[a-zA-Z][0-9a-zA-Z]*()/ */
$context->setNewToken(TOKEN_ID);
/** @lexToken /[+\-*\/]/ */
$context->setNewToken(TOKEN_OPERATOR);
/** @lexToken /[0-9]+/ */
$context->setNewToken(TOKEN_NUMBER);
Next step is building a token matcher from specification:
vendor/bin/unilex LexerSpec.php > TokenMatcher.php
Now we have a compiled token matcher in TokenMatcher.php
file. Let's use it and read all tokens from the buffer:
<?php
use Remorhaz\UniLex\Lexer\TokenFactory;
use Remorhaz\UniLex\Lexer\TokenReader;
use Remorhaz\UniLex\Unicode\CharBufferFactory;
require_once "vendor/autoload.php";
require_once "TokenMatcher.php";
$buffer = CharBufferFactory::createFromString("x+2*3");
$tokenReader = new TokenReader($buffer, new TokenMatcher, new TokenFactory(0xFF));
do {
$token = $tokenReader->read();
echo "Token ID: {$token->getType()}\n";
} while (!$token->isEoi());
On execution this script outputs:
Token ID: 1
Token ID: 2
Token ID: 3
Token ID: 2
Token ID: 3
Token ID: 255
Let's go a bit further and make it possible to retrieve text presentation of every token from input buffer. We need to modify a lexer specification to attach the result to each non-EOI token as an attribute:
<?php
/**
* @var \Remorhaz\UniLex\Lexer\TokenMatcherContextInterface $context
* @lexTargetClass TokenMatcher
* @lexHeader
*/
const TOKEN_ID = 1;
const TOKEN_OPERATOR = 2;
const TOKEN_NUMBER = 3;
/** @lexToken /[a-zA-Z][0-9a-zA-Z]*()/ */
$context
->setNewToken(TOKEN_ID)
->setTokenAttribute('text', $context->getSymbolString());
/** @lexToken /[+\-*\/]/ */
$context
->setNewToken(TOKEN_OPERATOR)
->setTokenAttribute('text', $context->getSymbolString());
/** @lexToken /[0-9]+/ */
$context
->setNewToken(TOKEN_NUMBER)
->setTokenAttribute('text', $context->getSymbolString());
After rebuilding token matcher with CLI utility we need to modify output cycle of our example program to make it print text with token IDs:
do {
$token = $tokenReader->read();
echo
"Token ID: {$token->getType()}",
$token->isEoi() ? "\n" : " / '{$token->getAttribute('text')}'\n";
} while (!$token->isEoi());
And now program prints:
Token ID: 1 / 'x'
Token ID: 2 / '+'
Token ID: 3 / '2'
Token ID: 2 / '*'
Token ID: 3 / '3'
Token ID: 255
You can use command-line utility to build token matcher from specification:
vendor/bin/unilex path/to/spec/LexerSpec.php path/to/target/TokenMatcher.php --desc="My example matcher."
Specification is a PHP file that is split in parts by DocBlock comments with special tags. There is a special variable $context
that contains context object with \Remorhaz\UniLex\Lexer\TokenMatcherContextInterface
interface. Current implementation also uses int
variable $char
that contains current symbol (TODO: should be moved into context object).
This block can contain namespace
and use
statements that will be used during matcher generation.
This block is executed before the beginning of matching procedure and can be used to initialize some additional variables.
This block is executed on each symbol matched by token's regular expression.
This block is executed on matching given regular expression from the input buffer. Most commonly it just setups new token in context object.
This tag tells parser that matching @lexToken
expression matches only if current lexical mode is mode_name
. Lexical mode can be switched with $context->setMode('mode_name')
method. Using lexical modes allows to have several "sub-grammars" in one specification (i. e. some tokens can be recognized only in comments or strings).
This block is executed if matcher fails to match any of token's regular expressions. By default it just returns false
.