UniLex

UniLex is lexical analyzer generator (similar to lex and flex) with Unicode support. It's written in PHP and generates code in PHP.

[WIP] Work in progress

Requirements

PHP 8

License

UniLex library is licensed under MIT license.

Installation

Installation is as simple as any other composer library's one:

composer require remorhaz/php-unilex

Usage

Quick start in example

Let's imagine we want to write a simple calculator and we need a lexer (lexical analyzer) that provides a stream of IDs, numbers and operators. Create a new Composer project and execute following command from project directory:

composer require --dev remorhaz/php-unilex

Next step is creating a lexer specification in LexerSpec.php file. We use @lexToken tag in comments to specify regular expression for a token:

<?php
/**
 * @var \Remorhaz\UniLex\Lexer\TokenMatcherContextInterface $context
 * @lexTargetClass TokenMatcher
 * @lexHeader
 */

const TOKEN_ID = 1;
const TOKEN_OPERATOR = 2;
const TOKEN_NUMBER = 3;

/** @lexToken /[a-zA-Z][0-9a-zA-Z]*()/ */
$context->setNewToken(TOKEN_ID);

/** @lexToken /[+\-*\/]/ */
$context->setNewToken(TOKEN_OPERATOR);

/** @lexToken /[0-9]+/ */
$context->setNewToken(TOKEN_NUMBER);

Next step is building a token matcher from specification:

vendor/bin/unilex LexerSpec.php > TokenMatcher.php

Now we have a compiled token matcher in TokenMatcher.php file. Let's use it and read all tokens from the buffer:

<?php

use Remorhaz\UniLex\Lexer\TokenFactory;
use Remorhaz\UniLex\Lexer\TokenReader;
use Remorhaz\UniLex\Unicode\CharBufferFactory;

require_once "vendor/autoload.php";
require_once "TokenMatcher.php";

$buffer = CharBufferFactory::createFromString("x+2*3");
$tokenReader = new TokenReader($buffer, new TokenMatcher, new TokenFactory(0xFF));

do {
    $token = $tokenReader->read();
    echo "Token ID: {$token->getType()}\n";
} while (!$token->isEoi());

On execution this script outputs:

Token ID: 1
Token ID: 2
Token ID: 3
Token ID: 2
Token ID: 3
Token ID: 255

Let's go a bit further and make it possible to retrieve text presentation of every token from input buffer. We need to modify a lexer specification to attach the result to each non-EOI token as an attribute:

<?php
/**
 * @var \Remorhaz\UniLex\Lexer\TokenMatcherContextInterface $context
 * @lexTargetClass TokenMatcher
 * @lexHeader
 */

const TOKEN_ID = 1;
const TOKEN_OPERATOR = 2;
const TOKEN_NUMBER = 3;

/** @lexToken /[a-zA-Z][0-9a-zA-Z]*()/ */
$context
    ->setNewToken(TOKEN_ID)
    ->setTokenAttribute('text', $context->getSymbolString());

/** @lexToken /[+\-*\/]/ */
$context
    ->setNewToken(TOKEN_OPERATOR)
    ->setTokenAttribute('text', $context->getSymbolString());

/** @lexToken /[0-9]+/ */
$context
    ->setNewToken(TOKEN_NUMBER)
    ->setTokenAttribute('text', $context->getSymbolString());

After rebuilding token matcher with CLI utility we need to modify output cycle of our example program to make it print text with token IDs:

do {
    $token = $tokenReader->read();
    echo
        "Token ID: {$token->getType()}",
        $token->isEoi() ? "\n" : " / '{$token->getAttribute('text')}'\n";
} while (!$token->isEoi());

And now program prints:

Token ID: 1 / 'x'
Token ID: 2 / '+'
Token ID: 3 / '2'
Token ID: 2 / '*'
Token ID: 3 / '3'
Token ID: 255

CLI

You can use command-line utility to build token matcher from specification:

vendor/bin/unilex path/to/spec/LexerSpec.php path/to/target/TokenMatcher.php --desc="My example matcher."

Specification

Specification is a PHP file that is split in parts by DocBlock comments with special tags. There is a special variable $context that contains context object with \Remorhaz\UniLex\Lexer\TokenMatcherContextInterface interface. Current implementation also uses int variable $char that contains current symbol (TODO: should be moved into context object).

@lexHeader

This block can contain namespace and use statements that will be used during matcher generation.

@lexBeforeMatch

This block is executed before the beginning of matching procedure and can be used to initialize some additional variables.

@lexOnTransition

This block is executed on each symbol matched by token's regular expression.

@lexToken /regexp/

This block is executed on matching given regular expression from the input buffer. Most commonly it just setups new token in context object.

@lexMode 'mode_name'

This tag tells parser that matching @lexToken expression matches only if current lexical mode is mode_name. Lexical mode can be switched with $context->setMode('mode_name') method. Using lexical modes allows to have several "sub-grammars" in one specification (i. e. some tokens can be recognized only in comments or strings).

@lexOnError

This block is executed if matcher fails to match any of token's regular expressions. By default it just returns false.

Name		Name	Last commit message	Last commit date
Latest commit History 426 Commits
.github/workflows		.github/workflows
.phive		.phive
bin		bin
doc		doc
docker		docker
examples		examples
spec		spec
src		src
tests		tests
vendor-bin/cs		vendor-bin/cs
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
composer.json		composer.json
docker-compose.yml		docker-compose.yml
infection.json.dist		infection.json.dist
php-8.1.Dockerfile		php-8.1.Dockerfile
php-8.2.Dockerfile		php-8.2.Dockerfile
php-8.3.Dockerfile		php-8.3.Dockerfile
phpcs.xml		phpcs.xml
phpunit.xml.dist		phpunit.xml.dist

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UniLex

Requirements

License

Installation

Usage

Quick start in example

CLI

Specification

@lexHeader

@lexBeforeMatch

@lexOnTransition

@lexToken /regexp/

@lexMode 'mode_name'

@lexOnError

About

Releases 27

Packages

Languages

License

remorhaz/php-unilex

Folders and files

Latest commit

History

Repository files navigation

UniLex

Requirements

License

Installation

Usage

Quick start in example

CLI

Specification

@lexHeader

@lexBeforeMatch

@lexOnTransition

@lexToken /regexp/

@lexMode 'mode_name'

@lexOnError

About

Resources

License

Stars

Watchers

Forks

Releases 27

Packages 0

Languages

Packages