A simple lexer based on regular expressions.
Inspired by https://eli.thegreenplace.net/2013/06/25/regex-based-lexical-analysis-in-python-and-javascript
You define the lexing rules and lexery matches them iteratively as a look-up:
>>> import lexery
>>> import re
>>> text = 'crop \t ( 20, 30, 40, 10 ) ;'
>>>
>>> lexer = lexery.Lexer(
... rules=[
... lexery.Rule(identifier='identifier',
... pattern=re.compile(r'[a-zA-Z_][a-zA-Z_]*')),
... lexery.Rule(identifier='lpar', pattern=re.compile(r'\(')),
... lexery.Rule(identifier='number', pattern=re.compile(r'[1-9][0-9]*')),
... lexery.Rule(identifier='rpar', pattern=re.compile(r'\)')),
... lexery.Rule(identifier='comma', pattern=re.compile(r',')),
... lexery.Rule(identifier='semi', pattern=re.compile(r';'))
... ],
... skip_whitespace=True)
>>> tokens = lexer.lex(text=text)
>>> assert tokens == [[
... lexery.Token('identifier', 'crop', 0, 0),
... lexery.Token('lpar', '(', 9, 0),
... lexery.Token('number', '20', 11, 0),
... lexery.Token('comma', ',', 13, 0),
... lexery.Token('number', '30', 15, 0),
... lexery.Token('comma', ',', 17, 0),
... lexery.Token('number', '40', 19, 0),
... lexery.Token('comma', ',', 21, 0),
... lexery.Token('number', '10', 23, 0),
... lexery.Token('rpar', ')', 26, 0),
... lexery.Token('semi', ';', 28, 0)]]
Mind that if a part of the text can not be matched, a lexery.Error
is raised:
>>> import lexery
>>> import re
>>> text = 'some-identifier ( 23 )'
>>>
>>> lexer = lexery.Lexer(
... rules=[
... lexery.Rule(identifier='identifier', pattern=re.compile(r'[a-zA-Z_][a-zA-Z_]*')),
... lexery.Rule(identifier='number', pattern=re.compile(r'[1-9][0-9]*')),
... ],
... skip_whitespace=True)
>>> tokens = lexer.lex(text=text)
Traceback (most recent call last):
...
lexery.Error: Unmatched text at line 0 and position 4:
some-identifier ( 23 )
^
If you specify an unmatched_identifier
, all the unmatched characters are accumulated in tokens with that identifier:
>>> import lexery
>>> import re
>>> text = 'some-identifier ( 23 )-'
>>>
>>> lexer = lexery.Lexer(
... rules=[
... lexery.Rule(identifier='identifier', pattern=re.compile(r'[a-zA-Z_][a-zA-Z_]*')),
... lexery.Rule(identifier='number', pattern=re.compile(r'[1-9][0-9]*')),
... ],
... skip_whitespace=True,
... unmatched_identifier='unmatched')
>>> tokens = lexer.lex(text=text)
>>> assert tokens == [[
... lexery.Token('identifier', 'some', 0, 0),
... lexery.Token('unmatched', '-', 4, 0),
... lexery.Token('identifier', 'identifier', 5, 0),
... lexery.Token('unmatched', '(', 16, 0),
... lexery.Token('number', '23', 18, 0),
... lexery.Token('unmatched', ')-', 21, 0)]]
- Install lexery with pip:
pip3 install lexery
- Check out the repository.
- In the repository root, create the virtual environment:
python3 -m venv venv3
- Activate the virtual environment:
source venv3/bin/activate
- Install the development dependencies:
pip3 install -e .[dev]
We provide a set of pre-commit checks that run unit tests, lint and check code for formatting.
Namely, we use:
- yapf to check the formatting.
- The style of the docstrings is checked with pydocstyle.
- Static type analysis is performed with mypy.
- Various linter checks are done with pylint.
Run the pre-commit checks locally from an activated virtual environment with development dependencies:
./precommit.py
- The pre-commit script can also automatically format the code:
./precommit.py --overwrite
We follow Semantic Versioning. The version X.Y.Z indicates:
- X is the major version (backward-incompatible),
- Y is the minor version (backward-compatible), and
- Z is the patch version (backward-compatible bug fix).