Get identifiers, names, paths, URLs and words from the command output.
The xontrib-output-search for xonsh shell is using this library.
If you like the idea click ⭐ on the repo and stay tuned by watching releases.
pip install -U tokenize-output
You can use tokenize-output
command as well as export the tokenizers in Python:
from tokenize_output.tokenize_output import *
tokenizer_split("Hello world!")
# {'final': set(), 'new': {'Hello', 'world!'}}
echo "Try https://github.com/xxh/xxh" | tokenize-output -p
# Try
# https://github.com/xxh/xxh
echo '{"Try": "xonsh shell"}' | tokenize-output -p
# Try
# shell
# xonsh
# xonsh shell
echo 'PATH=/one/two:/three/four' | tokenize-output -p
# /one/two
# /one/two:/three/four
# /three/four
# PATH
Tokenizer is a functions which extract tokens from the text.
Priority | Tokenizer | Text example | Tokens |
---|---|---|---|
1 | dict | {"key": "val as str"} |
key , val as str |
2 | env | PATH=/bin:/etc |
PATH , /bin:/etc , /bin , /etc |
3 | split | Split me \n now! |
Split , me , now! |
4 | strip | {Hello}!. |
Hello |
You can create your tokenizer and add it to tokenizers_all
in tokenize_output.py
.
Tokenizing is a recursive process where every tokenizer returns final
and new
tokens.
The final
tokens directly go to the result list of tokens. The new
tokens go to all
tokenizers again to find new tokens. As result if there is a mix of json and env data
in the output it will be found and tokenized in appropriate way.
You can start from env
tokenizer:
- Prepare regexp
- Prepare tokenizer function
- Add the function to the list and to the preset.
- Add test.
- Now you can test and debug (see below).
Run tests:
cd ~
git clone https://github.com/anki-code/tokenize-output
cd tokenize-output
python -m pytest tests/
To debug the tokenizer:
echo "Hello world" | ./tokenize-output -p