Skip to content

A small tokenizer, and tokenizer lib written in Rust

Notifications You must be signed in to change notification settings

andyslucky/Tok2me

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tok2me

Tok2me is a Maximal munch parser designed to fit into terminal pipelines and tokenize user input and simplify text processing pipelines. Now any program that can read a tsv can utilize Tok2me to tokenize user input. Tok2me reads a token definition yaml file, tokenizes an input file (or reads from stdin if no input file is provided), and writes the tokenized output to stdout in the format TOKEN_NAME<TAB>token_value.

e.g. Token deffinition:

# tokens.yaml
# Example token deffinition document
# Each token has a name and a list of regular expressions that it can match
ignore: []
tokens:
  - 
    token_type: "COMMA"
    exprs: [","]
  - 
    token_type: "WS"
    exprs: ["[ \\t]+"]
  -
    token_type: "NL"
    exprs: ["\\n","\\r\\n"]
  -
    token_type: "STRING"
    exprs: ["\"([\\s\\S]*?)\""]

e.g. Running tok2me with the provided token file on standard input:

printf "\"This is a string\" ,\r\n" | tok2me.exe -t tokens.yaml

output:

# This is the tokenized output from tok2me.
# Lines beginning with '#' are comments and may be skipped!
STRING	"This is a string"
WS	 
COMMA	,
NL	\r\n

About

A small tokenizer, and tokenizer lib written in Rust

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages