Skip to content

khchen/tinyre

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Donate

TinyRE

TinyRE is a Nim wrap for a tiny regex engine based on Rob Pike's VM implementation. Compare to other small regex engines, this engine supports unicode and most of common regex syntax, in less than 10K code size (LOC < 1K), and guarantees that input regex will scale O(n) with the size of the string.

Features

  • Support unicode.
  • Support case-insensitive matching.
  • Support global matching.
  • Support most common regex syntax, including:
    • Greedy and non-greedy expressions: *, +, ?, *?, +?, ??.
    • Characters sets: [xyz], [a-z], [^xyz], [^m-z].
    • Meta characters: \s, \S ,\w, \W, \d, \D, \n, \r, \t etc.
    • Ascii or unicode characters: \x00, \u0000, \U00000000
    • Alternation operator: |.
    • Beginning and end assertions: ^, $.
    • Repetition operators: {n}, {n,m}, {n,}.
    • Non-greedy repetition operators: {n,m}?, {n,}?.
    • Group and non-capture group: (...), (?:...).
    • Start-of-word, end-of-word, and word/nonword boundary assertions: \<, \>, \b, \B.

Examples

import tinyre

doAssert match("abc123", re"\d+") == @["123"]
doAssert bounds("abc123", re"\d+") == @[3..5]
doAssert contains("abc123", re"\d+") == true
doAssert startsWith("abc123", re"[a-z]+") == true
doAssert endsWith("abc123", re"\d+") == true
doAssert split("abc123", re"\d+") == @["abc", ""]
doAssert replacef("abc123", re"([a-z]+)(\d+)", "$2$1") == "123abc"

# reG for global matching
doAssert match("abc123", reG".") == @["a", "b", "c", "1", "2", "3"]

# reI for case insensitive matching
doAssert match("abc123", reI"ABC") == @["abc"]

# reU for utf8 matching
doAssert match("中文", reU"..") == @["中文"]

Code Size

Comparing the code size of different regex library.

# gcc: version 11.1.0 MinGW-W64
# compile options: -d:release -d:danger --opt:size -d:lto -d:strip

# test codes
echo contains("abc123def", "123") # for strutils
echo contains("abc123def", re"\d+") # for regex library

# Results
strutils:       65,536 bytes (no dependence)
std/re:         68,608 bytes (dependence: pcre64.dll: 526,336 bytes)
tinyre:         75,264 bytes (no dependence)
nim regex:     296,448 bytes (no dependence)

Performance

In summary, faster than std/re in small string, but slower than std/re in large string. Here is the benchmark result on my computer. The test file and pattern is from https://github.com/mariomka/regex-benchmark.

# small string: "abc123def".contains("\d+")
# large string: 6.71 MB text file
#   email: [\w\.+-]+@[\w\.-]+\.[\w\.-]+
#   uri: [\w]+://[^/\s?#]+[^\s?#]+(?:\?[^\s#]*)?(?:#[^\s]*)?
#   ipv4: (?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])
# compile options: -d:release -d:danger --opt:speed -d:lto

name ............................... min time      avg time    std dv   runs
tinyre (small string) .............. 0.366 ms      0.379 ms    ±0.018  x1000
std/re (small string) .............. 5.862 ms      6.218 ms    ±0.171   x797
nim-regex (small string) .......... 16.132 ms     17.067 ms    ±0.580   x288
tinyre (large string, email) ..... 140.684 ms    151.663 ms    ±8.625    x33
std/re (large string, email) ...... 44.793 ms     48.884 ms    ±2.716   x102
nim-regex (large string, email) .... 3.680 ms      3.921 ms    ±0.132  x1000
tinyre (large string, uri) ....... 127.465 ms    131.721 ms    ±2.110    x38
std/re (large string, uri) ........ 40.380 ms     42.812 ms    ±1.175   x117
nim-regex (large string, uri) ..... 21.400 ms     22.205 ms    ±0.344   x225
tinyre (large string, ipv4) ...... 182.995 ms    186.441 ms    ±1.057    x27
std/re (large string, ipv4) ........ 4.854 ms      5.965 ms    ±0.903   x838
nim-regex (large string, ipv4) ..... 7.569 ms      7.849 ms    ±0.159   x635

Docs

Reference

License

Copyright (c) Chen Kai-Hung, Ward. All rights reserved.

Donate

If this project help you reduce time to develop, you can give me a cup of coffee :)

paypal