GitHub

An implementation of POSIX Extended Regular Expressions in Hoon.

Files

lib/regex.hoon: A regular expressions library.
tests/lib/regex.hoon: A test suite for the library.

Usage

Import with Ford per usual (/+ regex), or try it out in the dojo with =regex -build-file %/lib/regex/hoon. Regex syntax is Posix ERE, with some extensions (see below).

When writing regexes in Hoon, take care to properly escape the characters that need escaping. A regex that matches five characters is written in Hoon as ".\{5}" to avoid string interpolation; a regex that matches a literal { is written as "\\\{". Some sequences, like the backreference \2, can have a deceptive but valid meaning if not escaped: "(.(.))\2" is a syntax error, but "(.(.))\20" encodes a space character. This backrefence should be written as \\2.

Regexes involving nested repetition, such as (\w+\.*)* (or its tape representation, "(\\w+\\.*)*"), can be especially slow to execute because of the combinatorial explosion of valid groupings. For example, abcd could be matched as {abcd}, {abc d}, {ab cd}, {ab c d}, and so forth. This can be usually be remedied by turning * into + within the capture group, preventing overlapping match candidates.

Types

Regex operations usually produce either a $range (some text and its location within the subject) or a $match (a map of ranges by capture number).

+$  range  (pair pint tape)
+$  match  (pair range (map @u range))

Every function that produces a $match, with all subcaptures, has a variant that produces only the total $range of matched text.

Operations

`++valid:regex`

Determine if a regular expression is valid.

Accepts: a regular expression

Produces: %.y if the regex is valid; %.n otherwise

> (valid:regex "")
%.y
> (valid:regex "(?)")
%.n
> (valid:regex "[a-b-c]")
%.n

`++run:regex`

Find the first match of a regular expression.

Accepts: a regular expression and a match target

Produces: (unit match)

Crashes if the regular expression is invalid.

> (run:regex ".+" "")
~
> (run:regex "[a-z]+" "Abc")
[~ {[p=0 q=[p=[p=[p=1 q=2] q=[p=1 q=4]] q="bc"]]}]
> (run:regex "([a-z])+" "abc")
[ ~
  { [p=1 q=[p=[p=[p=1 q=3] q=[p=1 q=4]] q="c"]]
    [p=0 q=[p=[p=[p=1 q=1] q=[p=1 q=4]] q="abc"]]
  }
]

`++ran:regex`

Find the first match crashing if there is none.

Accepts: a regular expression and a match target

Produces: match

Crashes if the regular expression is invalid or of there is no match.

> (ran:regex ".+" "")
...
dojo: hoon expression failed
> (ran:regex "[a-z]+" "Abc")
{[p=0 q=[p=[p=[p=1 q=2] q=[p=1 q=4]] q="bc"]]}
> (ran:regex "([a-z])+" "abc")
{ [p=1 q=[p=[p=[p=1 q=3] q=[p=1 q=4]] q="c"]]
  [p=0 q=[p=[p=[p=1 q=1] q=[p=1 q=4]] q="abc"]]
}

`++rut:regex`

Variant of +run that produces (unit range) instead of (unit match).

`++rat:regex`

Variant of +ran that produces range instead of match.

`++all:regex`

Find all matches of a regular expression.

Accepts: a regular expression and a match target

Produces: (list match)

Crashes if the regular expression is invalid.

> (all:regex "\\w*" "the quick brown fox")
~[
  {[p=0 q=[p=[p=[p=1 q=1] q=[p=1 q=4]] q="the"]]}
  {[p=0 q=[p=[p=[p=1 q=5] q=[p=1 q=10]] q="quick"]]}
  {[p=0 q=[p=[p=[p=1 q=11] q=[p=1 q=16]] q="brown"]]}
  {[p=0 q=[p=[p=[p=1 q=17] q=[p=1 q=20]] q="fox"]]}
]

`++alt:regex`

Variant of +all that produces (list range) instead of (list match).

`++sub:regex`

Replace the first match of a regular expression.

Accepts: a regular expression, a replacement string, and a match target

Produces: the match target, potentially modified

Crashes if the regular expression is invalid.

> (sub:regex "(?i)hoon" "HOON" "Hoon hoon hoon")
"HOON hoon hoon"

`++gsub:regex`

Replace the all matches of a regular expression.

Accepts: a regular expression, a replacement string, and a match target

Produces: the match target, potentially modified

Crashes if the regular expression is invalid.

> (gsub:regex "(?i)hoon" "HOON" "Hoon hoon hoon")
"HOON HOON HOON

`++subf:regex`

Variant of +sub that accepts a gate from tape to tape instead of a replacement string, slamming that gate on the text of the first match.

`++gsubf:regex`

Variant of +gsub that accepts a gate from tape to tape instead of a replacement string, slamming that gate on the text of each match.

Regular expression syntax

Same as POSIX Extended Regular Expressions (grep -E), with the following extensions:

Extension	Syntax	Example
Perl classes	`\w\d\s\u\l`	`\w+\.` (word followed by period)
	`\W\D\S\U\L`	`\d+(\W+\d+)*` (series of numbers)
Perl anchors	`\b\B\<\>`	`\<\w+\>` (any full word)
Lookahead	`(?=...)`	`(?=\u)\w+` (capitalized word)
	`(?!...)`	`(?!\d)\w+` (word not beginning with number)
Case sensitivity	`(?i)`	`(?i)(\w+)(\s*\1)+` (repeated word)
	`(?-i)`	`(?i)(\w+)(?-i)(\s*\1)+` (word repeated in same case)

These character classes are equivalent:

Perl class	Posix class
`\w`	`[[:word:]]`
`\W`	`[[:^word:]]`
`\d`	`[[:digit:]]`
`\D`	`[[:^digit:]]`
`\s`	`[[:space:]]`
`\S`	`[[:^space:]]`
`\u`	`[[:upper:]]`
`\U`	`[[:^upper:]]`
`\l`	`[[:lower:]]`
`\L`	`[[:^lower:]]`

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Files

Usage

Types

Operations

`++valid:regex`

`++run:regex`

`++ran:regex`

`++rut:regex`

`++rat:regex`

`++all:regex`

`++alt:regex`

`++sub:regex`

`++gsub:regex`

`++subf:regex`

`++gsubf:regex`

Regular expression syntax

About

Releases

Packages

Contributors 3

Languages

lynko/re.hoon

Folders and files

Latest commit

History

Repository files navigation

Files

Usage

Types

Operations

++valid:regex

++run:regex

++ran:regex

++rut:regex

++rat:regex

++all:regex

++alt:regex

++sub:regex

++gsub:regex

++subf:regex

++gsubf:regex

Regular expression syntax

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

`++valid:regex`

`++run:regex`

`++ran:regex`

`++rut:regex`

`++rat:regex`

`++all:regex`

`++alt:regex`

`++sub:regex`

`++gsub:regex`

`++subf:regex`

`++gsubf:regex`

Packages