-
Notifications
You must be signed in to change notification settings - Fork 134
/
tre-syntax.html
433 lines (379 loc) · 14.4 KB
/
tre-syntax.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
<h1>TRE Regexp Syntax</h1>
<p>
This document describes the POSIX 1003.2 extended RE (ERE) syntax and
the basic RE (BRE) syntax as implemented by TRE, and the TRE extensions
to the ERE syntax. A simple Extended Backus-Naur Form (EBNF) style
notation is used to describe the grammar.
</p>
<h2>ERE Syntax</h2>
<h3>Alternation operator</h3>
<a name="alternation"></a>
<a name="extended-regexp"></a>
<table bgcolor="#e0e0f0" cellpadding="10">
<tr><td>
<pre>
<i>extended-regexp</i> ::= <a href="#branch"><i>branch</i></a>
| <i>extended-regexp</i> <b>"|"</b> <a href="#branch"><i>branch</i></a>
</pre>
</td></tr>
</table>
<p>
An extended regexp (ERE) is one or more <i>branches</i>, separated by
<tt>|</tt>. An ERE matches anything that matches one or more of the
branches.
</p>
<h3>Catenation of REs</h3>
<a name="catenation"></a>
<a name="branch"></a>
<table bgcolor="#e0e0f0" cellpadding="10">
<tr><td>
<pre>
<i>branch</i> ::= <i>piece</i>
| <i>branch</i> <i>piece</i>
</pre>
</td></tr>
</table>
<p>
A branch is one or more <i>pieces</i> concatenated. It matches a
match for the first piece, followed by a match for the second piece,
and so on.
</p>
<table bgcolor="#e0e0f0" cellpadding="10">
<tr><td>
<pre>
<i>piece</i> ::= <i>atom</i>
| <i>atom</i> <a href="#repeat-operator"><i>repeat-operator</i></a>
| <i>atom</i> <a href="#approx-settings"><i>approx-settings</i></a>
</pre>
</td></tr>
</table>
<p>
A piece is an <i>atom</i> possibly followed by a repeat operator or an
expression controlling approximate matching parameters for the <i>atom</i>.
</p>
<table bgcolor="#e0e0f0" cellpadding="10">
<tr><td>
<pre>
<i>atom</i> ::= <b>"("</b> <i>extended-regexp</i> <b>")"</b>
| <a href="#bracket-expression"><i>bracket-expression</i></a>
| <b>"."</b>
| <a href="#assertion"><i>assertion</i></a>
| <a href="#literal"><i>literal</i></a>
| <a href="#backref"><i>back-reference</i></a>
| <b>"(?#"</b> <i>comment-text</i> <b>")"</b>
| <b>"(?"</b> <a href="#options"><i>options</i></a> <b>")"</b> <i>extended-regexp</i>
| <b>"(?"</b> <a href="#options"><i>options</i></a> <b>":"</b> <i>extended-regexp</i> <b>")"</b>
</pre>
</td></tr>
</table>
<p>
An atom is either an ERE enclosed in parenthesis, a bracket
expression, a <tt>.</tt> (period), an assertion, or a literal.
</p>
<p>
The dot (<tt>.</tt>) matches any single character.
If the <code>REG_NEWLINE</code> compilation flag (see <a
href="api.html">API manual</a>) is specified, the newline
character is not matched.
</p>
<p>
<tt>Comment-text</tt> can contain any characters except for a closing parenthesis <tt>)</tt>. The text in the comment is
completely ignored by the regex parser and it used solely for readability purposes.
</p>
<h3>Repeat operators</h3>
<a name="repeat-operator"></a>
<table bgcolor="#e0e0f0" cellpadding="10">
<tr><td>
<pre>
<i>repeat-operator</i> ::= <b>"*"</b>
| <b>"+"</b>
| <b>"?"</b>
| <i>bound</i>
| <b>"*?"</b>
| <b>"+?"</b>
| <b>"??"</b>
| <i>bound</i> <b>?</b>
</pre>
</td></tr>
</table>
<p>
An atom followed by <tt>*</tt> matches a sequence of 0 or more matches
of the atom. <tt>+</tt> is similar to <tt>*</tt>, matching a sequence
of 1 or more matches of the atom. An atom followed by <tt>?</tt>
matches a sequence of 0 or 1 matches of the atom.
</p>
<p>
A <i>bound</i> is one of the following, where <i>m</i> and <i>m</i>
are unsigned decimal integers between <tt>0</tt> and
<tt>RE_DUP_MAX</tt>:
</p>
<ol>
<li><tt>{</tt><i>m</i><tt>,</tt><i>n</i><tt>}</tt></li>
<li><tt>{</tt><i>m</i><tt>,}</tt></li>
<li><tt>{</tt><i>m</i><tt>}</tt></li>
</ol>
<p>
An atom followed by [1] matches a sequence of <i>m</i> through <i>n</i>
(inclusive) matches of the atom. An atom followed by [2]
matches a sequence of <i>m</i> or more matches of the atom. An atom
followed by [3] matches a sequence of exactly <i>m</i> matches of the
atom.
</p>
<p>
Adding a <tt>?</tt> to a repeat operator makes the subexpression minimal, or
non-greedy. Normally a repeated expression is greedy, that is, it matches as
many characters as possible. A non-greedy subexpression matches as few
characters as possible. Note that this does not (always) mean the same thing
as matching as many or few repetitions as possible. Also note
that <strong>minimal repetitions are not currently supported for approximate
matching</strong>.
</p>
<h3>Approximate matching settings</h3>
<a name="approx-settings"></a>
<table bgcolor="#e0e0f0" cellpadding="10">
<tr><td>
<pre>
<i>approx-settings</i> ::= <b>"{"</b> <i>count-limits</i>* <b>","</b>? <i>cost-equation</i>? <b>"}"</b>
<i>count-limits</i> ::= <b>"+"</b> <i>number</i>?
| <b>"-"</b> <i>number</i>?
| <b>"#"</b> <i>number</i>?
| <b>"~"</b> <i>number</i>?
<i>cost-equation</i> ::= ( <i>cost-term</i> "+"? " "? )+ <b>"<"</b> <i>number</i>
<i>cost-term</i> ::= <i>number</i> <b>"i"</b>
| <i>number</i> <b>"d"</b>
| <i>number</i> <b>"s"</b>
</pre>
</td></tr>
</table>
<p>
The approximate matching settings for a subpattern can be changed
by appending <i>approx-settings</i> to the subpattern. Limits for
the number of errors can be set and an expression for specifying and
limiting the costs can be given.
</p>
<p>
The <i>count-limits</i> can be used to set limits for the number of
insertions (<tt>+</tt>), deletions (<tt>-</tt>), substitutions
(<tt>#</tt>), and total number of errors (<tt>~</tt>). If the
<i>number</i> part is omitted, the specified error count will be
unlimited.
</p>
<p>
The <i>cost-equation</i> can be thought of as a mathematical equation,
where <tt>i</tt>, <tt>d</tt>, and <tt>s</tt> stand for the number of
insertions, deletions, and substitutions, respectively. The equation
can have a multiplier for each of <tt>i</tt>, <tt>d</tt>, and
<tt>s</tt>. The multiplier is the cost of the error, and the number
after <tt><</tt> is the maximum allowed cost of a match. Spaces
and pluses can be inserted to make the equation readable. In fact, when
specifying only a cost equation, adding a space after the opening <tt>{</tt>
is <strong>required</strong>.
</p>
<p>
Examples:
<dl>
<dt><tt>{~}</tt></dt>
<dd>Sets the maximum number of errors to unlimited.</dd>
<dt><tt>{~3}</tt></dt>
<dd>Sets the maximum number of errors to three.</dd>
<dt><tt>{+2~5}</tt></dt>
<dd>Sets the maximum number of errors to five, and the maximum number
of insertions to two.</dd>
<dt><tt>{<3}</tt></dt>
<dd>Sets the maximum cost to three.
<dt><tt>{ 2i + 1d + 2s < 5 }</tt></dt>
<dd>Sets the cost of an insertion to two, a deletion to one, a
substitution to two, and the maximum cost to five.
</dl>
<h3>Bracket expressions</h3>
<a name="bracket-expression"></a>
<table bgcolor="#e0e0f0" cellpadding="10">
<tr><td>
<pre>
<i>bracket-expression</i> ::= <b>"["</b> <i>item</i>+ <b>"]"</b>
| <b>"[^"</b> <i>item</i>+ <b>"]"</b>
</pre>
</td></tr>
</table>
<p>
A bracket expression specifies a set of characters by enclosing a
nonempty list of items in brackets. Normally anything matching any
item in the list is matched. If the list begins with <tt>^</tt> the
meaning is negated; any character matching no item in the list is
matched.
</p>
<p>
An item is any of the following:
</p>
<ul>
<li>A single character, matching that character.</li>
<li>Two characters separated by <tt>-</tt>. This is shorthand for the
full range of characters between those two (inclusive) in the
collating sequence. For example, <tt>[0-9]</tt> in ASCII matches any
decimal digit.</li>
<li>A collating element enclosed in <tt>[.</tt> and <tt>.]</tt>,
matching the collating element. This can be used to include a literal
<tt>-</tt> or a multi-character collating element in the list.</li>
<li>A collating element enclosed in <tt>[=</tt> and <tt>=]</tt> (an
equivalence class), matching all collating elements with the same
primary collation weight as that element, including the element
itself.</li>
<li>The name of a character class enclosed in <tt>[:</tt> and
<tt>:]</tt>, matching any character belonging to the class. The set
of valid names depends on the <code>LC_CTYPE</code> category of the
current locale, but the following names are valid in all locales:
<ul>
<li><tt>alnum</tt> - alphanumeric characters</li>
<li><tt>alpha</tt> - alphabetic characters</li>
<li><tt>blank</tt> - blank characters</li>
<li><tt>cntrl</tt> - control characters</li>
<li><tt>digit</tt> - decimal digits (0 through 9)</li>
<li><tt>graph</tt> - all printable characters except space</li>
<li><tt>lower</tt> - lower-case letters</li>
<li><tt>print</tt> - printable characters including space</li>
<li><tt>punct</tt> - printable characters not space or alphanumeric</li>
<li><tt>space</tt> - white-space characters</li>
<li><tt>upper</tt> - upper case letters</li>
<li><tt>xdigit</tt> - hexadecimal digits</li>
</ul>
</ul>
<p>
To include a literal <tt>-</tt> in the list, make it either the first
or last item, the second endpoint of a range, or enclose it in
<tt>[.</tt> and <tt>.]</tt> to make it a collating element. To
include a literal <tt>]</tt> in the list, make it either the first
item, the second endpoint of a range, or enclose it in <tt>[.</tt> and
<tt>.]</tt>. To use a literal <tt>-</tt> as the first
endpoint of a range, enclose it in <tt>[.</tt> and <tt>.]</tt>.
</p>
<h3>Assertions</h3>
<a name="assertion"></a>
<table bgcolor="#e0e0f0" cellpadding="10">
<tr><td>
<pre>
<i>assertion</i> ::= <b>"^"</b>
| <b>"$"</b>
| <b>"\"</b> <i>assertion-character</i>
</pre>
</td></tr>
</table>
<p>
The expressions <tt>^</tt> and <tt>$</tt> are called "left anchor" and
"right anchor", respectively. The left anchor matches the empty
string at the beginning of the string. The right anchor matches the
empty string at the end of the string. The behaviour of both anchors
can be varied by specifying certain execution and compilation flags;
see the <a href="api.html">API manual</a>.
</p>
<p>
An assertion-character can be any of the following:
</p>
<ul>
<li><tt><</tt> - Beginning of word
<li><tt>></tt> - End of word
<li><tt>b</tt> - Word boundary
<li><tt>B</tt> - Non-word boundary
<li><tt>d</tt> - Digit character (equivalent to <tt>[[:digit:]]</tt>)</li>
<li><tt>D</tt> - Non-digit character (equivalent to <tt>[^[:digit:]]</tt>)</li>
<li><tt>s</tt> - Space character (equivalent to <tt>[[:space:]]</tt>)</li>
<li><tt>S</tt> - Non-space character (equivalent to <tt>[^[:space:]]</tt>)</li>
<li><tt>w</tt> - Word character (equivalent to <tt>[[:alnum:]_]</tt>)</li>
<li><tt>W</tt> - Non-word character (equivalent to <tt>[^[:alnum:]_]</tt>)</li>
</ul>
<h3>Literals</h3>
<a name="literal"></a>
<table bgcolor="#e0e0f0" cellpadding="10">
<tr><td>
<pre>
<i>literal</i> ::= <i>ordinary-character</i>
| <b>"\x"</b> [<b>"1"</b>-<b>"9"</b> <b>"a"-<b>"f"</b> <b>"A"</b>-<b>"F"</b>]{0,2}
| <b>"\x{"</b> [<b>"1"</b>-<b>"9"</b> <b>"a"-<b>"f"</b> <b>"A"</b>-<b>"F"</b>]* <b>"}"</b>
| <b>"\"</b> <i>character</i>
</pre>
</td></tr>
</table>
<p>
A literal is either an ordinary character (a character that has no
other significance in the context), an 8 bit hexadecimal encoded
character (e.g. <tt>\x1B</tt>), a wide hexadecimal encoded character
(e.g. <tt>\x{263a}</tt>), or an escaped character. An escaped
character is a <tt>\</tt> followed by any character, and matches that
character. Escaping can be used to match characters which have a
special meaning in regexp syntax. A <tt>\</tt> cannot be the last
character of an ERE. Escaping also allows you to include a few
non-printable characters in the regular expression. These special
escape sequences include:
</p>
<ul>
<li><tt>\a</tt> - Bell character (ASCII code 7)
<li><tt>\e</tt> - Escape character (ASCII code 27)
<li><tt>\f</tt> - Form-feed character (ASCII code 12)
<li><tt>\n</tt> - New-line/line-feed character (ASCII code 10)
<li><tt>\r</tt> - Carriage return character (ASCII code 13)
<li><tt>\t</tt> - Horizontal tab character (ASCII code 9)
</ul>
<p>
An ordinary character is just a single character with no other
significance, and matches that character. A <tt>{</tt> followed by
something else than a digit is considered an ordinary character.
</p>
<h3>Back references</h3>
<a name="backref"></a>
<table bgcolor="#e0e0f0" cellpadding="10">
<tr><td>
<pre>
<i>back-reference</i> ::= <b>"\"</b> [<b>"1"</b>-<b>"9"</b>]
</pre>
</td></tr>
</table>
<p>
A back reference is a backslash followed by a single non-zero decimal
digit <i>d</i>. It matches the same sequence of characters
matched by the <i>d</i>th parenthesized subexpression.
</p>
<p>
Back references are not defined for POSIX EREs (for BREs they are),
but many matchers, including TRE, implement back references for both
EREs and BREs.
</p>
<h3>Options</h3>
<a name="options"></a>
<table bgcolor="#e0e0f0" cellpadding="10">
<tr><td>
<pre>
<i>options</i> ::= [<b>"i" "n" "r" "U"</b>]* (<b>"-"</b> [<b>"i" "n" "r" "U"</b>]*)?
</pre>
</td></tr>
</table>
Options allow compile time options to be turned on/off for particular parts of the
regular expression. The options equate to several compile time options specified to
the regcomp API function. If the option is specified in the first section, it is
turned on. If it is specified in the second section (after the <tt>-</tt>), it is
turned off.
<ul>
<li>i - Case insensitive.
<li>n - Forces special handling of the new line character. See the REG_NEWLINE flag in
the <a href="tre-api.html">API Manual</a>.
<li>r - Causes the regex to be matched in a right associative manner rather than the normal
left associative manner.
<li>U - Forces repetition operators to be non-greedy unless a <tt>?</tt> is appended.
</ul>
<h2>BRE Syntax</h2>
<p>
The obsolete basic regexp (BRE) syntax differs from the ERE syntax as
follows:
</p>
<ul>
<li><tt>|</tt> is an ordinary character, and there is no equivalent
for its functionality. <tt>+</tt>, and <tt>?</tt> are ordinary
characters.</li>
<li>The delimiters for bounds are <tt>\{</tt> and <tt>\}</tt>, with
<tt>{</tt> and <tt>}</tt> by themselves ordinary characters.</li>
<li>The parentheses for nested subexpressions are <tt>\(</tt> and
<tt>\)</tt>, with <tt>(</tt> and <tt>)</tt> by themselves ordinary
characters.</li>
<li><tt>^</tt> is an ordinary character except at the beginning of the
RE or the beginning of a parenthesized subexpression. Similarly,
<tt>$</tt> is an ordinary character except at the end of the
RE or the end of a parenthesized subexpression.</li>
</ul>