-
Notifications
You must be signed in to change notification settings - Fork 1
/
pcrepattern.html
3235 lines (3225 loc) · 136 KB
/
pcrepattern.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<html>
<head>
<title>pcrepattern specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcrepattern man page</h1>
<p>
Return to the <a href="index.html">PCRE index page</a>.
</p>
<p>
This page is part of the PCRE HTML documentation. It was generated automatically
from the original man page. If there is any nonsense in it, please consult the
man page, in case the conversion went wrong.
<br>
<ul>
<li><a name="TOC1" href="#SEC1">PCRE REGULAR EXPRESSION DETAILS</a>
<li><a name="TOC2" href="#SEC2">SPECIAL START-OF-PATTERN ITEMS</a>
<li><a name="TOC3" href="#SEC3">EBCDIC CHARACTER CODES</a>
<li><a name="TOC4" href="#SEC4">CHARACTERS AND METACHARACTERS</a>
<li><a name="TOC5" href="#SEC5">BACKSLASH</a>
<li><a name="TOC6" href="#SEC6">CIRCUMFLEX AND DOLLAR</a>
<li><a name="TOC7" href="#SEC7">FULL STOP (PERIOD, DOT) AND \N</a>
<li><a name="TOC8" href="#SEC8">MATCHING A SINGLE DATA UNIT</a>
<li><a name="TOC9" href="#SEC9">SQUARE BRACKETS AND CHARACTER CLASSES</a>
<li><a name="TOC10" href="#SEC10">POSIX CHARACTER CLASSES</a>
<li><a name="TOC11" href="#SEC11">COMPATIBILITY FEATURE FOR WORD BOUNDARIES</a>
<li><a name="TOC12" href="#SEC12">VERTICAL BAR</a>
<li><a name="TOC13" href="#SEC13">INTERNAL OPTION SETTING</a>
<li><a name="TOC14" href="#SEC14">SUBPATTERNS</a>
<li><a name="TOC15" href="#SEC15">DUPLICATE SUBPATTERN NUMBERS</a>
<li><a name="TOC16" href="#SEC16">NAMED SUBPATTERNS</a>
<li><a name="TOC17" href="#SEC17">REPETITION</a>
<li><a name="TOC18" href="#SEC18">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a>
<li><a name="TOC19" href="#SEC19">BACK REFERENCES</a>
<li><a name="TOC20" href="#SEC20">ASSERTIONS</a>
<li><a name="TOC21" href="#SEC21">CONDITIONAL SUBPATTERNS</a>
<li><a name="TOC22" href="#SEC22">COMMENTS</a>
<li><a name="TOC23" href="#SEC23">RECURSIVE PATTERNS</a>
<li><a name="TOC24" href="#SEC24">SUBPATTERNS AS SUBROUTINES</a>
<li><a name="TOC25" href="#SEC25">ONIGURUMA SUBROUTINE SYNTAX</a>
<li><a name="TOC26" href="#SEC26">CALLOUTS</a>
<li><a name="TOC27" href="#SEC27">BACKTRACKING CONTROL</a>
<li><a name="TOC28" href="#SEC28">SEE ALSO</a>
<li><a name="TOC29" href="#SEC29">AUTHOR</a>
<li><a name="TOC30" href="#SEC30">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION DETAILS</a><br>
<P>
The syntax and semantics of the regular expressions that are supported by PCRE
are described in detail below. There is a quick-reference syntax summary in the
<a href="pcresyntax.html"><b>pcresyntax</b></a>
page. PCRE tries to match Perl syntax and semantics as closely as it can. PCRE
also supports some alternative regular expression syntax (which does not
conflict with the Perl syntax) in order to provide some compatibility with
regular expressions in Python, .NET, and Oniguruma.
</P>
<P>
Perl's regular expressions are described in its own documentation, and
regular expressions in general are covered in a number of books, some of which
have copious examples. Jeffrey Friedl's "Mastering Regular Expressions",
published by O'Reilly, covers regular expressions in great detail. This
description of PCRE's regular expressions is intended as reference material.
</P>
<P>
This document discusses the patterns that are supported by PCRE when one its
main matching functions, <b>pcre_exec()</b> (8-bit) or <b>pcre[16|32]_exec()</b>
(16- or 32-bit), is used. PCRE also has alternative matching functions,
<b>pcre_dfa_exec()</b> and <b>pcre[16|32_dfa_exec()</b>, which match using a
different algorithm that is not Perl-compatible. Some of the features discussed
below are not available when DFA matching is used. The advantages and
disadvantages of the alternative functions, and how they differ from the normal
functions, are discussed in the
<a href="pcrematching.html"><b>pcrematching</b></a>
page.
</P>
<br><a name="SEC2" href="#TOC1">SPECIAL START-OF-PATTERN ITEMS</a><br>
<P>
A number of options that can be passed to <b>pcre_compile()</b> can also be set
by special items at the start of a pattern. These are not Perl-compatible, but
are provided to make these options accessible to pattern writers who are not
able to change the program that processes the pattern. Any number of these
items may appear, but they must all be together right at the start of the
pattern string, and the letters must be in upper case.
</P>
<br><b>
UTF support
</b><br>
<P>
The original operation of PCRE was on strings of one-byte characters. However,
there is now also support for UTF-8 strings in the original library, an
extra library that supports 16-bit and UTF-16 character strings, and a
third library that supports 32-bit and UTF-32 character strings. To use these
features, PCRE must be built to include appropriate support. When using UTF
strings you must either call the compiling function with the PCRE_UTF8,
PCRE_UTF16, or PCRE_UTF32 option, or the pattern must start with one of
these special sequences:
<pre>
(*UTF8)
(*UTF16)
(*UTF32)
(*UTF)
</pre>
(*UTF) is a generic sequence that can be used with any of the libraries.
Starting a pattern with such a sequence is equivalent to setting the relevant
option. How setting a UTF mode affects pattern matching is mentioned in several
places below. There is also a summary of features in the
<a href="pcreunicode.html"><b>pcreunicode</b></a>
page.
</P>
<P>
Some applications that allow their users to supply patterns may wish to
restrict them to non-UTF data for security reasons. If the PCRE_NEVER_UTF
option is set at compile time, (*UTF) etc. are not allowed, and their
appearance causes an error.
</P>
<br><b>
Unicode property support
</b><br>
<P>
Another special sequence that may appear at the start of a pattern is (*UCP).
This has the same effect as setting the PCRE_UCP option: it causes sequences
such as \d and \w to use Unicode properties to determine character types,
instead of recognizing only characters with codes less than 128 via a lookup
table.
</P>
<br><b>
Disabling auto-possessification
</b><br>
<P>
If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as setting
the PCRE_NO_AUTO_POSSESS option at compile time. This stops PCRE from making
quantifiers possessive when what follows cannot match the repeated item. For
example, by default a+b is treated as a++b. For more details, see the
<a href="pcreapi.html"><b>pcreapi</b></a>
documentation.
</P>
<br><b>
Disabling start-up optimizations
</b><br>
<P>
If a pattern starts with (*NO_START_OPT), it has the same effect as setting the
PCRE_NO_START_OPTIMIZE option either at compile or matching time. This disables
several optimizations for quickly reaching "no match" results. For more
details, see the
<a href="pcreapi.html"><b>pcreapi</b></a>
documentation.
<a name="newlines"></a></P>
<br><b>
Newline conventions
</b><br>
<P>
PCRE supports five different conventions for indicating line breaks in
strings: a single CR (carriage return) character, a single LF (linefeed)
character, the two-character sequence CRLF, any of the three preceding, or any
Unicode newline sequence. The
<a href="pcreapi.html"><b>pcreapi</b></a>
page has
<a href="pcreapi.html#newlines">further discussion</a>
about newlines, and shows how to set the newline convention in the
<i>options</i> arguments for the compiling and matching functions.
</P>
<P>
It is also possible to specify a newline convention by starting a pattern
string with one of the following five sequences:
<pre>
(*CR) carriage return
(*LF) linefeed
(*CRLF) carriage return, followed by linefeed
(*ANYCRLF) any of the three above
(*ANY) all Unicode newline sequences
</pre>
These override the default and the options given to the compiling function. For
example, on a Unix system where LF is the default newline sequence, the pattern
<pre>
(*CR)a.b
</pre>
changes the convention to CR. That pattern matches "a\nb" because LF is no
longer a newline. If more than one of these settings is present, the last one
is used.
</P>
<P>
The newline convention affects where the circumflex and dollar assertions are
true. It also affects the interpretation of the dot metacharacter when
PCRE_DOTALL is not set, and the behaviour of \N. However, it does not affect
what the \R escape sequence matches. By default, this is any Unicode newline
sequence, for Perl compatibility. However, this can be changed; see the
description of \R in the section entitled
<a href="#newlineseq">"Newline sequences"</a>
below. A change of \R setting can be combined with a change of newline
convention.
</P>
<br><b>
Setting match and recursion limits
</b><br>
<P>
The caller of <b>pcre_exec()</b> can set a limit on the number of times the
internal <b>match()</b> function is called and on the maximum depth of
recursive calls. These facilities are provided to catch runaway matches that
are provoked by patterns with huge matching trees (a typical example is a
pattern with nested unlimited repeats) and to avoid running out of system stack
by too much recursion. When one of these limits is reached, <b>pcre_exec()</b>
gives an error return. The limits can also be set by items at the start of the
pattern of the form
<pre>
(*LIMIT_MATCH=d)
(*LIMIT_RECURSION=d)
</pre>
where d is any number of decimal digits. However, the value of the setting must
be less than the value set (or defaulted) by the caller of <b>pcre_exec()</b>
for it to have any effect. In other words, the pattern writer can lower the
limits set by the programmer, but not raise them. If there is more than one
setting of one of these limits, the lower value is used.
</P>
<br><a name="SEC3" href="#TOC1">EBCDIC CHARACTER CODES</a><br>
<P>
PCRE can be compiled to run in an environment that uses EBCDIC as its character
code rather than ASCII or Unicode (typically a mainframe system). In the
sections below, character code values are ASCII or Unicode; in an EBCDIC
environment these characters may have different code values, and there are no
code points greater than 255.
</P>
<br><a name="SEC4" href="#TOC1">CHARACTERS AND METACHARACTERS</a><br>
<P>
A regular expression is a pattern that is matched against a subject string from
left to right. Most characters stand for themselves in a pattern, and match the
corresponding characters in the subject. As a trivial example, the pattern
<pre>
The quick brown fox
</pre>
matches a portion of a subject string that is identical to itself. When
caseless matching is specified (the PCRE_CASELESS option), letters are matched
independently of case. In a UTF mode, PCRE always understands the concept of
case for characters whose values are less than 128, so caseless matching is
always possible. For characters with higher values, the concept of case is
supported if PCRE is compiled with Unicode property support, but not otherwise.
If you want to use caseless matching for characters 128 and above, you must
ensure that PCRE is compiled with Unicode property support as well as with
UTF support.
</P>
<P>
The power of regular expressions comes from the ability to include alternatives
and repetitions in the pattern. These are encoded in the pattern by the use of
<i>metacharacters</i>, which do not stand for themselves but instead are
interpreted in some special way.
</P>
<P>
There are two different sets of metacharacters: those that are recognized
anywhere in the pattern except within square brackets, and those that are
recognized within square brackets. Outside square brackets, the metacharacters
are as follows:
<pre>
\ general escape character with several uses
^ assert start of string (or line, in multiline mode)
$ assert end of string (or line, in multiline mode)
. match any character except newline (by default)
[ start character class definition
| start of alternative branch
( start subpattern
) end subpattern
? extends the meaning of (
also 0 or 1 quantifier
also quantifier minimizer
* 0 or more quantifier
+ 1 or more quantifier
also "possessive quantifier"
{ start min/max quantifier
</pre>
Part of a pattern that is in square brackets is called a "character class". In
a character class the only metacharacters are:
<pre>
\ general escape character
^ negate the class, but only if the first character
- indicates character range
[ POSIX character class (only if followed by POSIX syntax)
] terminates the character class
</pre>
The following sections describe the use of each of the metacharacters.
</P>
<br><a name="SEC5" href="#TOC1">BACKSLASH</a><br>
<P>
The backslash character has several uses. Firstly, if it is followed by a
character that is not a number or a letter, it takes away any special meaning
that character may have. This use of backslash as an escape character applies
both inside and outside character classes.
</P>
<P>
For example, if you want to match a * character, you write \* in the pattern.
This escaping action applies whether or not the following character would
otherwise be interpreted as a metacharacter, so it is always safe to precede a
non-alphanumeric with backslash to specify that it stands for itself. In
particular, if you want to match a backslash, you write \\.
</P>
<P>
In a UTF mode, only ASCII numbers and letters have any special meaning after a
backslash. All other characters (in particular, those whose codepoints are
greater than 127) are treated as literals.
</P>
<P>
If a pattern is compiled with the PCRE_EXTENDED option, most white space in the
pattern (other than in a character class), and characters between a # outside a
character class and the next newline, inclusive, are ignored. An escaping
backslash can be used to include a white space or # character as part of the
pattern.
</P>
<P>
If you want to remove the special meaning from a sequence of characters, you
can do so by putting them between \Q and \E. This is different from Perl in
that $ and @ are handled as literals in \Q...\E sequences in PCRE, whereas in
Perl, $ and @ cause variable interpolation. Note the following examples:
<pre>
Pattern PCRE matches Perl matches
\Qabc$xyz\E abc$xyz abc followed by the contents of $xyz
\Qabc\$xyz\E abc\$xyz abc\$xyz
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz
</pre>
The \Q...\E sequence is recognized both inside and outside character classes.
An isolated \E that is not preceded by \Q is ignored. If \Q is not followed
by \E later in the pattern, the literal interpretation continues to the end of
the pattern (that is, \E is assumed at the end). If the isolated \Q is inside
a character class, this causes an error, because the character class is not
terminated.
<a name="digitsafterbackslash"></a></P>
<br><b>
Non-printing characters
</b><br>
<P>
A second use of backslash provides a way of encoding non-printing characters
in patterns in a visible manner. There is no restriction on the appearance of
non-printing characters, apart from the binary zero that terminates a pattern,
but when a pattern is being prepared by text editing, it is often easier to use
one of the following escape sequences than the binary character it represents:
<pre>
\a alarm, that is, the BEL character (hex 07)
\cx "control-x", where x is any ASCII character
\e escape (hex 1B)
\f form feed (hex 0C)
\n linefeed (hex 0A)
\r carriage return (hex 0D)
\t tab (hex 09)
\0dd character with octal code 0dd
\ddd character with octal code ddd, or back reference
\o{ddd..} character with octal code ddd..
\xhh character with hex code hh
\x{hhh..} character with hex code hhh.. (non-JavaScript mode)
\uhhhh character with hex code hhhh (JavaScript mode only)
</pre>
The precise effect of \cx on ASCII characters is as follows: if x is a lower
case letter, it is converted to upper case. Then bit 6 of the character (hex
40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A (A is 41, Z is 5A),
but \c{ becomes hex 3B ({ is 7B), and \c; becomes hex 7B (; is 3B). If the
data item (byte or 16-bit value) following \c has a value greater than 127, a
compile-time error occurs. This locks out non-ASCII characters in all modes.
</P>
<P>
The \c facility was designed for use with ASCII characters, but with the
extension to Unicode it is even less useful than it once was. It is, however,
recognized when PCRE is compiled in EBCDIC mode, where data items are always
bytes. In this mode, all values are valid after \c. If the next character is a
lower case letter, it is converted to upper case. Then the 0xc0 bits of the
byte are inverted. Thus \cA becomes hex 01, as in ASCII (A is C1), but because
the EBCDIC letters are disjoint, \cZ becomes hex 29 (Z is E9), and other
characters also generate different values.
</P>
<P>
After \0 up to two further octal digits are read. If there are fewer than two
digits, just those that are present are used. Thus the sequence \0\x\07
specifies two binary zeros followed by a BEL character (code value 7). Make
sure you supply two digits after the initial zero if the pattern character that
follows is itself an octal digit.
</P>
<P>
The escape \o must be followed by a sequence of octal digits, enclosed in
braces. An error occurs if this is not the case. This escape is a recent
addition to Perl; it provides way of specifying character code points as octal
numbers greater than 0777, and it also allows octal numbers and back references
to be unambiguously specified.
</P>
<P>
For greater clarity and unambiguity, it is best to avoid following \ by a
digit greater than zero. Instead, use \o{} or \x{} to specify character
numbers, and \g{} to specify back references. The following paragraphs
describe the old, ambiguous syntax.
</P>
<P>
The handling of a backslash followed by a digit other than 0 is complicated,
and Perl has changed in recent releases, causing PCRE also to change. Outside a
character class, PCRE reads the digit and any following digits as a decimal
number. If the number is less than 8, or if there have been at least that many
previous capturing left parentheses in the expression, the entire sequence is
taken as a <i>back reference</i>. A description of how this works is given
<a href="#backreferences">later,</a>
following the discussion of
<a href="#subpattern">parenthesized subpatterns.</a>
</P>
<P>
Inside a character class, or if the decimal number following \ is greater than
7 and there have not been that many capturing subpatterns, PCRE handles \8 and
\9 as the literal characters "8" and "9", and otherwise re-reads up to three
octal digits following the backslash, using them to generate a data character.
Any subsequent digits stand for themselves. For example:
<pre>
\040 is another way of writing an ASCII space
\40 is the same, provided there are fewer than 40 previous capturing subpatterns
\7 is always a back reference
\11 might be a back reference, or another way of writing a tab
\011 is always a tab
\0113 is a tab followed by the character "3"
\113 might be a back reference, otherwise the character with octal code 113
\377 might be a back reference, otherwise the value 255 (decimal)
\81 is either a back reference, or the two characters "8" and "1"
</pre>
Note that octal values of 100 or greater that are specified using this syntax
must not be introduced by a leading zero, because no more than three octal
digits are ever read.
</P>
<P>
By default, after \x that is not followed by {, from zero to two hexadecimal
digits are read (letters can be in upper or lower case). Any number of
hexadecimal digits may appear between \x{ and }. If a character other than
a hexadecimal digit appears between \x{ and }, or if there is no terminating
}, an error occurs.
</P>
<P>
If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \x is
as just described only when it is followed by two hexadecimal digits.
Otherwise, it matches a literal "x" character. In JavaScript mode, support for
code points greater than 256 is provided by \u, which must be followed by
four hexadecimal digits; otherwise it matches a literal "u" character.
</P>
<P>
Characters whose value is less than 256 can be defined by either of the two
syntaxes for \x (or by \u in JavaScript mode). There is no difference in the
way they are handled. For example, \xdc is exactly the same as \x{dc} (or
\u00dc in JavaScript mode).
</P>
<br><b>
Constraints on character values
</b><br>
<P>
Characters that are specified using octal or hexadecimal numbers are
limited to certain values, as follows:
<pre>
8-bit non-UTF mode less than 0x100
8-bit UTF-8 mode less than 0x10ffff and a valid codepoint
16-bit non-UTF mode less than 0x10000
16-bit UTF-16 mode less than 0x10ffff and a valid codepoint
32-bit non-UTF mode less than 0x100000000
32-bit UTF-32 mode less than 0x10ffff and a valid codepoint
</pre>
Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called
"surrogate" codepoints), and 0xffef.
</P>
<br><b>
Escape sequences in character classes
</b><br>
<P>
All the sequences that define a single character value can be used both inside
and outside character classes. In addition, inside a character class, \b is
interpreted as the backspace character (hex 08).
</P>
<P>
\N is not allowed in a character class. \B, \R, and \X are not special
inside a character class. Like other unrecognized escape sequences, they are
treated as the literal characters "B", "R", and "X" by default, but cause an
error if the PCRE_EXTRA option is set. Outside a character class, these
sequences have different meanings.
</P>
<br><b>
Unsupported escape sequences
</b><br>
<P>
In Perl, the sequences \l, \L, \u, and \U are recognized by its string
handler and used to modify the case of following characters. By default, PCRE
does not support these escape sequences. However, if the PCRE_JAVASCRIPT_COMPAT
option is set, \U matches a "U" character, and \u can be used to define a
character by code point, as described in the previous section.
</P>
<br><b>
Absolute and relative back references
</b><br>
<P>
The sequence \g followed by an unsigned or a negative number, optionally
enclosed in braces, is an absolute or relative back reference. A named back
reference can be coded as \g{name}. Back references are discussed
<a href="#backreferences">later,</a>
following the discussion of
<a href="#subpattern">parenthesized subpatterns.</a>
</P>
<br><b>
Absolute and relative subroutine calls
</b><br>
<P>
For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or
a number enclosed either in angle brackets or single quotes, is an alternative
syntax for referencing a subpattern as a "subroutine". Details are discussed
<a href="#onigurumasubroutines">later.</a>
Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are <i>not</i>
synonymous. The former is a back reference; the latter is a
<a href="#subpatternsassubroutines">subroutine</a>
call.
<a name="genericchartypes"></a></P>
<br><b>
Generic character types
</b><br>
<P>
Another use of backslash is for specifying generic character types:
<pre>
\d any decimal digit
\D any character that is not a decimal digit
\h any horizontal white space character
\H any character that is not a horizontal white space character
\s any white space character
\S any character that is not a white space character
\v any vertical white space character
\V any character that is not a vertical white space character
\w any "word" character
\W any "non-word" character
</pre>
There is also the single sequence \N, which matches a non-newline character.
This is the same as
<a href="#fullstopdot">the "." metacharacter</a>
when PCRE_DOTALL is not set. Perl also uses \N to match characters by name;
PCRE does not support this.
</P>
<P>
Each pair of lower and upper case escape sequences partitions the complete set
of characters into two disjoint sets. Any given character matches one, and only
one, of each pair. The sequences can appear both inside and outside character
classes. They each match one character of the appropriate type. If the current
matching point is at the end of the subject string, all of them fail, because
there is no character to match.
</P>
<P>
For compatibility with Perl, \s did not used to match the VT character (code
11), which made it different from the the POSIX "space" class. However, Perl
added VT at release 5.18, and PCRE followed suit at release 8.34. The default
\s characters are now HT (9), LF (10), VT (11), FF (12), CR (13), and space
(32), which are defined as white space in the "C" locale. This list may vary if
locale-specific matching is taking place. For example, in some locales the
"non-breaking space" character (\xA0) is recognized as white space, and in
others the VT character is not.
</P>
<P>
A "word" character is an underscore or any character that is a letter or digit.
By default, the definition of letters and digits is controlled by PCRE's
low-valued character tables, and may vary if locale-specific matching is taking
place (see
<a href="pcreapi.html#localesupport">"Locale support"</a>
in the
<a href="pcreapi.html"><b>pcreapi</b></a>
page). For example, in a French locale such as "fr_FR" in Unix-like systems,
or "french" in Windows, some character codes greater than 127 are used for
accented letters, and these are then matched by \w. The use of locales with
Unicode is discouraged.
</P>
<P>
By default, characters whose code points are greater than 127 never match \d,
\s, or \w, and always match \D, \S, and \W, although this may vary for
characters in the range 128-255 when locale-specific matching is happening.
These escape sequences retain their original meanings from before Unicode
support was available, mainly for efficiency reasons. If PCRE is compiled with
Unicode property support, and the PCRE_UCP option is set, the behaviour is
changed so that Unicode properties are used to determine character types, as
follows:
<pre>
\d any character that matches \p{Nd} (decimal digit)
\s any character that matches \p{Z} or \h or \v
\w any character that matches \p{L} or \p{N}, plus underscore
</pre>
The upper case escapes match the inverse sets of characters. Note that \d
matches only decimal digits, whereas \w matches any Unicode digit, as well as
any Unicode letter, and underscore. Note also that PCRE_UCP affects \b, and
\B because they are defined in terms of \w and \W. Matching these sequences
is noticeably slower when PCRE_UCP is set.
</P>
<P>
The sequences \h, \H, \v, and \V are features that were added to Perl at
release 5.10. In contrast to the other sequences, which match only ASCII
characters by default, these always match certain high-valued code points,
whether or not PCRE_UCP is set. The horizontal space characters are:
<pre>
U+0009 Horizontal tab (HT)
U+0020 Space
U+00A0 Non-break space
U+1680 Ogham space mark
U+180E Mongolian vowel separator
U+2000 En quad
U+2001 Em quad
U+2002 En space
U+2003 Em space
U+2004 Three-per-em space
U+2005 Four-per-em space
U+2006 Six-per-em space
U+2007 Figure space
U+2008 Punctuation space
U+2009 Thin space
U+200A Hair space
U+202F Narrow no-break space
U+205F Medium mathematical space
U+3000 Ideographic space
</pre>
The vertical space characters are:
<pre>
U+000A Linefeed (LF)
U+000B Vertical tab (VT)
U+000C Form feed (FF)
U+000D Carriage return (CR)
U+0085 Next line (NEL)
U+2028 Line separator
U+2029 Paragraph separator
</pre>
In 8-bit, non-UTF-8 mode, only the characters with codepoints less than 256 are
relevant.
<a name="newlineseq"></a></P>
<br><b>
Newline sequences
</b><br>
<P>
Outside a character class, by default, the escape sequence \R matches any
Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent to the
following:
<pre>
(?>\r\n|\n|\x0b|\f|\r|\x85)
</pre>
This is an example of an "atomic group", details of which are given
<a href="#atomicgroup">below.</a>
This particular group matches either the two-character sequence CR followed by
LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab,
U+000B), FF (form feed, U+000C), CR (carriage return, U+000D), or NEL (next
line, U+0085). The two-character sequence is treated as a single unit that
cannot be split.
</P>
<P>
In other modes, two additional characters whose codepoints are greater than 255
are added: LS (line separator, U+2028) and PS (paragraph separator, U+2029).
Unicode character property support is not needed for these characters to be
recognized.
</P>
<P>
It is possible to restrict \R to match only CR, LF, or CRLF (instead of the
complete set of Unicode line endings) by setting the option PCRE_BSR_ANYCRLF
either at compile time or when the pattern is matched. (BSR is an abbrevation
for "backslash R".) This can be made the default when PCRE is built; if this is
the case, the other behaviour can be requested via the PCRE_BSR_UNICODE option.
It is also possible to specify these settings by starting a pattern string with
one of the following sequences:
<pre>
(*BSR_ANYCRLF) CR, LF, or CRLF only
(*BSR_UNICODE) any Unicode newline sequence
</pre>
These override the default and the options given to the compiling function, but
they can themselves be overridden by options given to a matching function. Note
that these special settings, which are not Perl-compatible, are recognized only
at the very start of a pattern, and that they must be in upper case. If more
than one of them is present, the last one is used. They can be combined with a
change of newline convention; for example, a pattern can start with:
<pre>
(*ANY)(*BSR_ANYCRLF)
</pre>
They can also be combined with the (*UTF8), (*UTF16), (*UTF32), (*UTF) or
(*UCP) special sequences. Inside a character class, \R is treated as an
unrecognized escape sequence, and so matches the letter "R" by default, but
causes an error if PCRE_EXTRA is set.
<a name="uniextseq"></a></P>
<br><b>
Unicode character properties
</b><br>
<P>
When PCRE is built with Unicode character property support, three additional
escape sequences that match characters with specific properties are available.
When in 8-bit non-UTF-8 mode, these sequences are of course limited to testing
characters whose codepoints are less than 256, but they do work in this mode.
The extra escape sequences are:
<pre>
\p{<i>xx</i>} a character with the <i>xx</i> property
\P{<i>xx</i>} a character without the <i>xx</i> property
\X a Unicode extended grapheme cluster
</pre>
The property names represented by <i>xx</i> above are limited to the Unicode
script names, the general category properties, "Any", which matches any
character (including newline), and some special PCRE properties (described
in the
<a href="#extraprops">next section).</a>
Other Perl properties such as "InMusicalSymbols" are not currently supported by
PCRE. Note that \P{Any} does not match any characters, so always causes a
match failure.
</P>
<P>
Sets of Unicode characters are defined as belonging to certain scripts. A
character from one of these sets can be matched using a script name. For
example:
<pre>
\p{Greek}
\P{Han}
</pre>
Those that are not part of an identified script are lumped together as
"Common". The current list of scripts is:
</P>
<P>
Arabic,
Armenian,
Avestan,
Balinese,
Bamum,
Batak,
Bengali,
Bopomofo,
Brahmi,
Braille,
Buginese,
Buhid,
Canadian_Aboriginal,
Carian,
Chakma,
Cham,
Cherokee,
Common,
Coptic,
Cuneiform,
Cypriot,
Cyrillic,
Deseret,
Devanagari,
Egyptian_Hieroglyphs,
Ethiopic,
Georgian,
Glagolitic,
Gothic,
Greek,
Gujarati,
Gurmukhi,
Han,
Hangul,
Hanunoo,
Hebrew,
Hiragana,
Imperial_Aramaic,
Inherited,
Inscriptional_Pahlavi,
Inscriptional_Parthian,
Javanese,
Kaithi,
Kannada,
Katakana,
Kayah_Li,
Kharoshthi,
Khmer,
Lao,
Latin,
Lepcha,
Limbu,
Linear_B,
Lisu,
Lycian,
Lydian,
Malayalam,
Mandaic,
Meetei_Mayek,
Meroitic_Cursive,
Meroitic_Hieroglyphs,
Miao,
Mongolian,
Myanmar,
New_Tai_Lue,
Nko,
Ogham,
Old_Italic,
Old_Persian,
Old_South_Arabian,
Old_Turkic,
Ol_Chiki,
Oriya,
Osmanya,
Phags_Pa,
Phoenician,
Rejang,
Runic,
Samaritan,
Saurashtra,
Sharada,
Shavian,
Sinhala,
Sora_Sompeng,
Sundanese,
Syloti_Nagri,
Syriac,
Tagalog,
Tagbanwa,
Tai_Le,
Tai_Tham,
Tai_Viet,
Takri,
Tamil,
Telugu,
Thaana,
Thai,
Tibetan,
Tifinagh,
Ugaritic,
Vai,
Yi.
</P>
<P>
Each character has exactly one Unicode general category property, specified by
a two-letter abbreviation. For compatibility with Perl, negation can be
specified by including a circumflex between the opening brace and the property
name. For example, \p{^Lu} is the same as \P{Lu}.
</P>
<P>
If only one letter is specified with \p or \P, it includes all the general
category properties that start with that letter. In this case, in the absence
of negation, the curly brackets in the escape sequence are optional; these two
examples have the same effect:
<pre>
\p{L}
\pL
</pre>
The following general category property codes are supported:
<pre>
C Other
Cc Control
Cf Format
Cn Unassigned
Co Private use
Cs Surrogate
L Letter
Ll Lower case letter
Lm Modifier letter
Lo Other letter
Lt Title case letter
Lu Upper case letter
M Mark
Mc Spacing mark
Me Enclosing mark
Mn Non-spacing mark
N Number
Nd Decimal number
Nl Letter number
No Other number
P Punctuation
Pc Connector punctuation
Pd Dash punctuation
Pe Close punctuation
Pf Final punctuation
Pi Initial punctuation
Po Other punctuation
Ps Open punctuation
S Symbol
Sc Currency symbol
Sk Modifier symbol
Sm Mathematical symbol
So Other symbol
Z Separator
Zl Line separator
Zp Paragraph separator
Zs Space separator
</pre>
The special property L& is also supported: it matches a character that has
the Lu, Ll, or Lt property, in other words, a letter that is not classified as
a modifier or "other".
</P>
<P>
The Cs (Surrogate) property applies only to characters in the range U+D800 to
U+DFFF. Such characters are not valid in Unicode strings and so
cannot be tested by PCRE, unless UTF validity checking has been turned off
(see the discussion of PCRE_NO_UTF8_CHECK, PCRE_NO_UTF16_CHECK and
PCRE_NO_UTF32_CHECK in the
<a href="pcreapi.html"><b>pcreapi</b></a>
page). Perl does not support the Cs property.
</P>
<P>
The long synonyms for property names that Perl supports (such as \p{Letter})
are not supported by PCRE, nor is it permitted to prefix any of these
properties with "Is".
</P>
<P>
No character that is in the Unicode table has the Cn (unassigned) property.
Instead, this property is assumed for any code point that is not in the
Unicode table.
</P>
<P>
Specifying caseless matching does not affect these escape sequences. For
example, \p{Lu} always matches only upper case letters. This is different from
the behaviour of current versions of Perl.
</P>
<P>
Matching characters by Unicode property is not fast, because PCRE has to do a
multistage table lookup in order to find a character's property. That is why
the traditional escape sequences such as \d and \w do not use Unicode
properties in PCRE by default, though you can make them do so by setting the
PCRE_UCP option or by starting the pattern with (*UCP).
</P>
<br><b>
Extended grapheme clusters
</b><br>
<P>
The \X escape matches any number of Unicode characters that form an "extended
grapheme cluster", and treats the sequence as an atomic group
<a href="#atomicgroup">(see below).</a>
Up to and including release 8.31, PCRE matched an earlier, simpler definition
that was equivalent to
<pre>
(?>\PM\pM*)
</pre>
That is, it matched a character without the "mark" property, followed by zero
or more characters with the "mark" property. Characters with the "mark"
property are typically non-spacing accents that affect the preceding character.
</P>
<P>
This simple definition was extended in Unicode to include more complicated
kinds of composite character by giving each character a grapheme breaking
property, and creating rules that use these properties to define the boundaries
of extended grapheme clusters. In releases of PCRE later than 8.31, \X matches
one of these clusters.
</P>
<P>
\X always matches at least one character. Then it decides whether to add
additional characters according to the following rules for ending a cluster:
</P>
<P>
1. End at the end of the subject string.
</P>
<P>
2. Do not end between CR and LF; otherwise end after any control character.
</P>
<P>
3. Do not break Hangul (a Korean script) syllable sequences. Hangul characters
are of five types: L, V, T, LV, and LVT. An L character may be followed by an
L, V, LV, or LVT character; an LV or V character may be followed by a V or T
character; an LVT or T character may be follwed only by a T character.
</P>
<P>
4. Do not end before extending characters or spacing marks. Characters with
the "mark" property always have the "extend" grapheme breaking property.
</P>
<P>
5. Do not end after prepend characters.
</P>
<P>
6. Otherwise, end the cluster.
<a name="extraprops"></a></P>
<br><b>
PCRE's additional properties
</b><br>
<P>
As well as the standard Unicode properties described above, PCRE supports four
more that make it possible to convert traditional escape sequences such as \w
and \s to use Unicode properties. PCRE uses these non-standard, non-Perl
properties internally when PCRE_UCP is set. However, they may also be used
explicitly. These properties are:
<pre>
Xan Any alphanumeric character
Xps Any POSIX space character
Xsp Any Perl space character
Xwd Any Perl "word" character
</pre>
Xan matches characters that have either the L (letter) or the N (number)
property. Xps matches the characters tab, linefeed, vertical tab, form feed, or
carriage return, and any other character that has the Z (separator) property.
Xsp is the same as Xps; it used to exclude vertical tab, for Perl
compatibility, but Perl changed, and so PCRE followed at release 8.34. Xwd
matches the same characters as Xan, plus underscore.
</P>
<P>
There is another non-standard property, Xuc, which matches any character that
can be represented by a Universal Character Name in C++ and other programming
languages. These are the characters $, @, ` (grave accent), and all characters
with Unicode code points greater than or equal to U+00A0, except for the
surrogates U+D800 to U+DFFF. Note that most base (ASCII) characters are
excluded. (Universal Character Names are of the form \uHHHH or \UHHHHHHHH
where H is a hexadecimal digit. Note that the Xuc property does not match these
sequences but the characters that they represent.)
<a name="resetmatchstart"></a></P>
<br><b>
Resetting the match start
</b><br>
<P>
The escape sequence \K causes any previously matched characters not to be
included in the final matched sequence. For example, the pattern:
<pre>
foo\Kbar
</pre>
matches "foobar", but reports that it has matched "bar". This feature is
similar to a lookbehind assertion
<a href="#lookbehind">(described below).</a>
However, in this case, the part of the subject before the real match does not
have to be of fixed length, as lookbehind assertions do. The use of \K does
not interfere with the setting of
<a href="#subpattern">captured substrings.</a>
For example, when the pattern
<pre>
(foo)\Kbar
</pre>