| 24 |
<li><a name="TOC9" href="#SEC9">VERTICAL BAR</a> |
<li><a name="TOC9" href="#SEC9">VERTICAL BAR</a> |
| 25 |
<li><a name="TOC10" href="#SEC10">INTERNAL OPTION SETTING</a> |
<li><a name="TOC10" href="#SEC10">INTERNAL OPTION SETTING</a> |
| 26 |
<li><a name="TOC11" href="#SEC11">SUBPATTERNS</a> |
<li><a name="TOC11" href="#SEC11">SUBPATTERNS</a> |
| 27 |
<li><a name="TOC12" href="#SEC12">NAMED SUBPATTERNS</a> |
<li><a name="TOC12" href="#SEC12">DUPLICATE SUBPATTERN NUMBERS</a> |
| 28 |
<li><a name="TOC13" href="#SEC13">REPETITION</a> |
<li><a name="TOC13" href="#SEC13">NAMED SUBPATTERNS</a> |
| 29 |
<li><a name="TOC14" href="#SEC14">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a> |
<li><a name="TOC14" href="#SEC14">REPETITION</a> |
| 30 |
<li><a name="TOC15" href="#SEC15">BACK REFERENCES</a> |
<li><a name="TOC15" href="#SEC15">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a> |
| 31 |
<li><a name="TOC16" href="#SEC16">ASSERTIONS</a> |
<li><a name="TOC16" href="#SEC16">BACK REFERENCES</a> |
| 32 |
<li><a name="TOC17" href="#SEC17">CONDITIONAL SUBPATTERNS</a> |
<li><a name="TOC17" href="#SEC17">ASSERTIONS</a> |
| 33 |
<li><a name="TOC18" href="#SEC18">COMMENTS</a> |
<li><a name="TOC18" href="#SEC18">CONDITIONAL SUBPATTERNS</a> |
| 34 |
<li><a name="TOC19" href="#SEC19">RECURSIVE PATTERNS</a> |
<li><a name="TOC19" href="#SEC19">COMMENTS</a> |
| 35 |
<li><a name="TOC20" href="#SEC20">SUBPATTERNS AS SUBROUTINES</a> |
<li><a name="TOC20" href="#SEC20">RECURSIVE PATTERNS</a> |
| 36 |
<li><a name="TOC21" href="#SEC21">CALLOUTS</a> |
<li><a name="TOC21" href="#SEC21">SUBPATTERNS AS SUBROUTINES</a> |
| 37 |
<li><a name="TOC22" href="#SEC22">SEE ALSO</a> |
<li><a name="TOC22" href="#SEC22">CALLOUTS</a> |
| 38 |
<li><a name="TOC23" href="#SEC23">AUTHOR</a> |
<li><a name="TOC23" href="#SEC23">SEE ALSO</a> |
| 39 |
<li><a name="TOC24" href="#SEC24">REVISION</a> |
<li><a name="TOC24" href="#SEC24">AUTHOR</a> |
| 40 |
|
<li><a name="TOC25" href="#SEC25">REVISION</a> |
| 41 |
</ul> |
</ul> |
| 42 |
<br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION DETAILS</a><br> |
<br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION DETAILS</a><br> |
| 43 |
<P> |
<P> |
| 271 |
<pre> |
<pre> |
| 272 |
\d any decimal digit |
\d any decimal digit |
| 273 |
\D any character that is not a decimal digit |
\D any character that is not a decimal digit |
| 274 |
|
\h any horizontal whitespace character |
| 275 |
|
\H any character that is not a horizontal whitespace character |
| 276 |
\s any whitespace character |
\s any whitespace character |
| 277 |
\S any character that is not a whitespace character |
\S any character that is not a whitespace character |
| 278 |
|
\v any vertical whitespace character |
| 279 |
|
\V any character that is not a vertical whitespace character |
| 280 |
\w any "word" character |
\w any "word" character |
| 281 |
\W any "non-word" character |
\W any "non-word" character |
| 282 |
</pre> |
</pre> |
| 292 |
<P> |
<P> |
| 293 |
For compatibility with Perl, \s does not match the VT character (code 11). |
For compatibility with Perl, \s does not match the VT character (code 11). |
| 294 |
This makes it different from the the POSIX "space" class. The \s characters |
This makes it different from the the POSIX "space" class. The \s characters |
| 295 |
are HT (9), LF (10), FF (12), CR (13), and space (32). (If "use locale;" is |
are HT (9), LF (10), FF (12), CR (13), and space (32). If "use locale;" is |
| 296 |
included in a Perl script, \s may match the VT character. In PCRE, it never |
included in a Perl script, \s may match the VT character. In PCRE, it never |
| 297 |
does.) |
does. |
| 298 |
|
</P> |
| 299 |
|
<P> |
| 300 |
|
In UTF-8 mode, characters with values greater than 128 never match \d, \s, or |
| 301 |
|
\w, and always match \D, \S, and \W. This is true even when Unicode |
| 302 |
|
character property support is available. These sequences retain their original |
| 303 |
|
meanings from before UTF-8 support was available, mainly for efficiency |
| 304 |
|
reasons. |
| 305 |
|
</P> |
| 306 |
|
<P> |
| 307 |
|
The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to the |
| 308 |
|
other sequences, these do match certain high-valued codepoints in UTF-8 mode. |
| 309 |
|
The horizontal space characters are: |
| 310 |
|
<pre> |
| 311 |
|
U+0009 Horizontal tab |
| 312 |
|
U+0020 Space |
| 313 |
|
U+00A0 Non-break space |
| 314 |
|
U+1680 Ogham space mark |
| 315 |
|
U+180E Mongolian vowel separator |
| 316 |
|
U+2000 En quad |
| 317 |
|
U+2001 Em quad |
| 318 |
|
U+2002 En space |
| 319 |
|
U+2003 Em space |
| 320 |
|
U+2004 Three-per-em space |
| 321 |
|
U+2005 Four-per-em space |
| 322 |
|
U+2006 Six-per-em space |
| 323 |
|
U+2007 Figure space |
| 324 |
|
U+2008 Punctuation space |
| 325 |
|
U+2009 Thin space |
| 326 |
|
U+200A Hair space |
| 327 |
|
U+202F Narrow no-break space |
| 328 |
|
U+205F Medium mathematical space |
| 329 |
|
U+3000 Ideographic space |
| 330 |
|
</pre> |
| 331 |
|
The vertical space characters are: |
| 332 |
|
<pre> |
| 333 |
|
U+000A Linefeed |
| 334 |
|
U+000B Vertical tab |
| 335 |
|
U+000C Formfeed |
| 336 |
|
U+000D Carriage return |
| 337 |
|
U+0085 Next line |
| 338 |
|
U+2028 Line separator |
| 339 |
|
U+2029 Paragraph separator |
| 340 |
|
</PRE> |
| 341 |
</P> |
</P> |
| 342 |
<P> |
<P> |
| 343 |
A "word" character is an underscore or any character less than 256 that is a |
A "word" character is an underscore or any character less than 256 that is a |
| 349 |
<a href="pcreapi.html"><b>pcreapi</b></a> |
<a href="pcreapi.html"><b>pcreapi</b></a> |
| 350 |
page). For example, in a French locale such as "fr_FR" in Unix-like systems, |
page). For example, in a French locale such as "fr_FR" in Unix-like systems, |
| 351 |
or "french" in Windows, some character codes greater than 128 are used for |
or "french" in Windows, some character codes greater than 128 are used for |
| 352 |
accented letters, and these are matched by \w. |
accented letters, and these are matched by \w. The use of locales with Unicode |
| 353 |
</P> |
is discouraged. |
|
<P> |
|
|
In UTF-8 mode, characters with values greater than 128 never match \d, \s, or |
|
|
\w, and always match \D, \S, and \W. This is true even when Unicode |
|
|
character property support is available. The use of locales with Unicode is |
|
|
discouraged. |
|
| 354 |
</P> |
</P> |
| 355 |
<br><b> |
<br><b> |
| 356 |
Newline sequences |
Newline sequences |
| 357 |
</b><br> |
</b><br> |
| 358 |
<P> |
<P> |
| 359 |
Outside a character class, the escape sequence \R matches any Unicode newline |
Outside a character class, the escape sequence \R matches any Unicode newline |
| 360 |
sequence. This is an extension to Perl. In non-UTF-8 mode \R is equivalent to |
sequence. This is a Perl 5.10 feature. In non-UTF-8 mode \R is equivalent to |
| 361 |
the following: |
the following: |
| 362 |
<pre> |
<pre> |
| 363 |
(?>\r\n|\n|\x0b|\f|\r|\x85) |
(?>\r\n|\n|\x0b|\f|\r|\x85) |
| 1009 |
is reached, an option setting in one branch does affect subsequent branches, so |
is reached, an option setting in one branch does affect subsequent branches, so |
| 1010 |
the above patterns match "SUNDAY" as well as "Saturday". |
the above patterns match "SUNDAY" as well as "Saturday". |
| 1011 |
</P> |
</P> |
| 1012 |
<br><a name="SEC12" href="#TOC1">NAMED SUBPATTERNS</a><br> |
<br><a name="SEC12" href="#TOC1">DUPLICATE SUBPATTERN NUMBERS</a><br> |
| 1013 |
|
<P> |
| 1014 |
|
Perl 5.10 introduced a feature whereby each alternative in a subpattern uses |
| 1015 |
|
the same numbers for its capturing parentheses. Such a subpattern starts with |
| 1016 |
|
(?| and is itself a non-capturing subpattern. For example, consider this |
| 1017 |
|
pattern: |
| 1018 |
|
<pre> |
| 1019 |
|
(?|(Sat)ur|(Sun))day |
| 1020 |
|
</pre> |
| 1021 |
|
Because the two alternatives are inside a (?| group, both sets of capturing |
| 1022 |
|
parentheses are numbered one. Thus, when the pattern matches, you can look |
| 1023 |
|
at captured substring number one, whichever alternative matched. This construct |
| 1024 |
|
is useful when you want to capture part, but not all, of one of a number of |
| 1025 |
|
alternatives. Inside a (?| group, parentheses are numbered as usual, but the |
| 1026 |
|
number is reset at the start of each branch. The numbers of any capturing |
| 1027 |
|
buffers that follow the subpattern start after the highest number used in any |
| 1028 |
|
branch. The following example is taken from the Perl documentation. |
| 1029 |
|
The numbers underneath show in which buffer the captured content will be |
| 1030 |
|
stored. |
| 1031 |
|
<pre> |
| 1032 |
|
# before ---------------branch-reset----------- after |
| 1033 |
|
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x |
| 1034 |
|
# 1 2 2 3 2 3 4 |
| 1035 |
|
</pre> |
| 1036 |
|
A backreference or a recursive call to a numbered subpattern always refers to |
| 1037 |
|
the first one in the pattern with the given number. |
| 1038 |
|
</P> |
| 1039 |
|
<P> |
| 1040 |
|
An alternative approach to using this "branch reset" feature is to use |
| 1041 |
|
duplicate named subpatterns, as described in the next section. |
| 1042 |
|
</P> |
| 1043 |
|
<br><a name="SEC13" href="#TOC1">NAMED SUBPATTERNS</a><br> |
| 1044 |
<P> |
<P> |
| 1045 |
Identifying capturing parentheses by number is simple, but it can be very hard |
Identifying capturing parentheses by number is simple, but it can be very hard |
| 1046 |
to keep track of the numbers in complicated regular expressions. Furthermore, |
to keep track of the numbers in complicated regular expressions. Furthermore, |
| 1082 |
(?<DN>Sat)(?:urday)? |
(?<DN>Sat)(?:urday)? |
| 1083 |
</pre> |
</pre> |
| 1084 |
There are five capturing substrings, but only one is ever set after a match. |
There are five capturing substrings, but only one is ever set after a match. |
| 1085 |
|
(An alternative way of solving this problem is to use a "branch reset" |
| 1086 |
|
subpattern, as described in the previous section.) |
| 1087 |
|
</P> |
| 1088 |
|
<P> |
| 1089 |
The convenience function for extracting the data by name returns the substring |
The convenience function for extracting the data by name returns the substring |
| 1090 |
for the first (and in this example, the only) subpattern of that name that |
for the first (and in this example, the only) subpattern of that name that |
| 1091 |
matched. This saves searching to find which numbered subpattern it was. If you |
matched. This saves searching to find which numbered subpattern it was. If you |
| 1095 |
<a href="pcreapi.html"><b>pcreapi</b></a> |
<a href="pcreapi.html"><b>pcreapi</b></a> |
| 1096 |
documentation. |
documentation. |
| 1097 |
</P> |
</P> |
| 1098 |
<br><a name="SEC13" href="#TOC1">REPETITION</a><br> |
<br><a name="SEC14" href="#TOC1">REPETITION</a><br> |
| 1099 |
<P> |
<P> |
| 1100 |
Repetition is specified by quantifiers, which can follow any of the following |
Repetition is specified by quantifiers, which can follow any of the following |
| 1101 |
items: |
items: |
| 1246 |
</pre> |
</pre> |
| 1247 |
matches "aba" the value of the second captured substring is "b". |
matches "aba" the value of the second captured substring is "b". |
| 1248 |
<a name="atomicgroup"></a></P> |
<a name="atomicgroup"></a></P> |
| 1249 |
<br><a name="SEC14" href="#TOC1">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a><br> |
<br><a name="SEC15" href="#TOC1">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a><br> |
| 1250 |
<P> |
<P> |
| 1251 |
With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy") |
With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy") |
| 1252 |
repetition, failure of what follows normally causes the repeated item to be |
repetition, failure of what follows normally causes the repeated item to be |
| 1345 |
</pre> |
</pre> |
| 1346 |
sequences of non-digits cannot be broken, and failure happens quickly. |
sequences of non-digits cannot be broken, and failure happens quickly. |
| 1347 |
<a name="backreferences"></a></P> |
<a name="backreferences"></a></P> |
| 1348 |
<br><a name="SEC15" href="#TOC1">BACK REFERENCES</a><br> |
<br><a name="SEC16" href="#TOC1">BACK REFERENCES</a><br> |
| 1349 |
<P> |
<P> |
| 1350 |
Outside a character class, a backslash followed by a digit greater than 0 (and |
Outside a character class, a backslash followed by a digit greater than 0 (and |
| 1351 |
possibly further digits) is a back reference to a capturing subpattern earlier |
possibly further digits) is a back reference to a capturing subpattern earlier |
| 1458 |
done using alternation, as in the example above, or by a quantifier with a |
done using alternation, as in the example above, or by a quantifier with a |
| 1459 |
minimum of zero. |
minimum of zero. |
| 1460 |
<a name="bigassertions"></a></P> |
<a name="bigassertions"></a></P> |
| 1461 |
<br><a name="SEC16" href="#TOC1">ASSERTIONS</a><br> |
<br><a name="SEC17" href="#TOC1">ASSERTIONS</a><br> |
| 1462 |
<P> |
<P> |
| 1463 |
An assertion is a test on the characters following or preceding the current |
An assertion is a test on the characters following or preceding the current |
| 1464 |
matching point that does not actually consume any characters. The simple |
matching point that does not actually consume any characters. The simple |
| 1618 |
is another pattern that matches "foo" preceded by three digits and any three |
is another pattern that matches "foo" preceded by three digits and any three |
| 1619 |
characters that are not "999". |
characters that are not "999". |
| 1620 |
<a name="conditions"></a></P> |
<a name="conditions"></a></P> |
| 1621 |
<br><a name="SEC17" href="#TOC1">CONDITIONAL SUBPATTERNS</a><br> |
<br><a name="SEC18" href="#TOC1">CONDITIONAL SUBPATTERNS</a><br> |
| 1622 |
<P> |
<P> |
| 1623 |
It is possible to cause the matching process to obey a subpattern |
It is possible to cause the matching process to obey a subpattern |
| 1624 |
conditionally or to choose between two alternative subpatterns, depending on |
conditionally or to choose between two alternative subpatterns, depending on |
| 1756 |
against the second. This pattern matches strings in one of the two forms |
against the second. This pattern matches strings in one of the two forms |
| 1757 |
dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits. |
dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits. |
| 1758 |
<a name="comments"></a></P> |
<a name="comments"></a></P> |
| 1759 |
<br><a name="SEC18" href="#TOC1">COMMENTS</a><br> |
<br><a name="SEC19" href="#TOC1">COMMENTS</a><br> |
| 1760 |
<P> |
<P> |
| 1761 |
The sequence (?# marks the start of a comment that continues up to the next |
The sequence (?# marks the start of a comment that continues up to the next |
| 1762 |
closing parenthesis. Nested parentheses are not permitted. The characters |
closing parenthesis. Nested parentheses are not permitted. The characters |
| 1767 |
character class introduces a comment that continues to immediately after the |
character class introduces a comment that continues to immediately after the |
| 1768 |
next newline in the pattern. |
next newline in the pattern. |
| 1769 |
<a name="recursion"></a></P> |
<a name="recursion"></a></P> |
| 1770 |
<br><a name="SEC19" href="#TOC1">RECURSIVE PATTERNS</a><br> |
<br><a name="SEC20" href="#TOC1">RECURSIVE PATTERNS</a><br> |
| 1771 |
<P> |
<P> |
| 1772 |
Consider the problem of matching a string in parentheses, allowing for |
Consider the problem of matching a string in parentheses, allowing for |
| 1773 |
unlimited nested parentheses. Without the use of recursion, the best that can |
unlimited nested parentheses. Without the use of recursion, the best that can |
| 1897 |
different alternatives for the recursive and non-recursive cases. The (?R) item |
different alternatives for the recursive and non-recursive cases. The (?R) item |
| 1898 |
is the actual recursive call. |
is the actual recursive call. |
| 1899 |
<a name="subpatternsassubroutines"></a></P> |
<a name="subpatternsassubroutines"></a></P> |
| 1900 |
<br><a name="SEC20" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br> |
<br><a name="SEC21" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br> |
| 1901 |
<P> |
<P> |
| 1902 |
If the syntax for a recursive subpattern reference (either by number or by |
If the syntax for a recursive subpattern reference (either by number or by |
| 1903 |
name) is used outside the parentheses to which it refers, it operates like a |
name) is used outside the parentheses to which it refers, it operates like a |
| 1937 |
It matches "abcabc". It does not match "abcABC" because the change of |
It matches "abcabc". It does not match "abcABC" because the change of |
| 1938 |
processing option does not affect the called subpattern. |
processing option does not affect the called subpattern. |
| 1939 |
</P> |
</P> |
| 1940 |
<br><a name="SEC21" href="#TOC1">CALLOUTS</a><br> |
<br><a name="SEC22" href="#TOC1">CALLOUTS</a><br> |
| 1941 |
<P> |
<P> |
| 1942 |
Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl |
Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl |
| 1943 |
code to be obeyed in the middle of matching a regular expression. This makes it |
code to be obeyed in the middle of matching a regular expression. This makes it |
| 1972 |
<a href="pcrecallout.html"><b>pcrecallout</b></a> |
<a href="pcrecallout.html"><b>pcrecallout</b></a> |
| 1973 |
documentation. |
documentation. |
| 1974 |
</P> |
</P> |
| 1975 |
<br><a name="SEC22" href="#TOC1">SEE ALSO</a><br> |
<br><a name="SEC23" href="#TOC1">SEE ALSO</a><br> |
| 1976 |
<P> |
<P> |
| 1977 |
<b>pcreapi</b>(3), <b>pcrecallout</b>(3), <b>pcrematching</b>(3), <b>pcre</b>(3). |
<b>pcreapi</b>(3), <b>pcrecallout</b>(3), <b>pcrematching</b>(3), <b>pcre</b>(3). |
| 1978 |
</P> |
</P> |
| 1979 |
<br><a name="SEC23" href="#TOC1">AUTHOR</a><br> |
<br><a name="SEC24" href="#TOC1">AUTHOR</a><br> |
| 1980 |
<P> |
<P> |
| 1981 |
Philip Hazel |
Philip Hazel |
| 1982 |
<br> |
<br> |
| 1985 |
Cambridge CB2 3QH, England. |
Cambridge CB2 3QH, England. |
| 1986 |
<br> |
<br> |
| 1987 |
</P> |
</P> |
| 1988 |
<br><a name="SEC24" href="#TOC1">REVISION</a><br> |
<br><a name="SEC25" href="#TOC1">REVISION</a><br> |
| 1989 |
<P> |
<P> |
| 1990 |
Last updated: 29 May 2007 |
Last updated: 13 June 2007 |
| 1991 |
<br> |
<br> |
| 1992 |
Copyright © 1997-2007 University of Cambridge. |
Copyright © 1997-2007 University of Cambridge. |
| 1993 |
<br> |
<br> |