| 24 |
<li><a name="TOC9" href="#SEC9">VERTICAL BAR</a> |
<li><a name="TOC9" href="#SEC9">VERTICAL BAR</a> |
| 25 |
<li><a name="TOC10" href="#SEC10">INTERNAL OPTION SETTING</a> |
<li><a name="TOC10" href="#SEC10">INTERNAL OPTION SETTING</a> |
| 26 |
<li><a name="TOC11" href="#SEC11">SUBPATTERNS</a> |
<li><a name="TOC11" href="#SEC11">SUBPATTERNS</a> |
| 27 |
<li><a name="TOC12" href="#SEC12">NAMED SUBPATTERNS</a> |
<li><a name="TOC12" href="#SEC12">DUPLICATE SUBPATTERN NUMBERS</a> |
| 28 |
<li><a name="TOC13" href="#SEC13">REPETITION</a> |
<li><a name="TOC13" href="#SEC13">NAMED SUBPATTERNS</a> |
| 29 |
<li><a name="TOC14" href="#SEC14">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a> |
<li><a name="TOC14" href="#SEC14">REPETITION</a> |
| 30 |
<li><a name="TOC15" href="#SEC15">BACK REFERENCES</a> |
<li><a name="TOC15" href="#SEC15">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a> |
| 31 |
<li><a name="TOC16" href="#SEC16">ASSERTIONS</a> |
<li><a name="TOC16" href="#SEC16">BACK REFERENCES</a> |
| 32 |
<li><a name="TOC17" href="#SEC17">CONDITIONAL SUBPATTERNS</a> |
<li><a name="TOC17" href="#SEC17">ASSERTIONS</a> |
| 33 |
<li><a name="TOC18" href="#SEC18">COMMENTS</a> |
<li><a name="TOC18" href="#SEC18">CONDITIONAL SUBPATTERNS</a> |
| 34 |
<li><a name="TOC19" href="#SEC19">RECURSIVE PATTERNS</a> |
<li><a name="TOC19" href="#SEC19">COMMENTS</a> |
| 35 |
<li><a name="TOC20" href="#SEC20">SUBPATTERNS AS SUBROUTINES</a> |
<li><a name="TOC20" href="#SEC20">RECURSIVE PATTERNS</a> |
| 36 |
<li><a name="TOC21" href="#SEC21">CALLOUTS</a> |
<li><a name="TOC21" href="#SEC21">SUBPATTERNS AS SUBROUTINES</a> |
| 37 |
<li><a name="TOC22" href="#SEC22">SEE ALSO</a> |
<li><a name="TOC22" href="#SEC22">CALLOUTS</a> |
| 38 |
<li><a name="TOC23" href="#SEC23">AUTHOR</a> |
<li><a name="TOC23" href="#SEC23">SEE ALSO</a> |
| 39 |
<li><a name="TOC24" href="#SEC24">REVISION</a> |
<li><a name="TOC24" href="#SEC24">AUTHOR</a> |
| 40 |
|
<li><a name="TOC25" href="#SEC25">REVISION</a> |
| 41 |
</ul> |
</ul> |
| 42 |
<br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION DETAILS</a><br> |
<br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION DETAILS</a><br> |
| 43 |
<P> |
<P> |
| 64 |
PCRE when its main matching function, <b>pcre_exec()</b>, is used. |
PCRE when its main matching function, <b>pcre_exec()</b>, is used. |
| 65 |
From release 6.0, PCRE offers a second matching function, |
From release 6.0, PCRE offers a second matching function, |
| 66 |
<b>pcre_dfa_exec()</b>, which matches using a different algorithm that is not |
<b>pcre_dfa_exec()</b>, which matches using a different algorithm that is not |
| 67 |
Perl-compatible. The advantages and disadvantages of the alternative function, |
Perl-compatible. Some of the features discussed below are not available when |
| 68 |
and how it differs from the normal function, are discussed in the |
<b>pcre_dfa_exec()</b> is used. The advantages and disadvantages of the |
| 69 |
|
alternative function, and how it differs from the normal function, are |
| 70 |
|
discussed in the |
| 71 |
<a href="pcrematching.html"><b>pcrematching</b></a> |
<a href="pcrematching.html"><b>pcrematching</b></a> |
| 72 |
page. |
page. |
| 73 |
</P> |
</P> |
| 256 |
</b><br> |
</b><br> |
| 257 |
<P> |
<P> |
| 258 |
The sequence \g followed by a positive or negative number, optionally enclosed |
The sequence \g followed by a positive or negative number, optionally enclosed |
| 259 |
in braces, is an absolute or relative back reference. Back references are |
in braces, is an absolute or relative back reference. A named back reference |
| 260 |
discussed |
can be coded as \g{name}. Back references are discussed |
| 261 |
<a href="#backreferences">later,</a> |
<a href="#backreferences">later,</a> |
| 262 |
following the discussion of |
following the discussion of |
| 263 |
<a href="#subpattern">parenthesized subpatterns.</a> |
<a href="#subpattern">parenthesized subpatterns.</a> |
| 271 |
<pre> |
<pre> |
| 272 |
\d any decimal digit |
\d any decimal digit |
| 273 |
\D any character that is not a decimal digit |
\D any character that is not a decimal digit |
| 274 |
|
\h any horizontal whitespace character |
| 275 |
|
\H any character that is not a horizontal whitespace character |
| 276 |
\s any whitespace character |
\s any whitespace character |
| 277 |
\S any character that is not a whitespace character |
\S any character that is not a whitespace character |
| 278 |
|
\v any vertical whitespace character |
| 279 |
|
\V any character that is not a vertical whitespace character |
| 280 |
\w any "word" character |
\w any "word" character |
| 281 |
\W any "non-word" character |
\W any "non-word" character |
| 282 |
</pre> |
</pre> |
| 292 |
<P> |
<P> |
| 293 |
For compatibility with Perl, \s does not match the VT character (code 11). |
For compatibility with Perl, \s does not match the VT character (code 11). |
| 294 |
This makes it different from the the POSIX "space" class. The \s characters |
This makes it different from the the POSIX "space" class. The \s characters |
| 295 |
are HT (9), LF (10), FF (12), CR (13), and space (32). (If "use locale;" is |
are HT (9), LF (10), FF (12), CR (13), and space (32). If "use locale;" is |
| 296 |
included in a Perl script, \s may match the VT character. In PCRE, it never |
included in a Perl script, \s may match the VT character. In PCRE, it never |
| 297 |
does.) |
does. |
| 298 |
|
</P> |
| 299 |
|
<P> |
| 300 |
|
In UTF-8 mode, characters with values greater than 128 never match \d, \s, or |
| 301 |
|
\w, and always match \D, \S, and \W. This is true even when Unicode |
| 302 |
|
character property support is available. These sequences retain their original |
| 303 |
|
meanings from before UTF-8 support was available, mainly for efficiency |
| 304 |
|
reasons. |
| 305 |
|
</P> |
| 306 |
|
<P> |
| 307 |
|
The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to the |
| 308 |
|
other sequences, these do match certain high-valued codepoints in UTF-8 mode. |
| 309 |
|
The horizontal space characters are: |
| 310 |
|
<pre> |
| 311 |
|
U+0009 Horizontal tab |
| 312 |
|
U+0020 Space |
| 313 |
|
U+00A0 Non-break space |
| 314 |
|
U+1680 Ogham space mark |
| 315 |
|
U+180E Mongolian vowel separator |
| 316 |
|
U+2000 En quad |
| 317 |
|
U+2001 Em quad |
| 318 |
|
U+2002 En space |
| 319 |
|
U+2003 Em space |
| 320 |
|
U+2004 Three-per-em space |
| 321 |
|
U+2005 Four-per-em space |
| 322 |
|
U+2006 Six-per-em space |
| 323 |
|
U+2007 Figure space |
| 324 |
|
U+2008 Punctuation space |
| 325 |
|
U+2009 Thin space |
| 326 |
|
U+200A Hair space |
| 327 |
|
U+202F Narrow no-break space |
| 328 |
|
U+205F Medium mathematical space |
| 329 |
|
U+3000 Ideographic space |
| 330 |
|
</pre> |
| 331 |
|
The vertical space characters are: |
| 332 |
|
<pre> |
| 333 |
|
U+000A Linefeed |
| 334 |
|
U+000B Vertical tab |
| 335 |
|
U+000C Formfeed |
| 336 |
|
U+000D Carriage return |
| 337 |
|
U+0085 Next line |
| 338 |
|
U+2028 Line separator |
| 339 |
|
U+2029 Paragraph separator |
| 340 |
|
</PRE> |
| 341 |
</P> |
</P> |
| 342 |
<P> |
<P> |
| 343 |
A "word" character is an underscore or any character less than 256 that is a |
A "word" character is an underscore or any character less than 256 that is a |
| 347 |
<a href="pcreapi.html#localesupport">"Locale support"</a> |
<a href="pcreapi.html#localesupport">"Locale support"</a> |
| 348 |
in the |
in the |
| 349 |
<a href="pcreapi.html"><b>pcreapi</b></a> |
<a href="pcreapi.html"><b>pcreapi</b></a> |
| 350 |
page). For example, in the "fr_FR" (French) locale, some character codes |
page). For example, in a French locale such as "fr_FR" in Unix-like systems, |
| 351 |
greater than 128 are used for accented letters, and these are matched by \w. |
or "french" in Windows, some character codes greater than 128 are used for |
| 352 |
</P> |
accented letters, and these are matched by \w. The use of locales with Unicode |
| 353 |
<P> |
is discouraged. |
|
In UTF-8 mode, characters with values greater than 128 never match \d, \s, or |
|
|
\w, and always match \D, \S, and \W. This is true even when Unicode |
|
|
character property support is available. The use of locales with Unicode is |
|
|
discouraged. |
|
| 354 |
</P> |
</P> |
| 355 |
<br><b> |
<br><b> |
| 356 |
Newline sequences |
Newline sequences |
| 357 |
</b><br> |
</b><br> |
| 358 |
<P> |
<P> |
| 359 |
Outside a character class, the escape sequence \R matches any Unicode newline |
Outside a character class, the escape sequence \R matches any Unicode newline |
| 360 |
sequence. This is an extension to Perl. In non-UTF-8 mode \R is equivalent to |
sequence. This is a Perl 5.10 feature. In non-UTF-8 mode \R is equivalent to |
| 361 |
the following: |
the following: |
| 362 |
<pre> |
<pre> |
| 363 |
(?>\r\n|\n|\x0b|\f|\r|\x85) |
(?>\r\n|\n|\x0b|\f|\r|\x85) |
| 573 |
a structure that contains data for over fifteen thousand characters. That is |
a structure that contains data for over fifteen thousand characters. That is |
| 574 |
why the traditional escape sequences such as \d and \w do not use Unicode |
why the traditional escape sequences such as \d and \w do not use Unicode |
| 575 |
properties in PCRE. |
properties in PCRE. |
| 576 |
|
<a name="resetmatchstart"></a></P> |
| 577 |
|
<br><b> |
| 578 |
|
Resetting the match start |
| 579 |
|
</b><br> |
| 580 |
|
<P> |
| 581 |
|
The escape sequence \K, which is a Perl 5.10 feature, causes any previously |
| 582 |
|
matched characters not to be included in the final matched sequence. For |
| 583 |
|
example, the pattern: |
| 584 |
|
<pre> |
| 585 |
|
foo\Kbar |
| 586 |
|
</pre> |
| 587 |
|
matches "foobar", but reports that it has matched "bar". This feature is |
| 588 |
|
similar to a lookbehind assertion |
| 589 |
|
<a href="#lookbehind">(described below).</a> |
| 590 |
|
However, in this case, the part of the subject before the real match does not |
| 591 |
|
have to be of fixed length, as lookbehind assertions do. The use of \K does |
| 592 |
|
not interfere with the setting of |
| 593 |
|
<a href="#subpattern">captured substrings.</a> |
| 594 |
|
For example, when the pattern |
| 595 |
|
<pre> |
| 596 |
|
(foo)\Kbar |
| 597 |
|
</pre> |
| 598 |
|
matches "foobar", the first substring is still set to "foo". |
| 599 |
<a name="smallassertions"></a></P> |
<a name="smallassertions"></a></P> |
| 600 |
<br><b> |
<br><b> |
| 601 |
Simple assertions |
Simple assertions |
| 825 |
If a range that includes letters is used when caseless matching is set, it |
If a range that includes letters is used when caseless matching is set, it |
| 826 |
matches the letters in either case. For example, [W-c] is equivalent to |
matches the letters in either case. For example, [W-c] is equivalent to |
| 827 |
[][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if character |
[][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if character |
| 828 |
tables for the "fr_FR" locale are in use, [\xc8-\xcb] matches accented E |
tables for a French locale are in use, [\xc8-\xcb] matches accented E |
| 829 |
characters in both cases. In UTF-8 mode, PCRE supports the concept of case for |
characters in both cases. In UTF-8 mode, PCRE supports the concept of case for |
| 830 |
characters with values greater than 128 only when it is compiled with Unicode |
characters with values greater than 128 only when it is compiled with Unicode |
| 831 |
property support. |
property support. |
| 1009 |
is reached, an option setting in one branch does affect subsequent branches, so |
is reached, an option setting in one branch does affect subsequent branches, so |
| 1010 |
the above patterns match "SUNDAY" as well as "Saturday". |
the above patterns match "SUNDAY" as well as "Saturday". |
| 1011 |
</P> |
</P> |
| 1012 |
<br><a name="SEC12" href="#TOC1">NAMED SUBPATTERNS</a><br> |
<br><a name="SEC12" href="#TOC1">DUPLICATE SUBPATTERN NUMBERS</a><br> |
| 1013 |
|
<P> |
| 1014 |
|
Perl 5.10 introduced a feature whereby each alternative in a subpattern uses |
| 1015 |
|
the same numbers for its capturing parentheses. Such a subpattern starts with |
| 1016 |
|
(?| and is itself a non-capturing subpattern. For example, consider this |
| 1017 |
|
pattern: |
| 1018 |
|
<pre> |
| 1019 |
|
(?|(Sat)ur|(Sun))day |
| 1020 |
|
</pre> |
| 1021 |
|
Because the two alternatives are inside a (?| group, both sets of capturing |
| 1022 |
|
parentheses are numbered one. Thus, when the pattern matches, you can look |
| 1023 |
|
at captured substring number one, whichever alternative matched. This construct |
| 1024 |
|
is useful when you want to capture part, but not all, of one of a number of |
| 1025 |
|
alternatives. Inside a (?| group, parentheses are numbered as usual, but the |
| 1026 |
|
number is reset at the start of each branch. The numbers of any capturing |
| 1027 |
|
buffers that follow the subpattern start after the highest number used in any |
| 1028 |
|
branch. The following example is taken from the Perl documentation. |
| 1029 |
|
The numbers underneath show in which buffer the captured content will be |
| 1030 |
|
stored. |
| 1031 |
|
<pre> |
| 1032 |
|
# before ---------------branch-reset----------- after |
| 1033 |
|
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x |
| 1034 |
|
# 1 2 2 3 2 3 4 |
| 1035 |
|
</pre> |
| 1036 |
|
A backreference or a recursive call to a numbered subpattern always refers to |
| 1037 |
|
the first one in the pattern with the given number. |
| 1038 |
|
</P> |
| 1039 |
|
<P> |
| 1040 |
|
An alternative approach to using this "branch reset" feature is to use |
| 1041 |
|
duplicate named subpatterns, as described in the next section. |
| 1042 |
|
</P> |
| 1043 |
|
<br><a name="SEC13" href="#TOC1">NAMED SUBPATTERNS</a><br> |
| 1044 |
<P> |
<P> |
| 1045 |
Identifying capturing parentheses by number is simple, but it can be very hard |
Identifying capturing parentheses by number is simple, but it can be very hard |
| 1046 |
to keep track of the numbers in complicated regular expressions. Furthermore, |
to keep track of the numbers in complicated regular expressions. Furthermore, |
| 1082 |
(?<DN>Sat)(?:urday)? |
(?<DN>Sat)(?:urday)? |
| 1083 |
</pre> |
</pre> |
| 1084 |
There are five capturing substrings, but only one is ever set after a match. |
There are five capturing substrings, but only one is ever set after a match. |
| 1085 |
|
(An alternative way of solving this problem is to use a "branch reset" |
| 1086 |
|
subpattern, as described in the previous section.) |
| 1087 |
|
</P> |
| 1088 |
|
<P> |
| 1089 |
The convenience function for extracting the data by name returns the substring |
The convenience function for extracting the data by name returns the substring |
| 1090 |
for the first (and in this example, the only) subpattern of that name that |
for the first (and in this example, the only) subpattern of that name that |
| 1091 |
matched. This saves searching to find which numbered subpattern it was. If you |
matched. This saves searching to find which numbered subpattern it was. If you |
| 1095 |
<a href="pcreapi.html"><b>pcreapi</b></a> |
<a href="pcreapi.html"><b>pcreapi</b></a> |
| 1096 |
documentation. |
documentation. |
| 1097 |
</P> |
</P> |
| 1098 |
<br><a name="SEC13" href="#TOC1">REPETITION</a><br> |
<br><a name="SEC14" href="#TOC1">REPETITION</a><br> |
| 1099 |
<P> |
<P> |
| 1100 |
Repetition is specified by quantifiers, which can follow any of the following |
Repetition is specified by quantifiers, which can follow any of the following |
| 1101 |
items: |
items: |
| 1246 |
</pre> |
</pre> |
| 1247 |
matches "aba" the value of the second captured substring is "b". |
matches "aba" the value of the second captured substring is "b". |
| 1248 |
<a name="atomicgroup"></a></P> |
<a name="atomicgroup"></a></P> |
| 1249 |
<br><a name="SEC14" href="#TOC1">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a><br> |
<br><a name="SEC15" href="#TOC1">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a><br> |
| 1250 |
<P> |
<P> |
| 1251 |
With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy") |
With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy") |
| 1252 |
repetition, failure of what follows normally causes the repeated item to be |
repetition, failure of what follows normally causes the repeated item to be |
| 1345 |
</pre> |
</pre> |
| 1346 |
sequences of non-digits cannot be broken, and failure happens quickly. |
sequences of non-digits cannot be broken, and failure happens quickly. |
| 1347 |
<a name="backreferences"></a></P> |
<a name="backreferences"></a></P> |
| 1348 |
<br><a name="SEC15" href="#TOC1">BACK REFERENCES</a><br> |
<br><a name="SEC16" href="#TOC1">BACK REFERENCES</a><br> |
| 1349 |
<P> |
<P> |
| 1350 |
Outside a character class, a backslash followed by a digit greater than 0 (and |
Outside a character class, a backslash followed by a digit greater than 0 (and |
| 1351 |
possibly further digits) is a back reference to a capturing subpattern earlier |
possibly further digits) is a back reference to a capturing subpattern earlier |
| 1412 |
capturing subpattern is matched caselessly. |
capturing subpattern is matched caselessly. |
| 1413 |
</P> |
</P> |
| 1414 |
<P> |
<P> |
| 1415 |
Back references to named subpatterns use the Perl syntax \k<name> or \k'name' |
There are several different ways of writing back references to named |
| 1416 |
or the Python syntax (?P=name). We could rewrite the above example in either of |
subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or |
| 1417 |
|
\k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's unified |
| 1418 |
|
back reference syntax, in which \g can be used for both numeric and named |
| 1419 |
|
references, is also supported. We could rewrite the above example in any of |
| 1420 |
the following ways: |
the following ways: |
| 1421 |
<pre> |
<pre> |
| 1422 |
(?<p1>(?i)rah)\s+\k<p1> |
(?<p1>(?i)rah)\s+\k<p1> |
| 1423 |
|
(?'p1'(?i)rah)\s+\k{p1} |
| 1424 |
(?P<p1>(?i)rah)\s+(?P=p1) |
(?P<p1>(?i)rah)\s+(?P=p1) |
| 1425 |
|
(?<p1>(?i)rah)\s+\g{p1} |
| 1426 |
</pre> |
</pre> |
| 1427 |
A subpattern that is referenced by name may appear in the pattern before or |
A subpattern that is referenced by name may appear in the pattern before or |
| 1428 |
after the reference. |
after the reference. |
| 1458 |
done using alternation, as in the example above, or by a quantifier with a |
done using alternation, as in the example above, or by a quantifier with a |
| 1459 |
minimum of zero. |
minimum of zero. |
| 1460 |
<a name="bigassertions"></a></P> |
<a name="bigassertions"></a></P> |
| 1461 |
<br><a name="SEC16" href="#TOC1">ASSERTIONS</a><br> |
<br><a name="SEC17" href="#TOC1">ASSERTIONS</a><br> |
| 1462 |
<P> |
<P> |
| 1463 |
An assertion is a test on the characters following or preceding the current |
An assertion is a test on the characters following or preceding the current |
| 1464 |
matching point that does not actually consume any characters. The simple |
matching point that does not actually consume any characters. The simple |
| 1540 |
<pre> |
<pre> |
| 1541 |
(?<=abc|abde) |
(?<=abc|abde) |
| 1542 |
</pre> |
</pre> |
| 1543 |
|
In some cases, the Perl 5.10 escape sequence \K |
| 1544 |
|
<a href="#resetmatchstart">(see above)</a> |
| 1545 |
|
can be used instead of a lookbehind assertion; this is not restricted to a |
| 1546 |
|
fixed-length. |
| 1547 |
|
</P> |
| 1548 |
|
<P> |
| 1549 |
The implementation of lookbehind assertions is, for each alternative, to |
The implementation of lookbehind assertions is, for each alternative, to |
| 1550 |
temporarily move the current position back by the fixed length and then try to |
temporarily move the current position back by the fixed length and then try to |
| 1551 |
match. If there are insufficient characters before the current position, the |
match. If there are insufficient characters before the current position, the |
| 1618 |
is another pattern that matches "foo" preceded by three digits and any three |
is another pattern that matches "foo" preceded by three digits and any three |
| 1619 |
characters that are not "999". |
characters that are not "999". |
| 1620 |
<a name="conditions"></a></P> |
<a name="conditions"></a></P> |
| 1621 |
<br><a name="SEC17" href="#TOC1">CONDITIONAL SUBPATTERNS</a><br> |
<br><a name="SEC18" href="#TOC1">CONDITIONAL SUBPATTERNS</a><br> |
| 1622 |
<P> |
<P> |
| 1623 |
It is possible to cause the matching process to obey a subpattern |
It is possible to cause the matching process to obey a subpattern |
| 1624 |
conditionally or to choose between two alternative subpatterns, depending on |
conditionally or to choose between two alternative subpatterns, depending on |
| 1642 |
<P> |
<P> |
| 1643 |
If the text between the parentheses consists of a sequence of digits, the |
If the text between the parentheses consists of a sequence of digits, the |
| 1644 |
condition is true if the capturing subpattern of that number has previously |
condition is true if the capturing subpattern of that number has previously |
| 1645 |
matched. |
matched. An alternative notation is to precede the digits with a plus or minus |
| 1646 |
|
sign. In this case, the subpattern number is relative rather than absolute. |
| 1647 |
|
The most recently opened parentheses can be referenced by (?(-1), the next most |
| 1648 |
|
recent by (?(-2), and so on. In looping constructs it can also make sense to |
| 1649 |
|
refer to subsequent groups with constructs such as (?(+2). |
| 1650 |
</P> |
</P> |
| 1651 |
<P> |
<P> |
| 1652 |
Consider the following pattern, which contains non-significant white space to |
Consider the following pattern, which contains non-significant white space to |
| 1665 |
subpattern matches nothing. In other words, this pattern matches a sequence of |
subpattern matches nothing. In other words, this pattern matches a sequence of |
| 1666 |
non-parentheses, optionally enclosed in parentheses. |
non-parentheses, optionally enclosed in parentheses. |
| 1667 |
</P> |
</P> |
| 1668 |
|
<P> |
| 1669 |
|
If you were embedding this pattern in a larger one, you could use a relative |
| 1670 |
|
reference: |
| 1671 |
|
<pre> |
| 1672 |
|
...other stuff... ( \( )? [^()]+ (?(-1) \) ) ... |
| 1673 |
|
</pre> |
| 1674 |
|
This makes the fragment independent of the parentheses in the larger pattern. |
| 1675 |
|
</P> |
| 1676 |
<br><b> |
<br><b> |
| 1677 |
Checking for a used subpattern by name |
Checking for a used subpattern by name |
| 1678 |
</b><br> |
</b><br> |
| 1756 |
against the second. This pattern matches strings in one of the two forms |
against the second. This pattern matches strings in one of the two forms |
| 1757 |
dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits. |
dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits. |
| 1758 |
<a name="comments"></a></P> |
<a name="comments"></a></P> |
| 1759 |
<br><a name="SEC18" href="#TOC1">COMMENTS</a><br> |
<br><a name="SEC19" href="#TOC1">COMMENTS</a><br> |
| 1760 |
<P> |
<P> |
| 1761 |
The sequence (?# marks the start of a comment that continues up to the next |
The sequence (?# marks the start of a comment that continues up to the next |
| 1762 |
closing parenthesis. Nested parentheses are not permitted. The characters |
closing parenthesis. Nested parentheses are not permitted. The characters |
| 1767 |
character class introduces a comment that continues to immediately after the |
character class introduces a comment that continues to immediately after the |
| 1768 |
next newline in the pattern. |
next newline in the pattern. |
| 1769 |
<a name="recursion"></a></P> |
<a name="recursion"></a></P> |
| 1770 |
<br><a name="SEC19" href="#TOC1">RECURSIVE PATTERNS</a><br> |
<br><a name="SEC20" href="#TOC1">RECURSIVE PATTERNS</a><br> |
| 1771 |
<P> |
<P> |
| 1772 |
Consider the problem of matching a string in parentheses, allowing for |
Consider the problem of matching a string in parentheses, allowing for |
| 1773 |
unlimited nested parentheses. Without the use of recursion, the best that can |
unlimited nested parentheses. Without the use of recursion, the best that can |
| 1823 |
( \( ( (?>[^()]+) | (?1) )* \) ) |
( \( ( (?>[^()]+) | (?1) )* \) ) |
| 1824 |
</pre> |
</pre> |
| 1825 |
We have put the pattern into parentheses, and caused the recursion to refer to |
We have put the pattern into parentheses, and caused the recursion to refer to |
| 1826 |
them instead of the whole pattern. In a larger pattern, keeping track of |
them instead of the whole pattern. |
| 1827 |
parenthesis numbers can be tricky. It may be more convenient to use named |
</P> |
| 1828 |
parentheses instead. The Perl syntax for this is (?&name); PCRE's earlier |
<P> |
| 1829 |
syntax (?P>name) is also supported. We could rewrite the above example as |
In a larger pattern, keeping track of parenthesis numbers can be tricky. This |
| 1830 |
follows: |
is made easier by the use of relative references. (A Perl 5.10 feature.) |
| 1831 |
|
Instead of (?1) in the pattern above you can write (?-2) to refer to the second |
| 1832 |
|
most recently opened parentheses preceding the recursion. In other words, a |
| 1833 |
|
negative number counts capturing parentheses leftwards from the point at which |
| 1834 |
|
it is encountered. |
| 1835 |
|
</P> |
| 1836 |
|
<P> |
| 1837 |
|
It is also possible to refer to subsequently opened parentheses, by writing |
| 1838 |
|
references such as (?+2). However, these cannot be recursive because the |
| 1839 |
|
reference is not inside the parentheses that are referenced. They are always |
| 1840 |
|
"subroutine" calls, as described in the next section. |
| 1841 |
|
</P> |
| 1842 |
|
<P> |
| 1843 |
|
An alternative approach is to use named parentheses instead. The Perl syntax |
| 1844 |
|
for this is (?&name); PCRE's earlier syntax (?P>name) is also supported. We |
| 1845 |
|
could rewrite the above example as follows: |
| 1846 |
<pre> |
<pre> |
| 1847 |
(?<pn> \( ( (?>[^()]+) | (?&pn) )* \) ) |
(?<pn> \( ( (?>[^()]+) | (?&pn) )* \) ) |
| 1848 |
</pre> |
</pre> |
| 1849 |
If there is more than one subpattern with the same name, the earliest one is |
If there is more than one subpattern with the same name, the earliest one is |
| 1850 |
used. This particular example pattern contains nested unlimited repeats, and so |
used. |
| 1851 |
the use of atomic grouping for matching strings of non-parentheses is important |
</P> |
| 1852 |
when applying the pattern to strings that do not match. For example, when this |
<P> |
| 1853 |
pattern is applied to |
This particular example pattern that we have been looking at contains nested |
| 1854 |
|
unlimited repeats, and so the use of atomic grouping for matching strings of |
| 1855 |
|
non-parentheses is important when applying the pattern to strings that do not |
| 1856 |
|
match. For example, when this pattern is applied to |
| 1857 |
<pre> |
<pre> |
| 1858 |
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() |
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() |
| 1859 |
</pre> |
</pre> |
| 1897 |
different alternatives for the recursive and non-recursive cases. The (?R) item |
different alternatives for the recursive and non-recursive cases. The (?R) item |
| 1898 |
is the actual recursive call. |
is the actual recursive call. |
| 1899 |
<a name="subpatternsassubroutines"></a></P> |
<a name="subpatternsassubroutines"></a></P> |
| 1900 |
<br><a name="SEC20" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br> |
<br><a name="SEC21" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br> |
| 1901 |
<P> |
<P> |
| 1902 |
If the syntax for a recursive subpattern reference (either by number or by |
If the syntax for a recursive subpattern reference (either by number or by |
| 1903 |
name) is used outside the parentheses to which it refers, it operates like a |
name) is used outside the parentheses to which it refers, it operates like a |
| 1904 |
subroutine in a programming language. The "called" subpattern may be defined |
subroutine in a programming language. The "called" subpattern may be defined |
| 1905 |
before or after the reference. An earlier example pointed out that the pattern |
before or after the reference. A numbered reference can be absolute or |
| 1906 |
|
relative, as in these examples: |
| 1907 |
|
<pre> |
| 1908 |
|
(...(absolute)...)...(?2)... |
| 1909 |
|
(...(relative)...)...(?-1)... |
| 1910 |
|
(...(?+1)...(relative)... |
| 1911 |
|
</pre> |
| 1912 |
|
An earlier example pointed out that the pattern |
| 1913 |
<pre> |
<pre> |
| 1914 |
(sens|respons)e and \1ibility |
(sens|respons)e and \1ibility |
| 1915 |
</pre> |
</pre> |
| 1932 |
case-independence are fixed when the subpattern is defined. They cannot be |
case-independence are fixed when the subpattern is defined. They cannot be |
| 1933 |
changed for different calls. For example, consider this pattern: |
changed for different calls. For example, consider this pattern: |
| 1934 |
<pre> |
<pre> |
| 1935 |
(abc)(?i:(?1)) |
(abc)(?i:(?-1)) |
| 1936 |
</pre> |
</pre> |
| 1937 |
It matches "abcabc". It does not match "abcABC" because the change of |
It matches "abcabc". It does not match "abcABC" because the change of |
| 1938 |
processing option does not affect the called subpattern. |
processing option does not affect the called subpattern. |
| 1939 |
</P> |
</P> |
| 1940 |
<br><a name="SEC21" href="#TOC1">CALLOUTS</a><br> |
<br><a name="SEC22" href="#TOC1">CALLOUTS</a><br> |
| 1941 |
<P> |
<P> |
| 1942 |
Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl |
Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl |
| 1943 |
code to be obeyed in the middle of matching a regular expression. This makes it |
code to be obeyed in the middle of matching a regular expression. This makes it |
| 1956 |
can put a number less than 256 after the letter C. The default value is zero. |
can put a number less than 256 after the letter C. The default value is zero. |
| 1957 |
For example, this pattern has two callout points: |
For example, this pattern has two callout points: |
| 1958 |
<pre> |
<pre> |
| 1959 |
(?C1)\dabc(?C2)def |
(?C1)abc(?C2)def |
| 1960 |
</pre> |
</pre> |
| 1961 |
If the PCRE_AUTO_CALLOUT flag is passed to <b>pcre_compile()</b>, callouts are |
If the PCRE_AUTO_CALLOUT flag is passed to <b>pcre_compile()</b>, callouts are |
| 1962 |
automatically installed before each item in the pattern. They are all numbered |
automatically installed before each item in the pattern. They are all numbered |
| 1972 |
<a href="pcrecallout.html"><b>pcrecallout</b></a> |
<a href="pcrecallout.html"><b>pcrecallout</b></a> |
| 1973 |
documentation. |
documentation. |
| 1974 |
</P> |
</P> |
| 1975 |
<br><a name="SEC22" href="#TOC1">SEE ALSO</a><br> |
<br><a name="SEC23" href="#TOC1">SEE ALSO</a><br> |
| 1976 |
<P> |
<P> |
| 1977 |
<b>pcreapi</b>(3), <b>pcrecallout</b>(3), <b>pcrematching</b>(3), <b>pcre</b>(3). |
<b>pcreapi</b>(3), <b>pcrecallout</b>(3), <b>pcrematching</b>(3), <b>pcre</b>(3). |
| 1978 |
</P> |
</P> |
| 1979 |
<br><a name="SEC23" href="#TOC1">AUTHOR</a><br> |
<br><a name="SEC24" href="#TOC1">AUTHOR</a><br> |
| 1980 |
<P> |
<P> |
| 1981 |
Philip Hazel |
Philip Hazel |
| 1982 |
<br> |
<br> |
| 1985 |
Cambridge CB2 3QH, England. |
Cambridge CB2 3QH, England. |
| 1986 |
<br> |
<br> |
| 1987 |
</P> |
</P> |
| 1988 |
<br><a name="SEC24" href="#TOC1">REVISION</a><br> |
<br><a name="SEC25" href="#TOC1">REVISION</a><br> |
| 1989 |
<P> |
<P> |
| 1990 |
Last updated: 06 March 2007 |
Last updated: 13 June 2007 |
| 1991 |
<br> |
<br> |
| 1992 |
Copyright © 1997-2007 University of Cambridge. |
Copyright © 1997-2007 University of Cambridge. |
| 1993 |
<br> |
<br> |