| 35 |
<li><a name="TOC20" href="#SEC20">COMMENTS</a> |
<li><a name="TOC20" href="#SEC20">COMMENTS</a> |
| 36 |
<li><a name="TOC21" href="#SEC21">RECURSIVE PATTERNS</a> |
<li><a name="TOC21" href="#SEC21">RECURSIVE PATTERNS</a> |
| 37 |
<li><a name="TOC22" href="#SEC22">SUBPATTERNS AS SUBROUTINES</a> |
<li><a name="TOC22" href="#SEC22">SUBPATTERNS AS SUBROUTINES</a> |
| 38 |
<li><a name="TOC23" href="#SEC23">CALLOUTS</a> |
<li><a name="TOC23" href="#SEC23">ONIGURUMA SUBROUTINE SYNTAX</a> |
| 39 |
<li><a name="TOC24" href="#SEC24">BACKTRACKING CONTROL</a> |
<li><a name="TOC24" href="#SEC24">CALLOUTS</a> |
| 40 |
<li><a name="TOC25" href="#SEC25">SEE ALSO</a> |
<li><a name="TOC25" href="#SEC25">BACKTRACKING CONTROL</a> |
| 41 |
<li><a name="TOC26" href="#SEC26">AUTHOR</a> |
<li><a name="TOC26" href="#SEC26">SEE ALSO</a> |
| 42 |
<li><a name="TOC27" href="#SEC27">REVISION</a> |
<li><a name="TOC27" href="#SEC27">AUTHOR</a> |
| 43 |
|
<li><a name="TOC28" href="#SEC28">REVISION</a> |
| 44 |
</ul> |
</ul> |
| 45 |
<br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION DETAILS</a><br> |
<br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION DETAILS</a><br> |
| 46 |
<P> |
<P> |
| 47 |
The syntax and semantics of the regular expressions that are supported by PCRE |
The syntax and semantics of the regular expressions that are supported by PCRE |
| 48 |
are described in detail below. There is a quick-reference syntax summary in the |
are described in detail below. There is a quick-reference syntax summary in the |
| 49 |
<a href="pcresyntax.html"><b>pcresyntax</b></a> |
<a href="pcresyntax.html"><b>pcresyntax</b></a> |
| 50 |
page. Perl's regular expressions are described in its own documentation, and |
page. PCRE tries to match Perl syntax and semantics as closely as it can. PCRE |
| 51 |
|
also supports some alternative regular expression syntax (which does not |
| 52 |
|
conflict with the Perl syntax) in order to provide some compatibility with |
| 53 |
|
regular expressions in Python, .NET, and Oniguruma. |
| 54 |
|
</P> |
| 55 |
|
<P> |
| 56 |
|
Perl's regular expressions are described in its own documentation, and |
| 57 |
regular expressions in general are covered in a number of books, some of which |
regular expressions in general are covered in a number of books, some of which |
| 58 |
have copious examples. Jeffrey Friedl's "Mastering Regular Expressions", |
have copious examples. Jeffrey Friedl's "Mastering Regular Expressions", |
| 59 |
published by O'Reilly, covers regular expressions in great detail. This |
published by O'Reilly, covers regular expressions in great detail. This |
| 63 |
The original operation of PCRE was on strings of one-byte characters. However, |
The original operation of PCRE was on strings of one-byte characters. However, |
| 64 |
there is now also support for UTF-8 character strings. To use this, you must |
there is now also support for UTF-8 character strings. To use this, you must |
| 65 |
build PCRE to include UTF-8 support, and then call <b>pcre_compile()</b> with |
build PCRE to include UTF-8 support, and then call <b>pcre_compile()</b> with |
| 66 |
the PCRE_UTF8 option. How this affects pattern matching is mentioned in several |
the PCRE_UTF8 option. There is also a special sequence that can be given at the |
| 67 |
places below. There is also a summary of UTF-8 features in the |
start of a pattern: |
| 68 |
|
<pre> |
| 69 |
|
(*UTF8) |
| 70 |
|
</pre> |
| 71 |
|
Starting a pattern with this sequence is equivalent to setting the PCRE_UTF8 |
| 72 |
|
option. This feature is not Perl-compatible. How setting UTF-8 mode affects |
| 73 |
|
pattern matching is mentioned in several places below. There is also a summary |
| 74 |
|
of UTF-8 features in the |
| 75 |
<a href="pcre.html#utf8support">section on UTF-8 support</a> |
<a href="pcre.html#utf8support">section on UTF-8 support</a> |
| 76 |
in the main |
in the main |
| 77 |
<a href="pcre.html"><b>pcre</b></a> |
<a href="pcre.html"><b>pcre</b></a> |
| 127 |
default, this is any Unicode newline sequence, for Perl compatibility. However, |
default, this is any Unicode newline sequence, for Perl compatibility. However, |
| 128 |
this can be changed; see the description of \R in the section entitled |
this can be changed; see the description of \R in the section entitled |
| 129 |
<a href="#newlineseq">"Newline sequences"</a> |
<a href="#newlineseq">"Newline sequences"</a> |
| 130 |
below. |
below. A change of \R setting can be combined with a change of newline |
| 131 |
|
convention. |
| 132 |
</P> |
</P> |
| 133 |
<br><a name="SEC3" href="#TOC1">CHARACTERS AND METACHARACTERS</a><br> |
<br><a name="SEC3" href="#TOC1">CHARACTERS AND METACHARACTERS</a><br> |
| 134 |
<P> |
<P> |
| 326 |
<a href="#subpattern">parenthesized subpatterns.</a> |
<a href="#subpattern">parenthesized subpatterns.</a> |
| 327 |
</P> |
</P> |
| 328 |
<br><b> |
<br><b> |
| 329 |
|
Absolute and relative subroutine calls |
| 330 |
|
</b><br> |
| 331 |
|
<P> |
| 332 |
|
For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or |
| 333 |
|
a number enclosed either in angle brackets or single quotes, is an alternative |
| 334 |
|
syntax for referencing a subpattern as a "subroutine". Details are discussed |
| 335 |
|
<a href="#onigurumasubroutines">later.</a> |
| 336 |
|
Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are <i>not</i> |
| 337 |
|
synonymous. The former is a back reference; the latter is a subroutine call. |
| 338 |
|
</P> |
| 339 |
|
<br><b> |
| 340 |
Generic character types |
Generic character types |
| 341 |
</b><br> |
</b><br> |
| 342 |
<P> |
<P> |
| 375 |
\w, and always match \D, \S, and \W. This is true even when Unicode |
\w, and always match \D, \S, and \W. This is true even when Unicode |
| 376 |
character property support is available. These sequences retain their original |
character property support is available. These sequences retain their original |
| 377 |
meanings from before UTF-8 support was available, mainly for efficiency |
meanings from before UTF-8 support was available, mainly for efficiency |
| 378 |
reasons. |
reasons. Note that this also affects \b, because it is defined in terms of \w |
| 379 |
|
and \W. |
| 380 |
</P> |
</P> |
| 381 |
<P> |
<P> |
| 382 |
The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to the |
The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to the |
| 454 |
<P> |
<P> |
| 455 |
It is possible to restrict \R to match only CR, LF, or CRLF (instead of the |
It is possible to restrict \R to match only CR, LF, or CRLF (instead of the |
| 456 |
complete set of Unicode line endings) by setting the option PCRE_BSR_ANYCRLF |
complete set of Unicode line endings) by setting the option PCRE_BSR_ANYCRLF |
| 457 |
either at compile time or when the pattern is matched. This can be made the |
either at compile time or when the pattern is matched. (BSR is an abbrevation |
| 458 |
default when PCRE is built; if this is the case, the other behaviour can be |
for "backslash R".) This can be made the default when PCRE is built; if this is |
| 459 |
requested via the PCRE_BSR_UNICODE option. It is also possible to specify these |
the case, the other behaviour can be requested via the PCRE_BSR_UNICODE option. |
| 460 |
settings by starting a pattern string with one of the following sequences: |
It is also possible to specify these settings by starting a pattern string with |
| 461 |
|
one of the following sequences: |
| 462 |
<pre> |
<pre> |
| 463 |
(*BSR_ANYCRLF) CR, LF, or CRLF only |
(*BSR_ANYCRLF) CR, LF, or CRLF only |
| 464 |
(*BSR_UNICODE) any Unicode newline sequence |
(*BSR_UNICODE) any Unicode newline sequence |
| 467 |
they can be overridden by options given to <b>pcre_exec()</b>. Note that these |
they can be overridden by options given to <b>pcre_exec()</b>. Note that these |
| 468 |
special settings, which are not Perl-compatible, are recognized only at the |
special settings, which are not Perl-compatible, are recognized only at the |
| 469 |
very start of a pattern, and that they must be in upper case. If more than one |
very start of a pattern, and that they must be in upper case. If more than one |
| 470 |
of them is present, the last one is used. |
of them is present, the last one is used. They can be combined with a change of |
| 471 |
</P> |
newline convention, for example, a pattern can start with: |
| 472 |
<P> |
<pre> |
| 473 |
|
(*ANY)(*BSR_ANYCRLF) |
| 474 |
|
</pre> |
| 475 |
Inside a character class, \R matches the letter "R". |
Inside a character class, \R matches the letter "R". |
| 476 |
<a name="uniextseq"></a></P> |
<a name="uniextseq"></a></P> |
| 477 |
<br><b> |
<br><b> |
| 1038 |
J, U and X respectively. |
J, U and X respectively. |
| 1039 |
</P> |
</P> |
| 1040 |
<P> |
<P> |
| 1041 |
When an option change occurs at top level (that is, not inside subpattern |
When one of these option changes occurs at top level (that is, not inside |
| 1042 |
parentheses), the change applies to the remainder of the pattern that follows. |
subpattern parentheses), the change applies to the remainder of the pattern |
| 1043 |
If the change is placed right at the start of a pattern, PCRE extracts it into |
that follows. If the change is placed right at the start of a pattern, PCRE |
| 1044 |
the global options (and it will therefore show up in data extracted by the |
extracts it into the global options (and it will therefore show up in data |
| 1045 |
<b>pcre_fullinfo()</b> function). |
extracted by the <b>pcre_fullinfo()</b> function). |
| 1046 |
</P> |
</P> |
| 1047 |
<P> |
<P> |
| 1048 |
An option change within a subpattern (see below for a description of |
An option change within a subpattern (see below for a description of |
| 1061 |
branch is abandoned before the option setting. This is because the effects of |
branch is abandoned before the option setting. This is because the effects of |
| 1062 |
option settings happen at compile time. There would be some very weird |
option settings happen at compile time. There would be some very weird |
| 1063 |
behaviour otherwise. |
behaviour otherwise. |
| 1064 |
|
</P> |
| 1065 |
|
<P> |
| 1066 |
|
<b>Note:</b> There are other PCRE-specific options that can be set by the |
| 1067 |
|
application when the compile or match functions are called. In some cases the |
| 1068 |
|
pattern can contain special leading sequences such as (*CRLF) to override what |
| 1069 |
|
the application has set or what has been defaulted. Details are given in the |
| 1070 |
|
section entitled |
| 1071 |
|
<a href="#newlineseq">"Newline sequences"</a> |
| 1072 |
|
above. There is also the (*UTF8) leading sequence that can be used to set UTF-8 |
| 1073 |
|
mode; this is equivalent to setting the PCRE_UTF8 option. |
| 1074 |
<a name="subpattern"></a></P> |
<a name="subpattern"></a></P> |
| 1075 |
<br><a name="SEC12" href="#TOC1">SUBPATTERNS</a><br> |
<br><a name="SEC12" href="#TOC1">SUBPATTERNS</a><br> |
| 1076 |
<P> |
<P> |
| 1212 |
<a href="pcreapi.html"><b>pcreapi</b></a> |
<a href="pcreapi.html"><b>pcreapi</b></a> |
| 1213 |
documentation. |
documentation. |
| 1214 |
</P> |
</P> |
| 1215 |
|
<P> |
| 1216 |
|
<b>Warning:</b> You cannot use different names to distinguish between two |
| 1217 |
|
subpatterns with the same number (see the previous section) because PCRE uses |
| 1218 |
|
only the numbers when matching. |
| 1219 |
|
</P> |
| 1220 |
<br><a name="SEC15" href="#TOC1">REPETITION</a><br> |
<br><a name="SEC15" href="#TOC1">REPETITION</a><br> |
| 1221 |
<P> |
<P> |
| 1222 |
Repetition is specified by quantifiers, which can follow any of the following |
Repetition is specified by quantifiers, which can follow any of the following |
| 1264 |
</P> |
</P> |
| 1265 |
<P> |
<P> |
| 1266 |
The quantifier {0} is permitted, causing the expression to behave as if the |
The quantifier {0} is permitted, causing the expression to behave as if the |
| 1267 |
previous item and the quantifier were not present. |
previous item and the quantifier were not present. This may be useful for |
| 1268 |
|
subpatterns that are referenced as |
| 1269 |
|
<a href="#subpatternsassubroutines">subroutines</a> |
| 1270 |
|
from elsewhere in the pattern. Items other than subpatterns that have a {0} |
| 1271 |
|
quantifier are omitted from the compiled pattern. |
| 1272 |
</P> |
</P> |
| 1273 |
<P> |
<P> |
| 1274 |
For convenience, the three most common quantifiers have single-character |
For convenience, the three most common quantifiers have single-character |
| 2068 |
</pre> |
</pre> |
| 2069 |
It matches "abcabc". It does not match "abcABC" because the change of |
It matches "abcabc". It does not match "abcABC" because the change of |
| 2070 |
processing option does not affect the called subpattern. |
processing option does not affect the called subpattern. |
| 2071 |
|
<a name="onigurumasubroutines"></a></P> |
| 2072 |
|
<br><a name="SEC23" href="#TOC1">ONIGURUMA SUBROUTINE SYNTAX</a><br> |
| 2073 |
|
<P> |
| 2074 |
|
For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or |
| 2075 |
|
a number enclosed either in angle brackets or single quotes, is an alternative |
| 2076 |
|
syntax for referencing a subpattern as a subroutine, possibly recursively. Here |
| 2077 |
|
are two of the examples used above, rewritten using this syntax: |
| 2078 |
|
<pre> |
| 2079 |
|
(?<pn> \( ( (?>[^()]+) | \g<pn> )* \) ) |
| 2080 |
|
(sens|respons)e and \g'1'ibility |
| 2081 |
|
</pre> |
| 2082 |
|
PCRE supports an extension to Oniguruma: if a number is preceded by a |
| 2083 |
|
plus or a minus sign it is taken as a relative reference. For example: |
| 2084 |
|
<pre> |
| 2085 |
|
(abc)(?i:\g<-1>) |
| 2086 |
|
</pre> |
| 2087 |
|
Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are <i>not</i> |
| 2088 |
|
synonymous. The former is a back reference; the latter is a subroutine call. |
| 2089 |
</P> |
</P> |
| 2090 |
<br><a name="SEC23" href="#TOC1">CALLOUTS</a><br> |
<br><a name="SEC24" href="#TOC1">CALLOUTS</a><br> |
| 2091 |
<P> |
<P> |
| 2092 |
Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl |
Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl |
| 2093 |
code to be obeyed in the middle of matching a regular expression. This makes it |
code to be obeyed in the middle of matching a regular expression. This makes it |
| 2122 |
<a href="pcrecallout.html"><b>pcrecallout</b></a> |
<a href="pcrecallout.html"><b>pcrecallout</b></a> |
| 2123 |
documentation. |
documentation. |
| 2124 |
</P> |
</P> |
| 2125 |
<br><a name="SEC24" href="#TOC1">BACKTRACKING CONTROL</a><br> |
<br><a name="SEC25" href="#TOC1">BACKTRACKING CONTROL</a><br> |
| 2126 |
<P> |
<P> |
| 2127 |
Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which |
Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which |
| 2128 |
are described in the Perl documentation as "experimental and subject to change |
are described in the Perl documentation as "experimental and subject to change |
| 2131 |
remarks apply to the PCRE features described in this section. |
remarks apply to the PCRE features described in this section. |
| 2132 |
</P> |
</P> |
| 2133 |
<P> |
<P> |
| 2134 |
Since these verbs are specifically related to backtracking, they can be used |
Since these verbs are specifically related to backtracking, most of them can be |
| 2135 |
only when the pattern is to be matched using <b>pcre_exec()</b>, which uses a |
used only when the pattern is to be matched using <b>pcre_exec()</b>, which uses |
| 2136 |
backtracking algorithm. They cause an error if encountered by |
a backtracking algorithm. With the exception of (*FAIL), which behaves like a |
| 2137 |
|
failing negative assertion, they cause an error if encountered by |
| 2138 |
<b>pcre_dfa_exec()</b>. |
<b>pcre_dfa_exec()</b>. |
| 2139 |
</P> |
</P> |
| 2140 |
<P> |
<P> |
| 2238 |
second alternative and tries COND2, without backtracking into COND1. If (*THEN) |
second alternative and tries COND2, without backtracking into COND1. If (*THEN) |
| 2239 |
is used outside of any alternation, it acts exactly like (*PRUNE). |
is used outside of any alternation, it acts exactly like (*PRUNE). |
| 2240 |
</P> |
</P> |
| 2241 |
<br><a name="SEC25" href="#TOC1">SEE ALSO</a><br> |
<br><a name="SEC26" href="#TOC1">SEE ALSO</a><br> |
| 2242 |
<P> |
<P> |
| 2243 |
<b>pcreapi</b>(3), <b>pcrecallout</b>(3), <b>pcrematching</b>(3), <b>pcre</b>(3). |
<b>pcreapi</b>(3), <b>pcrecallout</b>(3), <b>pcrematching</b>(3), <b>pcre</b>(3). |
| 2244 |
</P> |
</P> |
| 2245 |
<br><a name="SEC26" href="#TOC1">AUTHOR</a><br> |
<br><a name="SEC27" href="#TOC1">AUTHOR</a><br> |
| 2246 |
<P> |
<P> |
| 2247 |
Philip Hazel |
Philip Hazel |
| 2248 |
<br> |
<br> |
| 2251 |
Cambridge CB2 3QH, England. |
Cambridge CB2 3QH, England. |
| 2252 |
<br> |
<br> |
| 2253 |
</P> |
</P> |
| 2254 |
<br><a name="SEC27" href="#TOC1">REVISION</a><br> |
<br><a name="SEC28" href="#TOC1">REVISION</a><br> |
| 2255 |
<P> |
<P> |
| 2256 |
Last updated: 11 September 2007 |
Last updated: 11 April 2009 |
| 2257 |
<br> |
<br> |
| 2258 |
Copyright © 1997-2007 University of Cambridge. |
Copyright © 1997-2009 University of Cambridge. |
| 2259 |
<br> |
<br> |
| 2260 |
<p> |
<p> |
| 2261 |
Return to the <a href="index.html">PCRE index page</a>. |
Return to the <a href="index.html">PCRE index page</a>. |