/[pcre]/code/trunk/doc/pcrepattern.3
ViewVC logotype

Diff of /code/trunk/doc/pcrepattern.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 227 by ph10, Tue Aug 21 15:00:15 2007 UTC revision 453 by ph10, Fri Sep 18 19:12:35 2009 UTC
# Line 9  are described in detail below. There is Line 9  are described in detail below. There is
9  .\" HREF  .\" HREF
10  \fBpcresyntax\fP  \fBpcresyntax\fP
11  .\"  .\"
12  page. Perl's regular expressions are described in its own documentation, and  page. PCRE tries to match Perl syntax and semantics as closely as it can. PCRE
13    also supports some alternative regular expression syntax (which does not
14    conflict with the Perl syntax) in order to provide some compatibility with
15    regular expressions in Python, .NET, and Oniguruma.
16    .P
17    Perl's regular expressions are described in its own documentation, and
18  regular expressions in general are covered in a number of books, some of which  regular expressions in general are covered in a number of books, some of which
19  have copious examples. Jeffrey Friedl's "Mastering Regular Expressions",  have copious examples. Jeffrey Friedl's "Mastering Regular Expressions",
20  published by O'Reilly, covers regular expressions in great detail. This  published by O'Reilly, covers regular expressions in great detail. This
# Line 18  description of PCRE's regular expression Line 23  description of PCRE's regular expression
23  The original operation of PCRE was on strings of one-byte characters. However,  The original operation of PCRE was on strings of one-byte characters. However,
24  there is now also support for UTF-8 character strings. To use this, you must  there is now also support for UTF-8 character strings. To use this, you must
25  build PCRE to include UTF-8 support, and then call \fBpcre_compile()\fP with  build PCRE to include UTF-8 support, and then call \fBpcre_compile()\fP with
26  the PCRE_UTF8 option. How this affects pattern matching is mentioned in several  the PCRE_UTF8 option. There is also a special sequence that can be given at the
27  places below. There is also a summary of UTF-8 features in the  start of a pattern:
28    .sp
29      (*UTF8)
30    .sp
31    Starting a pattern with this sequence is equivalent to setting the PCRE_UTF8
32    option. This feature is not Perl-compatible. How setting UTF-8 mode affects
33    pattern matching is mentioned in several places below. There is also a summary
34    of UTF-8 features in the
35  .\" HTML <a href="pcre.html#utf8support">  .\" HTML <a href="pcre.html#utf8support">
36  .\" </a>  .\" </a>
37  section on UTF-8 support  section on UTF-8 support
# Line 79  example, on a Unix system where LF is th Line 91  example, on a Unix system where LF is th
91  changes the convention to CR. That pattern matches "a\enb" because LF is no  changes the convention to CR. That pattern matches "a\enb" because LF is no
92  longer a newline. Note that these special settings, which are not  longer a newline. Note that these special settings, which are not
93  Perl-compatible, are recognized only at the very start of a pattern, and that  Perl-compatible, are recognized only at the very start of a pattern, and that
94  they must be in upper case.  they must be in upper case. If more than one of them is present, the last one
95    is used.
96    .P
97    The newline convention does not affect what the \eR escape sequence matches. By
98    default, this is any Unicode newline sequence, for Perl compatibility. However,
99    this can be changed; see the description of \eR in the section entitled
100    .\" HTML <a href="#newlineseq">
101    .\" </a>
102    "Newline sequences"
103    .\"
104    below. A change of \eR setting can be combined with a change of newline
105    convention.
106  .  .
107  .  .
108  .SH "CHARACTERS AND METACHARACTERS"  .SH "CHARACTERS AND METACHARACTERS"
# Line 299  parenthesized subpatterns. Line 322  parenthesized subpatterns.
322  .\"  .\"
323  .  .
324  .  .
325    .SS "Absolute and relative subroutine calls"
326    .rs
327    .sp
328    For compatibility with Oniguruma, the non-Perl syntax \eg followed by a name or
329    a number enclosed either in angle brackets or single quotes, is an alternative
330    syntax for referencing a subpattern as a "subroutine". Details are discussed
331    .\" HTML <a href="#onigurumasubroutines">
332    .\" </a>
333    later.
334    .\"
335    Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP
336    synonymous. The former is a back reference; the latter is a subroutine call.
337    .
338    .
339  .SS "Generic character types"  .SS "Generic character types"
340  .rs  .rs
341  .sp  .sp
# Line 334  In UTF-8 mode, characters with values gr Line 371  In UTF-8 mode, characters with values gr
371  \ew, and always match \eD, \eS, and \eW. This is true even when Unicode  \ew, and always match \eD, \eS, and \eW. This is true even when Unicode
372  character property support is available. These sequences retain their original  character property support is available. These sequences retain their original
373  meanings from before UTF-8 support was available, mainly for efficiency  meanings from before UTF-8 support was available, mainly for efficiency
374  reasons.  reasons. Note that this also affects \eb, because it is defined in terms of \ew
375    and \eW.
376  .P  .P
377  The sequences \eh, \eH, \ev, and \eV are Perl 5.10 features. In contrast to the  The sequences \eh, \eH, \ev, and \eV are Perl 5.10 features. In contrast to the
378  other sequences, these do match certain high-valued codepoints in UTF-8 mode.  other sequences, these do match certain high-valued codepoints in UTF-8 mode.
# Line 388  accented letters, and these are matched Line 426  accented letters, and these are matched
426  is discouraged.  is discouraged.
427  .  .
428  .  .
429    .\" HTML <a name="newlineseq"></a>
430  .SS "Newline sequences"  .SS "Newline sequences"
431  .rs  .rs
432  .sp  .sp
433  Outside a character class, the escape sequence \eR matches any Unicode newline  Outside a character class, by default, the escape sequence \eR matches any
434  sequence. This is a Perl 5.10 feature. In non-UTF-8 mode \eR is equivalent to  Unicode newline sequence. This is a Perl 5.10 feature. In non-UTF-8 mode \eR is
435  the following:  equivalent to the following:
436  .sp  .sp
437    (?>\er\en|\en|\ex0b|\ef|\er|\ex85)    (?>\er\en|\en|\ex0b|\ef|\er|\ex85)
438  .sp  .sp
# Line 413  are added: LS (line separator, U+2028) a Line 452  are added: LS (line separator, U+2028) a
452  Unicode character property support is not needed for these characters to be  Unicode character property support is not needed for these characters to be
453  recognized.  recognized.
454  .P  .P
455    It is possible to restrict \eR to match only CR, LF, or CRLF (instead of the
456    complete set of Unicode line endings) by setting the option PCRE_BSR_ANYCRLF
457    either at compile time or when the pattern is matched. (BSR is an abbrevation
458    for "backslash R".) This can be made the default when PCRE is built; if this is
459    the case, the other behaviour can be requested via the PCRE_BSR_UNICODE option.
460    It is also possible to specify these settings by starting a pattern string with
461    one of the following sequences:
462    .sp
463      (*BSR_ANYCRLF)   CR, LF, or CRLF only
464      (*BSR_UNICODE)   any Unicode newline sequence
465    .sp
466    These override the default and the options given to \fBpcre_compile()\fP, but
467    they can be overridden by options given to \fBpcre_exec()\fP. Note that these
468    special settings, which are not Perl-compatible, are recognized only at the
469    very start of a pattern, and that they must be in upper case. If more than one
470    of them is present, the last one is used. They can be combined with a change of
471    newline convention, for example, a pattern can start with:
472    .sp
473      (*ANY)(*BSR_ANYCRLF)
474    .sp
475  Inside a character class, \eR matches the letter "R".  Inside a character class, \eR matches the letter "R".
476  .  .
477  .  .
# Line 583  cannot be tested by PCRE, unless UTF-8 v Line 642  cannot be tested by PCRE, unless UTF-8 v
642  .\" HREF  .\" HREF
643  \fBpcreapi\fP  \fBpcreapi\fP
644  .\"  .\"
645  page).  page). Perl does not support the Cs property.
646  .P  .P
647  The long synonyms for these properties that Perl supports (such as \ep{Letter})  The long synonyms for property names that Perl supports (such as \ep{Letter})
648  are not supported by PCRE, nor is it permitted to prefix any of these  are not supported by PCRE, nor is it permitted to prefix any of these
649  properties with "Is".  properties with "Is".
650  .P  .P
# Line 960  alternative in the subpattern. Line 1019  alternative in the subpattern.
1019  .rs  .rs
1020  .sp  .sp
1021  The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and  The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
1022  PCRE_EXTENDED options can be changed from within the pattern by a sequence of  PCRE_EXTENDED options (which are Perl-compatible) can be changed from within
1023  Perl option letters enclosed between "(?" and ")". The option letters are  the pattern by a sequence of Perl option letters enclosed between "(?" and ")".
1024    The option letters are
1025  .sp  .sp
1026    i  for PCRE_CASELESS    i  for PCRE_CASELESS
1027    m  for PCRE_MULTILINE    m  for PCRE_MULTILINE
# Line 975  PCRE_MULTILINE while unsetting PCRE_DOTA Line 1035  PCRE_MULTILINE while unsetting PCRE_DOTA
1035  permitted. If a letter appears both before and after the hyphen, the option is  permitted. If a letter appears both before and after the hyphen, the option is
1036  unset.  unset.
1037  .P  .P
1038  When an option change occurs at top level (that is, not inside subpattern  The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA can be
1039  parentheses), the change applies to the remainder of the pattern that follows.  changed in the same way as the Perl-compatible options by using the characters
1040  If the change is placed right at the start of a pattern, PCRE extracts it into  J, U and X respectively.
1041  the global options (and it will therefore show up in data extracted by the  .P
1042  \fBpcre_fullinfo()\fP function).  When one of these option changes occurs at top level (that is, not inside
1043    subpattern parentheses), the change applies to the remainder of the pattern
1044    that follows. If the change is placed right at the start of a pattern, PCRE
1045    extracts it into the global options (and it will therefore show up in data
1046    extracted by the \fBpcre_fullinfo()\fP function).
1047  .P  .P
1048  An option change within a subpattern (see below for a description of  An option change within a subpattern (see below for a description of
1049  subpatterns) affects only that part of the current pattern that follows it, so  subpatterns) affects only that part of the current pattern that follows it, so
# Line 998  branch is abandoned before the option se Line 1062  branch is abandoned before the option se
1062  option settings happen at compile time. There would be some very weird  option settings happen at compile time. There would be some very weird
1063  behaviour otherwise.  behaviour otherwise.
1064  .P  .P
1065  The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA can be  \fBNote:\fP There are other PCRE-specific options that can be set by the
1066  changed in the same way as the Perl-compatible options by using the characters  application when the compile or match functions are called. In some cases the
1067  J, U and X respectively.  pattern can contain special leading sequences such as (*CRLF) to override what
1068    the application has set or what has been defaulted. Details are given in the
1069    section entitled
1070    .\" HTML <a href="#newlineseq">
1071    .\" </a>
1072    "Newline sequences"
1073    .\"
1074    above. There is also the (*UTF8) leading sequence that can be used to set UTF-8
1075    mode; this is equivalent to setting the PCRE_UTF8 option.
1076  .  .
1077  .  .
1078  .\" HTML <a name="subpattern"></a>  .\" HTML <a name="subpattern"></a>
# Line 1149  details of the interfaces for handling n Line 1221  details of the interfaces for handling n
1221  \fBpcreapi\fP  \fBpcreapi\fP
1222  .\"  .\"
1223  documentation.  documentation.
1224    .P
1225    \fBWarning:\fP You cannot use different names to distinguish between two
1226    subpatterns with the same number (see the previous section) because PCRE uses
1227    only the numbers when matching.
1228  .  .
1229  .  .
1230  .SH REPETITION  .SH REPETITION
# Line 1197  support is available, \eX{3} matches thr Line 1273  support is available, \eX{3} matches thr
1273  which may be several bytes long (and they may be of different lengths).  which may be several bytes long (and they may be of different lengths).
1274  .P  .P
1275  The quantifier {0} is permitted, causing the expression to behave as if the  The quantifier {0} is permitted, causing the expression to behave as if the
1276  previous item and the quantifier were not present.  previous item and the quantifier were not present. This may be useful for
1277    subpatterns that are referenced as
1278    .\" HTML <a href="#subpatternsassubroutines">
1279    .\" </a>
1280    subroutines
1281    .\"
1282    from elsewhere in the pattern. Items other than subpatterns that have a {0}
1283    quantifier are omitted from the compiled pattern.
1284  .P  .P
1285  For convenience, the three most common quantifiers have single-character  For convenience, the three most common quantifiers have single-character
1286  abbreviations:  abbreviations:
# Line 1839  recursively to the pattern in which it a Line 1922  recursively to the pattern in which it a
1922  Obviously, PCRE cannot support the interpolation of Perl code. Instead, it  Obviously, PCRE cannot support the interpolation of Perl code. Instead, it
1923  supports special syntax for recursion of the entire pattern, and also for  supports special syntax for recursion of the entire pattern, and also for
1924  individual subpattern recursion. After its introduction in PCRE and Python,  individual subpattern recursion. After its introduction in PCRE and Python,
1925  this kind of recursion was introduced into Perl at release 5.10.  this kind of recursion was subsequently introduced into Perl at release 5.10.
1926  .P  .P
1927  A special item that consists of (? followed by a number greater than zero and a  A special item that consists of (? followed by a number greater than zero and a
1928  closing parenthesis is a recursive call of the subpattern of the given number,  closing parenthesis is a recursive call of the subpattern of the given number,
# Line 1847  provided that it occurs inside that subp Line 1930  provided that it occurs inside that subp
1930  call, which is described in the next section.) The special item (?R) or (?0) is  call, which is described in the next section.) The special item (?R) or (?0) is
1931  a recursive call of the entire regular expression.  a recursive call of the entire regular expression.
1932  .P  .P
 In PCRE (like Python, but unlike Perl), a recursive subpattern call is always  
 treated as an atomic group. That is, once it has matched some of the subject  
 string, it is never re-entered, even if it contains untried alternatives and  
 there is a subsequent matching failure.  
 .P  
1933  This PCRE pattern solves the nested parentheses problem (assume the  This PCRE pattern solves the nested parentheses problem (assume the
1934  PCRE_EXTENDED option is set so that white space is ignored):  PCRE_EXTENDED option is set so that white space is ignored):
1935  .sp  .sp
# Line 1939  different alternatives for the recursive Line 2017  different alternatives for the recursive
2017  is the actual recursive call.  is the actual recursive call.
2018  .  .
2019  .  .
2020    .\" HTML <a name="recursiondifference"></a>
2021    .SS "Recursion difference from Perl"
2022    .rs
2023    .sp
2024    In PCRE (like Python, but unlike Perl), a recursive subpattern call is always
2025    treated as an atomic group. That is, once it has matched some of the subject
2026    string, it is never re-entered, even if it contains untried alternatives and
2027    there is a subsequent matching failure. This can be illustrated by the
2028    following pattern, which purports to match a palindromic string that contains
2029    an odd number of characters (for example, "a", "aba", "abcba", "abcdcba"):
2030    .sp
2031      ^(.|(.)(?1)\e2)$
2032    .sp
2033    The idea is that it either matches a single character, or two identical
2034    characters surrounding a sub-palindrome. In Perl, this pattern works; in PCRE
2035    it does not if the pattern is longer than three characters. Consider the
2036    subject string "abcba":
2037    .P
2038    At the top level, the first character is matched, but as it is not at the end
2039    of the string, the first alternative fails; the second alternative is taken
2040    and the recursion kicks in. The recursive call to subpattern 1 successfully
2041    matches the next character ("b"). (Note that the beginning and end of line
2042    tests are not part of the recursion).
2043    .P
2044    Back at the top level, the next character ("c") is compared with what
2045    subpattern 2 matched, which was "a". This fails. Because the recursion is
2046    treated as an atomic group, there are now no backtracking points, and so the
2047    entire match fails. (Perl is able, at this point, to re-enter the recursion and
2048    try the second alternative.) However, if the pattern is written with the
2049    alternatives in the other order, things are different:
2050    .sp
2051      ^((.)(?1)\e2|.)$
2052    .sp
2053    This time, the recursing alternative is tried first, and continues to recurse
2054    until it runs out of characters, at which point the recursion fails. But this
2055    time we do have another alternative to try at the higher level. That is the big
2056    difference: in the previous case the remaining alternative is at a deeper
2057    recursion level, which PCRE cannot use.
2058    .P
2059    To change the pattern so that matches all palindromic strings, not just those
2060    with an odd number of characters, it is tempting to change the pattern to this:
2061    .sp
2062      ^((.)(?1)\e2|.?)$
2063    .sp
2064    Again, this works in Perl, but not in PCRE, and for the same reason. When a
2065    deeper recursion has matched a single character, it cannot be entered again in
2066    order to match an empty string. The solution is to separate the two cases, and
2067    write out the odd and even cases as alternatives at the higher level:
2068    .sp
2069      ^(?:((.)(?1)\e2|)|((.)(?3)\e4|.))
2070    .sp
2071    If you want to match typical palindromic phrases, the pattern has to ignore all
2072    non-word characters, which can be done like this:
2073    .sp
2074      ^\eW*+(?:((.)\eW*+(?1)\eW*+\e2|)|((.)\eW*+(?3)\eW*+\4|\eW*+.\eW*+))\eW*+$
2075    .sp
2076    If run with the PCRE_CASELESS option, this pattern matches phrases such as "A
2077    man, a plan, a canal: Panama!" and it works well in both PCRE and Perl. Note
2078    the use of the possessive quantifier *+ to avoid backtracking into sequences of
2079    non-word characters. Without this, PCRE takes a great deal longer (ten times or
2080    more) to match typical phrases, and Perl takes so long that you think it has
2081    gone into a loop.
2082    .
2083    .
2084  .\" HTML <a name="subpatternsassubroutines"></a>  .\" HTML <a name="subpatternsassubroutines"></a>
2085  .SH "SUBPATTERNS AS SUBROUTINES"  .SH "SUBPATTERNS AS SUBROUTINES"
2086  .rs  .rs
# Line 1980  It matches "abcabc". It does not match " Line 2122  It matches "abcabc". It does not match "
2122  processing option does not affect the called subpattern.  processing option does not affect the called subpattern.
2123  .  .
2124  .  .
2125    .\" HTML <a name="onigurumasubroutines"></a>
2126    .SH "ONIGURUMA SUBROUTINE SYNTAX"
2127    .rs
2128    .sp
2129    For compatibility with Oniguruma, the non-Perl syntax \eg followed by a name or
2130    a number enclosed either in angle brackets or single quotes, is an alternative
2131    syntax for referencing a subpattern as a subroutine, possibly recursively. Here
2132    are two of the examples used above, rewritten using this syntax:
2133    .sp
2134      (?<pn> \e( ( (?>[^()]+) | \eg<pn> )* \e) )
2135      (sens|respons)e and \eg'1'ibility
2136    .sp
2137    PCRE supports an extension to Oniguruma: if a number is preceded by a
2138    plus or a minus sign it is taken as a relative reference. For example:
2139    .sp
2140      (abc)(?i:\eg<-1>)
2141    .sp
2142    Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP
2143    synonymous. The former is a back reference; the latter is a subroutine call.
2144    .
2145    .
2146  .SH CALLOUTS  .SH CALLOUTS
2147  .rs  .rs
2148  .sp  .sp
# Line 2016  description of the interface to the call Line 2179  description of the interface to the call
2179  documentation.  documentation.
2180  .  .
2181  .  .
2182  .SH "BACTRACKING CONTROL"  .SH "BACKTRACKING CONTROL"
2183  .rs  .rs
2184  .sp  .sp
2185  Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which  Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which
# Line 2025  or removal in a future version of Perl". Line 2188  or removal in a future version of Perl".
2188  production code should be noted to avoid problems during upgrades." The same  production code should be noted to avoid problems during upgrades." The same
2189  remarks apply to the PCRE features described in this section.  remarks apply to the PCRE features described in this section.
2190  .P  .P
2191  Since these verbs are specifically related to backtracking, they can be used  Since these verbs are specifically related to backtracking, most of them can be
2192  only when the pattern is to be matched using \fBpcre_exec()\fP, which uses a  used only when the pattern is to be matched using \fBpcre_exec()\fP, which uses
2193  backtracking algorithm. They cause an error if encountered by  a backtracking algorithm. With the exception of (*FAIL), which behaves like a
2194    failing negative assertion, they cause an error if encountered by
2195  \fBpcre_dfa_exec()\fP.  \fBpcre_dfa_exec()\fP.
2196  .P  .P
2197    If any of these verbs are used in an assertion subpattern, their effect is
2198    confined to that subpattern; it does not extend to the surrounding pattern.
2199    Note that assertion subpatterns are processed as anchored at the point where
2200    they are tested.
2201    .P
2202  The new verbs make use of what was previously invalid syntax: an opening  The new verbs make use of what was previously invalid syntax: an opening
2203  parenthesis followed by an asterisk. In Perl, they are generally of the form  parenthesis followed by an asterisk. In Perl, they are generally of the form
2204  (*VERB:ARG) but PCRE does not support the use of arguments, so its general  (*VERB:ARG) but PCRE does not support the use of arguments, so its general
# Line 2045  The following verbs act as soon as they Line 2214  The following verbs act as soon as they
2214  .sp  .sp
2215  This verb causes the match to end successfully, skipping the remainder of the  This verb causes the match to end successfully, skipping the remainder of the
2216  pattern. When inside a recursion, only the innermost pattern is ended  pattern. When inside a recursion, only the innermost pattern is ended
2217  immediately. PCRE differs from Perl in what happens if the (*ACCEPT) is inside  immediately. If the (*ACCEPT) is inside capturing parentheses, the data so far
2218  capturing parentheses. In Perl, the data so far is captured: in PCRE no data is  is captured. (This feature was added to PCRE at release 8.00.) For example:
 captured. For example:  
2219  .sp  .sp
2220    A(A|B(*ACCEPT)|C)D    A((?:A|B(*ACCEPT)|C)D)
2221  .sp  .sp
2222  This matches "AB", "AAD", or "ACD", but when it matches "AB", no data is  This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by
2223  captured.  the outer parentheses.
2224  .sp  .sp
2225    (*FAIL) or (*F)    (*FAIL) or (*F)
2226  .sp  .sp
# Line 2149  Cambridge CB2 3QH, England. Line 2317  Cambridge CB2 3QH, England.
2317  .rs  .rs
2318  .sp  .sp
2319  .nf  .nf
2320  Last updated: 21 August 2007  Last updated: 18 September 2009
2321  Copyright (c) 1997-2007 University of Cambridge.  Copyright (c) 1997-2009 University of Cambridge.
2322  .fi  .fi

Legend:
Removed from v.227  
changed lines
  Added in v.453

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12