/[pcre]/code/trunk/doc/pcrepattern.3
ViewVC logotype

Diff of /code/trunk/doc/pcrepattern.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 452 by ph10, Wed Sep 16 11:05:51 2009 UTC revision 453 by ph10, Fri Sep 18 19:12:35 2009 UTC
# Line 1922  recursively to the pattern in which it a Line 1922  recursively to the pattern in which it a
1922  Obviously, PCRE cannot support the interpolation of Perl code. Instead, it  Obviously, PCRE cannot support the interpolation of Perl code. Instead, it
1923  supports special syntax for recursion of the entire pattern, and also for  supports special syntax for recursion of the entire pattern, and also for
1924  individual subpattern recursion. After its introduction in PCRE and Python,  individual subpattern recursion. After its introduction in PCRE and Python,
1925  this kind of recursion was introduced into Perl at release 5.10.  this kind of recursion was subsequently introduced into Perl at release 5.10.
1926  .P  .P
1927  A special item that consists of (? followed by a number greater than zero and a  A special item that consists of (? followed by a number greater than zero and a
1928  closing parenthesis is a recursive call of the subpattern of the given number,  closing parenthesis is a recursive call of the subpattern of the given number,
# Line 1930  provided that it occurs inside that subp Line 1930  provided that it occurs inside that subp
1930  call, which is described in the next section.) The special item (?R) or (?0) is  call, which is described in the next section.) The special item (?R) or (?0) is
1931  a recursive call of the entire regular expression.  a recursive call of the entire regular expression.
1932  .P  .P
 In PCRE (like Python, but unlike Perl), a recursive subpattern call is always  
 treated as an atomic group. That is, once it has matched some of the subject  
 string, it is never re-entered, even if it contains untried alternatives and  
 there is a subsequent matching failure.  
 .P  
1933  This PCRE pattern solves the nested parentheses problem (assume the  This PCRE pattern solves the nested parentheses problem (assume the
1934  PCRE_EXTENDED option is set so that white space is ignored):  PCRE_EXTENDED option is set so that white space is ignored):
1935  .sp  .sp
# Line 2022  different alternatives for the recursive Line 2017  different alternatives for the recursive
2017  is the actual recursive call.  is the actual recursive call.
2018  .  .
2019  .  .
2020    .\" HTML <a name="recursiondifference"></a>
2021    .SS "Recursion difference from Perl"
2022    .rs
2023    .sp
2024    In PCRE (like Python, but unlike Perl), a recursive subpattern call is always
2025    treated as an atomic group. That is, once it has matched some of the subject
2026    string, it is never re-entered, even if it contains untried alternatives and
2027    there is a subsequent matching failure. This can be illustrated by the
2028    following pattern, which purports to match a palindromic string that contains
2029    an odd number of characters (for example, "a", "aba", "abcba", "abcdcba"):
2030    .sp
2031      ^(.|(.)(?1)\e2)$
2032    .sp
2033    The idea is that it either matches a single character, or two identical
2034    characters surrounding a sub-palindrome. In Perl, this pattern works; in PCRE
2035    it does not if the pattern is longer than three characters. Consider the
2036    subject string "abcba":
2037    .P
2038    At the top level, the first character is matched, but as it is not at the end
2039    of the string, the first alternative fails; the second alternative is taken
2040    and the recursion kicks in. The recursive call to subpattern 1 successfully
2041    matches the next character ("b"). (Note that the beginning and end of line
2042    tests are not part of the recursion).
2043    .P
2044    Back at the top level, the next character ("c") is compared with what
2045    subpattern 2 matched, which was "a". This fails. Because the recursion is
2046    treated as an atomic group, there are now no backtracking points, and so the
2047    entire match fails. (Perl is able, at this point, to re-enter the recursion and
2048    try the second alternative.) However, if the pattern is written with the
2049    alternatives in the other order, things are different:
2050    .sp
2051      ^((.)(?1)\e2|.)$
2052    .sp
2053    This time, the recursing alternative is tried first, and continues to recurse
2054    until it runs out of characters, at which point the recursion fails. But this
2055    time we do have another alternative to try at the higher level. That is the big
2056    difference: in the previous case the remaining alternative is at a deeper
2057    recursion level, which PCRE cannot use.
2058    .P
2059    To change the pattern so that matches all palindromic strings, not just those
2060    with an odd number of characters, it is tempting to change the pattern to this:
2061    .sp
2062      ^((.)(?1)\e2|.?)$
2063    .sp
2064    Again, this works in Perl, but not in PCRE, and for the same reason. When a
2065    deeper recursion has matched a single character, it cannot be entered again in
2066    order to match an empty string. The solution is to separate the two cases, and
2067    write out the odd and even cases as alternatives at the higher level:
2068    .sp
2069      ^(?:((.)(?1)\e2|)|((.)(?3)\e4|.))
2070    .sp
2071    If you want to match typical palindromic phrases, the pattern has to ignore all
2072    non-word characters, which can be done like this:
2073    .sp
2074      ^\eW*+(?:((.)\eW*+(?1)\eW*+\e2|)|((.)\eW*+(?3)\eW*+\4|\eW*+.\eW*+))\eW*+$
2075    .sp
2076    If run with the PCRE_CASELESS option, this pattern matches phrases such as "A
2077    man, a plan, a canal: Panama!" and it works well in both PCRE and Perl. Note
2078    the use of the possessive quantifier *+ to avoid backtracking into sequences of
2079    non-word characters. Without this, PCRE takes a great deal longer (ten times or
2080    more) to match typical phrases, and Perl takes so long that you think it has
2081    gone into a loop.
2082    .
2083    .
2084  .\" HTML <a name="subpatternsassubroutines"></a>  .\" HTML <a name="subpatternsassubroutines"></a>
2085  .SH "SUBPATTERNS AS SUBROUTINES"  .SH "SUBPATTERNS AS SUBROUTINES"
2086  .rs  .rs
# Line 2258  Cambridge CB2 3QH, England. Line 2317  Cambridge CB2 3QH, England.
2317  .rs  .rs
2318  .sp  .sp
2319  .nf  .nf
2320  Last updated: 16 September 2009  Last updated: 18 September 2009
2321  Copyright (c) 1997-2009 University of Cambridge.  Copyright (c) 1997-2009 University of Cambridge.
2322  .fi  .fi

Legend:
Removed from v.452  
changed lines
  Added in v.453

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12