/[pcre]/code/trunk/doc/html/pcrepattern.html
ViewVC logotype

Diff of /code/trunk/doc/html/pcrepattern.html

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 452 by ph10, Sat Apr 11 14:34:02 2009 UTC revision 453 by ph10, Fri Sep 18 19:12:35 2009 UTC
# Line 644  U+DFFF. Such characters are not valid in Line 644  U+DFFF. Such characters are not valid in
644  cannot be tested by PCRE, unless UTF-8 validity checking has been turned off  cannot be tested by PCRE, unless UTF-8 validity checking has been turned off
645  (see the discussion of PCRE_NO_UTF8_CHECK in the  (see the discussion of PCRE_NO_UTF8_CHECK in the
646  <a href="pcreapi.html"><b>pcreapi</b></a>  <a href="pcreapi.html"><b>pcreapi</b></a>
647  page).  page). Perl does not support the Cs property.
648  </P>  </P>
649  <P>  <P>
650  The long synonyms for these properties that Perl supports (such as \p{Letter})  The long synonyms for property names that Perl supports (such as \p{Letter})
651  are not supported by PCRE, nor is it permitted to prefix any of these  are not supported by PCRE, nor is it permitted to prefix any of these
652  properties with "Is".  properties with "Is".
653  </P>  </P>
# Line 1922  recursively to the pattern in which it a Line 1922  recursively to the pattern in which it a
1922  Obviously, PCRE cannot support the interpolation of Perl code. Instead, it  Obviously, PCRE cannot support the interpolation of Perl code. Instead, it
1923  supports special syntax for recursion of the entire pattern, and also for  supports special syntax for recursion of the entire pattern, and also for
1924  individual subpattern recursion. After its introduction in PCRE and Python,  individual subpattern recursion. After its introduction in PCRE and Python,
1925  this kind of recursion was introduced into Perl at release 5.10.  this kind of recursion was subsequently introduced into Perl at release 5.10.
1926  </P>  </P>
1927  <P>  <P>
1928  A special item that consists of (? followed by a number greater than zero and a  A special item that consists of (? followed by a number greater than zero and a
# Line 1932  call, which is described in the next sec Line 1932  call, which is described in the next sec
1932  a recursive call of the entire regular expression.  a recursive call of the entire regular expression.
1933  </P>  </P>
1934  <P>  <P>
 In PCRE (like Python, but unlike Perl), a recursive subpattern call is always  
 treated as an atomic group. That is, once it has matched some of the subject  
 string, it is never re-entered, even if it contains untried alternatives and  
 there is a subsequent matching failure.  
 </P>  
 <P>  
1935  This PCRE pattern solves the nested parentheses problem (assume the  This PCRE pattern solves the nested parentheses problem (assume the
1936  PCRE_EXTENDED option is set so that white space is ignored):  PCRE_EXTENDED option is set so that white space is ignored):
1937  <pre>  <pre>
# Line 2028  recursing), whereas any characters are p Line 2022  recursing), whereas any characters are p
2022  In this pattern, (?(R) is the start of a conditional subpattern, with two  In this pattern, (?(R) is the start of a conditional subpattern, with two
2023  different alternatives for the recursive and non-recursive cases. The (?R) item  different alternatives for the recursive and non-recursive cases. The (?R) item
2024  is the actual recursive call.  is the actual recursive call.
2025    <a name="recursiondifference"></a></P>
2026    <br><b>
2027    Recursion difference from Perl
2028    </b><br>
2029    <P>
2030    In PCRE (like Python, but unlike Perl), a recursive subpattern call is always
2031    treated as an atomic group. That is, once it has matched some of the subject
2032    string, it is never re-entered, even if it contains untried alternatives and
2033    there is a subsequent matching failure. This can be illustrated by the
2034    following pattern, which purports to match a palindromic string that contains
2035    an odd number of characters (for example, "a", "aba", "abcba", "abcdcba"):
2036    <pre>
2037      ^(.|(.)(?1)\2)$
2038    </pre>
2039    The idea is that it either matches a single character, or two identical
2040    characters surrounding a sub-palindrome. In Perl, this pattern works; in PCRE
2041    it does not if the pattern is longer than three characters. Consider the
2042    subject string "abcba":
2043    </P>
2044    <P>
2045    At the top level, the first character is matched, but as it is not at the end
2046    of the string, the first alternative fails; the second alternative is taken
2047    and the recursion kicks in. The recursive call to subpattern 1 successfully
2048    matches the next character ("b"). (Note that the beginning and end of line
2049    tests are not part of the recursion).
2050    </P>
2051    <P>
2052    Back at the top level, the next character ("c") is compared with what
2053    subpattern 2 matched, which was "a". This fails. Because the recursion is
2054    treated as an atomic group, there are now no backtracking points, and so the
2055    entire match fails. (Perl is able, at this point, to re-enter the recursion and
2056    try the second alternative.) However, if the pattern is written with the
2057    alternatives in the other order, things are different:
2058    <pre>
2059      ^((.)(?1)\2|.)$
2060    </pre>
2061    This time, the recursing alternative is tried first, and continues to recurse
2062    until it runs out of characters, at which point the recursion fails. But this
2063    time we do have another alternative to try at the higher level. That is the big
2064    difference: in the previous case the remaining alternative is at a deeper
2065    recursion level, which PCRE cannot use.
2066    </P>
2067    <P>
2068    To change the pattern so that matches all palindromic strings, not just those
2069    with an odd number of characters, it is tempting to change the pattern to this:
2070    <pre>
2071      ^((.)(?1)\2|.?)$
2072    </pre>
2073    Again, this works in Perl, but not in PCRE, and for the same reason. When a
2074    deeper recursion has matched a single character, it cannot be entered again in
2075    order to match an empty string. The solution is to separate the two cases, and
2076    write out the odd and even cases as alternatives at the higher level:
2077    <pre>
2078      ^(?:((.)(?1)\2|)|((.)(?3)\4|.))
2079    </pre>
2080    If you want to match typical palindromic phrases, the pattern has to ignore all
2081    non-word characters, which can be done like this:
2082    <pre>
2083      ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
2084    </pre>
2085    If run with the PCRE_CASELESS option, this pattern matches phrases such as "A
2086    man, a plan, a canal: Panama!" and it works well in both PCRE and Perl. Note
2087    the use of the possessive quantifier *+ to avoid backtracking into sequences of
2088    non-word characters. Without this, PCRE takes a great deal longer (ten times or
2089    more) to match typical phrases, and Perl takes so long that you think it has
2090    gone into a loop.
2091  <a name="subpatternsassubroutines"></a></P>  <a name="subpatternsassubroutines"></a></P>
2092  <br><a name="SEC22" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>  <br><a name="SEC22" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>
2093  <P>  <P>
# Line 2138  failing negative assertion, they cause a Line 2198  failing negative assertion, they cause a
2198  <b>pcre_dfa_exec()</b>.  <b>pcre_dfa_exec()</b>.
2199  </P>  </P>
2200  <P>  <P>
2201    If any of these verbs are used in an assertion subpattern, their effect is
2202    confined to that subpattern; it does not extend to the surrounding pattern.
2203    Note that assertion subpatterns are processed as anchored at the point where
2204    they are tested.
2205    </P>
2206    <P>
2207  The new verbs make use of what was previously invalid syntax: an opening  The new verbs make use of what was previously invalid syntax: an opening
2208  parenthesis followed by an asterisk. In Perl, they are generally of the form  parenthesis followed by an asterisk. In Perl, they are generally of the form
2209  (*VERB:ARG) but PCRE does not support the use of arguments, so its general  (*VERB:ARG) but PCRE does not support the use of arguments, so its general
# Line 2154  The following verbs act as soon as they Line 2220  The following verbs act as soon as they
2220  </pre>  </pre>
2221  This verb causes the match to end successfully, skipping the remainder of the  This verb causes the match to end successfully, skipping the remainder of the
2222  pattern. When inside a recursion, only the innermost pattern is ended  pattern. When inside a recursion, only the innermost pattern is ended
2223  immediately. PCRE differs from Perl in what happens if the (*ACCEPT) is inside  immediately. If the (*ACCEPT) is inside capturing parentheses, the data so far
2224  capturing parentheses. In Perl, the data so far is captured: in PCRE no data is  is captured. (This feature was added to PCRE at release 8.00.) For example:
 captured. For example:  
2225  <pre>  <pre>
2226    A(A|B(*ACCEPT)|C)D    A((?:A|B(*ACCEPT)|C)D)
2227  </pre>  </pre>
2228  This matches "AB", "AAD", or "ACD", but when it matches "AB", no data is  This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by
2229  captured.  the outer parentheses.
2230  <pre>  <pre>
2231    (*FAIL) or (*F)    (*FAIL) or (*F)
2232  </pre>  </pre>
# Line 2253  Cambridge CB2 3QH, England. Line 2318  Cambridge CB2 3QH, England.
2318  </P>  </P>
2319  <br><a name="SEC28" href="#TOC1">REVISION</a><br>  <br><a name="SEC28" href="#TOC1">REVISION</a><br>
2320  <P>  <P>
2321  Last updated: 11 April 2009  Last updated: 18 September 2009
2322  <br>  <br>
2323  Copyright &copy; 1997-2009 University of Cambridge.  Copyright &copy; 1997-2009 University of Cambridge.
2324  <br>  <br>

Legend:
Removed from v.452  
changed lines
  Added in v.453

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12