/[pcre]/code/trunk/doc/pcrepattern.3
ViewVC logotype

Diff of /code/trunk/doc/pcrepattern.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 155 by ph10, Tue Apr 24 13:36:11 2007 UTC revision 175 by ph10, Mon Jun 11 13:38:38 2007 UTC
# Line 30  The remainder of this document discusses Line 30  The remainder of this document discusses
30  PCRE when its main matching function, \fBpcre_exec()\fP, is used.  PCRE when its main matching function, \fBpcre_exec()\fP, is used.
31  From release 6.0, PCRE offers a second matching function,  From release 6.0, PCRE offers a second matching function,
32  \fBpcre_dfa_exec()\fP, which matches using a different algorithm that is not  \fBpcre_dfa_exec()\fP, which matches using a different algorithm that is not
33  Perl-compatible. The advantages and disadvantages of the alternative function,  Perl-compatible. Some of the features discussed below are not available when
34  and how it differs from the normal function, are discussed in the  \fBpcre_dfa_exec()\fP is used. The advantages and disadvantages of the
35    alternative function, and how it differs from the normal function, are
36    discussed in the
37  .\" HREF  .\" HREF
38  \fBpcrematching\fP  \fBpcrematching\fP
39  .\"  .\"
# Line 239  meanings Line 241  meanings
241  .rs  .rs
242  .sp  .sp
243  The sequence \eg followed by a positive or negative number, optionally enclosed  The sequence \eg followed by a positive or negative number, optionally enclosed
244  in braces, is an absolute or relative back reference. Back references are  in braces, is an absolute or relative back reference. A named back reference
245  discussed  can be coded as \eg{name}. Back references are discussed
246  .\" HTML <a href="#backreferences">  .\" HTML <a href="#backreferences">
247  .\" </a>  .\" </a>
248  later,  later,
# Line 519  why the traditional escape sequences suc Line 521  why the traditional escape sequences suc
521  properties in PCRE.  properties in PCRE.
522  .  .
523  .  .
524    .\" HTML <a name="resetmatchstart"></a>
525    .SS "Resetting the match start"
526    .rs
527    .sp
528    The escape sequence \eK, which is a Perl 5.10 feature, causes any previously
529    matched characters not to be included in the final matched sequence. For
530    example, the pattern:
531    .sp
532      foo\eKbar
533    .sp
534    matches "foobar", but reports that it has matched "bar". This feature is
535    similar to a lookbehind assertion
536    .\" HTML <a href="#lookbehind">
537    .\" </a>
538    (described below).
539    .\"
540    However, in this case, the part of the subject before the real match does not
541    have to be of fixed length, as lookbehind assertions do. The use of \eK does
542    not interfere with the setting of
543    .\" HTML <a href="#subpattern">
544    .\" </a>
545    captured substrings.
546    .\"
547    For example, when the pattern
548    .sp
549      (foo)\eKbar
550    .sp
551    matches "foobar", the first substring is still set to "foo".
552    .
553    .
554  .\" HTML <a name="smallassertions"></a>  .\" HTML <a name="smallassertions"></a>
555  .SS "Simple assertions"  .SS "Simple assertions"
556  .rs  .rs
# Line 926  is reached, an option setting in one bra Line 958  is reached, an option setting in one bra
958  the above patterns match "SUNDAY" as well as "Saturday".  the above patterns match "SUNDAY" as well as "Saturday".
959  .  .
960  .  .
961    .SH "DUPLICATE SUBPATTERN NUMBERS"
962    .rs
963    .sp
964    Perl 5.10 introduced a feature whereby each alternative in a subpattern uses
965    the same numbers for its capturing parentheses. Such a subpattern starts with
966    (?| and is itself a non-capturing subpattern. For example, consider this
967    pattern:
968    .sp
969      (?|(Sat)ur|(Sun))day
970    .sp
971    Because the two alternatives are inside a (?| group, both sets of capturing
972    parentheses are numbered one. Thus, when the pattern matches, you can look
973    at captured substring number one, whichever alternative matched. This construct
974    is useful when you want to capture part, but not all, of one of a number of
975    alternatives. Inside a (?| group, parentheses are numbered as usual, but the
976    number is reset at the start of each branch. The numbers of any capturing
977    buffers that follow the subpattern start after the highest number used in any
978    branch. The following example is taken from the Perl documentation.
979    The numbers underneath show in which buffer the captured content will be
980    stored.
981    .sp
982      # before  ---------------branch-reset----------- after
983      / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
984      # 1            2         2  3        2     3     4
985    .sp
986    A backreference or a recursive call to a numbered subpattern always refers to
987    the first one in the pattern with the given number.
988    .P
989    An alternative approach to using this "branch reset" feature is to use
990    duplicate named subpatterns, as described in the next section.
991    .
992    .
993  .SH "NAMED SUBPATTERNS"  .SH "NAMED SUBPATTERNS"
994  .rs  .rs
995  .sp  .sp
# Line 975  abbreviation. This pattern (ignoring the Line 1039  abbreviation. This pattern (ignoring the
1039    (?<DN>Sat)(?:urday)?    (?<DN>Sat)(?:urday)?
1040  .sp  .sp
1041  There are five capturing substrings, but only one is ever set after a match.  There are five capturing substrings, but only one is ever set after a match.
1042    (An alternative way of solving this problem is to use a "branch reset"
1043    subpattern, as described in the previous section.)
1044    .P
1045  The convenience function for extracting the data by name returns the substring  The convenience function for extracting the data by name returns the substring
1046  for the first (and in this example, the only) subpattern of that name that  for the first (and in this example, the only) subpattern of that name that
1047  matched. This saves searching to find which numbered subpattern it was. If you  matched. This saves searching to find which numbered subpattern it was. If you
# Line 1293  back reference, the case of letters is r Line 1360  back reference, the case of letters is r
1360  matches "rah rah" and "RAH RAH", but not "RAH rah", even though the original  matches "rah rah" and "RAH RAH", but not "RAH rah", even though the original
1361  capturing subpattern is matched caselessly.  capturing subpattern is matched caselessly.
1362  .P  .P
1363  Back references to named subpatterns use the Perl syntax \ek<name> or \ek'name'  There are several different ways of writing back references to named
1364  or the Python syntax (?P=name). We could rewrite the above example in either of  subpatterns. The .NET syntax \ek{name} and the Perl syntax \ek<name> or
1365    \ek'name' are supported, as is the Python syntax (?P=name). Perl 5.10's unified
1366    back reference syntax, in which \eg can be used for both numeric and named
1367    references, is also supported. We could rewrite the above example in any of
1368  the following ways:  the following ways:
1369  .sp  .sp
1370    (?<p1>(?i)rah)\es+\ek<p1>    (?<p1>(?i)rah)\es+\ek<p1>
1371      (?'p1'(?i)rah)\es+\ek{p1}
1372    (?P<p1>(?i)rah)\es+(?P=p1)    (?P<p1>(?i)rah)\es+(?P=p1)
1373      (?<p1>(?i)rah)\es+\eg{p1}
1374  .sp  .sp
1375  A subpattern that is referenced by name may appear in the pattern before or  A subpattern that is referenced by name may appear in the pattern before or
1376  after the reference.  after the reference.
# Line 1421  lengths, but it is acceptable if rewritt Line 1493  lengths, but it is acceptable if rewritt
1493  .sp  .sp
1494    (?<=abc|abde)    (?<=abc|abde)
1495  .sp  .sp
1496    In some cases, the Perl 5.10 escape sequence \eK
1497    .\" HTML <a href="#resetmatchstart">
1498    .\" </a>
1499    (see above)
1500    .\"
1501    can be used instead of a lookbehind assertion; this is not restricted to a
1502    fixed-length.
1503    .P
1504  The implementation of lookbehind assertions is, for each alternative, to  The implementation of lookbehind assertions is, for each alternative, to
1505  temporarily move the current position back by the fixed length and then try to  temporarily move the current position back by the fixed length and then try to
1506  match. If there are insufficient characters before the current position, the  match. If there are insufficient characters before the current position, the
# Line 1515  recursion, a pseudo-condition called DEF Line 1595  recursion, a pseudo-condition called DEF
1595  .sp  .sp
1596  If the text between the parentheses consists of a sequence of digits, the  If the text between the parentheses consists of a sequence of digits, the
1597  condition is true if the capturing subpattern of that number has previously  condition is true if the capturing subpattern of that number has previously
1598  matched.  matched. An alternative notation is to precede the digits with a plus or minus
1599    sign. In this case, the subpattern number is relative rather than absolute.
1600    The most recently opened parentheses can be referenced by (?(-1), the next most
1601    recent by (?(-2), and so on. In looping constructs it can also make sense to
1602    refer to subsequent groups with constructs such as (?(+2).
1603  .P  .P
1604  Consider the following pattern, which contains non-significant white space to  Consider the following pattern, which contains non-significant white space to
1605  make it more readable (assume the PCRE_EXTENDED option) and to divide it into  make it more readable (assume the PCRE_EXTENDED option) and to divide it into
# Line 1532  the condition is true, and so the yes-pa Line 1616  the condition is true, and so the yes-pa
1616  parenthesis is required. Otherwise, since no-pattern is not present, the  parenthesis is required. Otherwise, since no-pattern is not present, the
1617  subpattern matches nothing. In other words, this pattern matches a sequence of  subpattern matches nothing. In other words, this pattern matches a sequence of
1618  non-parentheses, optionally enclosed in parentheses.  non-parentheses, optionally enclosed in parentheses.
1619    .P
1620    If you were embedding this pattern in a larger one, you could use a relative
1621    reference:
1622    .sp
1623      ...other stuff... ( \e( )?    [^()]+    (?(-1) \e) ) ...
1624    .sp
1625    This makes the fragment independent of the parentheses in the larger pattern.
1626  .  .
1627  .SS "Checking for a used subpattern by name"  .SS "Checking for a used subpattern by name"
1628  .rs  .rs
# Line 1674  pattern, so instead you could use this: Line 1765  pattern, so instead you could use this:
1765    ( \e( ( (?>[^()]+) | (?1) )* \e) )    ( \e( ( (?>[^()]+) | (?1) )* \e) )
1766  .sp  .sp
1767  We have put the pattern into parentheses, and caused the recursion to refer to  We have put the pattern into parentheses, and caused the recursion to refer to
1768  them instead of the whole pattern. In a larger pattern, keeping track of  them instead of the whole pattern.
1769  parenthesis numbers can be tricky. It may be more convenient to use named  .P
1770  parentheses instead. The Perl syntax for this is (?&name); PCRE's earlier  In a larger pattern, keeping track of parenthesis numbers can be tricky. This
1771  syntax (?P>name) is also supported. We could rewrite the above example as  is made easier by the use of relative references. (A Perl 5.10 feature.)
1772  follows:  Instead of (?1) in the pattern above you can write (?-2) to refer to the second
1773    most recently opened parentheses preceding the recursion. In other words, a
1774    negative number counts capturing parentheses leftwards from the point at which
1775    it is encountered.
1776    .P
1777    It is also possible to refer to subsequently opened parentheses, by writing
1778    references such as (?+2). However, these cannot be recursive because the
1779    reference is not inside the parentheses that are referenced. They are always
1780    "subroutine" calls, as described in the next section.
1781    .P
1782    An alternative approach is to use named parentheses instead. The Perl syntax
1783    for this is (?&name); PCRE's earlier syntax (?P>name) is also supported. We
1784    could rewrite the above example as follows:
1785  .sp  .sp
1786    (?<pn> \e( ( (?>[^()]+) | (?&pn) )* \e) )    (?<pn> \e( ( (?>[^()]+) | (?&pn) )* \e) )
1787  .sp  .sp
1788  If there is more than one subpattern with the same name, the earliest one is  If there is more than one subpattern with the same name, the earliest one is
1789  used. This particular example pattern contains nested unlimited repeats, and so  used.
1790  the use of atomic grouping for matching strings of non-parentheses is important  .P
1791  when applying the pattern to strings that do not match. For example, when this  This particular example pattern that we have been looking at contains nested
1792  pattern is applied to  unlimited repeats, and so the use of atomic grouping for matching strings of
1793    non-parentheses is important when applying the pattern to strings that do not
1794    match. For example, when this pattern is applied to
1795  .sp  .sp
1796    (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()    (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
1797  .sp  .sp
# Line 1738  is the actual recursive call. Line 1843  is the actual recursive call.
1843  If the syntax for a recursive subpattern reference (either by number or by  If the syntax for a recursive subpattern reference (either by number or by
1844  name) is used outside the parentheses to which it refers, it operates like a  name) is used outside the parentheses to which it refers, it operates like a
1845  subroutine in a programming language. The "called" subpattern may be defined  subroutine in a programming language. The "called" subpattern may be defined
1846  before or after the reference. An earlier example pointed out that the pattern  before or after the reference. A numbered reference can be absolute or
1847    relative, as in these examples:
1848    .sp
1849      (...(absolute)...)...(?2)...
1850      (...(relative)...)...(?-1)...
1851      (...(?+1)...(relative)...
1852    .sp
1853    An earlier example pointed out that the pattern
1854  .sp  .sp
1855    (sens|respons)e and \e1ibility    (sens|respons)e and \e1ibility
1856  .sp  .sp
# Line 1759  When a subpattern is used as a subroutin Line 1871  When a subpattern is used as a subroutin
1871  case-independence are fixed when the subpattern is defined. They cannot be  case-independence are fixed when the subpattern is defined. They cannot be
1872  changed for different calls. For example, consider this pattern:  changed for different calls. For example, consider this pattern:
1873  .sp  .sp
1874    (abc)(?i:(?1))    (abc)(?i:(?-1))
1875  .sp  .sp
1876  It matches "abcabc". It does not match "abcABC" because the change of  It matches "abcabc". It does not match "abcABC" because the change of
1877  processing option does not affect the called subpattern.  processing option does not affect the called subpattern.
# Line 1821  Cambridge CB2 3QH, England. Line 1933  Cambridge CB2 3QH, England.
1933  .rs  .rs
1934  .sp  .sp
1935  .nf  .nf
1936  Last updated: 06 March 2007  Last updated: 11 June 2007
1937  Copyright (c) 1997-2007 University of Cambridge.  Copyright (c) 1997-2007 University of Cambridge.
1938  .fi  .fi

Legend:
Removed from v.155  
changed lines
  Added in v.175

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12