/[pcre]/code/trunk/doc/html/pcrepattern.html
ViewVC logotype

Diff of /code/trunk/doc/html/pcrepattern.html

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 172 by ph10, Tue Jun 5 10:40:13 2007 UTC revision 182 by ph10, Wed Jun 13 15:09:54 2007 UTC
# Line 24  man page, in case the conversion went wr Line 24  man page, in case the conversion went wr
24  <li><a name="TOC9" href="#SEC9">VERTICAL BAR</a>  <li><a name="TOC9" href="#SEC9">VERTICAL BAR</a>
25  <li><a name="TOC10" href="#SEC10">INTERNAL OPTION SETTING</a>  <li><a name="TOC10" href="#SEC10">INTERNAL OPTION SETTING</a>
26  <li><a name="TOC11" href="#SEC11">SUBPATTERNS</a>  <li><a name="TOC11" href="#SEC11">SUBPATTERNS</a>
27  <li><a name="TOC12" href="#SEC12">NAMED SUBPATTERNS</a>  <li><a name="TOC12" href="#SEC12">DUPLICATE SUBPATTERN NUMBERS</a>
28  <li><a name="TOC13" href="#SEC13">REPETITION</a>  <li><a name="TOC13" href="#SEC13">NAMED SUBPATTERNS</a>
29  <li><a name="TOC14" href="#SEC14">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a>  <li><a name="TOC14" href="#SEC14">REPETITION</a>
30  <li><a name="TOC15" href="#SEC15">BACK REFERENCES</a>  <li><a name="TOC15" href="#SEC15">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a>
31  <li><a name="TOC16" href="#SEC16">ASSERTIONS</a>  <li><a name="TOC16" href="#SEC16">BACK REFERENCES</a>
32  <li><a name="TOC17" href="#SEC17">CONDITIONAL SUBPATTERNS</a>  <li><a name="TOC17" href="#SEC17">ASSERTIONS</a>
33  <li><a name="TOC18" href="#SEC18">COMMENTS</a>  <li><a name="TOC18" href="#SEC18">CONDITIONAL SUBPATTERNS</a>
34  <li><a name="TOC19" href="#SEC19">RECURSIVE PATTERNS</a>  <li><a name="TOC19" href="#SEC19">COMMENTS</a>
35  <li><a name="TOC20" href="#SEC20">SUBPATTERNS AS SUBROUTINES</a>  <li><a name="TOC20" href="#SEC20">RECURSIVE PATTERNS</a>
36  <li><a name="TOC21" href="#SEC21">CALLOUTS</a>  <li><a name="TOC21" href="#SEC21">SUBPATTERNS AS SUBROUTINES</a>
37  <li><a name="TOC22" href="#SEC22">SEE ALSO</a>  <li><a name="TOC22" href="#SEC22">CALLOUTS</a>
38  <li><a name="TOC23" href="#SEC23">AUTHOR</a>  <li><a name="TOC23" href="#SEC23">SEE ALSO</a>
39  <li><a name="TOC24" href="#SEC24">REVISION</a>  <li><a name="TOC24" href="#SEC24">AUTHOR</a>
40    <li><a name="TOC25" href="#SEC25">REVISION</a>
41  </ul>  </ul>
42  <br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION DETAILS</a><br>  <br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION DETAILS</a><br>
43  <P>  <P>
# Line 270  following are always recognized: Line 271  following are always recognized:
271  <pre>  <pre>
272    \d     any decimal digit    \d     any decimal digit
273    \D     any character that is not a decimal digit    \D     any character that is not a decimal digit
274      \h     any horizontal whitespace character
275      \H     any character that is not a horizontal whitespace character
276    \s     any whitespace character    \s     any whitespace character
277    \S     any character that is not a whitespace character    \S     any character that is not a whitespace character
278      \v     any vertical whitespace character
279      \V     any character that is not a vertical whitespace character
280    \w     any "word" character    \w     any "word" character
281    \W     any "non-word" character    \W     any "non-word" character
282  </pre>  </pre>
# Line 287  there is no character to match. Line 292  there is no character to match.
292  <P>  <P>
293  For compatibility with Perl, \s does not match the VT character (code 11).  For compatibility with Perl, \s does not match the VT character (code 11).
294  This makes it different from the the POSIX "space" class. The \s characters  This makes it different from the the POSIX "space" class. The \s characters
295  are HT (9), LF (10), FF (12), CR (13), and space (32). (If "use locale;" is  are HT (9), LF (10), FF (12), CR (13), and space (32). If "use locale;" is
296  included in a Perl script, \s may match the VT character. In PCRE, it never  included in a Perl script, \s may match the VT character. In PCRE, it never
297  does.)  does.
298    </P>
299    <P>
300    In UTF-8 mode, characters with values greater than 128 never match \d, \s, or
301    \w, and always match \D, \S, and \W. This is true even when Unicode
302    character property support is available. These sequences retain their original
303    meanings from before UTF-8 support was available, mainly for efficiency
304    reasons.
305    </P>
306    <P>
307    The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to the
308    other sequences, these do match certain high-valued codepoints in UTF-8 mode.
309    The horizontal space characters are:
310    <pre>
311      U+0009     Horizontal tab
312      U+0020     Space
313      U+00A0     Non-break space
314      U+1680     Ogham space mark
315      U+180E     Mongolian vowel separator
316      U+2000     En quad
317      U+2001     Em quad
318      U+2002     En space
319      U+2003     Em space
320      U+2004     Three-per-em space
321      U+2005     Four-per-em space
322      U+2006     Six-per-em space
323      U+2007     Figure space
324      U+2008     Punctuation space
325      U+2009     Thin space
326      U+200A     Hair space
327      U+202F     Narrow no-break space
328      U+205F     Medium mathematical space
329      U+3000     Ideographic space
330    </pre>
331    The vertical space characters are:
332    <pre>
333      U+000A     Linefeed
334      U+000B     Vertical tab
335      U+000C     Formfeed
336      U+000D     Carriage return
337      U+0085     Next line
338      U+2028     Line separator
339      U+2029     Paragraph separator
340    </PRE>
341  </P>  </P>
342  <P>  <P>
343  A "word" character is an underscore or any character less than 256 that is a  A "word" character is an underscore or any character less than 256 that is a
# Line 301  in the Line 349  in the
349  <a href="pcreapi.html"><b>pcreapi</b></a>  <a href="pcreapi.html"><b>pcreapi</b></a>
350  page). For example, in a French locale such as "fr_FR" in Unix-like systems,  page). For example, in a French locale such as "fr_FR" in Unix-like systems,
351  or "french" in Windows, some character codes greater than 128 are used for  or "french" in Windows, some character codes greater than 128 are used for
352  accented letters, and these are matched by \w.  accented letters, and these are matched by \w. The use of locales with Unicode
353  </P>  is discouraged.
 <P>  
 In UTF-8 mode, characters with values greater than 128 never match \d, \s, or  
 \w, and always match \D, \S, and \W. This is true even when Unicode  
 character property support is available. The use of locales with Unicode is  
 discouraged.  
354  </P>  </P>
355  <br><b>  <br><b>
356  Newline sequences  Newline sequences
357  </b><br>  </b><br>
358  <P>  <P>
359  Outside a character class, the escape sequence \R matches any Unicode newline  Outside a character class, the escape sequence \R matches any Unicode newline
360  sequence. This is an extension to Perl. In non-UTF-8 mode \R is equivalent to  sequence. This is a Perl 5.10 feature. In non-UTF-8 mode \R is equivalent to
361  the following:  the following:
362  <pre>  <pre>
363    (?&#62;\r\n|\n|\x0b|\f|\r|\x85)    (?&#62;\r\n|\n|\x0b|\f|\r|\x85)
# Line 966  from left to right, and options are not Line 1009  from left to right, and options are not
1009  is reached, an option setting in one branch does affect subsequent branches, so  is reached, an option setting in one branch does affect subsequent branches, so
1010  the above patterns match "SUNDAY" as well as "Saturday".  the above patterns match "SUNDAY" as well as "Saturday".
1011  </P>  </P>
1012  <br><a name="SEC12" href="#TOC1">NAMED SUBPATTERNS</a><br>  <br><a name="SEC12" href="#TOC1">DUPLICATE SUBPATTERN NUMBERS</a><br>
1013    <P>
1014    Perl 5.10 introduced a feature whereby each alternative in a subpattern uses
1015    the same numbers for its capturing parentheses. Such a subpattern starts with
1016    (?| and is itself a non-capturing subpattern. For example, consider this
1017    pattern:
1018    <pre>
1019      (?|(Sat)ur|(Sun))day
1020    </pre>
1021    Because the two alternatives are inside a (?| group, both sets of capturing
1022    parentheses are numbered one. Thus, when the pattern matches, you can look
1023    at captured substring number one, whichever alternative matched. This construct
1024    is useful when you want to capture part, but not all, of one of a number of
1025    alternatives. Inside a (?| group, parentheses are numbered as usual, but the
1026    number is reset at the start of each branch. The numbers of any capturing
1027    buffers that follow the subpattern start after the highest number used in any
1028    branch. The following example is taken from the Perl documentation.
1029    The numbers underneath show in which buffer the captured content will be
1030    stored.
1031    <pre>
1032      # before  ---------------branch-reset----------- after
1033      / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
1034      # 1            2         2  3        2     3     4
1035    </pre>
1036    A backreference or a recursive call to a numbered subpattern always refers to
1037    the first one in the pattern with the given number.
1038    </P>
1039    <P>
1040    An alternative approach to using this "branch reset" feature is to use
1041    duplicate named subpatterns, as described in the next section.
1042    </P>
1043    <br><a name="SEC13" href="#TOC1">NAMED SUBPATTERNS</a><br>
1044  <P>  <P>
1045  Identifying capturing parentheses by number is simple, but it can be very hard  Identifying capturing parentheses by number is simple, but it can be very hard
1046  to keep track of the numbers in complicated regular expressions. Furthermore,  to keep track of the numbers in complicated regular expressions. Furthermore,
# Line 1008  abbreviation. This pattern (ignoring the Line 1082  abbreviation. This pattern (ignoring the
1082    (?&#60;DN&#62;Sat)(?:urday)?    (?&#60;DN&#62;Sat)(?:urday)?
1083  </pre>  </pre>
1084  There are five capturing substrings, but only one is ever set after a match.  There are five capturing substrings, but only one is ever set after a match.
1085    (An alternative way of solving this problem is to use a "branch reset"
1086    subpattern, as described in the previous section.)
1087    </P>
1088    <P>
1089  The convenience function for extracting the data by name returns the substring  The convenience function for extracting the data by name returns the substring
1090  for the first (and in this example, the only) subpattern of that name that  for the first (and in this example, the only) subpattern of that name that
1091  matched. This saves searching to find which numbered subpattern it was. If you  matched. This saves searching to find which numbered subpattern it was. If you
# Line 1017  details of the interfaces for handling n Line 1095  details of the interfaces for handling n
1095  <a href="pcreapi.html"><b>pcreapi</b></a>  <a href="pcreapi.html"><b>pcreapi</b></a>
1096  documentation.  documentation.
1097  </P>  </P>
1098  <br><a name="SEC13" href="#TOC1">REPETITION</a><br>  <br><a name="SEC14" href="#TOC1">REPETITION</a><br>
1099  <P>  <P>
1100  Repetition is specified by quantifiers, which can follow any of the following  Repetition is specified by quantifiers, which can follow any of the following
1101  items:  items:
# Line 1168  example, after Line 1246  example, after
1246  </pre>  </pre>
1247  matches "aba" the value of the second captured substring is "b".  matches "aba" the value of the second captured substring is "b".
1248  <a name="atomicgroup"></a></P>  <a name="atomicgroup"></a></P>
1249  <br><a name="SEC14" href="#TOC1">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a><br>  <br><a name="SEC15" href="#TOC1">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a><br>
1250  <P>  <P>
1251  With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")  With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
1252  repetition, failure of what follows normally causes the repeated item to be  repetition, failure of what follows normally causes the repeated item to be
# Line 1267  an atomic group, like this: Line 1345  an atomic group, like this:
1345  </pre>  </pre>
1346  sequences of non-digits cannot be broken, and failure happens quickly.  sequences of non-digits cannot be broken, and failure happens quickly.
1347  <a name="backreferences"></a></P>  <a name="backreferences"></a></P>
1348  <br><a name="SEC15" href="#TOC1">BACK REFERENCES</a><br>  <br><a name="SEC16" href="#TOC1">BACK REFERENCES</a><br>
1349  <P>  <P>
1350  Outside a character class, a backslash followed by a digit greater than 0 (and  Outside a character class, a backslash followed by a digit greater than 0 (and
1351  possibly further digits) is a back reference to a capturing subpattern earlier  possibly further digits) is a back reference to a capturing subpattern earlier
# Line 1380  that the first iteration does not need t Line 1458  that the first iteration does not need t
1458  done using alternation, as in the example above, or by a quantifier with a  done using alternation, as in the example above, or by a quantifier with a
1459  minimum of zero.  minimum of zero.
1460  <a name="bigassertions"></a></P>  <a name="bigassertions"></a></P>
1461  <br><a name="SEC16" href="#TOC1">ASSERTIONS</a><br>  <br><a name="SEC17" href="#TOC1">ASSERTIONS</a><br>
1462  <P>  <P>
1463  An assertion is a test on the characters following or preceding the current  An assertion is a test on the characters following or preceding the current
1464  matching point that does not actually consume any characters. The simple  matching point that does not actually consume any characters. The simple
# Line 1540  preceded by "foo", while Line 1618  preceded by "foo", while
1618  is another pattern that matches "foo" preceded by three digits and any three  is another pattern that matches "foo" preceded by three digits and any three
1619  characters that are not "999".  characters that are not "999".
1620  <a name="conditions"></a></P>  <a name="conditions"></a></P>
1621  <br><a name="SEC17" href="#TOC1">CONDITIONAL SUBPATTERNS</a><br>  <br><a name="SEC18" href="#TOC1">CONDITIONAL SUBPATTERNS</a><br>
1622  <P>  <P>
1623  It is possible to cause the matching process to obey a subpattern  It is possible to cause the matching process to obey a subpattern
1624  conditionally or to choose between two alternative subpatterns, depending on  conditionally or to choose between two alternative subpatterns, depending on
# Line 1678  subject is matched against the first alt Line 1756  subject is matched against the first alt
1756  against the second. This pattern matches strings in one of the two forms  against the second. This pattern matches strings in one of the two forms
1757  dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.  dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
1758  <a name="comments"></a></P>  <a name="comments"></a></P>
1759  <br><a name="SEC18" href="#TOC1">COMMENTS</a><br>  <br><a name="SEC19" href="#TOC1">COMMENTS</a><br>
1760  <P>  <P>
1761  The sequence (?# marks the start of a comment that continues up to the next  The sequence (?# marks the start of a comment that continues up to the next
1762  closing parenthesis. Nested parentheses are not permitted. The characters  closing parenthesis. Nested parentheses are not permitted. The characters
# Line 1689  If the PCRE_EXTENDED option is set, an u Line 1767  If the PCRE_EXTENDED option is set, an u
1767  character class introduces a comment that continues to immediately after the  character class introduces a comment that continues to immediately after the
1768  next newline in the pattern.  next newline in the pattern.
1769  <a name="recursion"></a></P>  <a name="recursion"></a></P>
1770  <br><a name="SEC19" href="#TOC1">RECURSIVE PATTERNS</a><br>  <br><a name="SEC20" href="#TOC1">RECURSIVE PATTERNS</a><br>
1771  <P>  <P>
1772  Consider the problem of matching a string in parentheses, allowing for  Consider the problem of matching a string in parentheses, allowing for
1773  unlimited nested parentheses. Without the use of recursion, the best that can  unlimited nested parentheses. Without the use of recursion, the best that can
# Line 1819  In this pattern, (?(R) is the start of a Line 1897  In this pattern, (?(R) is the start of a
1897  different alternatives for the recursive and non-recursive cases. The (?R) item  different alternatives for the recursive and non-recursive cases. The (?R) item
1898  is the actual recursive call.  is the actual recursive call.
1899  <a name="subpatternsassubroutines"></a></P>  <a name="subpatternsassubroutines"></a></P>
1900  <br><a name="SEC20" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>  <br><a name="SEC21" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>
1901  <P>  <P>
1902  If the syntax for a recursive subpattern reference (either by number or by  If the syntax for a recursive subpattern reference (either by number or by
1903  name) is used outside the parentheses to which it refers, it operates like a  name) is used outside the parentheses to which it refers, it operates like a
# Line 1859  changed for different calls. For example Line 1937  changed for different calls. For example
1937  It matches "abcabc". It does not match "abcABC" because the change of  It matches "abcabc". It does not match "abcABC" because the change of
1938  processing option does not affect the called subpattern.  processing option does not affect the called subpattern.
1939  </P>  </P>
1940  <br><a name="SEC21" href="#TOC1">CALLOUTS</a><br>  <br><a name="SEC22" href="#TOC1">CALLOUTS</a><br>
1941  <P>  <P>
1942  Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl  Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl
1943  code to be obeyed in the middle of matching a regular expression. This makes it  code to be obeyed in the middle of matching a regular expression. This makes it
# Line 1894  description of the interface to the call Line 1972  description of the interface to the call
1972  <a href="pcrecallout.html"><b>pcrecallout</b></a>  <a href="pcrecallout.html"><b>pcrecallout</b></a>
1973  documentation.  documentation.
1974  </P>  </P>
1975  <br><a name="SEC22" href="#TOC1">SEE ALSO</a><br>  <br><a name="SEC23" href="#TOC1">SEE ALSO</a><br>
1976  <P>  <P>
1977  <b>pcreapi</b>(3), <b>pcrecallout</b>(3), <b>pcrematching</b>(3), <b>pcre</b>(3).  <b>pcreapi</b>(3), <b>pcrecallout</b>(3), <b>pcrematching</b>(3), <b>pcre</b>(3).
1978  </P>  </P>
1979  <br><a name="SEC23" href="#TOC1">AUTHOR</a><br>  <br><a name="SEC24" href="#TOC1">AUTHOR</a><br>
1980  <P>  <P>
1981  Philip Hazel  Philip Hazel
1982  <br>  <br>
# Line 1907  University Computing Service Line 1985  University Computing Service
1985  Cambridge CB2 3QH, England.  Cambridge CB2 3QH, England.
1986  <br>  <br>
1987  </P>  </P>
1988  <br><a name="SEC24" href="#TOC1">REVISION</a><br>  <br><a name="SEC25" href="#TOC1">REVISION</a><br>
1989  <P>  <P>
1990  Last updated: 29 May 2007  Last updated: 13 June 2007
1991  <br>  <br>
1992  Copyright &copy; 1997-2007 University of Cambridge.  Copyright &copy; 1997-2007 University of Cambridge.
1993  <br>  <br>

Legend:
Removed from v.172  
changed lines
  Added in v.182

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12