/[pcre]/code/trunk/doc/html/pcrepattern.html
ViewVC logotype

Diff of /code/trunk/doc/html/pcrepattern.html

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 243 by ph10, Thu Sep 13 09:28:14 2007 UTC revision 416 by ph10, Sat Apr 11 14:34:02 2009 UTC
# Line 35  man page, in case the conversion went wr Line 35  man page, in case the conversion went wr
35  <li><a name="TOC20" href="#SEC20">COMMENTS</a>  <li><a name="TOC20" href="#SEC20">COMMENTS</a>
36  <li><a name="TOC21" href="#SEC21">RECURSIVE PATTERNS</a>  <li><a name="TOC21" href="#SEC21">RECURSIVE PATTERNS</a>
37  <li><a name="TOC22" href="#SEC22">SUBPATTERNS AS SUBROUTINES</a>  <li><a name="TOC22" href="#SEC22">SUBPATTERNS AS SUBROUTINES</a>
38  <li><a name="TOC23" href="#SEC23">CALLOUTS</a>  <li><a name="TOC23" href="#SEC23">ONIGURUMA SUBROUTINE SYNTAX</a>
39  <li><a name="TOC24" href="#SEC24">BACKTRACKING CONTROL</a>  <li><a name="TOC24" href="#SEC24">CALLOUTS</a>
40  <li><a name="TOC25" href="#SEC25">SEE ALSO</a>  <li><a name="TOC25" href="#SEC25">BACKTRACKING CONTROL</a>
41  <li><a name="TOC26" href="#SEC26">AUTHOR</a>  <li><a name="TOC26" href="#SEC26">SEE ALSO</a>
42  <li><a name="TOC27" href="#SEC27">REVISION</a>  <li><a name="TOC27" href="#SEC27">AUTHOR</a>
43    <li><a name="TOC28" href="#SEC28">REVISION</a>
44  </ul>  </ul>
45  <br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION DETAILS</a><br>  <br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION DETAILS</a><br>
46  <P>  <P>
47  The syntax and semantics of the regular expressions that are supported by PCRE  The syntax and semantics of the regular expressions that are supported by PCRE
48  are described in detail below. There is a quick-reference syntax summary in the  are described in detail below. There is a quick-reference syntax summary in the
49  <a href="pcresyntax.html"><b>pcresyntax</b></a>  <a href="pcresyntax.html"><b>pcresyntax</b></a>
50  page. Perl's regular expressions are described in its own documentation, and  page. PCRE tries to match Perl syntax and semantics as closely as it can. PCRE
51    also supports some alternative regular expression syntax (which does not
52    conflict with the Perl syntax) in order to provide some compatibility with
53    regular expressions in Python, .NET, and Oniguruma.
54    </P>
55    <P>
56    Perl's regular expressions are described in its own documentation, and
57  regular expressions in general are covered in a number of books, some of which  regular expressions in general are covered in a number of books, some of which
58  have copious examples. Jeffrey Friedl's "Mastering Regular Expressions",  have copious examples. Jeffrey Friedl's "Mastering Regular Expressions",
59  published by O'Reilly, covers regular expressions in great detail. This  published by O'Reilly, covers regular expressions in great detail. This
# Line 56  description of PCRE's regular expression Line 63  description of PCRE's regular expression
63  The original operation of PCRE was on strings of one-byte characters. However,  The original operation of PCRE was on strings of one-byte characters. However,
64  there is now also support for UTF-8 character strings. To use this, you must  there is now also support for UTF-8 character strings. To use this, you must
65  build PCRE to include UTF-8 support, and then call <b>pcre_compile()</b> with  build PCRE to include UTF-8 support, and then call <b>pcre_compile()</b> with
66  the PCRE_UTF8 option. How this affects pattern matching is mentioned in several  the PCRE_UTF8 option. There is also a special sequence that can be given at the
67  places below. There is also a summary of UTF-8 features in the  start of a pattern:
68    <pre>
69      (*UTF8)
70    </pre>
71    Starting a pattern with this sequence is equivalent to setting the PCRE_UTF8
72    option. This feature is not Perl-compatible. How setting UTF-8 mode affects
73    pattern matching is mentioned in several places below. There is also a summary
74    of UTF-8 features in the
75  <a href="pcre.html#utf8support">section on UTF-8 support</a>  <a href="pcre.html#utf8support">section on UTF-8 support</a>
76  in the main  in the main
77  <a href="pcre.html"><b>pcre</b></a>  <a href="pcre.html"><b>pcre</b></a>
# Line 113  The newline convention does not affect w Line 127  The newline convention does not affect w
127  default, this is any Unicode newline sequence, for Perl compatibility. However,  default, this is any Unicode newline sequence, for Perl compatibility. However,
128  this can be changed; see the description of \R in the section entitled  this can be changed; see the description of \R in the section entitled
129  <a href="#newlineseq">"Newline sequences"</a>  <a href="#newlineseq">"Newline sequences"</a>
130  below.  below. A change of \R setting can be combined with a change of newline
131    convention.
132  </P>  </P>
133  <br><a name="SEC3" href="#TOC1">CHARACTERS AND METACHARACTERS</a><br>  <br><a name="SEC3" href="#TOC1">CHARACTERS AND METACHARACTERS</a><br>
134  <P>  <P>
# Line 311  following the discussion of Line 326  following the discussion of
326  <a href="#subpattern">parenthesized subpatterns.</a>  <a href="#subpattern">parenthesized subpatterns.</a>
327  </P>  </P>
328  <br><b>  <br><b>
329    Absolute and relative subroutine calls
330    </b><br>
331    <P>
332    For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or
333    a number enclosed either in angle brackets or single quotes, is an alternative
334    syntax for referencing a subpattern as a "subroutine". Details are discussed
335    <a href="#onigurumasubroutines">later.</a>
336    Note that \g{...} (Perl syntax) and \g&#60;...&#62; (Oniguruma syntax) are <i>not</i>
337    synonymous. The former is a back reference; the latter is a subroutine call.
338    </P>
339    <br><b>
340  Generic character types  Generic character types
341  </b><br>  </b><br>
342  <P>  <P>
# Line 349  In UTF-8 mode, characters with values gr Line 375  In UTF-8 mode, characters with values gr
375  \w, and always match \D, \S, and \W. This is true even when Unicode  \w, and always match \D, \S, and \W. This is true even when Unicode
376  character property support is available. These sequences retain their original  character property support is available. These sequences retain their original
377  meanings from before UTF-8 support was available, mainly for efficiency  meanings from before UTF-8 support was available, mainly for efficiency
378  reasons.  reasons. Note that this also affects \b, because it is defined in terms of \w
379    and \W.
380  </P>  </P>
381  <P>  <P>
382  The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to the  The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to the
# Line 427  recognized. Line 454  recognized.
454  <P>  <P>
455  It is possible to restrict \R to match only CR, LF, or CRLF (instead of the  It is possible to restrict \R to match only CR, LF, or CRLF (instead of the
456  complete set of Unicode line endings) by setting the option PCRE_BSR_ANYCRLF  complete set of Unicode line endings) by setting the option PCRE_BSR_ANYCRLF
457  either at compile time or when the pattern is matched. This can be made the  either at compile time or when the pattern is matched. (BSR is an abbrevation
458  default when PCRE is built; if this is the case, the other behaviour can be  for "backslash R".) This can be made the default when PCRE is built; if this is
459  requested via the PCRE_BSR_UNICODE option. It is also possible to specify these  the case, the other behaviour can be requested via the PCRE_BSR_UNICODE option.
460  settings by starting a pattern string with one of the following sequences:  It is also possible to specify these settings by starting a pattern string with
461    one of the following sequences:
462  <pre>  <pre>
463    (*BSR_ANYCRLF)   CR, LF, or CRLF only    (*BSR_ANYCRLF)   CR, LF, or CRLF only
464    (*BSR_UNICODE)   any Unicode newline sequence    (*BSR_UNICODE)   any Unicode newline sequence
# Line 439  These override the default and the optio Line 467  These override the default and the optio
467  they can be overridden by options given to <b>pcre_exec()</b>. Note that these  they can be overridden by options given to <b>pcre_exec()</b>. Note that these
468  special settings, which are not Perl-compatible, are recognized only at the  special settings, which are not Perl-compatible, are recognized only at the
469  very start of a pattern, and that they must be in upper case. If more than one  very start of a pattern, and that they must be in upper case. If more than one
470  of them is present, the last one is used.  of them is present, the last one is used. They can be combined with a change of
471  </P>  newline convention, for example, a pattern can start with:
472  <P>  <pre>
473      (*ANY)(*BSR_ANYCRLF)
474    </pre>
475  Inside a character class, \R matches the letter "R".  Inside a character class, \R matches the letter "R".
476  <a name="uniextseq"></a></P>  <a name="uniextseq"></a></P>
477  <br><b>  <br><b>
# Line 1008  changed in the same way as the Perl-comp Line 1038  changed in the same way as the Perl-comp
1038  J, U and X respectively.  J, U and X respectively.
1039  </P>  </P>
1040  <P>  <P>
1041  When an option change occurs at top level (that is, not inside subpattern  When one of these option changes occurs at top level (that is, not inside
1042  parentheses), the change applies to the remainder of the pattern that follows.  subpattern parentheses), the change applies to the remainder of the pattern
1043  If the change is placed right at the start of a pattern, PCRE extracts it into  that follows. If the change is placed right at the start of a pattern, PCRE
1044  the global options (and it will therefore show up in data extracted by the  extracts it into the global options (and it will therefore show up in data
1045  <b>pcre_fullinfo()</b> function).  extracted by the <b>pcre_fullinfo()</b> function).
1046  </P>  </P>
1047  <P>  <P>
1048  An option change within a subpattern (see below for a description of  An option change within a subpattern (see below for a description of
# Line 1031  matches "ab", "aB", "c", and "C", even t Line 1061  matches "ab", "aB", "c", and "C", even t
1061  branch is abandoned before the option setting. This is because the effects of  branch is abandoned before the option setting. This is because the effects of
1062  option settings happen at compile time. There would be some very weird  option settings happen at compile time. There would be some very weird
1063  behaviour otherwise.  behaviour otherwise.
1064    </P>
1065    <P>
1066    <b>Note:</b> There are other PCRE-specific options that can be set by the
1067    application when the compile or match functions are called. In some cases the
1068    pattern can contain special leading sequences such as (*CRLF) to override what
1069    the application has set or what has been defaulted. Details are given in the
1070    section entitled
1071    <a href="#newlineseq">"Newline sequences"</a>
1072    above. There is also the (*UTF8) leading sequence that can be used to set UTF-8
1073    mode; this is equivalent to setting the PCRE_UTF8 option.
1074  <a name="subpattern"></a></P>  <a name="subpattern"></a></P>
1075  <br><a name="SEC12" href="#TOC1">SUBPATTERNS</a><br>  <br><a name="SEC12" href="#TOC1">SUBPATTERNS</a><br>
1076  <P>  <P>
# Line 1172  details of the interfaces for handling n Line 1212  details of the interfaces for handling n
1212  <a href="pcreapi.html"><b>pcreapi</b></a>  <a href="pcreapi.html"><b>pcreapi</b></a>
1213  documentation.  documentation.
1214  </P>  </P>
1215    <P>
1216    <b>Warning:</b> You cannot use different names to distinguish between two
1217    subpatterns with the same number (see the previous section) because PCRE uses
1218    only the numbers when matching.
1219    </P>
1220  <br><a name="SEC15" href="#TOC1">REPETITION</a><br>  <br><a name="SEC15" href="#TOC1">REPETITION</a><br>
1221  <P>  <P>
1222  Repetition is specified by quantifiers, which can follow any of the following  Repetition is specified by quantifiers, which can follow any of the following
# Line 1219  which may be several bytes long (and the Line 1264  which may be several bytes long (and the
1264  </P>  </P>
1265  <P>  <P>
1266  The quantifier {0} is permitted, causing the expression to behave as if the  The quantifier {0} is permitted, causing the expression to behave as if the
1267  previous item and the quantifier were not present.  previous item and the quantifier were not present. This may be useful for
1268    subpatterns that are referenced as
1269    <a href="#subpatternsassubroutines">subroutines</a>
1270    from elsewhere in the pattern. Items other than subpatterns that have a {0}
1271    quantifier are omitted from the compiled pattern.
1272  </P>  </P>
1273  <P>  <P>
1274  For convenience, the three most common quantifiers have single-character  For convenience, the three most common quantifiers have single-character
# Line 2019  changed for different calls. For example Line 2068  changed for different calls. For example
2068  </pre>  </pre>
2069  It matches "abcabc". It does not match "abcABC" because the change of  It matches "abcabc". It does not match "abcABC" because the change of
2070  processing option does not affect the called subpattern.  processing option does not affect the called subpattern.
2071    <a name="onigurumasubroutines"></a></P>
2072    <br><a name="SEC23" href="#TOC1">ONIGURUMA SUBROUTINE SYNTAX</a><br>
2073    <P>
2074    For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or
2075    a number enclosed either in angle brackets or single quotes, is an alternative
2076    syntax for referencing a subpattern as a subroutine, possibly recursively. Here
2077    are two of the examples used above, rewritten using this syntax:
2078    <pre>
2079      (?&#60;pn&#62; \( ( (?&#62;[^()]+) | \g&#60;pn&#62; )* \) )
2080      (sens|respons)e and \g'1'ibility
2081    </pre>
2082    PCRE supports an extension to Oniguruma: if a number is preceded by a
2083    plus or a minus sign it is taken as a relative reference. For example:
2084    <pre>
2085      (abc)(?i:\g&#60;-1&#62;)
2086    </pre>
2087    Note that \g{...} (Perl syntax) and \g&#60;...&#62; (Oniguruma syntax) are <i>not</i>
2088    synonymous. The former is a back reference; the latter is a subroutine call.
2089  </P>  </P>
2090  <br><a name="SEC23" href="#TOC1">CALLOUTS</a><br>  <br><a name="SEC24" href="#TOC1">CALLOUTS</a><br>
2091  <P>  <P>
2092  Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl  Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl
2093  code to be obeyed in the middle of matching a regular expression. This makes it  code to be obeyed in the middle of matching a regular expression. This makes it
# Line 2055  description of the interface to the call Line 2122  description of the interface to the call
2122  <a href="pcrecallout.html"><b>pcrecallout</b></a>  <a href="pcrecallout.html"><b>pcrecallout</b></a>
2123  documentation.  documentation.
2124  </P>  </P>
2125  <br><a name="SEC24" href="#TOC1">BACKTRACKING CONTROL</a><br>  <br><a name="SEC25" href="#TOC1">BACKTRACKING CONTROL</a><br>
2126  <P>  <P>
2127  Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which  Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which
2128  are described in the Perl documentation as "experimental and subject to change  are described in the Perl documentation as "experimental and subject to change
# Line 2064  production code should be noted to avoid Line 2131  production code should be noted to avoid
2131  remarks apply to the PCRE features described in this section.  remarks apply to the PCRE features described in this section.
2132  </P>  </P>
2133  <P>  <P>
2134  Since these verbs are specifically related to backtracking, they can be used  Since these verbs are specifically related to backtracking, most of them can be
2135  only when the pattern is to be matched using <b>pcre_exec()</b>, which uses a  used only when the pattern is to be matched using <b>pcre_exec()</b>, which uses
2136  backtracking algorithm. They cause an error if encountered by  a backtracking algorithm. With the exception of (*FAIL), which behaves like a
2137    failing negative assertion, they cause an error if encountered by
2138  <b>pcre_dfa_exec()</b>.  <b>pcre_dfa_exec()</b>.
2139  </P>  </P>
2140  <P>  <P>
# Line 2170  the end of the group if FOO succeeds); o Line 2238  the end of the group if FOO succeeds); o
2238  second alternative and tries COND2, without backtracking into COND1. If (*THEN)  second alternative and tries COND2, without backtracking into COND1. If (*THEN)
2239  is used outside of any alternation, it acts exactly like (*PRUNE).  is used outside of any alternation, it acts exactly like (*PRUNE).
2240  </P>  </P>
2241  <br><a name="SEC25" href="#TOC1">SEE ALSO</a><br>  <br><a name="SEC26" href="#TOC1">SEE ALSO</a><br>
2242  <P>  <P>
2243  <b>pcreapi</b>(3), <b>pcrecallout</b>(3), <b>pcrematching</b>(3), <b>pcre</b>(3).  <b>pcreapi</b>(3), <b>pcrecallout</b>(3), <b>pcrematching</b>(3), <b>pcre</b>(3).
2244  </P>  </P>
2245  <br><a name="SEC26" href="#TOC1">AUTHOR</a><br>  <br><a name="SEC27" href="#TOC1">AUTHOR</a><br>
2246  <P>  <P>
2247  Philip Hazel  Philip Hazel
2248  <br>  <br>
# Line 2183  University Computing Service Line 2251  University Computing Service
2251  Cambridge CB2 3QH, England.  Cambridge CB2 3QH, England.
2252  <br>  <br>
2253  </P>  </P>
2254  <br><a name="SEC27" href="#TOC1">REVISION</a><br>  <br><a name="SEC28" href="#TOC1">REVISION</a><br>
2255  <P>  <P>
2256  Last updated: 11 September 2007  Last updated: 11 April 2009
2257  <br>  <br>
2258  Copyright &copy; 1997-2007 University of Cambridge.  Copyright &copy; 1997-2009 University of Cambridge.
2259  <br>  <br>
2260  <p>  <p>
2261  Return to the <a href="index.html">PCRE index page</a>.  Return to the <a href="index.html">PCRE index page</a>.

Legend:
Removed from v.243  
changed lines
  Added in v.416

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12