/[pcre]/code/trunk/doc/html/pcrepattern.html
ViewVC logotype

Diff of /code/trunk/doc/html/pcrepattern.html

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 534 by ph10, Tue May 18 15:47:01 2010 UTC revision 535 by ph10, Thu Jun 3 19:18:24 2010 UTC
# Line 78  in the main Line 78  in the main
78  page.  page.
79  </P>  </P>
80  <P>  <P>
81    Another special sequence that may appear at the start of a pattern or in
82    combination with (*UTF8) is:
83    <pre>
84      (*UCP)
85    </pre>
86    This has the same effect as setting the PCRE_UCP option: it causes sequences
87    such as \d and \w to use Unicode properties to determine character types,
88    instead of recognizing only characters with codes less than 128 via a lookup
89    table.
90    </P>
91    <P>
92  The remainder of this document discusses the patterns that are supported by  The remainder of this document discusses the patterns that are supported by
93  PCRE when its main matching function, <b>pcre_exec()</b>, is used.  PCRE when its main matching function, <b>pcre_exec()</b>, is used.
94  From release 6.0, PCRE offers a second matching function,  From release 6.0, PCRE offers a second matching function,
# Line 357  Another use of backslash is for specifyi Line 368  Another use of backslash is for specifyi
368    \w     any "word" character    \w     any "word" character
369    \W     any "non-word" character    \W     any "non-word" character
370  </pre>  </pre>
371  There is also the single sequence \N, which matches a non-newline character.  There is also the single sequence \N, which matches a non-newline character.
372  This is the same as  This is the same as
373  <a href="#fullstopdot">the "." metacharacter</a>  <a href="#fullstopdot">the "." metacharacter</a>
374  when PCRE_DOTALL is not set.  when PCRE_DOTALL is not set.
375  </P>  </P>
376  <P>  <P>
377  Each pair of lower and upper case escape sequences partitions the complete set  Each pair of lower and upper case escape sequences partitions the complete set
378  of characters into two disjoint sets. Any given character matches one, and only  of characters into two disjoint sets. Any given character matches one, and only
379  one, of each pair.  one, of each pair. The sequences can appear both inside and outside character
 </P>  
 <P>  
 These character type sequences can appear both inside and outside character  
380  classes. They each match one character of the appropriate type. If the current  classes. They each match one character of the appropriate type. If the current
381  matching point is at the end of the subject string, all of them fail, since  matching point is at the end of the subject string, all of them fail, because
382  there is no character to match.  there is no character to match.
383  </P>  </P>
384  <P>  <P>
# Line 381  included in a Perl script, \s may match Line 389  included in a Perl script, \s may match
389  does.  does.
390  </P>  </P>
391  <P>  <P>
392  In UTF-8 mode, characters with values greater than 128 never match \d, \s, or  A "word" character is an underscore or any character that is a letter or digit.
393  \w, and always match \D, \S, and \W. This is true even when Unicode  By default, the definition of letters and digits is controlled by PCRE's
394  character property support is available. These sequences retain their original  low-valued character tables, and may vary if locale-specific matching is taking
395  meanings from before UTF-8 support was available, mainly for efficiency  place (see
396  reasons. Note that this also affects \b, because it is defined in terms of \w  <a href="pcreapi.html#localesupport">"Locale support"</a>
397  and \W.  in the
398    <a href="pcreapi.html"><b>pcreapi</b></a>
399    page). For example, in a French locale such as "fr_FR" in Unix-like systems,
400    or "french" in Windows, some character codes greater than 128 are used for
401    accented letters, and these are then matched by \w. The use of locales with
402    Unicode is discouraged.
403    </P>
404    <P>
405    By default, in UTF-8 mode, characters with values greater than 128 never match
406    \d, \s, or \w, and always match \D, \S, and \W. These sequences retain
407    their original meanings from before UTF-8 support was available, mainly for
408    efficiency reasons. However, if PCRE is compiled with Unicode property support,
409    and the PCRE_UCP option is set, the behaviour is changed so that Unicode
410    properties are used to determine character types, as follows:
411    <pre>
412      \d  any character that \p{Nd} matches (decimal digit)
413      \s  any character that \p{Z} matches, plus HT, LF, FF, CR
414      \w  any character that \p{L} or \p{N} matches, plus underscore
415    </pre>
416    The upper case escapes match the inverse sets of characters. Note that \d
417    matches only decimal digits, whereas \w matches any Unicode digit, as well as
418    any Unicode letter, and underscore. Note also that PCRE_UCP affects \b, and
419    \B because they are defined in terms of \w and \W. Matching these sequences
420    is noticeably slower when PCRE_UCP is set.
421  </P>  </P>
422  <P>  <P>
423  The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to the  The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to the
424  other sequences, these do match certain high-valued codepoints in UTF-8 mode.  other sequences, which match only ASCII characters by default, these always
425  The horizontal space characters are:  match certain high-valued codepoints in UTF-8 mode, whether or not PCRE_UCP is
426    set. The horizontal space characters are:
427  <pre>  <pre>
428    U+0009     Horizontal tab    U+0009     Horizontal tab
429    U+0020     Space    U+0020     Space
# Line 422  The vertical space characters are: Line 454  The vertical space characters are:
454    U+0085     Next line    U+0085     Next line
455    U+2028     Line separator    U+2028     Line separator
456    U+2029     Paragraph separator    U+2029     Paragraph separator
457  </PRE>  <a name="newlineseq"></a></PRE>
458  </P>  </P>
 <P>  
 A "word" character is an underscore or any character less than 256 that is a  
 letter or digit. The definition of letters and digits is controlled by PCRE's  
 low-valued character tables, and may vary if locale-specific matching is taking  
 place (see  
 <a href="pcreapi.html#localesupport">"Locale support"</a>  
 in the  
 <a href="pcreapi.html"><b>pcreapi</b></a>  
 page). For example, in a French locale such as "fr_FR" in Unix-like systems,  
 or "french" in Windows, some character codes greater than 128 are used for  
 accented letters, and these are matched by \w. The use of locales with Unicode  
 is discouraged.  
 <a name="newlineseq"></a></P>  
459  <br><b>  <br><b>
460  Newline sequences  Newline sequences
461  </b><br>  </b><br>
# Line 479  These override the default and the optio Line 498  These override the default and the optio
498  which are not Perl-compatible, are recognized only at the very start of a  which are not Perl-compatible, are recognized only at the very start of a
499  pattern, and that they must be in upper case. If more than one of them is  pattern, and that they must be in upper case. If more than one of them is
500  present, the last one is used. They can be combined with a change of newline  present, the last one is used. They can be combined with a change of newline
501  convention, for example, a pattern can start with:  convention; for example, a pattern can start with:
502  <pre>  <pre>
503    (*ANY)(*BSR_ANYCRLF)    (*ANY)(*BSR_ANYCRLF)
504  </pre>  </pre>
505  Inside a character class, \R is treated as an unrecognized escape sequence,  They can also be combined with the (*UTF8) or (*UCP) special sequences. Inside
506  and so matches the letter "R" by default, but causes an error if PCRE_EXTRA is  a character class, \R is treated as an unrecognized escape sequence, and so
507  set.  matches the letter "R" by default, but causes an error if PCRE_EXTRA is set.
508  <a name="uniextseq"></a></P>  <a name="uniextseq"></a></P>
509  <br><b>  <br><b>
510  Unicode character properties  Unicode character properties
# Line 504  The extra escape sequences are: Line 523  The extra escape sequences are:
523  The property names represented by <i>xx</i> above are limited to the Unicode  The property names represented by <i>xx</i> above are limited to the Unicode
524  script names, the general category properties, "Any", which matches any  script names, the general category properties, "Any", which matches any
525  character (including newline), and some special PCRE properties (described  character (including newline), and some special PCRE properties (described
526  in the  in the
527  <a href="#extraprops">next section).</a>  <a href="#extraprops">next section).</a>
528  Other Perl properties such as "InMusicalSymbols" are not currently supported by  Other Perl properties such as "InMusicalSymbols" are not currently supported by
529  PCRE. Note that \P{Any} does not match any characters, so always causes a  PCRE. Note that \P{Any} does not match any characters, so always causes a
# Line 720  non-UTF-8 mode \X matches any one charac Line 739  non-UTF-8 mode \X matches any one charac
739  Matching characters by Unicode property is not fast, because PCRE has to search  Matching characters by Unicode property is not fast, because PCRE has to search
740  a structure that contains data for over fifteen thousand characters. That is  a structure that contains data for over fifteen thousand characters. That is
741  why the traditional escape sequences such as \d and \w do not use Unicode  why the traditional escape sequences such as \d and \w do not use Unicode
742  properties in PCRE.  properties in PCRE by default, though you can make them do so by setting the
743    PCRE_UCP option for <b>pcre_compile()</b> or by starting the pattern with
744    (*UCP).
745  <a name="extraprops"></a></P>  <a name="extraprops"></a></P>
746  <br><b>  <br><b>
747  PCRE's additional properties  PCRE's additional properties
748  </b><br>  </b><br>
749  <P>  <P>
750  As well as the standard Unicode properties described in the previous  As well as the standard Unicode properties described in the previous
751  section, PCRE supports four more that make it possible to convert traditional  section, PCRE supports four more that make it possible to convert traditional
752  escape sequences such as \w and \s and POSIX character classes to use Unicode  escape sequences such as \w and \s and POSIX character classes to use Unicode
753  properties. These are:  properties. PCRE uses these non-standard, non-Perl properties internally when
754    PCRE_UCP is set. They are:
755  <pre>  <pre>
756    Xan   Any alphanumeric character    Xan   Any alphanumeric character
757    Xps   Any POSIX space character    Xps   Any POSIX space character
758    Xsp   Any Perl space character    Xsp   Any Perl space character
759    Xwd   Any Perl "word" character    Xwd   Any Perl "word" character
760  </pre>  </pre>
761  Xan matches characters that have either the L (letter) or the N (number)  Xan matches characters that have either the L (letter) or the N (number)
762  property. Xps matches the characters tab, linefeed, vertical tab, formfeed, or  property. Xps matches the characters tab, linefeed, vertical tab, formfeed, or
763  carriage return, and any other character that has the Z (separator) property.  carriage return, and any other character that has the Z (separator) property.
764  Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the  Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the
765  same characters as Xan, plus underscore.  same characters as Xan, plus underscore.
766  <a name="resetmatchstart"></a></P>  <a name="resetmatchstart"></a></P>
767  <br><b>  <br><b>
# Line 790  The backslashed assertions are: Line 812  The backslashed assertions are:
812    \G     matches at the first matching position in the subject    \G     matches at the first matching position in the subject
813  </pre>  </pre>
814  Inside a character class, \b has a different meaning; it matches the backspace  Inside a character class, \b has a different meaning; it matches the backspace
815  character. If any other of these assertions appears in a character class, by  character. If any other of these assertions appears in a character class, by
816  default it matches the corresponding literal character (for example, \B  default it matches the corresponding literal character (for example, \B
817  matches the letter B). However, if the PCRE_EXTRA option is set, an "invalid  matches the letter B). However, if the PCRE_EXTRA option is set, an "invalid
818  escape sequence" error is generated instead.  escape sequence" error is generated instead.
# Line 799  escape sequence" error is generated inst Line 821  escape sequence" error is generated inst
821  A word boundary is a position in the subject string where the current character  A word boundary is a position in the subject string where the current character
822  and the previous character do not both match \w or \W (i.e. one matches  and the previous character do not both match \w or \W (i.e. one matches
823  \w and the other matches \W), or the start or end of the string if the  \w and the other matches \W), or the start or end of the string if the
824  first or last character matches \w, respectively. Neither PCRE nor Perl has a  first or last character matches \w, respectively. In UTF-8 mode, the meanings
825  separte "start of word" or "end of word" metasequence. However, whatever  of \w and \W can be changed by setting the PCRE_UCP option. When this is
826  follows \b normally determines which it is. For example, the fragment  done, it also affects \b and \B. Neither PCRE nor Perl has a separate "start
827  \ba matches "a" at the start of a word.  of word" or "end of word" metasequence. However, whatever follows \b normally
828    determines which it is. For example, the fragment \ba matches "a" at the start
829    of a word.
830  </P>  </P>
831  <P>  <P>
832  The \A, \Z, and \z assertions differ from the traditional circumflex and  The \A, \Z, and \z assertions differ from the traditional circumflex and
# Line 916  dollar, the only relationship being that Line 940  dollar, the only relationship being that
940  special meaning in a character class.  special meaning in a character class.
941  </P>  </P>
942  <P>  <P>
943  The escape sequence \N always behaves as a dot does when PCRE_DOTALL is not  The escape sequence \N always behaves as a dot does when PCRE_DOTALL is not
944  set. In other words, it matches any one character except one that signifies the  set. In other words, it matches any one character except one that signifies the
945  end of a line.  end of a line.
946  </P>  </P>
947  <br><a name="SEC7" href="#TOC1">MATCHING A SINGLE BYTE</a><br>  <br><a name="SEC7" href="#TOC1">MATCHING A SINGLE BYTE</a><br>
# Line 1016  characters with values greater than 128 Line 1040  characters with values greater than 128
1040  property support.  property support.
1041  </P>  </P>
1042  <P>  <P>
1043  The character types \d, \D, \p, \P, \s, \S, \w, and \W may also appear  The character types \d, \D, \h, \H, \p, \P, \s, \S, \v, \V, \w, and
1044  in a character class, and add the characters that they match to the class. For  \W may also appear in a character class, and add the characters that they
1045  example, [\dABCDEF] matches any hexadecimal digit. A circumflex can  match to the class. For example, [\dABCDEF] matches any hexadecimal digit. A
1046  conveniently be used with the upper case character types to specify a more  circumflex can conveniently be used with the upper case character types to
1047  restricted set of characters than the matching lower case type. For example,  specify a more restricted set of characters than the matching lower case type.
1048  the class [^\W_] matches any letter or digit, but not underscore.  For example, the class [^\W_] matches any letter or digit, but not underscore.
1049  </P>  </P>
1050  <P>  <P>
1051  The only metacharacters that are recognized in character classes are backslash,  The only metacharacters that are recognized in character classes are backslash,
# Line 1040  this notation. For example, Line 1064  this notation. For example,
1064    [01[:alpha:]%]    [01[:alpha:]%]
1065  </pre>  </pre>
1066  matches "0", "1", any alphabetic character, or "%". The supported class names  matches "0", "1", any alphabetic character, or "%". The supported class names
1067  are  are:
1068  <pre>  <pre>
1069    alnum    letters and digits    alnum    letters and digits
1070    alpha    letters    alpha    letters
# Line 1051  are Line 1075  are
1075    graph    printing characters, excluding space    graph    printing characters, excluding space
1076    lower    lower case letters    lower    lower case letters
1077    print    printing characters, including space    print    printing characters, including space
1078    punct    printing characters, excluding letters and digits    punct    printing characters, excluding letters and digits and space
1079    space    white space (not quite the same as \s)    space    white space (not quite the same as \s)
1080    upper    upper case letters    upper    upper case letters
1081    word     "word" characters (same as \w)    word     "word" characters (same as \w)
# Line 1074  syntax [.ch.] and [=ch=] where "ch" is a Line 1098  syntax [.ch.] and [=ch=] where "ch" is a
1098  supported, and an error is given if they are encountered.  supported, and an error is given if they are encountered.
1099  </P>  </P>
1100  <P>  <P>
1101  In UTF-8 mode, characters with values greater than 128 do not match any of  By default, in UTF-8 mode, characters with values greater than 128 do not match
1102  the POSIX character classes.  any of the POSIX character classes. However, if the PCRE_UCP option is passed
1103    to <b>pcre_compile()</b>, some of the classes are changed so that Unicode
1104    character properties are used. This is achieved by replacing the POSIX classes
1105    by other sequences, as follows:
1106    <pre>
1107      [:alnum:]  becomes  \p{Xan}
1108      [:alpha:]  becomes  \p{L}
1109      [:blank:]  becomes  \h
1110      [:digit:]  becomes  \p{Nd}
1111      [:lower:]  becomes  \p{Ll}
1112      [:space:]  becomes  \p{Xps}
1113      [:upper:]  becomes  \p{Lu}
1114      [:word:]   becomes  \p{Xwd}
1115    </pre>
1116    Negated versions, such as [:^alpha:] use \P instead of \p. The other POSIX
1117    classes are unchanged, and match only characters with code points less than
1118    128.
1119  </P>  </P>
1120  <br><a name="SEC10" href="#TOC1">VERTICAL BAR</a><br>  <br><a name="SEC10" href="#TOC1">VERTICAL BAR</a><br>
1121  <P>  <P>
# Line 1148  pattern can contain special leading sequ Line 1188  pattern can contain special leading sequ
1188  the application has set or what has been defaulted. Details are given in the  the application has set or what has been defaulted. Details are given in the
1189  section entitled  section entitled
1190  <a href="#newlineseq">"Newline sequences"</a>  <a href="#newlineseq">"Newline sequences"</a>
1191  above. There is also the (*UTF8) leading sequence that can be used to set UTF-8  above. There are also the (*UTF8) and (*UCP) leading sequences that can be used
1192  mode; this is equivalent to setting the PCRE_UTF8 option.  to set UTF-8 and Unicode property modes; they are equivalent to setting the
1193    PCRE_UTF8 and the PCRE_UCP options, respectively.
1194  <a name="subpattern"></a></P>  <a name="subpattern"></a></P>
1195  <br><a name="SEC12" href="#TOC1">SUBPATTERNS</a><br>  <br><a name="SEC12" href="#TOC1">SUBPATTERNS</a><br>
1196  <P>  <P>
# Line 2589  Cambridge CB2 3QH, England. Line 2630  Cambridge CB2 3QH, England.
2630  </P>  </P>
2631  <br><a name="SEC28" href="#TOC1">REVISION</a><br>  <br><a name="SEC28" href="#TOC1">REVISION</a><br>
2632  <P>  <P>
2633  Last updated: 05 May 2010  Last updated: 18 May 2010
2634  <br>  <br>
2635  Copyright &copy; 1997-2010 University of Cambridge.  Copyright &copy; 1997-2010 University of Cambridge.
2636  <br>  <br>

Legend:
Removed from v.534  
changed lines
  Added in v.535

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12