/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 243 by ph10, Thu Sep 13 09:28:14 2007 UTC revision 247 by ph10, Mon Sep 17 09:38:32 2007 UTC
# Line 2911  NEWLINE CONVENTIONS Line 2911  NEWLINE CONVENTIONS
2911         The  newline  convention  does  not  affect what the \R escape sequence         The  newline  convention  does  not  affect what the \R escape sequence
2912         matches. By default, this is any Unicode  newline  sequence,  for  Perl         matches. By default, this is any Unicode  newline  sequence,  for  Perl
2913         compatibility.  However, this can be changed; see the description of \R         compatibility.  However, this can be changed; see the description of \R
2914         in the section entitled "Newline sequences" below.         in the section entitled "Newline sequences" below. A change of \R  set-
2915           ting can be combined with a change of newline convention.
2916    
2917    
2918  CHARACTERS AND METACHARACTERS  CHARACTERS AND METACHARACTERS
2919    
2920         A regular expression is a pattern that is  matched  against  a  subject         A  regular  expression  is  a pattern that is matched against a subject
2921         string  from  left  to right. Most characters stand for themselves in a         string from left to right. Most characters stand for  themselves  in  a
2922         pattern, and match the corresponding characters in the  subject.  As  a         pattern,  and  match  the corresponding characters in the subject. As a
2923         trivial example, the pattern         trivial example, the pattern
2924    
2925           The quick brown fox           The quick brown fox
2926    
2927         matches a portion of a subject string that is identical to itself. When         matches a portion of a subject string that is identical to itself. When
2928         caseless matching is specified (the PCRE_CASELESS option), letters  are         caseless  matching is specified (the PCRE_CASELESS option), letters are
2929         matched  independently  of case. In UTF-8 mode, PCRE always understands         matched independently of case. In UTF-8 mode, PCRE  always  understands
2930         the concept of case for characters whose values are less than  128,  so         the  concept  of case for characters whose values are less than 128, so
2931         caseless  matching  is always possible. For characters with higher val-         caseless matching is always possible. For characters with  higher  val-
2932         ues, the concept of case is supported if PCRE is compiled with  Unicode         ues,  the concept of case is supported if PCRE is compiled with Unicode
2933         property  support,  but  not  otherwise.   If  you want to use caseless         property support, but not otherwise.   If  you  want  to  use  caseless
2934         matching for characters 128 and above, you must  ensure  that  PCRE  is         matching  for  characters  128  and above, you must ensure that PCRE is
2935         compiled with Unicode property support as well as with UTF-8 support.         compiled with Unicode property support as well as with UTF-8 support.
2936    
2937         The  power  of  regular  expressions  comes from the ability to include         The power of regular expressions comes  from  the  ability  to  include
2938         alternatives and repetitions in the pattern. These are encoded  in  the         alternatives  and  repetitions in the pattern. These are encoded in the
2939         pattern by the use of metacharacters, which do not stand for themselves         pattern by the use of metacharacters, which do not stand for themselves
2940         but instead are interpreted in some special way.         but instead are interpreted in some special way.
2941    
2942         There are two different sets of metacharacters: those that  are  recog-         There  are  two different sets of metacharacters: those that are recog-
2943         nized  anywhere in the pattern except within square brackets, and those         nized anywhere in the pattern except within square brackets, and  those
2944         that are recognized within square brackets.  Outside  square  brackets,         that  are  recognized  within square brackets. Outside square brackets,
2945         the metacharacters are as follows:         the metacharacters are as follows:
2946    
2947           \      general escape character with several uses           \      general escape character with several uses
# Line 2959  CHARACTERS AND METACHARACTERS Line 2960  CHARACTERS AND METACHARACTERS
2960                  also "possessive quantifier"                  also "possessive quantifier"
2961           {      start min/max quantifier           {      start min/max quantifier
2962    
2963         Part  of  a  pattern  that is in square brackets is called a "character         Part of a pattern that is in square brackets  is  called  a  "character
2964         class". In a character class the only metacharacters are:         class". In a character class the only metacharacters are:
2965    
2966           \      general escape character           \      general escape character
# Line 2969  CHARACTERS AND METACHARACTERS Line 2970  CHARACTERS AND METACHARACTERS
2970                    syntax)                    syntax)
2971           ]      terminates the character class           ]      terminates the character class
2972    
2973         The following sections describe the use of each of the  metacharacters.         The  following sections describe the use of each of the metacharacters.
2974    
2975    
2976  BACKSLASH  BACKSLASH
2977    
2978         The backslash character has several uses. Firstly, if it is followed by         The backslash character has several uses. Firstly, if it is followed by
2979         a non-alphanumeric character, it takes away any  special  meaning  that         a  non-alphanumeric  character,  it takes away any special meaning that
2980         character  may  have.  This  use  of  backslash  as an escape character         character may have. This  use  of  backslash  as  an  escape  character
2981         applies both inside and outside character classes.         applies both inside and outside character classes.
2982    
2983         For example, if you want to match a * character, you write  \*  in  the         For  example,  if  you want to match a * character, you write \* in the
2984         pattern.   This  escaping  action  applies whether or not the following         pattern.  This escaping action applies whether  or  not  the  following
2985         character would otherwise be interpreted as a metacharacter, so  it  is         character  would  otherwise be interpreted as a metacharacter, so it is
2986         always  safe  to  precede  a non-alphanumeric with backslash to specify         always safe to precede a non-alphanumeric  with  backslash  to  specify
2987         that it stands for itself. In particular, if you want to match a  back-         that  it stands for itself. In particular, if you want to match a back-
2988         slash, you write \\.         slash, you write \\.
2989    
2990         If  a  pattern is compiled with the PCRE_EXTENDED option, whitespace in         If a pattern is compiled with the PCRE_EXTENDED option,  whitespace  in
2991         the pattern (other than in a character class) and characters between  a         the  pattern (other than in a character class) and characters between a
2992         # outside a character class and the next newline are ignored. An escap-         # outside a character class and the next newline are ignored. An escap-
2993         ing backslash can be used to include a whitespace  or  #  character  as         ing  backslash  can  be  used to include a whitespace or # character as
2994         part of the pattern.         part of the pattern.
2995    
2996         If  you  want  to remove the special meaning from a sequence of charac-         If you want to remove the special meaning from a  sequence  of  charac-
2997         ters, you can do so by putting them between \Q and \E. This is  differ-         ters,  you can do so by putting them between \Q and \E. This is differ-
2998         ent  from  Perl  in  that  $  and  @ are handled as literals in \Q...\E         ent from Perl in that $ and  @  are  handled  as  literals  in  \Q...\E
2999         sequences in PCRE, whereas in Perl, $ and @ cause  variable  interpola-         sequences  in  PCRE, whereas in Perl, $ and @ cause variable interpola-
3000         tion. Note the following examples:         tion. Note the following examples:
3001    
3002           Pattern            PCRE matches   Perl matches           Pattern            PCRE matches   Perl matches
# Line 3005  BACKSLASH Line 3006  BACKSLASH
3006           \Qabc\$xyz\E       abc\$xyz       abc\$xyz           \Qabc\$xyz\E       abc\$xyz       abc\$xyz
3007           \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz           \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
3008    
3009         The  \Q...\E  sequence  is recognized both inside and outside character         The \Q...\E sequence is recognized both inside  and  outside  character
3010         classes.         classes.
3011    
3012     Non-printing characters     Non-printing characters
3013    
3014         A second use of backslash provides a way of encoding non-printing char-         A second use of backslash provides a way of encoding non-printing char-
3015         acters  in patterns in a visible manner. There is no restriction on the         acters in patterns in a visible manner. There is no restriction on  the
3016         appearance of non-printing characters, apart from the binary zero  that         appearance  of non-printing characters, apart from the binary zero that
3017         terminates  a  pattern,  but  when  a pattern is being prepared by text         terminates a pattern, but when a pattern  is  being  prepared  by  text
3018         editing, it is usually easier  to  use  one  of  the  following  escape         editing,  it  is  usually  easier  to  use  one of the following escape
3019         sequences than the binary character it represents:         sequences than the binary character it represents:
3020    
3021           \a        alarm, that is, the BEL character (hex 07)           \a        alarm, that is, the BEL character (hex 07)
# Line 3028  BACKSLASH Line 3029  BACKSLASH
3029           \xhh      character with hex code hh           \xhh      character with hex code hh
3030           \x{hhh..} character with hex code hhh..           \x{hhh..} character with hex code hhh..
3031    
3032         The  precise  effect of \cx is as follows: if x is a lower case letter,         The precise effect of \cx is as follows: if x is a lower  case  letter,
3033         it is converted to upper case. Then bit 6 of the character (hex 40)  is         it  is converted to upper case. Then bit 6 of the character (hex 40) is
3034         inverted.   Thus  \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;         inverted.  Thus \cz becomes hex 1A, but \c{ becomes hex 3B,  while  \c;
3035         becomes hex 7B.         becomes hex 7B.
3036    
3037         After \x, from zero to two hexadecimal digits are read (letters can  be         After  \x, from zero to two hexadecimal digits are read (letters can be
3038         in  upper  or  lower case). Any number of hexadecimal digits may appear         in upper or lower case). Any number of hexadecimal  digits  may  appear
3039         between \x{ and }, but the value of the character  code  must  be  less         between  \x{  and  },  but the value of the character code must be less
3040         than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is,         than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is,
3041         the maximum value in hexadecimal is 7FFFFFFF. Note that this is  bigger         the  maximum value in hexadecimal is 7FFFFFFF. Note that this is bigger
3042         than the largest Unicode code point, which is 10FFFF.         than the largest Unicode code point, which is 10FFFF.
3043    
3044         If  characters  other than hexadecimal digits appear between \x{ and },         If characters other than hexadecimal digits appear between \x{  and  },
3045         or if there is no terminating }, this form of escape is not recognized.         or if there is no terminating }, this form of escape is not recognized.
3046         Instead,  the  initial  \x  will  be interpreted as a basic hexadecimal         Instead, the initial \x will be  interpreted  as  a  basic  hexadecimal
3047         escape, with no following digits, giving a  character  whose  value  is         escape,  with  no  following  digits, giving a character whose value is
3048         zero.         zero.
3049    
3050         Characters whose value is less than 256 can be defined by either of the         Characters whose value is less than 256 can be defined by either of the
3051         two syntaxes for \x. There is no difference in the way  they  are  han-         two  syntaxes  for  \x. There is no difference in the way they are han-
3052         dled. For example, \xdc is exactly the same as \x{dc}.         dled. For example, \xdc is exactly the same as \x{dc}.
3053    
3054         After  \0  up  to two further octal digits are read. If there are fewer         After \0 up to two further octal digits are read. If  there  are  fewer
3055         than two digits, just  those  that  are  present  are  used.  Thus  the         than  two  digits,  just  those  that  are  present  are used. Thus the
3056         sequence \0\x\07 specifies two binary zeros followed by a BEL character         sequence \0\x\07 specifies two binary zeros followed by a BEL character
3057         (code value 7). Make sure you supply two digits after the initial  zero         (code  value 7). Make sure you supply two digits after the initial zero
3058         if the pattern character that follows is itself an octal digit.         if the pattern character that follows is itself an octal digit.
3059    
3060         The handling of a backslash followed by a digit other than 0 is compli-         The handling of a backslash followed by a digit other than 0 is compli-
3061         cated.  Outside a character class, PCRE reads it and any following dig-         cated.  Outside a character class, PCRE reads it and any following dig-
3062         its  as  a  decimal  number. If the number is less than 10, or if there         its as a decimal number. If the number is less than  10,  or  if  there
3063         have been at least that many previous capturing left parentheses in the         have been at least that many previous capturing left parentheses in the
3064         expression,  the  entire  sequence  is  taken  as  a  back reference. A         expression, the entire  sequence  is  taken  as  a  back  reference.  A
3065         description of how this works is given later, following the  discussion         description  of how this works is given later, following the discussion
3066         of parenthesized subpatterns.         of parenthesized subpatterns.
3067    
3068         Inside  a  character  class, or if the decimal number is greater than 9         Inside a character class, or if the decimal number is  greater  than  9
3069         and there have not been that many capturing subpatterns, PCRE  re-reads         and  there have not been that many capturing subpatterns, PCRE re-reads
3070         up to three octal digits following the backslash, and uses them to gen-         up to three octal digits following the backslash, and uses them to gen-
3071         erate a data character. Any subsequent digits stand for themselves.  In         erate  a data character. Any subsequent digits stand for themselves. In
3072         non-UTF-8  mode,  the  value  of a character specified in octal must be         non-UTF-8 mode, the value of a character specified  in  octal  must  be
3073         less than \400. In UTF-8 mode, values up to  \777  are  permitted.  For         less  than  \400.  In  UTF-8 mode, values up to \777 are permitted. For
3074         example:         example:
3075    
3076           \040   is another way of writing a space           \040   is another way of writing a space
# Line 3087  BACKSLASH Line 3088  BACKSLASH
3088           \81    is either a back reference, or a binary zero           \81    is either a back reference, or a binary zero
3089                     followed by the two characters "8" and "1"                     followed by the two characters "8" and "1"
3090    
3091         Note  that  octal  values of 100 or greater must not be introduced by a         Note that octal values of 100 or greater must not be  introduced  by  a
3092         leading zero, because no more than three octal digits are ever read.         leading zero, because no more than three octal digits are ever read.
3093    
3094         All the sequences that define a single character value can be used both         All the sequences that define a single character value can be used both
3095         inside  and  outside character classes. In addition, inside a character         inside and outside character classes. In addition, inside  a  character
3096         class, the sequence \b is interpreted as the backspace  character  (hex         class,  the  sequence \b is interpreted as the backspace character (hex
3097         08),  and the sequences \R and \X are interpreted as the characters "R"         08), and the sequences \R and \X are interpreted as the characters  "R"
3098         and "X", respectively. Outside a character class, these sequences  have         and  "X", respectively. Outside a character class, these sequences have
3099         different meanings (see below).         different meanings (see below).
3100    
3101     Absolute and relative back references     Absolute and relative back references
3102    
3103         The  sequence  \g followed by an unsigned or a negative number, option-         The sequence \g followed by an unsigned or a negative  number,  option-
3104         ally enclosed in braces, is an absolute or relative back  reference.  A         ally  enclosed  in braces, is an absolute or relative back reference. A
3105         named back reference can be coded as \g{name}. Back references are dis-         named back reference can be coded as \g{name}. Back references are dis-
3106         cussed later, following the discussion of parenthesized subpatterns.         cussed later, following the discussion of parenthesized subpatterns.
3107    
# Line 3121  BACKSLASH Line 3122  BACKSLASH
3122           \W     any "non-word" character           \W     any "non-word" character
3123    
3124         Each pair of escape sequences partitions the complete set of characters         Each pair of escape sequences partitions the complete set of characters
3125         into two disjoint sets. Any given character matches one, and only  one,         into  two disjoint sets. Any given character matches one, and only one,
3126         of each pair.         of each pair.
3127    
3128         These character type sequences can appear both inside and outside char-         These character type sequences can appear both inside and outside char-
3129         acter classes. They each match one character of the  appropriate  type.         acter  classes.  They each match one character of the appropriate type.
3130         If  the current matching point is at the end of the subject string, all         If the current matching point is at the end of the subject string,  all
3131         of them fail, since there is no character to match.         of them fail, since there is no character to match.
3132    
3133         For compatibility with Perl, \s does not match the VT  character  (code         For  compatibility  with Perl, \s does not match the VT character (code
3134         11).   This makes it different from the the POSIX "space" class. The \s         11).  This makes it different from the the POSIX "space" class. The  \s
3135         characters are HT (9), LF (10), FF (12), CR (13), and  space  (32).  If         characters  are  HT  (9), LF (10), FF (12), CR (13), and space (32). If
3136         "use locale;" is included in a Perl script, \s may match the VT charac-         "use locale;" is included in a Perl script, \s may match the VT charac-
3137         ter. In PCRE, it never does.         ter. In PCRE, it never does.
3138    
3139         In UTF-8 mode, characters with values greater than 128 never match  \d,         In  UTF-8 mode, characters with values greater than 128 never match \d,
3140         \s, or \w, and always match \D, \S, and \W. This is true even when Uni-         \s, or \w, and always match \D, \S, and \W. This is true even when Uni-
3141         code character property support is available.  These  sequences  retain         code  character  property  support is available. These sequences retain
3142         their original meanings from before UTF-8 support was available, mainly         their original meanings from before UTF-8 support was available, mainly
3143         for efficiency reasons.         for efficiency reasons.
3144    
3145         The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to         The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to
3146         the  other  sequences, these do match certain high-valued codepoints in         the other sequences, these do match certain high-valued  codepoints  in
3147         UTF-8 mode.  The horizontal space characters are:         UTF-8 mode.  The horizontal space characters are:
3148    
3149           U+0009     Horizontal tab           U+0009     Horizontal tab
# Line 3176  BACKSLASH Line 3177  BACKSLASH
3177           U+2029     Paragraph separator           U+2029     Paragraph separator
3178    
3179         A "word" character is an underscore or any character less than 256 that         A "word" character is an underscore or any character less than 256 that
3180         is  a  letter  or  digit.  The definition of letters and digits is con-         is a letter or digit. The definition of  letters  and  digits  is  con-
3181         trolled by PCRE's low-valued character tables, and may vary if  locale-         trolled  by PCRE's low-valued character tables, and may vary if locale-
3182         specific  matching is taking place (see "Locale support" in the pcreapi         specific matching is taking place (see "Locale support" in the  pcreapi
3183         page). For example, in a French locale such  as  "fr_FR"  in  Unix-like         page).  For  example,  in  a French locale such as "fr_FR" in Unix-like
3184         systems,  or "french" in Windows, some character codes greater than 128         systems, or "french" in Windows, some character codes greater than  128
3185         are used for accented letters, and these are matched by \w. The use  of         are  used for accented letters, and these are matched by \w. The use of
3186         locales with Unicode is discouraged.         locales with Unicode is discouraged.
3187    
3188     Newline sequences     Newline sequences
3189    
3190         Outside  a  character class, by default, the escape sequence \R matches         Outside a character class, by default, the escape sequence  \R  matches
3191         any Unicode newline sequence. This is a Perl 5.10 feature. In non-UTF-8         any Unicode newline sequence. This is a Perl 5.10 feature. In non-UTF-8
3192         mode \R is equivalent to the following:         mode \R is equivalent to the following:
3193    
3194           (?>\r\n|\n|\x0b|\f|\r|\x85)           (?>\r\n|\n|\x0b|\f|\r|\x85)
3195    
3196         This  is  an  example  of an "atomic group", details of which are given         This is an example of an "atomic group", details  of  which  are  given
3197         below.  This particular group matches either the two-character sequence         below.  This particular group matches either the two-character sequence
3198         CR  followed  by  LF,  or  one  of  the single characters LF (linefeed,         CR followed by LF, or  one  of  the  single  characters  LF  (linefeed,
3199         U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage         U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage
3200         return, U+000D), or NEL (next line, U+0085). The two-character sequence         return, U+000D), or NEL (next line, U+0085). The two-character sequence
3201         is treated as a single unit that cannot be split.         is treated as a single unit that cannot be split.
3202    
3203         In UTF-8 mode, two additional characters whose codepoints  are  greater         In  UTF-8  mode, two additional characters whose codepoints are greater
3204         than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-         than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
3205         rator, U+2029).  Unicode character property support is not  needed  for         rator,  U+2029).   Unicode character property support is not needed for
3206         these characters to be recognized.         these characters to be recognized.
3207    
3208         It is possible to restrict \R to match only CR, LF, or CRLF (instead of         It is possible to restrict \R to match only CR, LF, or CRLF (instead of
3209         the complete set  of  Unicode  line  endings)  by  setting  the  option         the  complete  set  of  Unicode  line  endings)  by  setting the option
3210         PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched.         PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched.
3211         This can be made the default when PCRE is built; if this is  the  case,         (BSR is an abbrevation for "backslash R".) This can be made the default
3212         the  other  behaviour can be requested via the PCRE_BSR_UNICODE option.         when PCRE is built; if this is the case, the  other  behaviour  can  be
3213         It is also possible to specify these settings  by  starting  a  pattern         requested  via  the  PCRE_BSR_UNICODE  option.   It is also possible to
3214         string with one of the following sequences:         specify these settings by starting a pattern string  with  one  of  the
3215           following sequences:
3216    
3217           (*BSR_ANYCRLF)   CR, LF, or CRLF only           (*BSR_ANYCRLF)   CR, LF, or CRLF only
3218           (*BSR_UNICODE)   any Unicode newline sequence           (*BSR_UNICODE)   any Unicode newline sequence
# Line 3219  BACKSLASH Line 3221  BACKSLASH
3221         they can be overridden by options given to pcre_exec(). Note that these         they can be overridden by options given to pcre_exec(). Note that these
3222         special settings, which are not Perl-compatible, are recognized only at         special settings, which are not Perl-compatible, are recognized only at
3223         the very start of a pattern, and that they must be in  upper  case.  If         the very start of a pattern, and that they must be in  upper  case.  If
3224         more than one of them is present, the last one is used.         more  than  one  of  them is present, the last one is used. They can be
3225           combined with a change of newline convention, for  example,  a  pattern
3226           can start with:
3227    
3228             (*ANY)(*BSR_ANYCRLF)
3229    
3230         Inside a character class, \R matches the letter "R".         Inside a character class, \R matches the letter "R".
3231    
# Line 4850  AUTHOR Line 4856  AUTHOR
4856    
4857  REVISION  REVISION
4858    
4859         Last updated: 11 September 2007         Last updated: 14 September 2007
4860         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
4861  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
4862    

Legend:
Removed from v.243  
changed lines
  Added in v.247

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12