/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 172 by ph10, Tue Jun 5 10:40:13 2007 UTC revision 202 by ph10, Fri Aug 3 09:44:26 2007 UTC
# Line 114  LIMITATIONS Line 114  LIMITATIONS
114         There is no limit to the number of parenthesized subpatterns, but there         There is no limit to the number of parenthesized subpatterns, but there
115         can be no more than 65535 capturing subpatterns.         can be no more than 65535 capturing subpatterns.
116    
117           If  a  non-capturing subpattern with an unlimited repetition quantifier
118           can match an empty string, there is a limit of 1000 on  the  number  of
119           times  it  can  be  repeated while not matching an empty string - if it
120           does match an empty string, the loop is immediately broken.
121    
122         The maximum length of name for a named subpattern is 32 characters, and         The maximum length of name for a named subpattern is 32 characters, and
123         the maximum number of named subpatterns is 10000.         the maximum number of named subpatterns is 10000.
124    
125         The maximum length of a subject string is the largest  positive  number         The  maximum  length of a subject string is the largest positive number
126         that  an integer variable can hold. However, when using the traditional         that an integer variable can hold. However, when using the  traditional
127         matching function, PCRE uses recursion to handle subpatterns and indef-         matching function, PCRE uses recursion to handle subpatterns and indef-
128         inite  repetition.  This means that the available stack space may limit         inite repetition.  This means that the available stack space may  limit
129         the size of a subject string that can be processed by certain patterns.         the size of a subject string that can be processed by certain patterns.
130         For a discussion of stack issues, see the pcrestack documentation.         For a discussion of stack issues, see the pcrestack documentation.
131    
132    
133  UTF-8 AND UNICODE PROPERTY SUPPORT  UTF-8 AND UNICODE PROPERTY SUPPORT
134    
135         From  release  3.3,  PCRE  has  had  some support for character strings         From release 3.3, PCRE has  had  some  support  for  character  strings
136         encoded in the UTF-8 format. For release 4.0 this was greatly  extended         encoded  in the UTF-8 format. For release 4.0 this was greatly extended
137         to  cover  most common requirements, and in release 5.0 additional sup-         to cover most common requirements, and in release 5.0  additional  sup-
138         port for Unicode general category properties was added.         port for Unicode general category properties was added.
139    
140         In order process UTF-8 strings, you must build PCRE  to  include  UTF-8         In  order  process  UTF-8 strings, you must build PCRE to include UTF-8
141         support  in  the  code,  and, in addition, you must call pcre_compile()         support in the code, and, in addition,  you  must  call  pcre_compile()
142         with the PCRE_UTF8 option flag. When you do this, both the pattern  and         with  the PCRE_UTF8 option flag. When you do this, both the pattern and
143         any  subject  strings  that are matched against it are treated as UTF-8         any subject strings that are matched against it are  treated  as  UTF-8
144         strings instead of just strings of bytes.         strings instead of just strings of bytes.
145    
146         If you compile PCRE with UTF-8 support, but do not use it at run  time,         If  you compile PCRE with UTF-8 support, but do not use it at run time,
147         the  library will be a bit bigger, but the additional run time overhead         the library will be a bit bigger, but the additional run time  overhead
148         is limited to testing the PCRE_UTF8 flag occasionally, so should not be         is limited to testing the PCRE_UTF8 flag occasionally, so should not be
149         very big.         very big.
150    
151         If PCRE is built with Unicode character property support (which implies         If PCRE is built with Unicode character property support (which implies
152         UTF-8 support), the escape sequences \p{..}, \P{..}, and  \X  are  sup-         UTF-8  support),  the  escape sequences \p{..}, \P{..}, and \X are sup-
153         ported.  The available properties that can be tested are limited to the         ported.  The available properties that can be tested are limited to the
154         general category properties such as Lu for an upper case letter  or  Nd         general  category  properties such as Lu for an upper case letter or Nd
155         for  a  decimal number, the Unicode script names such as Arabic or Han,         for a decimal number, the Unicode script names such as Arabic  or  Han,
156         and the derived properties Any and L&. A full  list  is  given  in  the         and  the  derived  properties  Any  and L&. A full list is given in the
157         pcrepattern documentation. Only the short names for properties are sup-         pcrepattern documentation. Only the short names for properties are sup-
158         ported. For example, \p{L} matches a letter. Its Perl synonym,  \p{Let-         ported.  For example, \p{L} matches a letter. Its Perl synonym, \p{Let-
159         ter},  is  not  supported.   Furthermore,  in Perl, many properties may         ter}, is not supported.  Furthermore,  in  Perl,  many  properties  may
160         optionally be prefixed by "Is", for compatibility with Perl  5.6.  PCRE         optionally  be  prefixed by "Is", for compatibility with Perl 5.6. PCRE
161         does not support this.         does not support this.
162    
163         The following comments apply when PCRE is running in UTF-8 mode:         The following comments apply when PCRE is running in UTF-8 mode:
164    
165         1.  When you set the PCRE_UTF8 flag, the strings passed as patterns and         1. When you set the PCRE_UTF8 flag, the strings passed as patterns  and
166         subjects are checked for validity on entry to the  relevant  functions.         subjects  are  checked for validity on entry to the relevant functions.
167         If an invalid UTF-8 string is passed, an error return is given. In some         If an invalid UTF-8 string is passed, an error return is given. In some
168         situations, you may already know  that  your  strings  are  valid,  and         situations,  you  may  already  know  that  your strings are valid, and
169         therefore want to skip these checks in order to improve performance. If         therefore want to skip these checks in order to improve performance. If
170         you set the PCRE_NO_UTF8_CHECK flag at compile time  or  at  run  time,         you  set  the  PCRE_NO_UTF8_CHECK  flag at compile time or at run time,
171         PCRE  assumes  that  the  pattern or subject it is given (respectively)         PCRE assumes that the pattern or subject  it  is  given  (respectively)
172         contains only valid UTF-8 codes. In this case, it does not diagnose  an         contains  only valid UTF-8 codes. In this case, it does not diagnose an
173         invalid  UTF-8 string. If you pass an invalid UTF-8 string to PCRE when         invalid UTF-8 string. If you pass an invalid UTF-8 string to PCRE  when
174         PCRE_NO_UTF8_CHECK is set, the results are undefined. Your program  may         PCRE_NO_UTF8_CHECK  is set, the results are undefined. Your program may
175         crash.         crash.
176    
177         2.  An  unbraced  hexadecimal  escape sequence (such as \xb3) matches a         2. An unbraced hexadecimal escape sequence (such  as  \xb3)  matches  a
178         two-byte UTF-8 character if the value is greater than 127.         two-byte UTF-8 character if the value is greater than 127.
179    
180         3. Octal numbers up to \777 are recognized, and  match  two-byte  UTF-8         3.  Octal  numbers  up to \777 are recognized, and match two-byte UTF-8
181         characters for values greater than \177.         characters for values greater than \177.
182    
183         4.  Repeat quantifiers apply to complete UTF-8 characters, not to indi-         4. Repeat quantifiers apply to complete UTF-8 characters, not to  indi-
184         vidual bytes, for example: \x{100}{3}.         vidual bytes, for example: \x{100}{3}.
185    
186         5. The dot metacharacter matches one UTF-8 character instead of a  sin-         5.  The dot metacharacter matches one UTF-8 character instead of a sin-
187         gle byte.         gle byte.
188    
189         6.  The  escape sequence \C can be used to match a single byte in UTF-8         6. The escape sequence \C can be used to match a single byte  in  UTF-8
190         mode, but its use can lead to some strange effects.  This  facility  is         mode,  but  its  use can lead to some strange effects. This facility is
191         not available in the alternative matching function, pcre_dfa_exec().         not available in the alternative matching function, pcre_dfa_exec().
192    
193         7.  The  character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly         7. The character escapes \b, \B, \d, \D, \s, \S, \w, and  \W  correctly
194         test characters of any code value, but the characters that PCRE  recog-         test  characters of any code value, but the characters that PCRE recog-
195         nizes  as  digits,  spaces,  or  word characters remain the same set as         nizes as digits, spaces, or word characters  remain  the  same  set  as
196         before, all with values less than 256. This remains true even when PCRE         before, all with values less than 256. This remains true even when PCRE
197         includes  Unicode  property support, because to do otherwise would slow         includes Unicode property support, because to do otherwise  would  slow
198         down PCRE in many common cases. If you really want to test for a  wider         down  PCRE in many common cases. If you really want to test for a wider
199         sense  of,  say,  "digit",  you must use Unicode property tests such as         sense of, say, "digit", you must use Unicode  property  tests  such  as
200         \p{Nd}.         \p{Nd}.
201    
202         8. Similarly, characters that match the POSIX named  character  classes         8.  Similarly,  characters that match the POSIX named character classes
203         are all low-valued characters.         are all low-valued characters.
204    
205         9.  Case-insensitive  matching  applies only to characters whose values         9. However, the Perl 5.10 horizontal and vertical  whitespace  matching
206         are less than 128, unless PCRE is built with Unicode property  support.         escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char-
207         Even  when  Unicode  property support is available, PCRE still uses its         acters.
208         own character tables when checking the case of  low-valued  characters,  
209         so  as not to degrade performance.  The Unicode property information is         10. Case-insensitive matching applies only to characters  whose  values
210           are  less than 128, unless PCRE is built with Unicode property support.
211           Even when Unicode property support is available, PCRE  still  uses  its
212           own  character  tables when checking the case of low-valued characters,
213           so as not to degrade performance.  The Unicode property information  is
214         used only for characters with higher values. Even when Unicode property         used only for characters with higher values. Even when Unicode property
215         support is available, PCRE supports case-insensitive matching only when         support is available, PCRE supports case-insensitive matching only when
216         there is a one-to-one mapping between a letter's  cases.  There  are  a         there  is  a  one-to-one  mapping between a letter's cases. There are a
217         small  number  of  many-to-one  mappings in Unicode; these are not sup-         small number of many-to-one mappings in Unicode;  these  are  not  sup-
218         ported by PCRE.         ported by PCRE.
219    
220    
# Line 215  AUTHOR Line 224  AUTHOR
224         University Computing Service         University Computing Service
225         Cambridge CB2 3QH, England.         Cambridge CB2 3QH, England.
226    
227         Putting an actual email address here seems to have been a spam  magnet,         Putting  an actual email address here seems to have been a spam magnet,
228         so  I've  taken  it away. If you want to email me, use my two initials,         so I've taken it away. If you want to email me, use  my  two  initials,
229         followed by the two digits 10, at the domain cam.ac.uk.         followed by the two digits 10, at the domain cam.ac.uk.
230    
231    
232  REVISION  REVISION
233    
234         Last updated: 18 April 2007         Last updated: 30 July 2007
235         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
236  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
237    
# Line 390  AVOIDING EXCESSIVE STACK USAGE Line 399  AVOIDING EXCESSIVE STACK USAGE
399    
400         to  the  configure  command. With this configuration, PCRE will use the         to  the  configure  command. With this configuration, PCRE will use the
401         pcre_stack_malloc and pcre_stack_free variables to call memory  manage-         pcre_stack_malloc and pcre_stack_free variables to call memory  manage-
402         ment  functions.  Separate  functions are provided because the usage is         ment  functions. By default these point to malloc() and free(), but you
403         very predictable: the block sizes requested are always  the  same,  and         can replace the pointers so that your own functions are used.
404         the  blocks  are always freed in reverse order. A calling program might  
405         be able to implement optimized functions that perform better  than  the         Separate functions are  provided  rather  than  using  pcre_malloc  and
406         standard  malloc()  and  free()  functions.  PCRE  runs noticeably more         pcre_free  because  the  usage  is  very  predictable:  the block sizes
407         slowly when built in this way. This option affects only the pcre_exec()         requested are always the same, and  the  blocks  are  always  freed  in
408         function; it is not relevant for the the pcre_dfa_exec() function.         reverse  order.  A calling program might be able to implement optimized
409           functions that perform better  than  malloc()  and  free().  PCRE  runs
410           noticeably more slowly when built in this way. This option affects only
411           the  pcre_exec()  function;  it   is   not   relevant   for   the   the
412           pcre_dfa_exec() function.
413    
414    
415  LIMITING PCRE RESOURCE USAGE  LIMITING PCRE RESOURCE USAGE
# Line 451  USING EBCDIC CODE Line 464  USING EBCDIC CODE
464    
465         PCRE  assumes  by  default that it will run in an environment where the         PCRE  assumes  by  default that it will run in an environment where the
466         character code is ASCII (or Unicode, which is  a  superset  of  ASCII).         character code is ASCII (or Unicode, which is  a  superset  of  ASCII).
467         PCRE  can,  however,  be  compiled  to  run in an EBCDIC environment by         This  is  the  case for most computer operating systems. PCRE can, how-
468         adding         ever, be compiled to run in an EBCDIC environment by adding
469    
470           --enable-ebcdic           --enable-ebcdic
471    
472         to the configure command. This setting implies --enable-rebuild-charta-         to the configure command. This setting implies --enable-rebuild-charta-
473         bles.         bles.  You  should  only  use  it if you know that you are in an EBCDIC
474           environment (for example, an IBM mainframe operating system).
475    
476    
477  SEE ALSO  SEE ALSO
# Line 474  AUTHOR Line 488  AUTHOR
488    
489  REVISION  REVISION
490    
491         Last updated: 16 April 2007         Last updated: 30 July 2007
492         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
493  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
494    
# Line 1259  COMPILATION ERROR CODES Line 1273  COMPILATION ERROR CODES
1273           26  malformed number or name after (?(           26  malformed number or name after (?(
1274           27  conditional group contains more than two branches           27  conditional group contains more than two branches
1275           28  assertion expected after (?(           28  assertion expected after (?(
1276           29  (?R or (?digits must be followed by )           29  (?R or (?[+-]digits must be followed by )
1277           30  unknown POSIX class name           30  unknown POSIX class name
1278           31  POSIX collating elements are not supported           31  POSIX collating elements are not supported
1279           32  this version of PCRE is not compiled with PCRE_UTF8 support           32  this version of PCRE is not compiled with PCRE_UTF8 support
# Line 1280  COMPILATION ERROR CODES Line 1294  COMPILATION ERROR CODES
1294           47  unknown property name after \P or \p           47  unknown property name after \P or \p
1295           48  subpattern name is too long (maximum 32 characters)           48  subpattern name is too long (maximum 32 characters)
1296           49  too many named subpatterns (maximum 10,000)           49  too many named subpatterns (maximum 10,000)
1297           50  repeated subpattern is too long           50  [this code is not in use]
1298           51  octal value is greater than \377 (not in UTF-8 mode)           51  octal value is greater than \377 (not in UTF-8 mode)
1299           52  internal error: overran compiling workspace           52  internal error: overran compiling workspace
1300           53   internal  error:  previously-checked  referenced  subpattern not           53   internal  error:  previously-checked  referenced  subpattern not
# Line 1288  COMPILATION ERROR CODES Line 1302  COMPILATION ERROR CODES
1302           54  DEFINE group contains more than one branch           54  DEFINE group contains more than one branch
1303           55  repeating a DEFINE group is not allowed           55  repeating a DEFINE group is not allowed
1304           56  inconsistent NEWLINE options"           56  inconsistent NEWLINE options"
1305             57  \g is not followed by a braced name or an optionally braced
1306                   non-zero number
1307             58  (?+ or (?- or (?(+ or (?(- must be followed by a non-zero number
1308    
1309    
1310  STUDYING A PATTERN  STUDYING A PATTERN
# Line 1480  INFORMATION ABOUT A PATTERN Line 1497  INFORMATION ABOUT A PATTERN
1497    
1498         Return  1  if the (?J) option setting is used in the pattern, otherwise         Return  1  if the (?J) option setting is used in the pattern, otherwise
1499         0. The fourth argument should point to an int variable. The (?J) inter-         0. The fourth argument should point to an int variable. The (?J) inter-
1500         nal option setting changes the local PCRE_DUPNAMES value.         nal option setting changes the local PCRE_DUPNAMES option.
1501    
1502           PCRE_INFO_LASTLITERAL           PCRE_INFO_LASTLITERAL
1503    
# Line 1548  INFORMATION ABOUT A PATTERN Line 1565  INFORMATION ABOUT A PATTERN
1565         Return a copy of the options with which the pattern was  compiled.  The         Return a copy of the options with which the pattern was  compiled.  The
1566         fourth  argument  should  point to an unsigned long int variable. These         fourth  argument  should  point to an unsigned long int variable. These
1567         option bits are those specified in the call to pcre_compile(), modified         option bits are those specified in the call to pcre_compile(), modified
1568         by any top-level option settings within the pattern itself.         by any top-level option settings at the start of the pattern itself. In
1569           other words, they are the options that will be in force  when  matching
1570           starts.  For  example, if the pattern /(?im)abc(?-i)d/ is compiled with
1571           the PCRE_EXTENDED option, the result is PCRE_CASELESS,  PCRE_MULTILINE,
1572           and PCRE_EXTENDED.
1573    
1574         A  pattern  is  automatically  anchored by PCRE if all of its top-level         A  pattern  is  automatically  anchored by PCRE if all of its top-level
1575         alternatives begin with one of the following:         alternatives begin with one of the following:
# Line 2039  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2060  MATCHING A PATTERN: THE TRADITIONAL FUNC
2060         field  in  a  pcre_extra  structure (or defaulted) was reached. See the         field  in  a  pcre_extra  structure (or defaulted) was reached. See the
2061         description above.         description above.
2062    
          PCRE_ERROR_NULLWSLIMIT    (-22)  
   
        When a group that can match an empty  substring  is  repeated  with  an  
        unbounded  upper  limit, the subject position at the start of the group  
        must be remembered, so that a test for an empty string can be made when  
        the  end  of the group is reached. Some workspace is required for this;  
        if it runs out, this error is given.  
   
2063           PCRE_ERROR_BADNEWLINE     (-23)           PCRE_ERROR_BADNEWLINE     (-23)
2064    
2065         An invalid combination of PCRE_NEWLINE_xxx options was given.         An invalid combination of PCRE_NEWLINE_xxx options was given.
2066    
2067         Error numbers -16 to -20 are not used by pcre_exec().         Error numbers -16 to -20 and -22 are not used by pcre_exec().
2068    
2069    
2070  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
# Line 2406  AUTHOR Line 2419  AUTHOR
2419    
2420  REVISION  REVISION
2421    
2422         Last updated: 04 June 2007         Last updated: 30 July 2007
2423         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
2424  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
2425    
# Line 2593  DIFFERENCES BETWEEN PCRE AND PERL Line 2606  DIFFERENCES BETWEEN PCRE AND PERL
2606    
2607         This  document describes the differences in the ways that PCRE and Perl         This  document describes the differences in the ways that PCRE and Perl
2608         handle regular expressions. The differences described here  are  mainly         handle regular expressions. The differences described here  are  mainly
2609         with  respect  to  Perl 5.8, though PCRE version 7.0 contains some fea-         with  respect  to  Perl 5.8, though PCRE versions 7.0 and later contain
2610         tures that are expected to be in the forthcoming Perl 5.10.         some features that are expected to be in the forthcoming Perl 5.10.
2611    
2612         1. PCRE has only a subset of Perl's UTF-8 and Unicode support.  Details         1. PCRE has only a subset of Perl's UTF-8 and Unicode support.  Details
2613         of  what  it does have are given in the section on UTF-8 support in the         of  what  it does have are given in the section on UTF-8 support in the
# Line 2672  DIFFERENCES BETWEEN PCRE AND PERL Line 2685  DIFFERENCES BETWEEN PCRE AND PERL
2685         meta-character matches only at the very end of the string.         meta-character matches only at the very end of the string.
2686    
2687         (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-         (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-
2688         cial  meaning  is  faulted.  Otherwise,  like  Perl,  the  backslash is         cial meaning is faulted. Otherwise, like Perl, the backslash is quietly
2689         ignored. (Perl can be made to issue a warning.)         ignored.  (Perl can be made to issue a warning.)
2690    
2691         (d) If PCRE_UNGREEDY is set, the greediness of the  repetition  quanti-         (d) If PCRE_UNGREEDY is set, the greediness of the  repetition  quanti-
2692         fiers is inverted, that is, by default they are not greedy, but if fol-         fiers is inverted, that is, by default they are not greedy, but if fol-
# Line 2705  AUTHOR Line 2718  AUTHOR
2718    
2719  REVISION  REVISION
2720    
2721         Last updated: 06 March 2007         Last updated: 13 June 2007
2722         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
2723  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
2724    
# Line 2938  BACKSLASH Line 2951  BACKSLASH
2951    
2952           \d     any decimal digit           \d     any decimal digit
2953           \D     any character that is not a decimal digit           \D     any character that is not a decimal digit
2954             \h     any horizontal whitespace character
2955             \H     any character that is not a horizontal whitespace character
2956           \s     any whitespace character           \s     any whitespace character
2957           \S     any character that is not a whitespace character           \S     any character that is not a whitespace character
2958             \v     any vertical whitespace character
2959             \V     any character that is not a vertical whitespace character
2960           \w     any "word" character           \w     any "word" character
2961           \W     any "non-word" character           \W     any "non-word" character
2962    
# Line 2954  BACKSLASH Line 2971  BACKSLASH
2971    
2972         For compatibility with Perl, \s does not match the VT  character  (code         For compatibility with Perl, \s does not match the VT  character  (code
2973         11).   This makes it different from the the POSIX "space" class. The \s         11).   This makes it different from the the POSIX "space" class. The \s
2974         characters are HT (9), LF (10), FF (12), CR (13), and space  (32).  (If         characters are HT (9), LF (10), FF (12), CR (13), and  space  (32).  If
2975         "use locale;" is included in a Perl script, \s may match the VT charac-         "use locale;" is included in a Perl script, \s may match the VT charac-
2976         ter. In PCRE, it never does.)         ter. In PCRE, it never does.
2977    
2978           In UTF-8 mode, characters with values greater than 128 never match  \d,
2979           \s, or \w, and always match \D, \S, and \W. This is true even when Uni-
2980           code character property support is available.  These  sequences  retain
2981           their original meanings from before UTF-8 support was available, mainly
2982           for efficiency reasons.
2983    
2984           The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to
2985           the  other  sequences, these do match certain high-valued codepoints in
2986           UTF-8 mode.  The horizontal space characters are:
2987    
2988             U+0009     Horizontal tab
2989             U+0020     Space
2990             U+00A0     Non-break space
2991             U+1680     Ogham space mark
2992             U+180E     Mongolian vowel separator
2993             U+2000     En quad
2994             U+2001     Em quad
2995             U+2002     En space
2996             U+2003     Em space
2997             U+2004     Three-per-em space
2998             U+2005     Four-per-em space
2999             U+2006     Six-per-em space
3000             U+2007     Figure space
3001             U+2008     Punctuation space
3002             U+2009     Thin space
3003             U+200A     Hair space
3004             U+202F     Narrow no-break space
3005             U+205F     Medium mathematical space
3006             U+3000     Ideographic space
3007    
3008           The vertical space characters are:
3009    
3010             U+000A     Linefeed
3011             U+000B     Vertical tab
3012             U+000C     Formfeed
3013             U+000D     Carriage return
3014             U+0085     Next line
3015             U+2028     Line separator
3016             U+2029     Paragraph separator
3017    
3018         A "word" character is an underscore or any character less than 256 that         A "word" character is an underscore or any character less than 256 that
3019         is  a  letter  or  digit.  The definition of letters and digits is con-         is  a  letter  or  digit.  The definition of letters and digits is con-
# Line 2964  BACKSLASH Line 3021  BACKSLASH
3021         specific  matching is taking place (see "Locale support" in the pcreapi         specific  matching is taking place (see "Locale support" in the pcreapi
3022         page). For example, in a French locale such  as  "fr_FR"  in  Unix-like         page). For example, in a French locale such  as  "fr_FR"  in  Unix-like
3023         systems,  or "french" in Windows, some character codes greater than 128         systems,  or "french" in Windows, some character codes greater than 128
3024         are used for accented letters, and these are matched by \w.         are used for accented letters, and these are matched by \w. The use  of
3025           locales with Unicode is discouraged.
        In UTF-8 mode, characters with values greater than 128 never match  \d,  
        \s, or \w, and always match \D, \S, and \W. This is true even when Uni-  
        code character property support is available. The use of  locales  with  
        Unicode is discouraged.  
3026    
3027     Newline sequences     Newline sequences
3028    
3029         Outside  a  character class, the escape sequence \R matches any Unicode         Outside  a  character class, the escape sequence \R matches any Unicode
3030         newline sequence. This is an extension to Perl. In non-UTF-8 mode \R is         newline sequence. This is a Perl 5.10 feature. In non-UTF-8 mode \R  is
3031         equivalent to the following:         equivalent to the following:
3032    
3033           (?>\r\n|\n|\x0b|\f|\r|\x85)           (?>\r\n|\n|\x0b|\f|\r|\x85)
# Line 2996  BACKSLASH Line 3049  BACKSLASH
3049     Unicode character properties     Unicode character properties
3050    
3051         When PCRE is built with Unicode character property support, three addi-         When PCRE is built with Unicode character property support, three addi-
3052         tional escape sequences to match  character  properties  are  available         tional escape sequences that match characters with specific  properties
3053         when UTF-8 mode is selected. They are:         are  available.   When not in UTF-8 mode, these sequences are of course
3054           limited to testing characters whose codepoints are less than  256,  but
3055           they do work in this mode.  The extra escape sequences are:
3056    
3057           \p{xx}   a character with the xx property           \p{xx}   a character with the xx property
3058           \P{xx}   a character without the xx property           \P{xx}   a character without the xx property
# Line 3111  BACKSLASH Line 3166  BACKSLASH
3166         That is, it matches a character without the "mark"  property,  followed         That is, it matches a character without the "mark"  property,  followed
3167         by  zero  or  more  characters with the "mark" property, and treats the         by  zero  or  more  characters with the "mark" property, and treats the
3168         sequence as an atomic group (see below).  Characters  with  the  "mark"         sequence as an atomic group (see below).  Characters  with  the  "mark"
3169         property are typically accents that affect the preceding character.         property  are  typically  accents  that affect the preceding character.
3170           None of them have codepoints less than 256, so  in  non-UTF-8  mode  \X
3171           matches any one character.
3172    
3173         Matching  characters  by Unicode property is not fast, because PCRE has         Matching  characters  by Unicode property is not fast, because PCRE has
3174         to search a structure that contains  data  for  over  fifteen  thousand         to search a structure that contains  data  for  over  fifteen  thousand
# Line 3537  SUBPATTERNS Line 3594  SUBPATTERNS
3594         "Saturday".         "Saturday".
3595    
3596    
3597    DUPLICATE SUBPATTERN NUMBERS
3598    
3599           Perl 5.10 introduced a feature whereby each alternative in a subpattern
3600           uses the same numbers for its capturing parentheses. Such a  subpattern
3601           starts  with (?| and is itself a non-capturing subpattern. For example,
3602           consider this pattern:
3603    
3604             (?|(Sat)ur|(Sun))day
3605    
3606           Because the two alternatives are inside a (?| group, both sets of  cap-
3607           turing  parentheses  are  numbered one. Thus, when the pattern matches,
3608           you can look at captured substring number  one,  whichever  alternative
3609           matched.  This  construct  is useful when you want to capture part, but
3610           not all, of one of a number of alternatives. Inside a (?| group, paren-
3611           theses  are  numbered as usual, but the number is reset at the start of
3612           each branch. The numbers of any capturing buffers that follow the  sub-
3613           pattern  start after the highest number used in any branch. The follow-
3614           ing example is taken from the Perl documentation.  The  numbers  under-
3615           neath show in which buffer the captured content will be stored.
3616    
3617             # before  ---------------branch-reset----------- after
3618             / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
3619             # 1            2         2  3        2     3     4
3620    
3621           A  backreference  or  a  recursive call to a numbered subpattern always
3622           refers to the first one in the pattern with the given number.
3623    
3624           An alternative approach to using this "branch reset" feature is to  use
3625           duplicate named subpatterns, as described in the next section.
3626    
3627    
3628  NAMED SUBPATTERNS  NAMED SUBPATTERNS
3629    
3630         Identifying  capturing  parentheses  by number is simple, but it can be         Identifying  capturing  parentheses  by number is simple, but it can be
# Line 3576  NAMED SUBPATTERNS Line 3664  NAMED SUBPATTERNS
3664           (?<DN>Sat)(?:urday)?           (?<DN>Sat)(?:urday)?
3665    
3666         There  are  five capturing substrings, but only one is ever set after a         There  are  five capturing substrings, but only one is ever set after a
3667         match.  The convenience  function  for  extracting  the  data  by  name         match.  (An alternative way of solving this problem is to use a "branch
3668         returns  the  substring  for  the first (and in this example, the only)         reset" subpattern, as described in the previous section.)
3669         subpattern of that name that matched.  This  saves  searching  to  find  
3670         which  numbered  subpattern  it  was. If you make a reference to a non-         The  convenience  function  for extracting the data by name returns the
3671         unique named subpattern from elsewhere in the  pattern,  the  one  that         substring for the first (and in this example, the only)  subpattern  of
3672         corresponds  to  the  lowest number is used. For further details of the         that  name  that  matched.  This saves searching to find which numbered
3673         interfaces for handling named subpatterns, see the  pcreapi  documenta-         subpattern it was. If you make a reference to a non-unique  named  sub-
3674         tion.         pattern  from elsewhere in the pattern, the one that corresponds to the
3675           lowest number is used. For further details of the interfaces  for  han-
3676           dling named subpatterns, see the pcreapi documentation.
3677    
3678    
3679  REPETITION  REPETITION
# Line 4455  AUTHOR Line 4545  AUTHOR
4545    
4546  REVISION  REVISION
4547    
4548         Last updated: 29 May 2007         Last updated: 19 June 2007
4549         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
4550  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
4551    
# Line 4786  RE-USING A PRECOMPILED PATTERN Line 4876  RE-USING A PRECOMPILED PATTERN
4876    
4877  COMPATIBILITY WITH DIFFERENT PCRE RELEASES  COMPATIBILITY WITH DIFFERENT PCRE RELEASES
4878    
4879         The layout of the control block that is at the start of the  data  that         In general, it is safest to  recompile  all  saved  patterns  when  you
4880         makes  up  a  compiled pattern was changed for release 5.0. If you have         update  to  a new PCRE release, though not all updates actually require
4881         any saved patterns that were compiled with  previous  releases  (not  a         this. Recompiling is definitely needed for release 7.2.
        facility  that  was  previously advertised), you will have to recompile  
        them for release 5.0 and above.  
   
        If you have any saved patterns in UTF-8 mode that use  \p  or  \P  that  
        were  compiled  with any release up to and including 6.4, you will have  
        to recompile them for release 6.5 and above.  
   
        All saved patterns from earlier releases must be recompiled for release  
        7.0  or  higher,  because  there was an internal reorganization at that  
        release.  
4882    
4883    
4884  AUTHOR  AUTHOR
# Line 4810  AUTHOR Line 4890  AUTHOR
4890    
4891  REVISION  REVISION
4892    
4893         Last updated: 24 April 2007         Last updated: 13 June 2007
4894         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
4895  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
4896    
# Line 5545  PCRE SAMPLE PROGRAM Line 5625  PCRE SAMPLE PROGRAM
5625         bility  of  matching an empty string. Comments in the code explain what         bility  of  matching an empty string. Comments in the code explain what
5626         is going on.         is going on.
5627    
5628         If PCRE is installed in the standard include  and  library  directories         The demonstration program is automatically built if you use  "./config-
5629         for  your  system, you should be able to compile the demonstration pro-         ure;make"  to  build PCRE. Otherwise, if PCRE is installed in the stan-
5630         gram using this command:         dard include and library directories for your  system,  you  should  be
5631           able to compile the demonstration program using this command:
5632    
5633           gcc -o pcredemo pcredemo.c -lpcre           gcc -o pcredemo pcredemo.c -lpcre
5634    
5635         If PCRE is installed elsewhere, you may need to add additional  options         If  PCRE is installed elsewhere, you may need to add additional options
5636         to  the  command line. For example, on a Unix-like system that has PCRE         to the command line. For example, on a Unix-like system that  has  PCRE
5637         installed in /usr/local, you  can  compile  the  demonstration  program         installed  in  /usr/local,  you  can  compile the demonstration program
5638         using a command like this:         using a command like this:
5639    
5640           gcc -o pcredemo -I/usr/local/include pcredemo.c \           gcc -o pcredemo -I/usr/local/include pcredemo.c \
5641               -L/usr/local/lib -lpcre               -L/usr/local/lib -lpcre
5642    
5643         Once  you  have  compiled the demonstration program, you can run simple         Once you have compiled the demonstration program, you  can  run  simple
5644         tests like this:         tests like this:
5645    
5646           ./pcredemo 'cat|dog' 'the cat sat on the mat'           ./pcredemo 'cat|dog' 'the cat sat on the mat'
5647           ./pcredemo -g 'cat|dog' 'the dog sat on the cat'           ./pcredemo -g 'cat|dog' 'the dog sat on the cat'
5648    
5649         Note that there is a  much  more  comprehensive  test  program,  called         Note  that  there  is  a  much  more comprehensive test program, called
5650         pcretest,  which  supports  many  more  facilities  for testing regular         pcretest, which supports  many  more  facilities  for  testing  regular
5651         expressions and the PCRE library. The pcredemo program is provided as a         expressions and the PCRE library. The pcredemo program is provided as a
5652         simple coding example.         simple coding example.
5653    
# Line 5574  PCRE SAMPLE PROGRAM Line 5655  PCRE SAMPLE PROGRAM
5655         the standard library directory, you may get an error like this when you         the standard library directory, you may get an error like this when you
5656         try to run pcredemo:         try to run pcredemo:
5657    
5658           ld.so.1:  a.out:  fatal:  libpcre.so.0:  open failed: No such file or           ld.so.1: a.out: fatal: libpcre.so.0: open failed:  No  such  file  or
5659         directory         directory
5660    
5661         This is caused by the way shared library support works  on  those  sys-         This  is  caused  by the way shared library support works on those sys-
5662         tems. You need to add         tems. You need to add
5663    
5664           -R/usr/local/lib           -R/usr/local/lib
# Line 5594  AUTHOR Line 5675  AUTHOR
5675    
5676  REVISION  REVISION
5677    
5678         Last updated: 06 March 2007         Last updated: 13 June 2007
5679         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
5680  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
5681  PCRESTACK(3)                                                      PCRESTACK(3)  PCRESTACK(3)                                                      PCRESTACK(3)
# Line 5664  PCRE DISCUSSION OF STACK USAGE Line 5745  PCRE DISCUSSION OF STACK USAGE
5745         In environments where stack memory is constrained, you  might  want  to         In environments where stack memory is constrained, you  might  want  to
5746         compile  PCRE to use heap memory instead of stack for remembering back-         compile  PCRE to use heap memory instead of stack for remembering back-
5747         up points. This makes it run a lot more slowly, however. Details of how         up points. This makes it run a lot more slowly, however. Details of how
5748         to do this are given in the pcrebuild documentation.         to do this are given in the pcrebuild documentation. When built in this
5749           way, instead of using the stack, PCRE obtains and frees memory by call-
5750         In  Unix-like environments, there is not often a problem with the stack         ing  the  functions  that  are  pointed to by the pcre_stack_malloc and
5751         unless very long strings are involved,  though  the  default  limit  on         pcre_stack_free variables. By default,  these  point  to  malloc()  and
5752         stack  size  varies  from system to system. Values from 8Mb to 64Mb are         free(),  but you can replace the pointers to cause PCRE to use your own
5753           functions. Since the block sizes are always the same,  and  are  always
5754           freed in reverse order, it may be possible to implement customized mem-
5755           ory handlers that are more efficient than the standard functions.
5756    
5757           In Unix-like environments, there is not often a problem with the  stack
5758           unless  very  long  strings  are  involved, though the default limit on
5759           stack size varies from system to system. Values from 8Mb  to  64Mb  are
5760         common. You can find your default limit by running the command:         common. You can find your default limit by running the command:
5761    
5762           ulimit -s           ulimit -s
5763    
5764         Unfortunately, the effect of running out of  stack  is  often  SIGSEGV,         Unfortunately,  the  effect  of  running out of stack is often SIGSEGV,
5765         though  sometimes  a more explicit error message is given. You can nor-         though sometimes a more explicit error message is given. You  can  nor-
5766         mally increase the limit on stack size by code such as this:         mally increase the limit on stack size by code such as this:
5767    
5768           struct rlimit rlim;           struct rlimit rlim;
# Line 5682  PCRE DISCUSSION OF STACK USAGE Line 5770  PCRE DISCUSSION OF STACK USAGE
5770           rlim.rlim_cur = 100*1024*1024;           rlim.rlim_cur = 100*1024*1024;
5771           setrlimit(RLIMIT_STACK, &rlim);           setrlimit(RLIMIT_STACK, &rlim);
5772    
5773         This reads the current limits (soft and hard) using  getrlimit(),  then         This  reads  the current limits (soft and hard) using getrlimit(), then
5774         attempts  to  increase  the  soft limit to 100Mb using setrlimit(). You         attempts to increase the soft limit to  100Mb  using  setrlimit().  You
5775         must do this before calling pcre_exec().         must do this before calling pcre_exec().
5776    
5777         PCRE has an internal counter that can be used to  limit  the  depth  of         PCRE  has  an  internal  counter that can be used to limit the depth of
5778         recursion,  and  thus cause pcre_exec() to give an error code before it         recursion, and thus cause pcre_exec() to give an error code  before  it
5779         runs out of stack. By default, the limit is very  large,  and  unlikely         runs  out  of  stack. By default, the limit is very large, and unlikely
5780         ever  to operate. It can be changed when PCRE is built, and it can also         ever to operate. It can be changed when PCRE is built, and it can  also
5781         be set when pcre_exec() is called. For details of these interfaces, see         be set when pcre_exec() is called. For details of these interfaces, see
5782         the pcrebuild and pcreapi documentation.         the pcrebuild and pcreapi documentation.
5783    
5784         As a very rough rule of thumb, you should reckon on about 500 bytes per         As a very rough rule of thumb, you should reckon on about 500 bytes per
5785         recursion. Thus, if you want to limit your  stack  usage  to  8Mb,  you         recursion.  Thus,  if  you  want  to limit your stack usage to 8Mb, you
5786         should  set  the  limit at 16000 recursions. A 64Mb stack, on the other         should set the limit at 16000 recursions. A 64Mb stack,  on  the  other
5787         hand, can support around 128000 recursions. The pcretest  test  program         hand,  can  support around 128000 recursions. The pcretest test program
5788         has a command line option (-S) that can be used to increase the size of         has a command line option (-S) that can be used to increase the size of
5789         its stack.         its stack.
5790    
# Line 5710  AUTHOR Line 5798  AUTHOR
5798    
5799  REVISION  REVISION
5800    
5801         Last updated: 12 March 2007         Last updated: 05 June 2007
5802         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
5803  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
5804    

Legend:
Removed from v.172  
changed lines
  Added in v.202

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12