/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 53 by nigel, Sat Feb 24 21:39:42 2007 UTC revision 63 by nigel, Sat Feb 24 21:40:03 2007 UTC
# Line 1  Line 1 
1    This file contains a concatenation of the PCRE man pages, converted to plain
2    text format for ease of searching with a text editor, or for use on systems
3    that do not have a man page processor. The small individual files that give
4    synopses of each function in the library have not been included. There are
5    separate text files for the pcregrep and pcretest commands.
6    -----------------------------------------------------------------------------
7    
8    NAME
9         PCRE - Perl-compatible regular expressions
10    
11    
12    DESCRIPTION
13    
14         The PCRE library is a set of functions that implement  regu-
15         lar  expression  pattern  matching using the same syntax and
16         semantics as Perl, with just a few differences. The  current
17         implementation  of  PCRE  (release 4.x) corresponds approxi-
18         mately with Perl 5.8, including support  for  UTF-8  encoded
19         strings.    However,  this  support  has  to  be  explicitly
20         enabled; it is not the default.
21    
22         PCRE is written in C and released as a C library. However, a
23         number  of  people  have  written wrappers and interfaces of
24         various kinds. A C++ class is included  in  these  contribu-
25         tions,  which  can  be found in the Contrib directory at the
26         primary FTP site, which is:
27    
28         ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
29    
30         Details of exactly which Perl  regular  expression  features
31         are  and  are  not  supported  by PCRE are given in separate
32         documents. See the pcrepattern and pcrecompat pages.
33    
34         Some features of PCRE can be included, excluded, or  changed
35         when  the library is built. The pcre_config() function makes
36         it possible for a client  to  discover  which  features  are
37         available.  Documentation  about  building  PCRE for various
38         operating systems can be found in the  README  file  in  the
39         source distribution.
40    
41    
42    USER DOCUMENTATION
43    
44         The user documentation for PCRE has been  split  up  into  a
45         number  of  different sections. In the "man" format, each of
46         these is a separate "man page". In the HTML format, each  is
47         a  separate  page,  linked from the index page. In the plain
48         text format, all the sections are concatenated, for ease  of
49         searching. The sections are as follows:
50    
51           pcre              this document
52           pcreapi           details of PCRE's native API
53           pcrebuild         options for building PCRE
54           pcrecallout       details of the callout feature
55           pcrecompat        discussion of Perl compatibility
56           pcregrep          description of the pcregrep command
57           pcrepattern       syntax and semantics of supported
58                               regular expressions
59           pcreperform       discussion of performance issues
60           pcreposix         the POSIX-compatible API
61           pcresample        discussion of the sample program
62           pcretest          the pcretest testing command
63    
64         In addition, in the "man" and HTML formats, there is a short
65         page  for  each  library function, listing its arguments and
66         results.
67    
68    
69    LIMITATIONS
70    
71         There are some size limitations in PCRE but it is hoped that
72         they will never in practice be relevant.
73    
74         The maximum length of a  compiled  pattern  is  65539  (sic)
75         bytes  if PCRE is compiled with the default internal linkage
76         size of 2. If you want to process regular  expressions  that
77         are  truly  enormous,  you can compile PCRE with an internal
78         linkage size of 3 or 4 (see the README file  in  the  source
79         distribution  and  the pcrebuild documentation for details).
80         If these cases the limit is substantially larger.   However,
81         the speed of execution will be slower.
82    
83         All values in repeating quantifiers must be less than 65536.
84         The maximum number of capturing subpatterns is 65535.
85    
86         There is no limit to the  number  of  non-capturing  subpat-
87         terns,  but  the  maximum  depth  of nesting of all kinds of
88         parenthesized subpattern, including  capturing  subpatterns,
89         assertions, and other types of subpattern, is 200.
90    
91         The maximum length of a subject string is the largest  posi-
92         tive number that an integer variable can hold. However, PCRE
93         uses recursion to handle subpatterns and indefinite  repeti-
94         tion.  This  means  that the available stack space may limit
95         the size of a subject string that can be processed  by  cer-
96         tain patterns.
97    
98    
99    UTF-8 SUPPORT
100    
101         Starting at release 3.3, PCRE has had some support for char-
102         acter  strings  encoded in the UTF-8 format. For release 4.0
103         this has been greatly extended to cover most common require-
104         ments.
105    
106         In order process UTF-8  strings,  you  must  build  PCRE  to
107         include  UTF-8  support  in  the code, and, in addition, you
108         must call pcre_compile() with  the  PCRE_UTF8  option  flag.
109         When  you  do this, both the pattern and any subject strings
110         that are matched against it are  treated  as  UTF-8  strings
111         instead of just strings of bytes.
112    
113         If you compile PCRE with UTF-8 support, but do not use it at
114         run  time,  the  library will be a bit bigger, but the addi-
115         tional run time overhead is limited to testing the PCRE_UTF8
116         flag in several places, so should not be very large.
117    
118         The following comments apply when PCRE is running  in  UTF-8
119         mode:
120    
121         1. PCRE assumes that the strings it is given  contain  valid
122         UTF-8  codes. It does not diagnose invalid UTF-8 strings. If
123         you pass invalid UTF-8 strings  to  PCRE,  the  results  are
124         undefined.
125    
126         2. In a pattern, the escape sequence \x{...}, where the con-
127         tents  of  the  braces is a string of hexadecimal digits, is
128         interpreted as a UTF-8 character whose code  number  is  the
129         given  hexadecimal  number, for example: \x{1234}. If a non-
130         hexadecimal digit appears between the braces,  the  item  is
131         not  recognized.  This escape sequence can be used either as
132         a literal, or within a character class.
133    
134         3. The original hexadecimal escape sequence, \xhh, matches a
135         two-byte UTF-8 character if the value is greater than 127.
136    
137         4. Repeat quantifiers apply to  complete  UTF-8  characters,
138         not to individual bytes, for example: \x{100}{3}.
139    
140         5. The dot metacharacter matches one UTF-8 character instead
141         of a single byte.
142    
143         6. The escape sequence \C can be used to match a single byte
144         in UTF-8 mode, but its use can lead to some strange effects.
145    
146         7. The character escapes \b, \B, \d, \D, \s, \S, \w, and  \W
147         correctly test characters of any code value, but the charac-
148         ters that PCRE recognizes as digits, spaces, or word charac-
149         ters  remain  the  same  set as before, all with values less
150         than 256.
151    
152         8. Case-insensitive  matching  applies  only  to  characters
153         whose  values  are  less than 256. PCRE does not support the
154         notion of "case" for higher-valued characters.
155    
156         9. PCRE does not support the use of Unicode tables and  pro-
157         perties or the Perl escapes \p, \P, and \X.
158    
159    
160    AUTHOR
161    
162         Philip Hazel <ph10@cam.ac.uk>
163         University Computing Service,
164         Cambridge CB2 3QG, England.
165         Phone: +44 1223 334714
166    
167    Last updated: 04 February 2003
168    Copyright (c) 1997-2003 University of Cambridge.
169    -----------------------------------------------------------------------------
170    
171    NAME
172         PCRE - Perl-compatible regular expressions
173    
174    
175    PCRE BUILD-TIME OPTIONS
176    
177         This document describes the optional features of  PCRE  that
178         can  be  selected when the library is compiled. They are all
179         selected, or deselected, by providing options to the config-
180         ure  script  which  is run before the make command. The com-
181         plete list of options  for  configure  (which  includes  the
182         standard  ones  such  as  the  selection of the installation
183         directory) can be obtained by running
184    
185           ./configure --help
186    
187         The following sections describe certain options whose  names
188         begin  with  --enable  or  --disable. These settings specify
189         changes to the defaults for the configure  command.  Because
190         of  the  way  that  configure  works, --enable and --disable
191         always come in pairs, so  the  complementary  option  always
192         exists  as  well, but as it specifies the default, it is not
193         described.
194    
195    
196    UTF-8 SUPPORT
197    
198         To build PCRE with support for UTF-8 character strings, add
199    
200           --enable-utf8
201    
202         to the configure command. Of itself, this does not make PCRE
203         treat  strings as UTF-8. As well as compiling PCRE with this
204         option, you also have have to set the PCRE_UTF8 option  when
205         you call the pcre_compile() function.
206    
207    
208    CODE VALUE OF NEWLINE
209    
210         By default, PCRE treats character 10 (linefeed) as the  new-
211         line  character.  This  is  the  normal newline character on
212         Unix-like systems. You can compile PCRE to use character  13
213         (carriage return) instead by adding
214    
215           --enable-newline-is-cr
216    
217         to the configure command. For completeness there is  also  a
218         --enable-newline-is-lf  option,  which  explicitly specifies
219         linefeed as the newline character.
220    
221    
222    BUILDING SHARED AND STATIC LIBRARIES
223    
224         The PCRE building process uses libtool to build both  shared
225         and  static  Unix libraries by default. You can suppress one
226         of these by adding one of
227    
228           --disable-shared
229           --disable-static
230    
231         to the configure command, as required.
232    
233    
234    POSIX MALLOC USAGE
235    
236         When PCRE is called through the  POSIX  interface  (see  the
237         pcreposix  documentation),  additional  working  storage  is
238         required for holding the pointers  to  capturing  substrings
239         because  PCRE requires three integers per substring, whereas
240         the POSIX interface provides only  two.  If  the  number  of
241         expected  substrings  is  small,  the  wrapper function uses
242         space on the stack, because this is faster than  using  mal-
243         loc()  for  each call. The default threshold above which the
244         stack is no longer used is 10; it can be changed by adding a
245         setting such as
246    
247           --with-posix-malloc-threshold=20
248    
249         to the configure command.
250    
251    
252    LIMITING PCRE RESOURCE USAGE
253    
254         Internally, PCRE has a  function  called  match()  which  it
255         calls  repeatedly  (possibly  recursively) when performing a
256         matching operation. By limiting the  number  of  times  this
257         function  may  be  called,  a  limit  can  be  placed on the
258         resources used by a single call to  pcre_exec().  The  limit
259         can  be  changed  at  run  time, as described in the pcreapi
260         documentation. The default is 10 million, but  this  can  be
261         changed by adding a setting such as
262    
263           --with-match-limit=500000
264    
265         to the configure command.
266    
267    
268    HANDLING VERY LARGE PATTERNS
269    
270         Within a compiled pattern, offset values are used  to  point
271         from  one  part  to  another  (for  example, from an opening
272         parenthesis to an  alternation  metacharacter).  By  default
273         two-byte  values  are  used  for these offsets, leading to a
274         maximum size for a compiled pattern of around 64K.  This  is
275         sufficient  to  handle  all  but the most gigantic patterns.
276         Nevertheless, some people do want to process  enormous  pat-
277         terns,  so  it is possible to compile PCRE to use three-byte
278         or four-byte offsets by adding a setting such as
279    
280           --with-link-size=3
281    
282         to the configure command. The value given must be 2,  3,  or
283         4.  Using  longer  offsets  slows down the operation of PCRE
284         because it has to load additional bytes when handling them.
285    
286         If you build PCRE with an increased link size, test  2  (and
287         test 5 if you are using UTF-8) will fail. Part of the output
288         of these tests is a representation of the compiled  pattern,
289         and this changes with the link size.
290    
291    Last updated: 21 January 2003
292    Copyright (c) 1997-2003 University of Cambridge.
293    -----------------------------------------------------------------------------
294    
295  NAME  NAME
296       pcre - Perl-compatible regular expressions.       PCRE - Perl-compatible regular expressions
297    
298    
299    SYNOPSIS OF PCRE API
300    
 SYNOPSIS  
301       #include <pcre.h>       #include <pcre.h>
302    
303       pcre *pcre_compile(const char *pattern, int options,       pcre *pcre_compile(const char *pattern, int options,
# Line 17  SYNOPSIS Line 311  SYNOPSIS
311            const char *subject, int length, int startoffset,            const char *subject, int length, int startoffset,
312            int options, int *ovector, int ovecsize);            int options, int *ovector, int ovecsize);
313    
314         int pcre_copy_named_substring(const pcre *code,
315              const char *subject, int *ovector,
316              int stringcount, const char *stringname,
317              char *buffer, int buffersize);
318    
319       int pcre_copy_substring(const char *subject, int *ovector,       int pcre_copy_substring(const char *subject, int *ovector,
320            int stringcount, int stringnumber, char *buffer,            int stringcount, int stringnumber, char *buffer,
321            int buffersize);            int buffersize);
322    
323         int pcre_get_named_substring(const pcre *code,
324              const char *subject, int *ovector,
325              int stringcount, const char *stringname,
326              const char **stringptr);
327    
328         int pcre_get_stringnumber(const pcre *code,
329              const char *name);
330    
331       int pcre_get_substring(const char *subject, int *ovector,       int pcre_get_substring(const char *subject, int *ovector,
332            int stringcount, int stringnumber,            int stringcount, int stringnumber,
333            const char **stringptr);            const char **stringptr);
# Line 37  SYNOPSIS Line 344  SYNOPSIS
344       int pcre_fullinfo(const pcre *code, const pcre_extra *extra,       int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
345            int what, void *where);            int what, void *where);
346    
347    
348       int pcre_info(const pcre *code, int *optptr, *firstcharptr);       int pcre_info(const pcre *code, int *optptr, *firstcharptr);
349    
350         int pcre_config(int what, void *where);
351    
352       char *pcre_version(void);       char *pcre_version(void);
353    
354       void *(*pcre_malloc)(size_t);       void *(*pcre_malloc)(size_t);
355    
356       void (*pcre_free)(void *);       void (*pcre_free)(void *);
357    
358         int (*pcre_callout)(pcre_callout_block *);
359    
360    
361    PCRE API
 DESCRIPTION  
      The PCRE library is a set of functions that implement  regu-  
      lar  expression  pattern  matching using the same syntax and  
      semantics as Perl  5,  with  just  a  few  differences  (see  
   
      below).  The  current  implementation  corresponds  to  Perl  
      5.005, with some additional features  from  later  versions.  
      This  includes  some  experimental,  incomplete  support for  
      UTF-8 encoded strings. Details of exactly what is  and  what  
      is not supported are given below.  
362    
363       PCRE has its own native API,  which  is  described  in  this       PCRE has its own native API,  which  is  described  in  this
364       document.  There  is  also  a  set of wrapper functions that       document.  There  is  also  a  set of wrapper functions that
# Line 76  DESCRIPTION Line 377  DESCRIPTION
377       The functions pcre_compile(), pcre_study(), and  pcre_exec()       The functions pcre_compile(), pcre_study(), and  pcre_exec()
378       are  used  for compiling and matching regular expressions. A       are  used  for compiling and matching regular expressions. A
379       sample program that demonstrates the simplest way  of  using       sample program that demonstrates the simplest way  of  using
380       them  is  given  in the file pcredemo.c. The last section of       them  is  given in the file pcredemo.c. The pcresample docu-
381       this man page describes how to run it.       mentation describes how to run it.
382    
383         There are convenience functions for extracting captured sub-
384         strings from a matched subject string. They are:
385    
386       The functions  pcre_copy_substring(),  pcre_get_substring(),         pcre_copy_substring()
387       and  pcre_get_substring_list() are convenience functions for         pcre_copy_named_substring()
388       extracting  captured  substrings  from  a  matched   subject         pcre_get_substring()
389       string; pcre_free_substring() and pcre_free_substring_list()         pcre_get_named_substring()
390       are also provided, to free the  memory  used  for  extracted         pcre_get_substring_list()
391    
392         pcre_free_substring()  and  pcre_free_substring_list()   are
393         also  provided,  to  free  the  memory  used  for  extracted
394       strings.       strings.
395    
396       The function pcre_maketables() is used (optionally) to build       The function pcre_maketables() is used (optionally) to build
# Line 104  DESCRIPTION Line 411  DESCRIPTION
411       replace them if it  wishes  to  intercept  the  calls.  This       replace them if it  wishes  to  intercept  the  calls.  This
412       should be done before calling any PCRE functions.       should be done before calling any PCRE functions.
413    
414         The global variable pcre_callout initially contains NULL. It
415         can be set by the caller to a "callout" function, which PCRE
416         will then call at specified points during a matching  opera-
417         tion. Details are given in the pcrecallout documentation.
418    
419    
420    MULTITHREADING
421    
 MULTI-THREADING  
422       The PCRE functions can be used in  multi-threading  applica-       The PCRE functions can be used in  multi-threading  applica-
423       tions, with the proviso that the memory management functions       tions, with the proviso that the memory management functions
424       pointed to by pcre_malloc and pcre_free are  shared  by  all       pointed to by pcre_malloc and  pcre_free,  and  the  callout
425         function  pointed  to  by  pcre_callout,  are  shared by all
426       threads.       threads.
427    
428       The compiled form of a regular  expression  is  not  altered       The compiled form of a regular  expression  is  not  altered
# Line 117  MULTI-THREADING Line 430  MULTI-THREADING
430       used by several threads at once.       used by several threads at once.
431    
432    
433    CHECKING BUILD-TIME OPTIONS
434    
435         int pcre_config(int what, void *where);
436    
437         The function pcre_config() makes  it  possible  for  a  PCRE
438         client  to  discover  which optional features have been com-
439         piled into the PCRE library. The pcrebuild documentation has
440         more details about these optional features.
441    
442         The first argument for pcre_config() is an integer, specify-
443         ing  which information is required; the second argument is a
444         pointer to a variable into which the information is  placed.
445         The following information is available:
446    
447           PCRE_CONFIG_UTF8
448    
449         The output is an integer that is set to one if UTF-8 support
450         is available; otherwise it is set to zero.
451    
452           PCRE_CONFIG_NEWLINE
453    
454         The output is an integer that is set to  the  value  of  the
455         code  that  is  used for the newline character. It is either
456         linefeed (10) or carriage return (13), and  should  normally
457         be the standard character for your operating system.
458    
459           PCRE_CONFIG_LINK_SIZE
460    
461         The output is an integer that contains the number  of  bytes
462         used  for  internal linkage in compiled regular expressions.
463         The value is 2, 3, or 4. Larger values allow larger  regular
464         expressions  to be compiled, at the expense of slower match-
465         ing. The default value of 2 is sufficient for  all  but  the
466         most  massive patterns, since it allows the compiled pattern
467         to be up to 64K in size.
468    
469           PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
470    
471         The output is an integer that contains the  threshold  above
472         which  the POSIX interface uses malloc() for output vectors.
473         Further details are given in the pcreposix documentation.
474    
475           PCRE_CONFIG_MATCH_LIMIT
476    
477         The output is an integer that gives the  default  limit  for
478         the   number  of  internal  matching  function  calls  in  a
479         pcre_exec()  execution.  Further  details  are  given   with
480         pcre_exec() below.
481    
482    
483  COMPILING A PATTERN  COMPILING A PATTERN
484    
485         pcre *pcre_compile(const char *pattern, int options,
486              const char **errptr, int *erroffset,
487              const unsigned char *tableptr);
488    
489       The function pcre_compile() is called to compile  a  pattern       The function pcre_compile() is called to compile  a  pattern
490       into  an internal form. The pattern is a C string terminated       into  an internal form. The pattern is a C string terminated
491       by a binary zero, and is passed in the argument  pattern.  A       by a binary zero, and is passed in the argument  pattern.  A
# Line 134  COMPILING A PATTERN Line 501  COMPILING A PATTERN
501       pcre data block is not fully relocatable,  because  it  con-       pcre data block is not fully relocatable,  because  it  con-
502       tains  a  copy of the tableptr argument, which is an address       tains  a  copy of the tableptr argument, which is an address
503       (see below).       (see below).
   
      The size of a compiled pattern is  roughly  proportional  to  
      the length of the pattern string, except that each character  
      class (other than those containing just a single  character,  
      negated  or  not)  requires 33 bytes, and repeat quantifiers  
      with a minimum greater than one or a bounded  maximum  cause  
      the  relevant  portions of the compiled pattern to be repli-  
      cated.  
   
504       The options argument contains independent bits  that  affect       The options argument contains independent bits  that  affect
505       the  compilation.  It  should  be  zero  if  no  options are       the  compilation.  It  should  be  zero  if  no  options are
506       required. Some of the options, in particular, those that are       required. Some of the options, in particular, those that are
507       compatible  with Perl, can also be set and unset from within       compatible  with Perl, can also be set and unset from within
508       the pattern (see the detailed description of regular expres-       the pattern (see the detailed description of regular expres-
509       sions below). For these options, the contents of the options       sions  in the pcrepattern documentation). For these options,
510       argument specifies their initial settings at  the  start  of       the contents of the options argument specifies their initial
511       compilation  and  execution. The PCRE_ANCHORED option can be       settings  at  the  start  of  compilation and execution. The
512       set at the time of matching as well as at compile time.       PCRE_ANCHORED option can be set at the time of  matching  as
513         well as at compile time.
514    
515       If errptr is NULL, pcre_compile() returns NULL  immediately.       If errptr is NULL, pcre_compile() returns NULL  immediately.
516       Otherwise, if compilation of a pattern fails, pcre_compile()       Otherwise, if compilation of a pattern fails, pcre_compile()
# Line 181  COMPILING A PATTERN Line 540  COMPILING A PATTERN
540           &erroffset,       /* for error offset */           &erroffset,       /* for error offset */
541           NULL);            /* use default character tables */           NULL);            /* use default character tables */
542    
543       The following option bits are defined in the header file:       The following option bits are defined:
544    
545         PCRE_ANCHORED         PCRE_ANCHORED
546    
547       If this bit is set, the pattern is forced to be  "anchored",       If this bit is set, the pattern is forced to be  "anchored",
548       that is, it is constrained to match only at the start of the       that is, it is constrained to match only at the first match-
549       string which is being searched (the "subject string").  This       ing point in the string which is being searched  (the  "sub-
550       effect can also be achieved by appropriate constructs in the       ject string"). This effect can also be achieved by appropri-
551       pattern itself, which is the only way to do it in Perl.       ate constructs in the pattern itself, which is the only  way
552         to do it in Perl.
553    
554         PCRE_CASELESS         PCRE_CASELESS
555    
556       If this bit is set, letters in the pattern match both  upper       If this bit is set, letters in the pattern match both  upper
557       and  lower  case  letters.  It  is  equivalent  to Perl's /i       and  lower  case  letters.  It  is  equivalent  to Perl's /i
558       option.       option, and it can be changed within a  pattern  by  a  (?i)
559         option setting.
560    
561         PCRE_DOLLAR_ENDONLY         PCRE_DOLLAR_ENDONLY
562    
# Line 205  COMPILING A PATTERN Line 566  COMPILING A PATTERN
566       character  if it is a newline (but not before any other new-       character  if it is a newline (but not before any other new-
567       lines).  The  PCRE_DOLLAR_ENDONLY  option  is   ignored   if       lines).  The  PCRE_DOLLAR_ENDONLY  option  is   ignored   if
568       PCRE_MULTILINE is set. There is no equivalent to this option       PCRE_MULTILINE is set. There is no equivalent to this option
569       in Perl.       in Perl, and no way to set it within a pattern.
570    
571         PCRE_DOTALL         PCRE_DOTALL
572    
573       If this bit is  set,  a  dot  metacharater  in  the  pattern       If this bit is  set,  a  dot  metacharater  in  the  pattern
574       matches all characters, including newlines. Without it, new-       matches all characters, including newlines. Without it, new-
575       lines are excluded. This option is equivalent to  Perl's  /s       lines are excluded. This option is equivalent to  Perl's  /s
576       option.  A negative class such as [^a] always matches a new-       option,  and  it  can  be changed within a pattern by a (?s)
577       line character, independent of the setting of this option.       option setting. A negative class such as [^a] always matches
578         a  newline  character,  independent  of  the setting of this
579         option.
580    
581         PCRE_EXTENDED         PCRE_EXTENDED
582    
583       If this bit is set, whitespace data characters in  the  pat-       If this bit is set, whitespace data characters in  the  pat-
584       tern  are  totally  ignored  except when escaped or inside a       tern  are  totally  ignored  except when escaped or inside a
585       character class, and characters between an unescaped #  out-       character class. Whitespace does not include the VT  charac-
586       side  a  character  class  and  the  next newline character,       ter  (code 11). In addition, characters between an unescaped
587         # outside a character class and the next newline  character,
588       inclusive, are also ignored. This is equivalent to Perl's /x       inclusive, are also ignored. This is equivalent to Perl's /x
589       option,  and  makes  it  possible to include comments inside       option, and it can be changed within a  pattern  by  a  (?x)
590       complicated patterns. Note, however, that this applies  only       option setting.
591       to  data  characters. Whitespace characters may never appear  
592         This option makes it possible  to  include  comments  inside
593         complicated patterns.  Note, however, that this applies only
594         to data characters. Whitespace characters may  never  appear
595       within special character sequences in a pattern, for example       within special character sequences in a pattern, for example
596       within  the sequence (?( which introduces a conditional sub-       within the sequence (?( which introduces a conditional  sub-
597       pattern.       pattern.
598    
599         PCRE_EXTRA         PCRE_EXTRA
# Line 256  COMPILING A PATTERN Line 623  COMPILING A PATTERN
623       of  line"  constructs match immediately following or immedi-       of  line"  constructs match immediately following or immedi-
624       ately before any newline  in  the  subject  string,  respec-       ately before any newline  in  the  subject  string,  respec-
625       tively,  as  well  as  at  the  very  start and end. This is       tively,  as  well  as  at  the  very  start and end. This is
626       equivalent to Perl's /m option. If there are no "\n" charac-       equivalent to Perl's /m option, and it can be changed within
627       ters  in  a subject string, or no occurrences of ^ or $ in a       a  pattern  by  a  (?m) option setting. If there are no "\n"
628       pattern, setting PCRE_MULTILINE has no effect.       characters in a subject string, or no occurrences of ^ or  $
629         in a pattern, setting PCRE_MULTILINE has no effect.
630    
631           PCRE_NO_AUTO_CAPTURE
632    
633         If this option is set, it disables the use of numbered  cap-
634         turing  parentheses  in the pattern. Any opening parenthesis
635         that is not followed by ? behaves as if it were followed  by
636         ?:  but  named  parentheses  can still be used for capturing
637         (and they acquire numbers in the usual  way).  There  is  no
638         equivalent of this option in Perl.
639    
640         PCRE_UNGREEDY         PCRE_UNGREEDY
641    
# Line 270  COMPILING A PATTERN Line 647  COMPILING A PATTERN
647         PCRE_UTF8         PCRE_UTF8
648    
649       This option causes PCRE to regard both the pattern  and  the       This option causes PCRE to regard both the pattern  and  the
650       subject  as strings of UTF-8 characters instead of just byte       subject  as  strings  of UTF-8 characters instead of single-
651       strings. However, it is available  only  if  PCRE  has  been       byte character strings. However, it  is  available  only  if
652       built  to  include  UTF-8  support.  If not, the use of this       PCRE  has  been  built to include UTF-8 support. If not, the
653       option provokes an error. Support for UTF-8 is new,  experi-       use of this option provokes an error. Details  of  how  this
654       mental,  and incomplete.  Details of exactly what it entails       option  changes  the behaviour of PCRE are given in the sec-
655       are given below.       tion on UTF-8 support in the main pcre page.
   
656    
657    
658  STUDYING A PATTERN  STUDYING A PATTERN
659    
660         pcre_extra *pcre_study(const pcre *code, int options,
661              const char **errptr);
662    
663       When a pattern is going to be  used  several  times,  it  is       When a pattern is going to be  used  several  times,  it  is
664       worth  spending  more time analyzing it in order to speed up       worth  spending  more time analyzing it in order to speed up
665       the time taken for matching. The function pcre_study() takes       the time taken for matching. The function pcre_study() takes
666       a  pointer  to a compiled pattern as its first argument, and       a  pointer  to  a compiled pattern as its first argument. If
667       returns a pointer to a pcre_extra block (another typedef for       studing the pattern  produces  additional  information  that
668       a  structure  with  hidden  contents)  containing additional       will  help speed up matching, pcre_study() returns a pointer
669       information  about  the  pattern;  this  can  be  passed  to       to a pcre_extra block, in which the study_data field  points
670       pcre_exec(). If no additional information is available, NULL       to the results of the study.
671       is returned.  
672         The  returned  value  from  a  pcre_study()  can  be  passed
673         directly  to pcre_exec(). However, the pcre_extra block also
674         contains other fields that can be set by the  caller  before
675         the  block is passed; these are described below. If studying
676         the pattern does not  produce  any  additional  information,
677         pcre_study() returns NULL. In that circumstance, if the cal-
678         ling program wants to pass  some  of  the  other  fields  to
679         pcre_exec(), it must set up its own pcre_extra block.
680    
681       The second argument contains option  bits.  At  present,  no       The second argument contains option  bits.  At  present,  no
682       options  are  defined  for  pcre_study(),  and this argument       options  are  defined  for  pcre_study(),  and this argument
683       should always be zero.       should always be zero.
684    
685       The third argument for pcre_study() is a pointer to an error       The third argument for pcre_study()  is  a  pointer  for  an
686       message. If studying succeeds (even if no data is returned),       error  message.  If  studying  succeeds  (even if no data is
687       the variable it points to  is  set  to  NULL.  Otherwise  it       returned), the variable it points to is set to NULL.  Other-
688       points to a textual error message.       wise it points to a textual error message. You should there-
689         fore  test  the  error  pointer  for  NULL   after   calling
690         pcre_study(), to be sure that it has run successfully.
691    
692       This is a typical call to pcre_study():       This is a typical call to pcre_study():
693    
# Line 313  STUDYING A PATTERN Line 703  STUDYING A PATTERN
703       created.       created.
704    
705    
   
706  LOCALE SUPPORT  LOCALE SUPPORT
707    
708       PCRE handles caseless matching, and determines whether char-       PCRE handles caseless matching, and determines whether char-
709       acters  are  letters, digits, or whatever, by reference to a       acters  are  letters, digits, or whatever, by reference to a
710       set of tables. The library contains a default set of  tables       set of tables. When running in UTF-8 mode, this applies only
711       which  is  created in the default C locale when PCRE is com-       to characters with codes less than 256. The library contains
712       piled.  This  is   used   when   the   final   argument   of       a default set of tables that is created  in  the  default  C
713       pcre_compile()  is NULL, and is sufficient for many applica-       locale  when  PCRE  is compiled. This is used when the final
714       tions.       argument of pcre_compile() is NULL, and  is  sufficient  for
715         many applications.
716    
717       An alternative set of tables can, however, be supplied. Such       An alternative set of tables can, however, be supplied. Such
718       tables  are built by calling the pcre_maketables() function,       tables  are built by calling the pcre_maketables() function,
# Line 339  LOCALE SUPPORT Line 730  LOCALE SUPPORT
730       The  tables  are  built  in  memory  that  is  obtained  via       The  tables  are  built  in  memory  that  is  obtained  via
731       pcre_malloc.  The  pointer that is passed to pcre_compile is       pcre_malloc.  The  pointer that is passed to pcre_compile is
732       saved with the compiled pattern, and  the  same  tables  are       saved with the compiled pattern, and  the  same  tables  are
733       used  via this pointer by pcre_study() and pcre_exec(). Thus       used via this pointer by pcre_study() and pcre_exec(). Thus,
734       for any single pattern, compilation, studying  and  matching       for any single pattern, compilation, studying  and  matching
735       all happen in the same locale, but different patterns can be       all happen in the same locale, but different patterns can be
736       compiled in different locales. It is the caller's  responsi-       compiled in different locales. It is the caller's  responsi-
# Line 347  LOCALE SUPPORT Line 738  LOCALE SUPPORT
738       remains available for as long as it is needed.       remains available for as long as it is needed.
739    
740    
   
741  INFORMATION ABOUT A PATTERN  INFORMATION ABOUT A PATTERN
742    
743         int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
744              int what, void *where);
745    
746       The pcre_fullinfo() function  returns  information  about  a       The pcre_fullinfo() function  returns  information  about  a
747       compiled pattern. It replaces the obsolete pcre_info() func-       compiled pattern. It replaces the obsolete pcre_info() func-
748       tion, which is nevertheless retained for backwards compabil-       tion, which is nevertheless retained for backwards compabil-
# Line 358  INFORMATION ABOUT A PATTERN Line 752  INFORMATION ABOUT A PATTERN
752       compiled  pattern.  The  second  argument  is  the result of       compiled  pattern.  The  second  argument  is  the result of
753       pcre_study(), or NULL if the pattern was  not  studied.  The       pcre_study(), or NULL if the pattern was  not  studied.  The
754       third  argument  specifies  which  piece  of  information is       third  argument  specifies  which  piece  of  information is
755       required, while the fourth argument is a pointer to a  vari-       required, and the fourth argument is a pointer to a variable
756       able  to receive the data. The yield of the function is zero       to  receive  the data. The yield of the function is zero for
757       for success, or one of the following negative numbers:       success, or one of the following negative numbers:
758    
759         PCRE_ERROR_NULL       the argument code was NULL         PCRE_ERROR_NULL       the argument code was NULL
760                               the argument where was NULL                               the argument where was NULL
# Line 381  INFORMATION ABOUT A PATTERN Line 775  INFORMATION ABOUT A PATTERN
775       The possible values for the third argument  are  defined  in       The possible values for the third argument  are  defined  in
776       pcre.h, and are as follows:       pcre.h, and are as follows:
777    
778         PCRE_INFO_OPTIONS         PCRE_INFO_BACKREFMAX
   
      Return a copy of the options with which the pattern was com-  
      piled.  The fourth argument should point to an unsigned long  
      int variable. These option bits are those specified  in  the  
      call  to  pcre_compile(),  modified  by any top-level option  
      settings  within  the   pattern   itself,   and   with   the  
      PCRE_ANCHORED  bit  forcibly  set if the form of the pattern  
      implies that it can match only at the  start  of  a  subject  
      string.  
   
        PCRE_INFO_SIZE  
779    
780       Return the size of the compiled pattern, that is, the  value       Return the number of the highest back reference in the  pat-
781       that  was  passed as the argument to pcre_malloc() when PCRE       tern.  The  fourth argument should point to an int variable.
782       was getting memory in which to place the compiled data.  The       Zero is returned if there are no back references.
      fourth argument should point to a size_t variable.  
783    
784         PCRE_INFO_CAPTURECOUNT         PCRE_INFO_CAPTURECOUNT
785    
786       Return the number of capturing subpatterns in  the  pattern.       Return the number of capturing subpatterns in  the  pattern.
787       The fourth argument should point to an int variable.       The fourth argument should point to an int variable.
788    
789         PCRE_INFO_BACKREFMAX         PCRE_INFO_FIRSTBYTE
   
      Return the number of the highest back reference in the  pat-  
      tern.  The  fourth argument should point to an int variable.  
      Zero is returned if there are no back references.  
790    
791         PCRE_INFO_FIRSTCHAR       Return information about  the  first  byte  of  any  matched
792         string,  for a non-anchored pattern. (This option used to be
793         called PCRE_INFO_FIRSTCHAR; the old name is still recognized
794         for backwards compatibility.)
795    
796       Return information about the first character of any  matched       If there is a fixed first byte, e.g. from a pattern such  as
      string,  for  a  non-anchored  pattern.  If there is a fixed  
      first   character,   e.g.   from   a   pattern    such    as  
797       (cat|cow|coyote),  it  is returned in the integer pointed to       (cat|cow|coyote),  it  is returned in the integer pointed to
798       by where. Otherwise, if either       by where. Otherwise, if either
799    
# Line 426  INFORMATION ABOUT A PATTERN Line 805  INFORMATION ABOUT A PATTERN
805       anchored),       anchored),
806    
807       -1 is returned, indicating that the pattern matches only  at       -1 is returned, indicating that the pattern matches only  at
808       the  start  of a subject string or after any "\n" within the       the  start  of  a subject string or after any newline within
809       string. Otherwise -2 is returned.  For anchored patterns, -2       the string. Otherwise -2 is returned. For anchored patterns,
810       is returned.       -2 is returned.
811    
812         PCRE_INFO_FIRSTTABLE         PCRE_INFO_FIRSTTABLE
813    
814       If the pattern was studied, and this resulted  in  the  con-       If the pattern was studied, and this resulted  in  the  con-
815       struction of a 256-bit table indicating a fixed set of char-       struction of a 256-bit table indicating a fixed set of bytes
816       acters for the first character in  any  matching  string,  a       for the first byte in any matching string, a pointer to  the
817       pointer   to  the  table  is  returned.  Otherwise  NULL  is       table  is  returned.  Otherwise NULL is returned. The fourth
818       returned. The fourth argument should point  to  an  unsigned       argument should point to an unsigned char * variable.
      char * variable.  
819    
820         PCRE_INFO_LASTLITERAL         PCRE_INFO_LASTLITERAL
821    
822       For a non-anchored pattern, return the value of  the  right-       For a non-anchored pattern, return the value of  the  right-
823       most  literal  character  which  must  exist  in any matched       most  literal  byte  which must exist in any matched string,
824       string, other than at its start. The fourth argument  should       other than at its start. The fourth argument should point to
825       point  to an int variable. If there is no such character, or       an int variable. If there is no such byte, or if the pattern
826       if the pattern is anchored, -1 is returned. For example, for       is anchored, -1 is returned. For example,  for  the  pattern
827       the pattern /a\d+z\d+/ the returned value is 'z'.       /a\d+z\d+/ the returned value is 'z'.
828    
829           PCRE_INFO_NAMECOUNT
830           PCRE_INFO_NAMEENTRYSIZE
831           PCRE_INFO_NAMETABLE
832    
833         PCRE supports the use of named as well as numbered capturing
834         parentheses. The names are just an additional way of identi-
835         fying the parentheses,  which  still  acquire  a  number.  A
836         caller  that  wants  to extract data from a named subpattern
837         must convert the name to a number in  order  to  access  the
838         correct  pointers  in  the  output  vector  (described  with
839         pcre_exec() below). In order to do this, it must  first  use
840         these  three  values  to  obtain  the name-to-number mapping
841         table for the pattern.
842    
843         The  map  consists  of  a  number  of  fixed-size   entries.
844         PCRE_INFO_NAMECOUNT   gives   the  number  of  entries,  and
845         PCRE_INFO_NAMEENTRYSIZE gives the size of each  entry;  both
846         of  these return an int value. The entry size depends on the
847         length of the longest name.  PCRE_INFO_NAMETABLE  returns  a
848         pointer to the first entry of the table (a pointer to char).
849         The first two bytes of each entry are the number of the cap-
850         turing parenthesis, most significant byte first. The rest of
851         the entry is the corresponding name,  zero  terminated.  The
852         names  are  in alphabetical order. For example, consider the
853         following pattern (assume PCRE_EXTENDED  is  set,  so  white
854         space - including newlines - is ignored):
855    
856           (?P<date> (?P<year>(\d\d)?\d\d) -
857           (?P<month>\d\d) - (?P<day>\d\d) )
858    
859         There are four named subpatterns,  so  the  table  has  four
860         entries,  and  each  entry in the table is eight bytes long.
861         The table is as follows, with non-printing  bytes  shows  in
862         hex, and undefined bytes shown as ??:
863    
864           00 01 d  a  t  e  00 ??
865           00 05 d  a  y  00 ?? ??
866           00 04 m  o  n  t  h  00
867           00 02 y  e  a  r  00 ??
868    
869         When writing code to extract data  from  named  subpatterns,
870         remember  that the length of each entry may be different for
871         each compiled pattern.
872    
873           PCRE_INFO_OPTIONS
874    
875         Return a copy of the options with which the pattern was com-
876         piled.  The fourth argument should point to an unsigned long
877         int variable. These option bits are those specified  in  the
878         call  to  pcre_compile(),  modified  by any top-level option
879         settings within the pattern itself.
880    
881         A pattern is automatically anchored by PCRE if  all  of  its
882         top-level alternatives begin with one of the following:
883    
884           ^     unless PCRE_MULTILINE is set
885           \A    always
886           \G    always
887           .*    if PCRE_DOTALL is set and there are no back
888                   references to the subpattern in which .* appears
889    
890         For such patterns, the  PCRE_ANCHORED  bit  is  set  in  the
891         options returned by pcre_fullinfo().
892    
893           PCRE_INFO_SIZE
894    
895         Return the size of the compiled pattern, that is, the  value
896         that  was  passed as the argument to pcre_malloc() when PCRE
897         was getting memory in which to place the compiled data.  The
898         fourth argument should point to a size_t variable.
899    
900           PCRE_INFO_STUDYSIZE
901    
902         Returns the size  of  the  data  block  pointed  to  by  the
903         study_data  field  in a pcre_extra block. That is, it is the
904         value that was passed to pcre_malloc() when PCRE was getting
905         memory into which to place the data created by pcre_study().
906         The fourth argument should point to a size_t variable.
907    
908    
909    OBSOLETE INFO FUNCTION
910    
911         int pcre_info(const pcre *code, int *optptr, *firstcharptr);
912    
913       The pcre_info() function is now obsolete because its  inter-       The pcre_info() function is now obsolete because its  inter-
914       face  is  too  restrictive  to return all the available data       face  is  too  restrictive  to return all the available data
# Line 465  INFORMATION ABOUT A PATTERN Line 927  INFORMATION ABOUT A PATTERN
927       If the pattern is not anchored and the firstcharptr argument       If the pattern is not anchored and the firstcharptr argument
928       is  not  NULL, it is used to pass back information about the       is  not  NULL, it is used to pass back information about the
929       first    character    of    any    matched    string    (see       first    character    of    any    matched    string    (see
930       PCRE_INFO_FIRSTCHAR above).       PCRE_INFO_FIRSTBYTE above).
   
931    
932    
933  MATCHING A PATTERN  MATCHING A PATTERN
      The function pcre_exec() is called to match a subject string  
   
   
   
   
   
 SunOS 5.8                 Last change:                          9  
   
934    
935         int pcre_exec(const pcre *code, const pcre_extra *extra,
936              const char *subject, int length, int startoffset,
937              int options, int *ovector, int ovecsize);
938    
939         The function pcre_exec() is called to match a subject string
940       against  a pre-compiled pattern, which is passed in the code       against  a pre-compiled pattern, which is passed in the code
941       argument. If the pattern has been studied, the result of the       argument. If the pattern has been studied, the result of the
942       study should be passed in the extra argument. Otherwise this       study should be passed in the extra argument.
      must be NULL.  
943    
944       Here is an example of a simple call to pcre_exec():       Here is an example of a simple call to pcre_exec():
945    
# Line 499  SunOS 5.8 Last change: Line 955  SunOS 5.8 Last change:
955           ovector,        /* vector for substring information */           ovector,        /* vector for substring information */
956           30);            /* number of elements in the vector */           30);            /* number of elements in the vector */
957    
958         If the extra argument is  not  NULL,  it  must  point  to  a
959         pcre_extra  data  block.  The  pcre_study() function returns
960         such a block (when it doesn't return NULL), but you can also
961         create  one for yourself, and pass additional information in
962         it. The fields in the block are as follows:
963    
964           unsigned long int flags;
965           void *study_data;
966           unsigned long int match_limit;
967           void *callout_data;
968    
969         The flags field is a bitmap  that  specifies  which  of  the
970         other fields are set. The flag bits are:
971    
972           PCRE_EXTRA_STUDY_DATA
973           PCRE_EXTRA_MATCH_LIMIT
974           PCRE_EXTRA_CALLOUT_DATA
975    
976         Other flag bits should be set to zero. The study_data  field
977         is   set  in  the  pcre_extra  block  that  is  returned  by
978         pcre_study(), together with the appropriate  flag  bit.  You
979         should  not  set this yourself, but you can add to the block
980         by setting the other fields.
981    
982         The match_limit field provides a means  of  preventing  PCRE
983         from  using  up a vast amount of resources when running pat-
984         terns that are not going to match, but  which  have  a  very
985         large  number  of  possibilities  in their search trees. The
986         classic example is the  use  of  nested  unlimited  repeats.
987         Internally,  PCRE  uses  a  function called match() which it
988         calls  repeatedly  (sometimes  recursively).  The  limit  is
989         imposed  on the number of times this function is called dur-
990         ing a match, which has the effect of limiting the amount  of
991         recursion and backtracking that can take place. For patterns
992         that are not anchored, the count starts from zero  for  each
993         position in the subject string.
994    
995         The default limit for the library can be set  when  PCRE  is
996         built;  the default default is 10 million, which handles all
997         but the most extreme cases. You can reduce  the  default  by
998         suppling  pcre_exec()  with  a  pcre_extra  block  in  which
999         match_limit   is   set   to    a    smaller    value,    and
1000         PCRE_EXTRA_MATCH_LIMIT  is  set  in  the flags field. If the
1001         limit      is      exceeded,       pcre_exec()       returns
1002         PCRE_ERROR_MATCHLIMIT.
1003    
1004         The pcre_callout field is used in conjunction with the "cal-
1005         lout"  feature,  which is described in the pcrecallout docu-
1006         mentation.
1007    
1008       The PCRE_ANCHORED option can be passed in the options  argu-       The PCRE_ANCHORED option can be passed in the options  argu-
1009       ment,  whose unused bits must be zero. However, if a pattern       ment,   whose   unused   bits  must  be  zero.  This  limits
1010       was  compiled  with  PCRE_ANCHORED,  or  turned  out  to  be       pcre_exec() to matching at the first matching position. How-
1011       anchored  by  virtue  of  its  contents,  it  cannot be made       ever,  if  a  pattern  was  compiled  with PCRE_ANCHORED, or
1012       unachored at matching time.       turned out to be anchored by virtue of its contents, it can-
1013         not be made unachored at matching time.
1014    
1015       There are also three further options that can be set only at       There are also three further options that can be set only at
1016       matching time:       matching time:
# Line 547  SunOS 5.8 Last change: Line 1054  SunOS 5.8 Last change:
1054       advancing the starting offset  (see  below)  and  trying  an       advancing the starting offset  (see  below)  and  trying  an
1055       ordinary match again.       ordinary match again.
1056    
1057       The subject string is passed as  a  pointer  in  subject,  a       The subject string is passed to pcre_exec() as a pointer  in
1058       length  in  length,  and  a  starting offset in startoffset.       subject,  a length in length, and a starting offset in star-
1059       Unlike the pattern string, the subject  may  contain  binary       toffset. Unlike the pattern string, the subject may  contain
1060       zero  characters.  When  the  starting  offset  is zero, the       binary  zero  bytes.  When  the starting offset is zero, the
1061       search for a match starts at the beginning of  the  subject,       search for a match starts at the beginning of  the  subject,
1062       and this is by far the most common case.       and this is by far the most common case.
1063    
1064         If the pattern was compiled with the PCRE_UTF8  option,  the
1065         subject  must  be  a sequence of bytes that is a valid UTF-8
1066         string.  If  an  invalid  UTF-8  string  is  passed,  PCRE's
1067         behaviour is not defined.
1068    
1069       A non-zero starting offset  is  useful  when  searching  for       A non-zero starting offset  is  useful  when  searching  for
1070       another  match  in  the  same subject by calling pcre_exec()       another  match  in  the  same subject by calling pcre_exec()
1071       again after a previous success.  Setting startoffset differs       again after a previous success.  Setting startoffset differs
# Line 615  SunOS 5.8 Last change: Line 1127  SunOS 5.8 Last change:
1127       there  are no capturing subpatterns, the return value from a       there  are no capturing subpatterns, the return value from a
1128       successful match is 1, indicating that just the  first  pair       successful match is 1, indicating that just the  first  pair
1129       of offsets has been set.       of offsets has been set.
   
1130       Some convenience functions are provided for  extracting  the       Some convenience functions are provided for  extracting  the
1131       captured substrings as separate strings. These are described       captured substrings as separate strings. These are described
1132       in the following section.       in the following section.
# Line 645  SunOS 5.8 Last change: Line 1156  SunOS 5.8 Last change:
1156       Note that pcre_info() can be used to find out how many  cap-       Note that pcre_info() can be used to find out how many  cap-
1157       turing  subpatterns  there  are  in  a compiled pattern. The       turing  subpatterns  there  are  in  a compiled pattern. The
1158       smallest size for ovector that will  allow  for  n  captured       smallest size for ovector that will  allow  for  n  captured
1159       substrings  in  addition  to  the  offsets  of the substring       substrings,  in  addition  to  the  offsets of the substring
1160       matched by the whole pattern is (n+1)*3.       matched by the whole pattern, is (n+1)*3.
1161    
1162       If pcre_exec() fails, it returns a negative number. The fol-       If pcre_exec() fails, it returns a negative number. The fol-
1163       lowing are defined in the header file:       lowing are defined in the header file:
# Line 686  SunOS 5.8 Last change: Line 1197  SunOS 5.8 Last change:
1197       pcre_malloc() fails, this error  is  given.  The  memory  is       pcre_malloc() fails, this error  is  given.  The  memory  is
1198       freed at the end of matching.       freed at the end of matching.
1199    
1200           PCRE_ERROR_NOSUBSTRING    (-7)
1201    
1202         This   error   is   used   by   the   pcre_copy_substring(),
1203         pcre_get_substring(),  and  pcre_get_substring_list()  func-
1204         tions (see below). It is never returned by pcre_exec().
1205    
1206           PCRE_ERROR_MATCHLIMIT     (-8)
1207    
1208         The recursion and backtracking limit, as  specified  by  the
1209         match_limit  field  in a pcre_extra structure (or defaulted)
1210         was reached. See the description above.
1211    
1212           PCRE_ERROR_CALLOUT        (-9)
1213    
1214         This error is never generated by pcre_exec() itself.  It  is
1215         provided  for  use by callout functions that want to yield a
1216         distinctive error code. See  the  pcrecallout  documentation
1217         for details.
1218    
1219    
1220    EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
1221    
1222         int pcre_copy_substring(const char *subject, int *ovector,
1223              int stringcount, int stringnumber, char *buffer,
1224              int buffersize);
1225    
1226         int pcre_get_substring(const char *subject, int *ovector,
1227              int stringcount, int stringnumber,
1228              const char **stringptr);
1229    
1230         int pcre_get_substring_list(const char *subject,
1231              int *ovector, int stringcount, const char ***listptr);
1232    
1233    
 EXTRACTING CAPTURED SUBSTRINGS  
1234       Captured substrings can be accessed directly  by  using  the       Captured substrings can be accessed directly  by  using  the
1235       offsets returned by pcre_exec() in ovector. For convenience,       offsets returned by pcre_exec() in ovector. For convenience,
1236       the functions  pcre_copy_substring(),  pcre_get_substring(),       the functions  pcre_copy_substring(),  pcre_get_substring(),
1237       and  pcre_get_substring_list()  are  provided for extracting       and  pcre_get_substring_list()  are  provided for extracting
1238       captured  substrings  as  new,   separate,   zero-terminated       captured  substrings  as  new,   separate,   zero-terminated
1239         strings.  These functions identify substrings by number. The
1240         next section describes functions for extracting  named  sub-
1241       strings.   A  substring  that  contains  a  binary  zero  is       strings.   A  substring  that  contains  a  binary  zero  is
1242       correctly extracted and has a further zero added on the end,       correctly extracted and has a further zero added on the end,
1243       but the result does not, of course, function as a C string.       but the result is not, of course, a C string.
1244    
1245       The first three arguments are the same for all  three  func-       The first three arguments are the  same  for  all  three  of
1246       tions:  subject  is  the  subject string which has just been       these  functions:   subject  is the subject string which has
1247       successfully matched, ovector is a pointer to the vector  of       just been successfully matched, ovector is a pointer to  the
1248       integer   offsets   that  was  passed  to  pcre_exec(),  and       vector  of  integer  offsets that was passed to pcre_exec(),
1249       stringcount is the number of substrings that  were  captured       and stringcount is the number of substrings that  were  cap-
1250       by  the  match,  including  the  substring  that matched the       tured by the match, including the substring that matched the
1251       entire regular expression. This is  the  value  returned  by       entire regular expression. This is  the  value  returned  by
1252       pcre_exec  if  it  is  greater  than  zero.  If  pcre_exec()       pcre_exec  if  it  is  greater  than  zero.  If  pcre_exec()
1253       returned zero, indicating that it ran out of space in  ovec-       returned zero, indicating that it ran out of space in  ovec-
# Line 763  EXTRACTING CAPTURED SUBSTRINGS Line 1306  EXTRACTING CAPTURED SUBSTRINGS
1306       the functions are provided.       the functions are provided.
1307    
1308    
1309    EXTRACTING CAPTURED SUBSTRINGS BY NAME
1310    
1311  LIMITATIONS       int pcre_copy_named_substring(const pcre *code,
1312       There are some size limitations in PCRE but it is hoped that            const char *subject, int *ovector,
1313       they will never in practice be relevant.  The maximum length            int stringcount, const char *stringname,
1314       of a compiled pattern is 65539 (sic) bytes.  All  values  in            char *buffer, int buffersize);
1315       repeating  quantifiers  must be less than 65536.  There max-  
1316       imum number of capturing subpatterns is 65535.  There is  no       int pcre_get_stringnumber(const pcre *code,
1317       limit  to  the  number of non-capturing subpatterns, but the            const char *name);
1318       maximum depth of nesting of all kinds of parenthesized  sub-  
1319       pattern,  including  capturing  subpatterns, assertions, and       int pcre_get_named_substring(const pcre *code,
1320       other types of subpattern, is 200.            const char *subject, int *ovector,
1321              int stringcount, const char *stringname,
1322              const char **stringptr);
1323    
1324       The maximum length of a subject string is the largest  posi-       To extract a substring by name, you first have to find asso-
1325       tive number that an integer variable can hold. However, PCRE       ciated    number.    This    can    be   done   by   calling
1326       uses recursion to handle subpatterns and indefinite  repeti-       pcre_get_stringnumber(). The first argument is the  compiled
1327       tion.  This  means  that the available stack space may limit       pattern,  and  the second is the name. For example, for this
1328       the size of a subject string that can be processed  by  cer-       pattern
1329       tain patterns.  
1330           ab(?<xxx>\d+)...
1331    
1332         the number of the subpattern called "xxx" is  1.  Given  the
1333         number,  you can then extract the substring directly, or use
1334         one of the functions described in the previous section.  For
1335         convenience,  there are also two functions that do the whole
1336         job.
1337    
1338         Most of the  arguments  of  pcre_copy_named_substring()  and
1339         pcre_get_named_substring()  are  the  same  as those for the
1340         functions that  extract  by  number,  and  so  are  not  re-
1341         described here. There are just two differences.
1342    
1343         First, instead of a substring number, a  substring  name  is
1344         given.  Second,  there  is  an  extra argument, given at the
1345         start, which is a pointer to the compiled pattern.  This  is
1346         needed  in order to gain access to the name-to-number trans-
1347         lation table.
1348    
1349         These functions  call  pcre_get_stringnumber(),  and  if  it
1350         succeeds,    they   then   call   pcre_copy_substring()   or
1351         pcre_get_substring(), as appropriate.
1352    
1353    Last updated: 03 February 2003
1354    Copyright (c) 1997-2003 University of Cambridge.
1355    -----------------------------------------------------------------------------
1356    
1357    NAME
1358         PCRE - Perl-compatible regular expressions
1359    
1360    
1361    PCRE CALLOUTS
1362    
1363         int (*pcre_callout)(pcre_callout_block *);
1364    
1365         PCRE provides a feature called "callout", which is  a  means
1366         of  temporarily passing control to the caller of PCRE in the
1367         middle of pattern matching. The caller of PCRE  provides  an
1368         external  function  by putting its entry point in the global
1369         variable pcre_callout. By default,  this  variable  contains
1370         NULL, which disables all calling out.
1371    
1372         Within a regular expression, (?C) indicates  the  points  at
1373         which  the external function is to be called. Different cal-
1374         lout points can be identified by putting a number less  than
1375         256  after  the  letter  C.  The default value is zero.  For
1376         example, this pattern has two callout points:
1377    
1378           (?C1)9abc(?C2)def
1379    
1380         During matching, when PCRE  reaches  a  callout  point  (and
1381         pcre_callout  is  set), the external function is called. Its
1382         only argument is a pointer to  a  pcre_callout  block.  This
1383         contains the following variables:
1384    
1385           int          version;
1386           int          callout_number;
1387           int         *offset_vector;
1388           const char  *subject;
1389           int          subject_length;
1390           int          start_match;
1391           int          current_position;
1392           int          capture_top;
1393           int          capture_last;
1394           void        *callout_data;
1395    
1396         The version field  is  an  integer  containing  the  version
1397         number of the block format. The current version is zero. The
1398         version number may change in future if additional fields are
1399         added,  but  the  intention  is  never  to remove any of the
1400         existing fields.
1401    
1402         The callout_number field contains the number of the callout,
1403         as compiled into the pattern (that is, the number after ?C).
1404    
1405         The offset_vector field  is  a  pointer  to  the  vector  of
1406         offsets  that  was  passed by the caller to pcre_exec(). The
1407         contents can be inspected in  order  to  extract  substrings
1408         that  have  been  matched  so  far,  in  the same way as for
1409         extracting substrings after a match has completed.
1410         The subject and subject_length  fields  contain  copies  the
1411         values that were passed to pcre_exec().
1412    
1413         The start_match field contains the offset within the subject
1414         at  which  the current match attempt started. If the pattern
1415         is not anchored, the callout function may be called  several
1416         times for different starting points.
1417    
1418         The current_position field contains the  offset  within  the
1419         subject of the current match pointer.
1420    
1421         The capture_top field contains the  number  of  the  highest
1422         captured substring so far.
1423    
1424         The capture_last field  contains  the  number  of  the  most
1425         recently captured substring.
1426    
1427         The callout_data field contains a value that  is  passed  to
1428         pcre_exec()  by  the  caller  specifically so that it can be
1429         passed back in callouts. It is passed  in  the  pcre_callout
1430         field  of the pcre_extra data structure. If no such data was
1431         passed, the value of callout_data in a pcre_callout block is
1432         NULL.  There is a description of the pcre_extra structure in
1433         the pcreapi documentation.
1434    
1435    
1436    
1437    RETURN VALUES
1438    
1439         The callout function returns an integer.  If  the  value  is
1440         zero,  matching  proceeds as normal. If the value is greater
1441         than zero, matching fails at the current  point,  but  back-
1442         tracking  to test other possibilities goes ahead, just as if
1443         a lookahead assertion had failed. If the value is less  than
1444         zero,  the  match  is abandoned, and pcre_exec() returns the
1445         value.
1446    
1447         Negative values should normally be chosen from  the  set  of
1448         PCRE_ERROR_xxx  values.  In  particular,  PCRE_ERROR_NOMATCH
1449         forces a standard "no  match"  failure.   The  error  number
1450         PCRE_ERROR_CALLOUT is reserved for use by callout functions;
1451         it will never be used by PCRE itself.
1452    
1453    Last updated: 21 January 2003
1454    Copyright (c) 1997-2003 University of Cambridge.
1455    -----------------------------------------------------------------------------
1456    
1457    NAME
1458         PCRE - Perl-compatible regular expressions
1459    
1460    
1461  DIFFERENCES FROM PERL  DIFFERENCES FROM PERL
      The differences described here  are  with  respect  to  Perl  
      5.005.  
1462    
1463       1. By default, a whitespace character is any character  that       This document describes the differences  in  the  ways  that
1464       the  C  library  function isspace() recognizes, though it is       PCRE  and  Perl  handle regular expressions. The differences
1465       possible to compile PCRE  with  alternative  character  type       described here are with respect to Perl 5.8.
      tables. Normally isspace() matches space, formfeed, newline,  
      carriage return, horizontal tab, and vertical tab. Perl 5 no  
      longer  includes vertical tab in its set of whitespace char-  
      acters. The \v escape that was in the Perl documentation for  
      a long time was never in fact recognized. However, the char-  
      acter itself was treated as whitespace at least up to 5.002.  
      In 5.004 and 5.005 it does not match \s.  
1466    
1467       2. PCRE does  not  allow  repeat  quantifiers  on  lookahead       1. PCRE does  not  allow  repeat  quantifiers  on  lookahead
1468       assertions. Perl permits them, but they do not mean what you       assertions. Perl permits them, but they do not mean what you
1469       might think. For example, (?!a){3} does not assert that  the       might think. For example, (?!a){3} does not assert that  the
1470       next  three characters are not "a". It just asserts that the       next  three characters are not "a". It just asserts that the
1471       next character is not "a" three times.       next character is not "a" three times.
1472    
1473       3. Capturing subpatterns that occur inside  negative  looka-       2. Capturing subpatterns that occur inside  negative  looka-
1474       head  assertions  are  counted,  but  their  entries  in the       head  assertions  are  counted,  but  their  entries  in the
1475       offsets vector are never set. Perl sets its numerical  vari-       offsets vector are never set. Perl sets its numerical  vari-
1476       ables  from  any  such  patterns that are matched before the       ables  from  any  such  patterns that are matched before the
# Line 813  DIFFERENCES FROM PERL Line 1478  DIFFERENCES FROM PERL
1478       only  if  the negative lookahead assertion contains just one       only  if  the negative lookahead assertion contains just one
1479       branch.       branch.
1480    
1481       4. Though binary zero characters are supported in  the  sub-       3. Though binary zero characters are supported in  the  sub-
1482       ject  string,  they  are  not  allowed  in  a pattern string       ject  string,  they  are  not  allowed  in  a pattern string
1483       because it is passed as a normal  C  string,  terminated  by       because it is passed as a normal  C  string,  terminated  by
1484       zero. The escape sequence "\0" can be used in the pattern to       zero. The escape sequence "\0" can be used in the pattern to
1485       represent a binary zero.       represent a binary zero.
1486    
1487       5. The following Perl escape sequences  are  not  supported:       4. The following Perl escape sequences  are  not  supported:
1488       \l,  \u,  \L,  \U,  \E, \Q. In fact these are implemented by       \l,  \u,  \L,  \U,  \P, \p, and \X. In fact these are imple-
1489       Perl's general string-handling and are not part of its  pat-       mented by Perl's general string-handling and are not part of
1490       tern matching engine.       its pattern matching engine. If any of these are encountered
1491         by PCRE, an error is generated.
1492    
1493         5. PCRE does support the \Q...\E  escape  for  quoting  sub-
1494         strings. Characters in between are treated as literals. This
1495         is slightly different from Perl in that $  and  @  are  also
1496         handled  as  literals inside the quotes. In Perl, they cause
1497         variable interpolation (but of course  PCRE  does  not  have
1498         variables). Note the following examples:
1499    
1500             Pattern            PCRE matches      Perl matches
1501    
1502             \Qabc$xyz\E        abc$xyz           abc followed by the
1503                                                    contents of $xyz
1504             \Qabc\$xyz\E       abc\$xyz          abc\$xyz
1505             \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz
1506    
1507       6. The Perl \G assertion is  not  supported  as  it  is  not       In PCRE, the \Q...\E mechanism is not  recognized  inside  a
1508       relevant to single pattern matches.       character class.
1509    
1510       7. Fairly obviously, PCRE does not support the (?{code}) and       8. Fairly obviously, PCRE does not support the (?{code}) and
1511       (?p{code})  constructions. However, there is some experimen-       (?p{code})  constructions. However, there is some experimen-
1512       tal support for recursive patterns using the  non-Perl  item       tal support for recursive patterns using the non-Perl  items
1513       (?R).       (?R),  (?number)  and  (?P>name).  Also,  the PCRE "callout"
1514         feature allows an external function to be called during pat-
1515       8. There are at the time of writing some  oddities  in  Perl       tern matching.
1516       5.005_02  concerned  with  the  settings of captured strings  
1517       when part of a pattern is repeated.  For  example,  matching       9. There are some differences that are  concerned  with  the
1518       "aba"  against the pattern /^(a(b)?)+$/ sets $2 to the value       settings  of  captured  strings  when  part  of a pattern is
1519       "b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves  $2       repeated. For example, matching "aba"  against  the  pattern
1520       unset.    However,    if   the   pattern   is   changed   to       /^(a(b)?)+$/  in Perl leaves $2 unset, but in PCRE it is set
1521       /^(aa(b(b))?)+$/ then $2 (and $3) are set.       to "b".
   
      In Perl 5.004 $2 is set in both cases, and that is also true  
      of PCRE. If in the future Perl changes to a consistent state  
      that is different, PCRE may change to follow.  
   
      9. Another as yet unresolved discrepancy  is  that  in  Perl  
      5.005_02  the  pattern /^(a)?(?(1)a|b)+$/ matches the string  
      "a", whereas in PCRE it does not.  However, in both Perl and  
      PCRE /^(a)?a/ matched against "a" leaves $1 unset.  
1522    
1523       10. PCRE  provides  some  extensions  to  the  Perl  regular       10. PCRE  provides  some  extensions  to  the  Perl  regular
1524       expression facilities:       expression facilities:
1525    
1526       (a) Although lookbehind assertions must match  fixed  length       (a) Although lookbehind assertions must match  fixed  length
1527       strings,  each  alternative branch of a lookbehind assertion       strings,  each  alternative branch of a lookbehind assertion
1528       can match a different length of string. Perl 5.005  requires       can match a different length of string. Perl  requires  them
1529       them all to have the same length.       all to have the same length.
1530    
1531       (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is  not       (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is  not
1532       set,  the  $ meta- character matches only at the very end of       set,  the  $  meta-character matches only at the very end of
1533       the string.       the string.
1534    
1535       (c) If PCRE_EXTRA is set, a backslash followed by  a  letter       (c) If PCRE_EXTRA is set, a backslash followed by  a  letter
# Line 869  DIFFERENCES FROM PERL Line 1540  DIFFERENCES FROM PERL
1540       not greedy, but if followed by a question mark they are.       not greedy, but if followed by a question mark they are.
1541    
1542       (e) PCRE_ANCHORED can be used to force a pattern to be tried       (e) PCRE_ANCHORED can be used to force a pattern to be tried
1543       only at the start of the subject.       only at the first matching position in the subject string.
1544    
1545       (f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY  options       (f)  The  PCRE_NOTBOL,   PCRE_NOTEOL,   PCRE_NOTEMPTY,   and
1546       for pcre_exec() have no Perl equivalents.       PCRE_NO_AUTO_CAPTURE  options  for  pcre_exec() have no Perl
1547         equivalents.
1548    
1549       (g) The (?R) construct allows for recursive pattern matching       (g) The (?R), (?number), and (?P>name) constructs allows for
1550       (Perl  5.6 can do this using the (?p{code}) construct, which       recursive  pattern  matching  (Perl  can  do  this using the
1551       PCRE cannot of course support.)       (?p{code}) construct, which PCRE cannot support.)
1552    
1553         (h) PCRE supports  named  capturing  substrings,  using  the
1554         Python syntax.
1555    
1556         (i) PCRE supports the  possessive  quantifier  "++"  syntax,
1557         taken from Sun's Java package.
1558    
1559         (j) The (R) condition, for  testing  recursion,  is  a  PCRE
1560         extension.
1561    
1562         (k) The callout facility is PCRE-specific.
1563    
1564    Last updated: 03 February 2003
1565    Copyright (c) 1997-2003 University of Cambridge.
1566    -----------------------------------------------------------------------------
1567    
1568    NAME
1569         PCRE - Perl-compatible regular expressions
1570    
1571    
1572    PCRE REGULAR EXPRESSION DETAILS
1573    
 REGULAR EXPRESSION DETAILS  
1574       The syntax and semantics of  the  regular  expressions  sup-       The syntax and semantics of  the  regular  expressions  sup-
1575       ported  by PCRE are described below. Regular expressions are       ported  by PCRE are described below. Regular expressions are
1576       also described in the Perl documentation and in a number  of       also described in the Perl documentation and in a number  of
1577       other  books,  some  of which have copious examples. Jeffrey       other  books,  some  of which have copious examples. Jeffrey
1578       Friedl's  "Mastering  Regular  Expressions",  published   by       Friedl's  "Mastering  Regular  Expressions",  published   by
1579       O'Reilly (ISBN 1-56592-257), covers them in great detail.       O'Reilly,  covers them in great detail. The description here
1580         is intended as reference documentation.
1581    
      The description here is intended as reference documentation.  
1582       The basic operation of PCRE is on strings of bytes. However,       The basic operation of PCRE is on strings of bytes. However,
1583       there is the beginnings of some support for UTF-8  character       there  is  also  support for UTF-8 character strings. To use
1584       strings.  To  use  this  support  you must configure PCRE to       this support you must build PCRE to include  UTF-8  support,
1585       include it, and then call pcre_compile() with the  PCRE_UTF8       and  then call pcre_compile() with the PCRE_UTF8 option. How
1586       option.  How  this affects the pattern matching is described       this affects the pattern matching is  mentioned  in  several
1587       in the final section of this document.       places  below.  There is also a summary of UTF-8 features in
1588         the section on UTF-8 support in the main pcre page.
1589    
1590       A regular expression is a pattern that is matched against  a       A regular expression is a pattern that is matched against  a
1591       subject string from left to right. Most characters stand for       subject string from left to right. Most characters stand for
# Line 916  REGULAR EXPRESSION DETAILS Line 1607  REGULAR EXPRESSION DETAILS
1607       Outside square brackets, the meta-characters are as follows:       Outside square brackets, the meta-characters are as follows:
1608    
1609         \      general escape character with several uses         \      general escape character with several uses
1610         ^      assert start of  subject  (or  line,  in  multiline         ^      assert start of string (or line, in multiline mode)
1611       mode)         $      assert end of string (or line, in multiline mode)
        $      assert end of subject (or line, in multiline mode)  
1612         .      match any character except newline (by default)         .      match any character except newline (by default)
1613         [      start character class definition         [      start character class definition
1614         |      start of alternative branch         |      start of alternative branch
# Line 929  REGULAR EXPRESSION DETAILS Line 1619  REGULAR EXPRESSION DETAILS
1619                also quantifier minimizer                also quantifier minimizer
1620         *      0 or more quantifier         *      0 or more quantifier
1621         +      1 or more quantifier         +      1 or more quantifier
1622                  also "possessive quantifier"
1623         {      start min/max quantifier         {      start min/max quantifier
1624    
1625       Part of a pattern that is in square  brackets  is  called  a       Part of a pattern that is in square  brackets  is  called  a
# Line 938  REGULAR EXPRESSION DETAILS Line 1629  REGULAR EXPRESSION DETAILS
1629         \      general escape character         \      general escape character
1630         ^      negate the class, but only if the first character         ^      negate the class, but only if the first character
1631         -      indicates character range         -      indicates character range
1632           [      POSIX character class (only if followed by POSIX
1633                    syntax)
1634         ]      terminates the character class         ]      terminates the character class
1635    
1636       The following sections describe  the  use  of  each  of  the       The following sections describe  the  use  of  each  of  the
1637       meta-characters.       meta-characters.
1638    
1639    
   
1640  BACKSLASH  BACKSLASH
1641    
1642       The backslash character has several uses. Firstly, if it  is       The backslash character has several uses. Firstly, if it  is
1643       followed  by  a  non-alphameric character, it takes away any       followed  by  a  non-alphameric character, it takes away any
1644       special  meaning  that  character  may  have.  This  use  of       special  meaning  that  character  may  have.  This  use  of
   
1645       backslash  as  an  escape  character applies both inside and       backslash  as  an  escape  character applies both inside and
1646       outside character classes.       outside character classes.
1647    
1648       For example, if you want to match a "*" character, you write       For example, if you want to match a * character,  you  write
1649       "\*" in the pattern. This applies whether or not the follow-       \*  in the pattern.  This escaping action applies whether or
1650       ing character would otherwise  be  interpreted  as  a  meta-       not the following character would otherwise  be  interpreted
1651       character,  so it is always safe to precede a non-alphameric       as  a meta-character, so it is always safe to precede a non-
1652       with "\" to specify that it stands for itself.  In  particu-       alphameric with backslash to  specify  that  it  stands  for
1653       lar, if you want to match a backslash, you write "\\".       itself. In particular, if you want to match a backslash, you
1654         write \\.
1655    
1656       If a pattern is compiled with the PCRE_EXTENDED option, whi-       If a pattern is compiled with the PCRE_EXTENDED option, whi-
1657       tespace in the pattern (other than in a character class) and       tespace in the pattern (other than in a character class) and
1658       characters between a "#" outside a character class  and  the       characters between a # outside a  character  class  and  the
1659       next  newline  character  are ignored. An escaping backslash       next  newline  character  are ignored. An escaping backslash
1660       can be used to include a whitespace or "#" character as part       can be used to include a whitespace or # character  as  part
1661       of the pattern.       of the pattern.
1662    
1663         If you want to remove the special meaning from a sequence of
1664         characters, you can do so by putting them between \Q and \E.
1665         This is different from Perl in that $ and @ are  handled  as
1666         literals  in  \Q...\E  sequences in PCRE, whereas in Perl, $
1667         and @ cause variable interpolation. Note the following exam-
1668         ples:
1669    
1670           Pattern            PCRE matches   Perl matches
1671    
1672           \Qabc$xyz\E        abc$xyz        abc followed by the
1673    
1674                                               contents of $xyz
1675           \Qabc\$xyz\E       abc\$xyz       abc\$xyz
1676           \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
1677    
1678         The \Q...\E sequence is recognized both inside  and  outside
1679         character classes.
1680    
1681       A second use of backslash provides a way  of  encoding  non-       A second use of backslash provides a way  of  encoding  non-
1682       printing  characters  in patterns in a visible manner. There       printing  characters  in patterns in a visible manner. There
1683       is no restriction on the appearance of non-printing  charac-       is no restriction on the appearance of non-printing  charac-
# Line 975  BACKSLASH Line 1686  BACKSLASH
1686       usually  easier to use one of the following escape sequences       usually  easier to use one of the following escape sequences
1687       than the binary character it represents:       than the binary character it represents:
1688    
1689         \a     alarm, that is, the BEL character (hex 07)         \a        alarm, that is, the BEL character (hex 07)
1690         \cx    "control-x", where x is any character         \cx       "control-x", where x is any character
1691         \e     escape (hex 1B)         \e        escape (hex 1B)
1692         \f     formfeed (hex 0C)         \f        formfeed (hex 0C)
1693         \n     newline (hex 0A)         \n        newline (hex 0A)
1694         \r     carriage return (hex 0D)         \r        carriage return (hex 0D)
1695         \t     tab (hex 09)         \t        tab (hex 09)
1696         \xhh   character with hex code hh         \ddd      character with octal code ddd, or backreference
1697         \ddd   character with octal code ddd, or backreference         \xhh      character with hex code hh
1698           \x{hhh..} character with hex code hhh... (UTF-8 mode only)
1699    
1700       The precise effect of "\cx" is as follows: if "x" is a lower       The precise effect of \cx is as follows: if  x  is  a  lower
1701       case  letter,  it  is converted to upper case. Then bit 6 of       case  letter,  it  is converted to upper case. Then bit 6 of
1702       the character (hex 40) is inverted.  Thus "\cz" becomes  hex       the character (hex 40) is inverted.  Thus  \cz  becomes  hex
1703       1A, but "\c{" becomes hex 3B, while "\c;" becomes hex 7B.       1A, but \c{ becomes hex 3B, while \c; becomes hex 7B.
1704    
1705       After "\x", up to two hexadecimal digits are  read  (letters       After \x, from zero  to  two  hexadecimal  digits  are  read
1706       can be in upper or lower case).       (letters  can be in upper or lower case). In UTF-8 mode, any
1707         number of hexadecimal digits may appear between \x{  and  },
1708         but  the value of the character code must be less than 2**31
1709         (that is, the maximum hexadecimal  value  is  7FFFFFFF).  If
1710         characters  other than hexadecimal digits appear between \x{
1711         and }, or if there is no terminating }, this form of  escape
1712         is  not  recognized.  Instead, the initial \x will be inter-
1713         preted as a basic  hexadecimal  escape,  with  no  following
1714         digits, giving a byte whose value is zero.
1715    
1716         Characters whose value is less than 256 can  be  defined  by
1717         either  of  the  two  syntaxes  for \x when PCRE is in UTF-8
1718         mode. There is no difference in the way  they  are  handled.
1719         For example, \xdc is exactly the same as \x{dc}.
1720    
1721       After "\0" up to two further octal digits are read. In  both       After \0 up to two further octal digits are  read.  In  both
1722       cases,  if  there are fewer than two digits, just those that       cases,  if  there are fewer than two digits, just those that
1723       are present are used. Thus the sequence "\0\x\07"  specifies       are present are used. Thus the  sequence  \0\x\07  specifies
1724       two binary zeros followed by a BEL character.  Make sure you       two binary zeros followed by a BEL character (code value 7).
1725       supply two digits after the initial zero  if  the  character       Make sure you supply two digits after the  initial  zero  if
1726       that follows is itself an octal digit.       the character that follows is itself an octal digit.
1727    
1728       The handling of a backslash followed by a digit other than 0       The handling of a backslash followed by a digit other than 0
1729       is  complicated.   Outside  a character class, PCRE reads it       is  complicated.   Outside  a character class, PCRE reads it
# Line 1024  BACKSLASH Line 1749  BACKSLASH
1749                   writing a tab                   writing a tab
1750         \011   is always a tab         \011   is always a tab
1751         \0113  is a tab followed by the character "3"         \0113  is a tab followed by the character "3"
1752         \113   is the character with octal code 113 (since there         \113   might be a back reference, otherwise the
1753                   can be no more than 99 back references)                   character with octal code 113
1754         \377   is a byte consisting entirely of 1 bits         \377   might be a back reference, otherwise
1755                     the byte consisting entirely of 1 bits
1756         \81    is either a back reference, or a binary zero         \81    is either a back reference, or a binary zero
1757                   followed by the two characters "8" and "1"                   followed by the two characters "8" and "1"
1758    
# Line 1034  BACKSLASH Line 1760  BACKSLASH
1760       duced  by  a  leading zero, because no more than three octal       duced  by  a  leading zero, because no more than three octal
1761       digits are ever read.       digits are ever read.
1762    
1763       All the sequences that define a single  byte  value  can  be       All the sequences that define a single byte value or a  sin-
1764       used both inside and outside character classes. In addition,       gle  UTF-8 character (in UTF-8 mode) can be used both inside
1765       inside a character class, the sequence "\b"  is  interpreted       and outside character classes. In addition, inside a charac-
1766       as  the  backspace  character  (hex 08). Outside a character       ter  class,  the sequence \b is interpreted as the backspace
1767       class it has a different meaning (see below).       character (hex 08). Outside a character class it has a  dif-
1768         ferent meaning (see below).
1769    
1770       The third use of backslash is for specifying generic charac-       The third use of backslash is for specifying generic charac-
1771       ter types:       ter types:
# Line 1048  BACKSLASH Line 1775  BACKSLASH
1775         \s     any whitespace character         \s     any whitespace character
1776         \S     any character that is not a whitespace character         \S     any character that is not a whitespace character
1777         \w     any "word" character         \w     any "word" character
1778         \W     any "non-word" character         W     any "non-word" character
1779    
1780       Each pair of escape sequences partitions the complete set of       Each pair of escape sequences partitions the complete set of
1781       characters  into  two  disjoint  sets.  Any  given character       characters  into  two  disjoint  sets.  Any  given character
1782       matches one, and only one, of each pair.       matches one, and only one, of each pair.
1783    
1784         In UTF-8 mode, characters with values greater than 255 never
1785         match \d, \s, or \w, and always match \D, \S, and \W.
1786    
1787         For compatibility with Perl, \s does not match the VT  char-
1788         acter (code 11).  This makes it different from the the POSIX
1789         "space" class. The \s characters are HT  (9),  LF  (10),  FF
1790         (12), CR (13), and space (32).
1791    
1792       A "word" character is any letter or digit or the  underscore       A "word" character is any letter or digit or the  underscore
1793       character,  that  is,  any  character which can be part of a       character,  that  is,  any  character which can be part of a
1794       Perl "word". The definition of letters and  digits  is  con-       Perl "word". The definition of letters and  digits  is  con-
1795       trolled  by PCRE's character tables, and may vary if locale-       trolled  by PCRE's character tables, and may vary if locale-
1796       specific matching is  taking  place  (see  "Locale  support"       specific matching is taking place (see "Locale  support"  in
1797       above). For example, in the "fr" (French) locale, some char-       the pcreapi page). For example, in the "fr" (French) locale,
1798       acter codes greater than 128 are used for accented  letters,       some character codes greater than 128 are used for  accented
1799       and these are matched by \w.       letters, and these are matched by \w.
1800    
1801       These character type sequences can appear  both  inside  and       These character type sequences can appear  both  inside  and
1802       outside  character classes. They each match one character of       outside  character classes. They each match one character of
# Line 1076  BACKSLASH Line 1811  BACKSLASH
1811       for more complicated  assertions  is  described  below.  The       for more complicated  assertions  is  described  below.  The
1812       backslashed assertions are       backslashed assertions are
1813    
1814         \b     word boundary         \b     matches at a word boundary
1815         \B     not a word boundary         \B     matches when not at a word boundary
1816         \A     start of subject (independent of multiline mode)         \A     matches at start of subject
1817         \Z     end of subject or newline at  end  (independent  of         \Z     matches at end of subject or before newline at end
1818       multiline mode)         \z     matches at end of subject
1819         \z     end of subject (independent of multiline mode)         \G     matches at first matching position in subject
1820    
1821       These assertions may not appear in  character  classes  (but       These assertions may not appear in  character  classes  (but
1822       note that "\b" has a different meaning, namely the backspace       note  that  \b has a different meaning, namely the backspace
1823       character, inside a character class).       character, inside a character class).
1824    
1825       A word boundary is a position in the  subject  string  where       A word boundary is a position in the  subject  string  where
# Line 1092  BACKSLASH Line 1827  BACKSLASH
1827       match \w or \W (i.e. one matches \w and  the  other  matches       match \w or \W (i.e. one matches \w and  the  other  matches
1828       \W),  or the start or end of the string if the first or last       \W),  or the start or end of the string if the first or last
1829       character matches \w, respectively.       character matches \w, respectively.
   
1830       The \A, \Z, and \z assertions differ  from  the  traditional       The \A, \Z, and \z assertions differ  from  the  traditional
1831       circumflex  and  dollar  (described below) in that they only       circumflex  and  dollar  (described below) in that they only
1832       ever match at the very start and end of the subject  string,       ever match at the very start and end of the subject  string,
1833       whatever  options  are  set.  They  are  not affected by the       whatever options are set. Thus, they are independent of mul-
1834       PCRE_NOTBOL or PCRE_NOTEOL options. If the startoffset argu-       tiline mode.
1835       ment  of  pcre_exec()  is  non-zero, \A can never match. The  
1836         They are not affected  by  the  PCRE_NOTBOL  or  PCRE_NOTEOL
1837         options.  If the startoffset argument of pcre_exec() is non-
1838         zero, indicating that matching is to start at a point  other
1839         than  the  beginning of the subject, \A can never match. The
1840       difference between \Z and \z is that  \Z  matches  before  a       difference between \Z and \z is that  \Z  matches  before  a
1841       newline  that is the last character of the string as well as       newline  that is the last character of the string as well as
1842       at the end of the string, whereas \z  matches  only  at  the       at the end of the string, whereas \z  matches  only  at  the
1843       end.       end.
1844    
1845         The \G assertion is true  only  when  the  current  matching
1846         position is at the start point of the match, as specified by
1847         the startoffset argument of pcre_exec(). It differs from  \A
1848         when  the  value  of  startoffset  is  non-zero.  By calling
1849         pcre_exec() multiple times with appropriate  arguments,  you
1850         can mimic Perl's /g option, and it is in this kind of imple-
1851         mentation where \G can be useful.
1852    
1853         Note, however, that PCRE's  interpretation  of  \G,  as  the
1854         start of the current match, is subtly different from Perl's,
1855         which defines it as the end of the previous match. In  Perl,
1856         these  can  be  different when the previously matched string
1857         was empty. Because PCRE does just one match at  a  time,  it
1858         cannot reproduce this behaviour.
1859    
1860         If all the alternatives of a  pattern  begin  with  \G,  the
1861         expression  is  anchored to the starting match position, and
1862         the "anchored" flag is set in the compiled  regular  expres-
1863         sion.
1864    
1865    
1866  CIRCUMFLEX AND DOLLAR  CIRCUMFLEX AND DOLLAR
1867    
1868       Outside a character class, in the default matching mode, the       Outside a character class, in the default matching mode, the
1869       circumflex  character  is an assertion which is true only if       circumflex  character  is an assertion which is true only if
1870       the current matching point is at the start  of  the  subject       the current matching point is at the start  of  the  subject
1871       string.  If  the startoffset argument of pcre_exec() is non-       string.  If  the startoffset argument of pcre_exec() is non-
1872       zero, circumflex can never match. Inside a character  class,       zero, circumflex  can  never  match  if  the  PCRE_MULTILINE
1873       circumflex has an entirely different meaning (see below).       option is unset. Inside a character class, circumflex has an
1874         entirely different meaning (see below).
1875    
1876       Circumflex need not be the first character of the pattern if       Circumflex need not be the first character of the pattern if
1877       a  number of alternatives are involved, but it should be the       a  number of alternatives are involved, but it should be the
# Line 1134  CIRCUMFLEX AND DOLLAR Line 1893  CIRCUMFLEX AND DOLLAR
1893    
1894       The meaning of dollar can be changed so that it matches only       The meaning of dollar can be changed so that it matches only
1895       at   the   very   end   of   the   string,  by  setting  the       at   the   very   end   of   the   string,  by  setting  the
1896       PCRE_DOLLAR_ENDONLY option at compile or matching time. This       PCRE_DOLLAR_ENDONLY option at compile time.  This  does  not
1897       does not affect the \Z assertion.       affect the \Z assertion.
1898    
1899       The meanings of the circumflex  and  dollar  characters  are       The meanings of the circumflex  and  dollar  characters  are
1900       changed  if  the  PCRE_MULTILINE option is set. When this is       changed  if  the  PCRE_MULTILINE option is set. When this is
1901       the case,  they  match  immediately  after  and  immediately       the case,  they  match  immediately  after  and  immediately
1902       before an internal "\n" character, respectively, in addition       before an internal newline character, respectively, in addi-
1903       to matching at the start and end of the subject string.  For       tion to matching at the start and end of the subject string.
1904       example,  the  pattern  /^abc$/  matches  the subject string       For  example, the pattern /^abc$/ matches the subject string
1905       "def\nabc" in multiline  mode,  but  not  otherwise.  Conse-       "def\nabc" in multiline  mode,  but  not  otherwise.  Conse-
1906       quently,  patterns  that  are  anchored  in single line mode       quently,  patterns  that  are  anchored  in single line mode
1907       because all branches start with "^" are not anchored in mul-       because all branches start with ^ are not anchored in multi-
1908       tiline mode, and a match for circumflex is possible when the       line  mode,  and a match for circumflex is possible when the
1909       startoffset  argument  of  pcre_exec()  is   non-zero.   The       startoffset  argument  of  pcre_exec()  is   non-zero.   The
1910       PCRE_DOLLAR_ENDONLY  option  is ignored if PCRE_MULTILINE is       PCRE_DOLLAR_ENDONLY  option  is ignored if PCRE_MULTILINE is
1911       set.       set.
# Line 1157  CIRCUMFLEX AND DOLLAR Line 1916  CIRCUMFLEX AND DOLLAR
1916       whether PCRE_MULTILINE is set or not.       whether PCRE_MULTILINE is set or not.
1917    
1918    
   
1919  FULL STOP (PERIOD, DOT)  FULL STOP (PERIOD, DOT)
1920    
1921       Outside a character class, a dot in the pattern matches  any       Outside a character class, a dot in the pattern matches  any
1922       one character in the subject, including a non-printing char-       one character in the subject, including a non-printing char-
1923       acter, but not (by default)  newline.   If  the  PCRE_DOTALL       acter, but not (by default) newline.  In UTF-8 mode,  a  dot
1924       option  is set, dots match newlines as well. The handling of       matches  any  UTF-8  character, which might be more than one
1925       dot is entirely independent of the  handling  of  circumflex       byte  long,  except  (by  default)  for  newline.   If   the
1926       and  dollar,  the  only  relationship  being  that they both       PCRE_DOTALL  option is set, dots match newlines as well. The
1927       involve newline characters. Dot has no special meaning in  a       handling of dot is entirely independent of the  handling  of
1928       character class.       circumflex and dollar, the only relationship being that they
1929         both involve newline characters. Dot has no special  meaning
1930         in a character class.
1931    
1932    
1933    
1934    MATCHING A SINGLE BYTE
1935    
1936         Outside a character class, the escape  sequence  \C  matches
1937         any  one  byte, both in and out of UTF-8 mode. Unlike a dot,
1938         it always matches a newline. The feature is provided in Perl
1939         in  order  to match individual bytes in UTF-8 mode.  Because
1940         it breaks up UTF-8 characters into  individual  bytes,  what
1941         remains  in  the string may be a malformed UTF-8 string. For
1942         this reason it is best avoided.
1943    
1944         PCRE does not allow \C to appear  in  lookbehind  assertions
1945         (see below), because in UTF-8 mode it makes it impossible to
1946         calculate the length of the lookbehind.
1947    
1948    
1949  SQUARE BRACKETS  SQUARE BRACKETS
1950    
1951       An opening square bracket introduces a character class, ter-       An opening square bracket introduces a character class, ter-
1952       minated  by  a  closing  square  bracket.  A  closing square       minated  by  a  closing  square  bracket.  A  closing square
1953       bracket on its own is  not  special.  If  a  closing  square       bracket on its own is  not  special.  If  a  closing  square
# Line 1178  SQUARE BRACKETS Line 1955  SQUARE BRACKETS
1955       the first data character in the class (after an initial cir-       the first data character in the class (after an initial cir-
1956       cumflex, if present) or escaped with a backslash.       cumflex, if present) or escaped with a backslash.
1957    
1958       A character class matches a single character in the subject;       A character class matches a single character in the subject.
1959       the  character  must  be in the set of characters defined by       In  UTF-8 mode, the character may occupy more than one byte.
1960       the class, unless the first character in the class is a cir-       A matched character must be in the set of characters defined
1961       cumflex,  in which case the subject character must not be in       by the class, unless the first character in the class defin-
1962       the set defined by the class. If a  circumflex  is  actually       ition is a circumflex, in which case the  subject  character
1963       required  as  a  member  of  the class, ensure it is not the       must not be in the set defined by the class. If a circumflex
1964       first character, or escape it with a backslash.       is actually required as a member of the class, ensure it  is
1965         not the first character, or escape it with a backslash.
1966    
1967       For example, the character class [aeiou] matches  any  lower       For example, the character class [aeiou] matches  any  lower
1968       case vowel, while [^aeiou] matches any character that is not       case vowel, while [^aeiou] matches any character that is not
# Line 1195  SQUARE BRACKETS Line 1973  SQUARE BRACKETS
1973       string, and fails if the current pointer is at  the  end  of       string, and fails if the current pointer is at  the  end  of
1974       the string.       the string.
1975    
1976         In UTF-8 mode, characters with values greater than  255  can
1977         be  included  in a class as a literal string of bytes, or by
1978         using the \x{ escaping mechanism.
1979    
1980       When caseless matching  is  set,  any  letters  in  a  class       When caseless matching  is  set,  any  letters  in  a  class
1981       represent  both their upper case and lower case versions, so       represent  both their upper case and lower case versions, so
1982       for example, a caseless [aeiou] matches "A" as well as  "a",       for example, a caseless [aeiou] matches "A" as well as  "a",
1983       and  a caseless [^aeiou] does not match "A", whereas a case-       and  a caseless [^aeiou] does not match "A", whereas a case-
1984       ful version would.       ful version would. PCRE does not support the concept of case
1985         for characters with values greater than 255.
1986       The newline character is never treated in any special way in       The newline character is never treated in any special way in
1987       character  classes,  whatever the setting of the PCRE_DOTALL       character  classes,  whatever the setting of the PCRE_DOTALL
1988       or PCRE_MULTILINE options is. A  class  such  as  [^a]  will       or PCRE_MULTILINE options is. A  class  such  as  [^a]  will
# Line 1224  SQUARE BRACKETS Line 2006  SQUARE BRACKETS
2006       separate characters. The octal or hexadecimal representation       separate characters. The octal or hexadecimal representation
2007       of "]" can also be used to end a range.       of "]" can also be used to end a range.
2008    
2009       Ranges operate in ASCII collating sequence. They can also be       Ranges  operate  in  the  collating  sequence  of  character
2010       used  for  characters  specified  numerically,  for  example       values.  They  can  also  be  used  for characters specified
2011       [\000-\037]. If a range that includes letters is  used  when       numerically, for example [\000-\037]. In UTF-8 mode,  ranges
2012       caseless  matching  is set, it matches the letters in either       can  include  characters  whose values are greater than 255,
2013       case. For example, [W-c] is equivalent  to  [][\^_`wxyzabc],       for example [\x{100}-\x{2ff}].
2014       matched  caselessly,  and  if  character tables for the "fr"  
2015       locale are in use, [\xc8-\xcb] matches accented E characters       If a range that  includes  letters  is  used  when  caseless
2016       in both cases.       matching  is set, it matches the letters in either case. For
2017         example, [W-c] is  equivalent  to  [][\^_`wxyzabc],  matched
2018         caselessly,  and if character tables for the "fr" locale are
2019         in use, [\xc8-\xcb] matches accented E  characters  in  both
2020         cases.
2021    
2022       The character types \d, \D, \s, \S,  \w,  and  \W  may  also       The character types \d, \D, \s, \S,  \w,  and  \W  may  also
2023       appear  in  a  character  class, and add the characters that       appear  in  a  character  class, and add the characters that
# Line 1247  SQUARE BRACKETS Line 2033  SQUARE BRACKETS
2033       classes, but it does no harm if they are escaped.       classes, but it does no harm if they are escaped.
2034    
2035    
   
2036  POSIX CHARACTER CLASSES  POSIX CHARACTER CLASSES
2037       Perl 5.6 (not yet released at the time of writing) is  going  
2038       to  support  the POSIX notation for character classes, which       Perl supports the  POSIX  notation  for  character  classes,
2039       uses names enclosed by  [:  and  :]   within  the  enclosing       which  uses names enclosed by [: and :] within the enclosing
2040       square brackets. PCRE supports this notation. For example,       square brackets. PCRE also supports this notation. For exam-
2041         ple,
2042    
2043         [01[:alpha:]%]         [01[:alpha:]%]
2044    
# Line 1262  POSIX CHARACTER CLASSES Line 2048  POSIX CHARACTER CLASSES
2048         alnum    letters and digits         alnum    letters and digits
2049         alpha    letters         alpha    letters
2050         ascii    character codes 0 - 127         ascii    character codes 0 - 127
2051           blank    space or tab only
2052         cntrl    control characters         cntrl    control characters
2053         digit    decimal digits (same as \d)         digit    decimal digits (same as \d)
2054         graph    printing characters, excluding space         graph    printing characters, excluding space
2055         lower    lower case letters         lower    lower case letters
2056         print    printing characters, including space         print    printing characters, including space
2057         punct    printing characters, excluding letters and digits         punct    printing characters, excluding letters and digits
2058         space    white space (same as \s)         space    white space (not quite the same as \s)
2059         upper    upper case letters         upper    upper case letters
2060         word     "word" characters (same as \w)         word     "word" characters (same as \w)
2061         xdigit   hexadecimal digits         xdigit   hexadecimal digits
2062    
2063       The names "ascii" and "word" are  Perl  extensions.  Another       The "space" characters are HT (9),  LF  (10),  VT  (11),  FF
2064       Perl  extension is negation, which is indicated by a ^ char-       (12),  CR  (13),  and  space  (32).  Notice  that  this list
2065       acter after the colon. For example,       includes the VT character (code 11). This makes "space" dif-
2066         ferent  to  \s, which does not include VT (for Perl compati-
2067         bility).
2068    
2069         The name "word" is a Perl extension, and "blank"  is  a  GNU
2070         extension from Perl 5.8. Another Perl extension is negation,
2071         which is indicated by a ^ character  after  the  colon.  For
2072         example,
2073    
2074         [12[:^digit:]]         [12[:^digit:]]
2075    
# Line 1284  POSIX CHARACTER CLASSES Line 2078  POSIX CHARACTER CLASSES
2078       "collating element", but these are  not  supported,  and  an       "collating element", but these are  not  supported,  and  an
2079       error is given if they are encountered.       error is given if they are encountered.
2080    
2081         In UTF-8 mode, characters with values greater  than  255  do
2082         not match any of the POSIX character classes.
2083    
2084    
2085  VERTICAL BAR  VERTICAL BAR
2086    
2087       Vertical bar characters are  used  to  separate  alternative       Vertical bar characters are  used  to  separate  alternative
2088       patterns. For example, the pattern       patterns. For example, the pattern
2089    
# Line 1302  VERTICAL BAR Line 2099  VERTICAL BAR
2099       subpattern.       subpattern.
2100    
2101    
   
2102  INTERNAL OPTION SETTING  INTERNAL OPTION SETTING
2103       The settings of PCRE_CASELESS, PCRE_MULTILINE,  PCRE_DOTALL,  
2104       and  PCRE_EXTENDED can be changed from within the pattern by       The   settings   of   the   PCRE_CASELESS,   PCRE_MULTILINE,
2105       a sequence of Perl option letters enclosed between "(?"  and       PCRE_DOTALL,  and  PCRE_EXTENDED options can be changed from
2106       ")". The option letters are       within the pattern by a  sequence  of  Perl  option  letters
2107         enclosed between "(?" and ")". The option letters are
2108    
2109         i  for PCRE_CASELESS         i  for PCRE_CASELESS
2110         m  for PCRE_MULTILINE         m  for PCRE_MULTILINE
# Line 1322  INTERNAL OPTION SETTING Line 2119  INTERNAL OPTION SETTING
2119       If  a  letter  appears both before and after the hyphen, the       If  a  letter  appears both before and after the hyphen, the
2120       option is unset.       option is unset.
2121    
2122       The scope of these option changes depends on  where  in  the       When an option change occurs at  top  level  (that  is,  not
2123       pattern  the  setting  occurs. For settings that are outside       inside  subpattern  parentheses),  the change applies to the
2124       any subpattern (defined below), the effect is the same as if       remainder of the pattern that follows.   If  the  change  is
2125       the  options were set or unset at the start of matching. The       placed  right  at  the  start of a pattern, PCRE extracts it
2126       following patterns all behave in exactly the same way:       into the global options (and it will therefore  show  up  in
2127         data extracted by the pcre_fullinfo() function).
2128         (?i)abc  
2129         a(?i)bc       An option change within a subpattern affects only that  part
2130         ab(?i)c       of the current pattern that follows it, so
        abc(?i)  
   
      which in turn is the same as compiling the pattern abc  with  
      PCRE_CASELESS  set.   In  other words, such "top level" set-  
      tings apply to the whole pattern  (unless  there  are  other  
      changes  inside subpatterns). If there is more than one set-  
      ting of the same option at top level, the rightmost  setting  
      is used.  
   
      If an option change occurs inside a subpattern,  the  effect  
      is  different.  This is a change of behaviour in Perl 5.005.  
      An option change inside a subpattern affects only that  part  
      of the subpattern that follows it, so  
2131    
2132         (a(?i)b)c         (a(?i)b)c
2133    
# Line 1370  INTERNAL OPTION SETTING Line 2154  INTERNAL OPTION SETTING
2154       even when it is at top level. It is best put at the start.       even when it is at top level. It is best put at the start.
2155    
2156    
   
2157  SUBPATTERNS  SUBPATTERNS
2158    
2159       Subpatterns are delimited by parentheses  (round  brackets),       Subpatterns are delimited by parentheses  (round  brackets),
2160       which can be nested.  Marking part of a pattern as a subpat-       which can be nested.  Marking part of a pattern as a subpat-
2161       tern does two things:       tern does two things:
# Line 1404  SUBPATTERNS Line 2188  SUBPATTERNS
2188       The fact that plain parentheses fulfil two functions is  not       The fact that plain parentheses fulfil two functions is  not
2189       always  helpful.  There are often times when a grouping sub-       always  helpful.  There are often times when a grouping sub-
2190       pattern is required without a capturing requirement.  If  an       pattern is required without a capturing requirement.  If  an
2191       opening parenthesis is followed by "?:", the subpattern does       opening  parenthesis  is  followed  by a question mark and a
2192       not do any capturing, and is not counted when computing  the       colon, the subpattern does not do any capturing, and is  not
2193       number of any subsequent capturing subpatterns. For example,       counted  when computing the number of any subsequent captur-
2194       if the string "the white queen" is matched against the  pat-       ing subpatterns. For  example,  if  the  string  "the  white
2195       tern       queen" is matched against the pattern
2196    
2197         the ((?:red|white) (king|queen))         the ((?:red|white) (king|queen))
2198    
2199       the captured substrings are "white queen" and  "queen",  and       the captured substrings are "white queen" and  "queen",  and
2200       are  numbered  1  and 2. The maximum number of captured sub-       are  numbered  1 and 2. The maximum number of capturing sub-
2201       strings is 99, and the maximum number  of  all  subpatterns,       patterns is 65535, and the maximum depth of nesting  of  all
2202       both capturing and non-capturing, is 200.       subpatterns, both capturing and non-capturing, is 200.
2203    
2204       As a  convenient  shorthand,  if  any  option  settings  are       As a  convenient  shorthand,  if  any  option  settings  are
2205       required  at  the  start  of a non-capturing subpattern, the       required  at  the  start  of a non-capturing subpattern, the
# Line 1432  SUBPATTERNS Line 2216  SUBPATTERNS
2216       the above patterns match "SUNDAY" as well as "Saturday".       the above patterns match "SUNDAY" as well as "Saturday".
2217    
2218    
2219    NAMED SUBPATTERNS
2220    
2221         Identifying capturing parentheses by number is  simple,  but
2222         it  can be very hard to keep track of the numbers in compli-
2223         cated regular expressions. Furthermore, if an expression  is
2224         modified,  the  numbers  may change. To help with the diffi-
2225         culty, PCRE supports the naming  of  subpatterns,  something
2226         that  Perl does not provide. The Python syntax (?P<name>...)
2227         is used. Names consist of alphanumeric characters and under-
2228         scores, and must be unique within a pattern.
2229    
2230         Named capturing parentheses are still allocated  numbers  as
2231         well  as  names.  The  PCRE  API provides function calls for
2232         extracting the name-to-number translation table from a  com-
2233         piled  pattern. For further details see the pcreapi documen-
2234         tation.
2235    
2236    
2237  REPETITION  REPETITION
2238    
2239       Repetition is specified by quantifiers, which can follow any       Repetition is specified by quantifiers, which can follow any
2240       of the following items:       of the following items:
2241    
2242         a single character, possibly escaped         a literal data character
2243         the . metacharacter         the . metacharacter
2244           the \C escape sequence
2245           escapes such as \d that match single characters
2246         a character class         a character class
2247         a back reference (see next section)         a back reference (see next section)
2248         a parenthesized subpattern (unless it is  an  assertion  -         a parenthesized subpattern (unless it is an assertion)
      see below)  
2249    
2250       The general repetition quantifier specifies  a  minimum  and       The general repetition quantifier specifies  a  minimum  and
2251       maximum  number  of  permitted  matches,  by  giving the two       maximum  number  of  permitted  matches,  by  giving the two
# Line 1470  REPETITION Line 2273  REPETITION
2273       one that does not match the syntax of a quantifier, is taken       one that does not match the syntax of a quantifier, is taken
2274       as  a literal character. For example, {,6} is not a quantif-       as  a literal character. For example, {,6} is not a quantif-
2275       ier, but a literal string of four characters.       ier, but a literal string of four characters.
2276    
2277         In UTF-8 mode, quantifiers apply to UTF-8 characters  rather
2278         than  to  individual  bytes.  Thus,  for example, \x{100}{2}
2279         matches two UTF-8 characters, each of which  is  represented
2280         by a two-byte sequence.
2281    
2282       The quantifier {0} is permitted, causing the  expression  to       The quantifier {0} is permitted, causing the  expression  to
2283       behave  as  if the previous item and the quantifier were not       behave  as  if the previous item and the quantifier were not
2284       present.       present.
# Line 1539  REPETITION Line 2348  REPETITION
2348       repeat  count  that is greater than 1 or with a limited max-       repeat  count  that is greater than 1 or with a limited max-
2349       imum, more store is required for the  compiled  pattern,  in       imum, more store is required for the  compiled  pattern,  in
2350       proportion to the size of the minimum or maximum.       proportion to the size of the minimum or maximum.
   
2351       If a pattern starts with .* or  .{0,}  and  the  PCRE_DOTALL       If a pattern starts with .* or  .{0,}  and  the  PCRE_DOTALL
2352       option (equivalent to Perl's /s) is set, thus allowing the .       option (equivalent to Perl's /s) is set, thus allowing the .
2353       to match  newlines,  the  pattern  is  implicitly  anchored,       to match  newlines,  the  pattern  is  implicitly  anchored,
2354       because whatever follows will be tried against every charac-       because whatever follows will be tried against every charac-
2355       ter position in the subject string, so there is no point  in       ter position in the subject string, so there is no point  in
2356       retrying  the overall match at any position after the first.       retrying  the overall match at any position after the first.
2357       PCRE treats such a pattern as though it were preceded by \A.       PCRE normally treats such a pattern as though it  were  pre-
2358       In  cases where it is known that the subject string contains       ceded by \A.
2359       no newlines, it is worth setting PCRE_DOTALL when  the  pat-  
2360       tern begins with .* in order to obtain this optimization, or       In cases where it is known that the subject string  contains
2361       alternatively using ^ to indicate anchoring explicitly.       no  newlines,  it  is  worth setting PCRE_DOTALL in order to
2362         obtain this optimization, or alternatively using ^ to  indi-
2363         cate anchoring explicitly.
2364    
2365         However, there is one situation where the optimization  can-
2366         not  be  used. When .*  is inside capturing parentheses that
2367         are the subject of a backreference elsewhere in the pattern,
2368         a match at the start may fail, and a later one succeed. Con-
2369         sider, for example:
2370    
2371           (.*)abc\1
2372    
2373         If the subject is "xyz123abc123"  the  match  point  is  the
2374         fourth  character.  For  this  reason, such a pattern is not
2375         implicitly anchored.
2376    
2377       When a capturing subpattern is repeated, the value  captured       When a capturing subpattern is repeated, the value  captured
2378       is the substring that matched the final iteration. For exam-       is the substring that matched the final iteration. For exam-
# Line 1570  REPETITION Line 2392  REPETITION
2392       "b".       "b".
2393    
2394    
2395    ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
2396    
2397  BACK REFERENCES       With both maximizing and minimizing repetition,  failure  of
2398       Outside a character class, a backslash followed by  a  digit       what  follows  normally  causes  the repeated item to be re-
2399       greater  than  0  (and  possibly  further  digits) is a back       evaluated to see if a different number of repeats allows the
2400         rest  of  the  pattern  to  match. Sometimes it is useful to
2401         prevent this, either to change the nature of the  match,  or
2402         to  cause  it fail earlier than it otherwise might, when the
2403         author of the pattern knows there is no  point  in  carrying
2404         on.
2405    
2406         Consider, for example, the pattern \d+foo  when  applied  to
2407         the subject line
2408    
2409           123456bar
2410    
2411         After matching all 6 digits and then failing to match "foo",
2412         the normal action of the matcher is to try again with only 5
2413         digits matching the \d+ item, and then with 4,  and  so  on,
2414         before  ultimately  failing. "Atomic grouping" (a term taken
2415         from Jeffrey Friedl's book) provides the means for  specify-
2416         ing  that once a subpattern has matched, it is not to be re-
2417         evaluated in this way.
2418    
2419         If we use atomic grouping  for  the  previous  example,  the
2420         matcher  would give up immediately on failing to match "foo"
2421         the  first  time.  The  notation  is  a  kind   of   special
2422         parenthesis, starting with (?> as in this example:
2423    
2424           (?>\d+)bar
2425    
2426         This kind of parenthesis "locks up" the  part of the pattern
2427         it  contains once it has matched, and a failure further into
2428         the pattern is prevented from backtracking  into  it.  Back-
2429         tracking  past  it to previous items, however, works as nor-
2430         mal.
2431    
2432         An alternative description is that a subpattern of this type
2433         matches  the  string  of  characters that an identical stan-
2434         dalone pattern would match, if anchored at the current point
2435         in the subject string.
2436    
2437         Atomic grouping subpatterns are not  capturing  subpatterns.
2438         Simple  cases such as the above example can be thought of as
2439         a maximizing repeat that must swallow everything it can. So,
2440         while both \d+ and \d+? are prepared to adjust the number of
2441         digits they match in order to make the rest of  the  pattern
2442         match, (?>\d+) can only match an entire sequence of digits.
2443    
2444         Atomic groups in general can of course  contain  arbitrarily
2445         complicated  subpatterns,  and  can be nested. However, when
2446         the subpattern for an atomic group is just a single repeated
2447         item,  as in the example above, a simpler notation, called a
2448         "possessive quantifier" can be used.  This  consists  of  an
2449         additional  +  character  following a quantifier. Using this
2450         notation, the previous example can be rewritten as
2451    
2452           \d++bar
2453    
2454         Possessive quantifiers are always greedy; the setting of the
2455         PCRE_UNGREEDY option is ignored. They are a convenient nota-
2456         tion for the simpler forms of atomic group.  However,  there
2457         is  no  difference in the meaning or processing of a posses-
2458         sive quantifier and the equivalent atomic group.
2459    
2460         The possessive quantifier syntax is an extension to the Perl
2461         syntax. It originates in Sun's Java package.
2462    
2463         When a pattern contains an unlimited repeat inside a subpat-
2464         tern  that  can  itself  be  repeated an unlimited number of
2465         times, the use of an atomic group is the only way  to  avoid
2466         some  failing  matches  taking  a very long time indeed. The
2467         pattern
2468    
2469           (\D+|<\d+>)*[!?]
2470    
2471         matches an unlimited number of substrings that  either  con-
2472         sist  of  non-digits,  or digits enclosed in <>, followed by
2473         either ! or ?. When it matches, it runs quickly. However, if
2474         it is applied to
2475    
2476           aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
2477    
2478         it takes a long  time  before  reporting  failure.  This  is
2479         because the string can be divided between the two repeats in
2480         a large number of ways, and all have to be tried. (The exam-
2481         ple  used  [!?]  rather  than a single character at the end,
2482         because both PCRE and Perl have an optimization that  allows
2483         for  fast  failure  when  a  single  character is used. They
2484         remember the last single character that is  required  for  a
2485         match,  and  fail early if it is not present in the string.)
2486         If the pattern is changed to
2487    
2488           ((?>\D+)|<\d+>)*[!?]
2489    
2490  SunOS 5.8                 Last change:                         30       sequences of non-digits cannot be broken, and  failure  hap-
2491         pens quickly.
2492    
2493    
2494    BACK REFERENCES
2495    
2496       reference to a capturing subpattern  earlier  (i.e.  to  its       Outside a character class, a backslash followed by  a  digit
2497         greater  than  0  (and  possibly  further  digits) is a back
2498         reference to a capturing subpattern earlier (that is, to its
2499       left)  in  the  pattern,  provided there have been that many       left)  in  the  pattern,  provided there have been that many
2500       previous capturing left parentheses.       previous capturing left parentheses.
2501    
# Line 1597  SunOS 5.8 Last change: Line 2510  SunOS 5.8 Last change:
2510    
2511       A back reference matches whatever actually matched the  cap-       A back reference matches whatever actually matched the  cap-
2512       turing subpattern in the current subject string, rather than       turing subpattern in the current subject string, rather than
2513       anything matching the subpattern itself. So the pattern       anything matching the subpattern itself (see "Subpatterns as
2514         subroutines" below for a way of doing that). So the pattern
2515    
2516         (sens|respons)e and \1ibility         (sens|respons)e and \1ibility
2517    
# Line 1612  SunOS 5.8 Last change: Line 2526  SunOS 5.8 Last change:
2526       though  the  original  capturing subpattern is matched case-       though  the  original  capturing subpattern is matched case-
2527       lessly.       lessly.
2528    
2529         Back references to named subpatterns use the  Python  syntax
2530         (?P=name). We could rewrite the above example as follows:
2531    
2532           (?<p1>(?i)rah)\s+(?P=p1)
2533    
2534       There may be more than one back reference to the  same  sub-       There may be more than one back reference to the  same  sub-
2535       pattern.  If  a  subpattern  has not actually been used in a       pattern.  If  a  subpattern  has not actually been used in a
2536       particular match, any back references to it always fail. For       particular match, any back references to it always fail. For
# Line 1620  SunOS 5.8 Last change: Line 2539  SunOS 5.8 Last change:
2539         (a|(bc))\2         (a|(bc))\2
2540    
2541       always fails if it starts to match  "a"  rather  than  "bc".       always fails if it starts to match  "a"  rather  than  "bc".
2542       Because  there  may  be up to 99 back references, all digits       Because  there  may  be many capturing parentheses in a pat-
2543       following the backslash are taken as  part  of  a  potential       tern, all digits following the backslash are taken  as  part
2544       back reference number. If the pattern continues with a digit       of a potential back reference number. If the pattern contin-
2545       character, some delimiter must be used to terminate the back       ues with a digit character, some delimiter must be  used  to
2546       reference.   If the PCRE_EXTENDED option is set, this can be       terminate the back reference. If the PCRE_EXTENDED option is
2547       whitespace. Otherwise an empty comment can be used.       set, this can be whitespace.  Otherwise an empty comment can
2548         be used.
2549    
2550       A back reference that occurs inside the parentheses to which       A back reference that occurs inside the parentheses to which
2551       it  refers  fails when the subpattern is first used, so, for       it  refers  fails when the subpattern is first used, so, for
# Line 1644  SunOS 5.8 Last change: Line 2564  SunOS 5.8 Last change:
2564       example above, or by a quantifier with a minimum of zero.       example above, or by a quantifier with a minimum of zero.
2565    
2566    
   
2567  ASSERTIONS  ASSERTIONS
2568    
2569       An assertion is  a  test  on  the  characters  following  or       An assertion is  a  test  on  the  characters  following  or
2570       preceding  the current matching point that does not actually       preceding  the current matching point that does not actually
2571       consume any characters. The simple assertions coded  as  \b,       consume any characters. The simple assertions coded  as  \b,
2572       \B,  \A,  \Z,  \z, ^ and $ are described above. More compli-       \B,  \A, \G, \Z, \z, ^ and $ are described above.  More com-
2573       cated assertions are coded as  subpatterns.  There  are  two       plicated assertions are coded as subpatterns. There are  two
2574       kinds:  those that look ahead of the current position in the       kinds:  those that look ahead of the current position in the
2575       subject string, and those that look behind it.       subject string, and those that look behind it.
2576    
# Line 1677  ASSERTIONS Line 2597  ASSERTIONS
2597       when  the  next  three  characters  are  "bar". A lookbehind       when  the  next  three  characters  are  "bar". A lookbehind
2598       assertion is needed to achieve this effect.       assertion is needed to achieve this effect.
2599    
2600         If you want to force a matching failure at some point  in  a
2601         pattern,  the  most  convenient  way  to  do it is with (?!)
2602         because an empty string always matches, so an assertion that
2603         requires there not to be an empty string must always fail.
2604    
2605       Lookbehind assertions start with (?<=  for  positive  asser-       Lookbehind assertions start with (?<=  for  positive  asser-
2606       tions and (?<! for negative assertions. For example,       tions and (?<! for negative assertions. For example,
2607    
# Line 1697  ASSERTIONS Line 2622  ASSERTIONS
2622       causes an error at compile time. Branches  that  match  dif-       causes an error at compile time. Branches  that  match  dif-
2623       ferent length strings are permitted only at the top level of       ferent length strings are permitted only at the top level of
2624       a lookbehind assertion. This is an extension  compared  with       a lookbehind assertion. This is an extension  compared  with
2625       Perl  5.005,  which  requires all branches to match the same       Perl  (at  least  for  5.8),  which requires all branches to
2626       length of string. An assertion such as       match the same length of string. An assertion such as
2627    
2628         (?<=ab(c|de))         (?<=ab(c|de))
2629    
# Line 1712  ASSERTIONS Line 2637  ASSERTIONS
2637       alternative,  to  temporarily move the current position back       alternative,  to  temporarily move the current position back
2638       by the fixed width and then  try  to  match.  If  there  are       by the fixed width and then  try  to  match.  If  there  are
2639       insufficient  characters  before  the  current position, the       insufficient  characters  before  the  current position, the
2640       match is deemed to fail.  Lookbehinds  in  conjunction  with       match is deemed to fail.
2641       once-only  subpatterns can be particularly useful for match-  
2642       ing at the ends of strings; an example is given at  the  end       PCRE does not allow the \C escape (which  matches  a  single
2643       of the section on once-only subpatterns.       byte  in  UTF-8  mode)  to  appear in lookbehind assertions,
2644         because it makes it impossible to calculate  the  length  of
2645         the lookbehind.
2646    
2647         Atomic groups can be used  in  conjunction  with  lookbehind
2648         assertions  to  specify efficient matching at the end of the
2649         subject string. Consider a simple pattern such as
2650    
2651           abcd$
2652    
2653         when applied to a long string that does not  match.  Because
2654         matching  proceeds  from  left  to right, PCRE will look for
2655         each "a" in the subject and then see if what follows matches
2656         the rest of the pattern. If the pattern is specified as
2657    
2658           ^.*abcd$
2659    
2660         the initial .* matches the entire string at first, but  when
2661         this  fails  (because  there  is no following "a"), it back-
2662         tracks to match all but the last character, then all but the
2663         last  two  characters,  and so on. Once again the search for
2664         "a" covers the entire string, from right to left, so we  are
2665         no better off. However, if the pattern is written as
2666    
2667           ^(?>.*)(?<=abcd)
2668    
2669         or, equivalently,
2670    
2671           ^.*+(?<=abcd)
2672    
2673         there can be no backtracking for the .* item; it  can  match
2674         only  the entire string. The subsequent lookbehind assertion
2675         does a single test on the last four characters. If it fails,
2676         the match fails immediately. For long strings, this approach
2677         makes a significant difference to the processing time.
2678    
2679       Several assertions (of any sort) may  occur  in  succession.       Several assertions (of any sort) may  occur  in  succession.
2680       For example,       For example,
# Line 1760  ASSERTIONS Line 2719  ASSERTIONS
2719       for positive assertions, because it does not make sense  for       for positive assertions, because it does not make sense  for
2720       negative assertions.       negative assertions.
2721    
      Assertions count towards the maximum  of  200  parenthesized  
      subpatterns.  
   
   
   
 ONCE-ONLY SUBPATTERNS  
      With both maximizing and minimizing repetition,  failure  of  
      what  follows  normally  causes  the repeated item to be re-  
      evaluated to see if a different number of repeats allows the  
      rest  of  the  pattern  to  match. Sometimes it is useful to  
      prevent this, either to change the nature of the  match,  or  
      to  cause  it fail earlier than it otherwise might, when the  
      author of the pattern knows there is no  point  in  carrying  
      on.  
   
      Consider, for example, the pattern \d+foo  when  applied  to  
      the subject line  
   
        123456bar  
   
      After matching all 6 digits and then failing to match "foo",  
      the normal action of the matcher is to try again with only 5  
      digits matching the \d+ item, and then with 4,  and  so  on,  
      before ultimately failing. Once-only subpatterns provide the  
      means for specifying that once a portion of the pattern  has  
      matched,  it  is  not to be re-evaluated in this way, so the  
      matcher would give up immediately on failing to match  "foo"  
      the  first  time.  The  notation  is another kind of special  
      parenthesis, starting with (?> as in this example:  
   
        (?>\d+)bar  
   
      This kind of parenthesis "locks up" the  part of the pattern  
      it  contains once it has matched, and a failure further into  
      the pattern is prevented from backtracking  into  it.  Back-  
      tracking  past  it to previous items, however, works as nor-  
      mal.  
   
      An alternative description is that a subpattern of this type  
      matches  the  string  of  characters that an identical stan-  
      dalone pattern would match, if anchored at the current point  
      in the subject string.  
   
      Once-only subpatterns are not capturing subpatterns.  Simple  
      cases  such as the above example can be thought of as a max-  
      imizing repeat that must  swallow  everything  it  can.  So,  
      while both \d+ and \d+? are prepared to adjust the number of  
      digits they match in order to make the rest of  the  pattern  
      match, (?>\d+) can only match an entire sequence of digits.  
   
      This construction can of course contain arbitrarily  compli-  
      cated subpatterns, and it can be nested.  
   
      Once-only subpatterns can be used in conjunction with  look-  
      behind  assertions  to specify efficient matching at the end  
      of the subject string. Consider a simple pattern such as  
   
        abcd$  
   
      when applied to a long string which does not match.  Because  
      matching  proceeds  from  left  to right, PCRE will look for  
      each "a" in the subject and then see if what follows matches  
      the rest of the pattern. If the pattern is specified as  
   
        ^.*abcd$  
   
      the initial .* matches the entire string at first, but  when  
      this  fails  (because  there  is no following "a"), it back-  
      tracks to match all but the last character, then all but the  
      last  two  characters,  and so on. Once again the search for  
      "a" covers the entire string, from right to left, so we  are  
      no better off. However, if the pattern is written as  
   
        ^(?>.*)(?<=abcd)  
   
      there can be no backtracking for the .* item; it  can  match  
      only  the entire string. The subsequent lookbehind assertion  
      does a single test on the last four characters. If it fails,  
      the match fails immediately. For long strings, this approach  
      makes a significant difference to the processing time.  
   
      When a pattern contains an unlimited repeat inside a subpat-  
      tern  that  can  itself  be  repeated an unlimited number of  
      times, the use of a once-only subpattern is the only way  to  
      avoid  some  failing matches taking a very long time indeed.  
      The pattern  
   
        (\D+|<\d+>)*[!?]  
   
      matches an unlimited number of substrings that  either  con-  
      sist  of  non-digits,  or digits enclosed in <>, followed by  
      either ! or ?. When it matches, it runs quickly. However, if  
      it is applied to  
   
        aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa  
   
      it takes a long  time  before  reporting  failure.  This  is  
      because the string can be divided between the two repeats in  
      a large number of ways, and all have to be tried. (The exam-  
      ple  used  [!?]  rather  than a single character at the end,  
      because both PCRE and Perl have an optimization that  allows  
      for  fast  failure  when  a  single  character is used. They  
      remember the last single character that is  required  for  a  
      match,  and  fail early if it is not present in the string.)  
      If the pattern is changed to  
   
        ((?>\D+)|<\d+>)*[!?]  
   
      sequences of non-digits cannot be broken, and  failure  hap-  
      pens quickly.  
   
   
2722    
2723  CONDITIONAL SUBPATTERNS  CONDITIONAL SUBPATTERNS
2724    
2725       It is possible to cause the matching process to obey a  sub-       It is possible to cause the matching process to obey a  sub-
2726       pattern  conditionally  or to choose between two alternative       pattern  conditionally  or to choose between two alternative
2727       subpatterns, depending on the result  of  an  assertion,  or       subpatterns, depending on the result  of  an  assertion,  or
# Line 1888  CONDITIONAL SUBPATTERNS Line 2736  CONDITIONAL SUBPATTERNS
2736       more than two alternatives in the subpattern, a compile-time       more than two alternatives in the subpattern, a compile-time
2737       error occurs.       error occurs.
2738    
2739       There are two kinds of condition. If the  text  between  the       There are three kinds of condition. If the text between  the
2740       parentheses  consists of a sequence of digits, the condition       parentheses  consists of a sequence of digits, the condition
2741       is satisfied if the capturing subpattern of that number  has       is satisfied if the capturing subpattern of that number  has
2742       previously  matched.  The  number must be greater than zero.       previously  matched.  The  number must be greater than zero.
# Line 1912  CONDITIONAL SUBPATTERNS Line 2760  CONDITIONAL SUBPATTERNS
2760       matches a sequence of non-parentheses,  optionally  enclosed       matches a sequence of non-parentheses,  optionally  enclosed
2761       in parentheses.       in parentheses.
2762    
2763       If the condition is not a sequence of digits, it must be  an       If the condition is the string (R), it  is  satisfied  if  a
2764       assertion.  This  may be a positive or negative lookahead or       recursive  call  to the pattern or subpattern has been made.
2765       lookbehind assertion. Consider this pattern, again  contain-       At "top level", the condition is  false.   This  is  a  PCRE
2766       ing  non-significant  white space, and with the two alterna-       extension.  Recursive  patterns  are  described  in the next
2767       tives on the second line:       section.
2768    
2769         If the condition is not a sequence of digits or (R), it must
2770         be  an assertion.  This may be a positive or negative looka-
2771         head or lookbehind assertion. Consider this  pattern,  again
2772         containing  non-significant  white  space,  and with the two
2773         alternatives on the second line:
2774    
2775         (?(?=[^a-z]*[a-z])         (?(?=[^a-z]*[a-z])
2776         \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )         \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
# Line 1931  CONDITIONAL SUBPATTERNS Line 2785  CONDITIONAL SUBPATTERNS
2785       letters and dd are digits.       letters and dd are digits.
2786    
2787    
   
2788  COMMENTS  COMMENTS
2789    
2790       The sequence (?# marks the start of a comment which  contin-       The sequence (?# marks the start of a comment which  contin-
2791       ues  up  to the next closing parenthesis. Nested parentheses       ues  up  to the next closing parenthesis. Nested parentheses
2792       are not permitted. The characters that  make  up  a  comment       are not permitted. The characters that  make  up  a  comment
# Line 1943  COMMENTS Line 2797  COMMENTS
2797       ues up to the next newline character in the pattern.       ues up to the next newline character in the pattern.
2798    
2799    
   
2800  RECURSIVE PATTERNS  RECURSIVE PATTERNS
2801    
2802       Consider the problem of matching a  string  in  parentheses,       Consider the problem of matching a  string  in  parentheses,
2803       allowing  for  unlimited nested parentheses. Without the use       allowing  for  unlimited nested parentheses. Without the use
2804       of recursion, the best that can be done is to use a  pattern       of recursion, the best that can be done is to use a  pattern
2805       that  matches  up  to some fixed depth of nesting. It is not       that  matches  up  to some fixed depth of nesting. It is not
2806       possible to handle an arbitrary nesting depth. Perl 5.6  has       possible to handle an arbitrary nesting depth. Perl has pro-
2807       provided   an  experimental  facility  that  allows  regular       vided  an  experimental facility that allows regular expres-
2808       expressions to recurse (amongst other things). It does  this       sions to recurse (amongst other things).  It  does  this  by
2809       by  interpolating  Perl  code in the expression at run time,       interpolating  Perl  code in the expression at run time, and
2810       and the code can refer to the expression itself. A Perl pat-       the code can refer to the expression itself. A Perl  pattern
2811       tern  to  solve  the parentheses problem can be created like       to solve the parentheses problem can be created like this:
      this:  
2812    
2813         $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;         $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
2814    
2815       The (?p{...}) item interpolates Perl code at run  time,  and       The (?p{...}) item interpolates Perl code at run  time,  and
2816       in  this  case refers recursively to the pattern in which it       in  this  case refers recursively to the pattern in which it
2817       appears. Obviously, PCRE cannot support the interpolation of       appears. Obviously, PCRE cannot support the interpolation of
2818       Perl  code.  Instead,  the special item (?R) is provided for       Perl  code.  Instead,  it  supports  some special syntax for
2819       the specific case of recursion. This PCRE pattern solves the       recursion of the entire pattern,  and  also  for  individual
2820       parentheses  problem (assume the PCRE_EXTENDED option is set       subpattern recursion.
2821       so that white space is ignored):  
2822         The special item that consists of (? followed  by  a  number
2823         greater  than  zero and a closing parenthesis is a recursive
2824         call of the subpattern of the given number, provided that it
2825         occurs inside that subpattern. (If not, it is a "subroutine"
2826         call, which is described in the next section.)  The  special
2827         item  (?R) is a recursive call of the entire regular expres-
2828         sion.
2829    
2830         For example, this PCRE pattern solves the nested parentheses
2831         problem  (assume  the  PCRE_EXTENDED  option  is set so that
2832         white space is ignored):
2833    
2834         \( ( (?>[^()]+) | (?R) )* \)         \( ( (?>[^()]+) | (?R) )* \)
2835    
2836       First it matches an opening parenthesis. Then it matches any       First it matches an opening parenthesis. Then it matches any
2837       number  of substrings which can either be a sequence of non-       number  of substrings which can either be a sequence of non-
2838       parentheses, or a recursive  match  of  the  pattern  itself       parentheses, or a recursive  match  of  the  pattern  itself
2839       (i.e. a correctly parenthesized substring). Finally there is       (that  is  a  correctly  parenthesized  substring).  Finally
2840       a closing parenthesis.       there is a closing parenthesis.
2841    
2842         If this were part of a larger pattern, you would not want to
2843         recurse the entire pattern, so instead you could use this:
2844    
2845           ( \( ( (?>[^()]+) | (?1) )* \) )
2846    
2847         We have put the pattern into  parentheses,  and  caused  the
2848         recursion  to refer to them instead of the whole pattern. In
2849         a larger pattern, keeping track of parenthesis  numbers  can
2850         be   tricky.   It  may  be  more  convenient  to  use  named
2851         parentheses instead. For this, PCRE uses (?P>name), which is
2852         an  extension  to the Python syntax that PCRE uses for named
2853         parentheses (Perl does not provide  named  parentheses).  We
2854         could rewrite the above example as follows:
2855    
2856           (?<pn> \( ( (?>[^()]+) | (?P>pn) )* \) )
2857    
2858       This particular example pattern  contains  nested  unlimited       This particular example pattern  contains  nested  unlimited
2859       repeats, and so the use of a once-only subpattern for match-       repeats,  and  so  the  use  of atomic grouping for matching
2860       ing strings of non-parentheses is  important  when  applying       strings of non-parentheses is important  when  applying  the
2861       the  pattern to strings that do not match. For example, when       pattern to strings that do not match. For example, when this
2862       it is applied to       pattern is applied to
2863    
2864         (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()         (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
2865    
2866       it yields "no match" quickly. However, if a  once-only  sub-       it yields "no match" quickly. However, if atomic grouping is
2867       pattern  is  not  used,  the match runs for a very long time       not used, the match runs for a very long time indeed because
2868       indeed because there are so many different ways the + and  *       there are so many different ways the +  and  *  repeats  can
2869       repeats  can carve up the subject, and all have to be tested       carve  up  the  subject,  and  all  have to be tested before
2870       before failure can be reported.       failure can be reported.
2871         At the end of a match, the values set for any capturing sub-
2872       The values set for any capturing subpatterns are those  from       patterns are those from the outermost level of the recursion
2873       the outermost level of the recursion at which the subpattern       at which the subpattern value is set.  If you want to obtain
2874       value is set. If the pattern above is matched against       intermediate  values,  a  callout  function can be used (see
2875         below and the pcrecallout  documentation).  If  the  pattern
2876         above is matched against
2877    
2878         (ab(cd)ef)         (ab(cd)ef)
2879    
# Line 2001  RECURSIVE PATTERNS Line 2883  RECURSIVE PATTERNS
2883    
2884         \( ( ( (?>[^()]+) | (?R) )* ) \)         \( ( ( (?>[^()]+) | (?R) )* ) \)
2885            ^                        ^            ^                        ^
2886            ^                        ^ the string they  capture  is            ^                        ^
      "ab(cd)ef",  the  contents  of the top level parentheses. If  
      there are more than 15 capturing parentheses in  a  pattern,  
      PCRE  has  to  obtain  extra  memory  to store data during a  
      recursion, which it does by using  pcre_malloc,  freeing  it  
      via  pcre_free  afterwards. If no memory can be obtained, it  
      saves data for the first 15 capturing parentheses  only,  as  
      there is no way to give an out-of-memory error from within a  
      recursion.  
2887    
2888         the string they capture is "ab(cd)ef", the contents  of  the
2889         top  level  parentheses. If there are more than 15 capturing
2890         parentheses in a pattern, PCRE has to obtain extra memory to
2891         store  data  during  a  recursion,  which  it  does by using
2892         pcre_malloc, freeing it  via  pcre_free  afterwards.  If  no
2893         memory   can   be   obtained,   the  match  fails  with  the
2894         PCRE_ERROR_NOMEMORY error.
2895    
2896         Do not confuse the (?R) item with the condition  (R),  which
2897         tests  for  recursion.  Consider this pattern, which matches
2898         text in angle brackets, allowing for arbitrary nesting. Only
2899         digits are allowed in nested brackets (that is, when recurs-
2900         ing), whereas any characters  are  permitted  at  the  outer
2901         level.
2902    
2903           < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
2904    
2905         In this pattern, (?(R) is the start of a conditional subpat-
2906         tern,  with two different alternatives for the recursive and
2907         non-recursive cases. The (?R) item is the  actual  recursive
2908         call.
2909    
2910    
2911    SUBPATTERNS AS SUBROUTINES
2912    
2913         If the syntax for a recursive subpattern  reference  (either
2914         by  number  or  by  name) is used outside the parentheses to
2915         which it refers, it operates like a subroutine in a program-
2916         ming  language. An earlier example pointed out that the pat-
2917         tern
2918    
2919           (sens|respons)e and \1ibility
2920    
2921  PERFORMANCE       matches "sense and sensibility" and "response and  responsi-
2922       Certain items that may appear in patterns are more efficient       bility",  but not "sense and responsibility". If instead the
2923       than  others.  It is more efficient to use a character class       pattern
2924       like [aeiou] than a set of alternatives such as (a|e|i|o|u).  
2925       In  general,  the  simplest  construction  that provides the         (sens|respons)e and (?1)ibility
2926       required behaviour is usually the  most  efficient.  Jeffrey  
2927       Friedl's  book contains a lot of discussion about optimizing       is used, it does match "sense and responsibility" as well as
2928       regular expressions for efficient performance.       the other two strings. Such references must, however, follow
2929         the subpattern to which they refer.
2930       When a pattern begins with .* and the PCRE_DOTALL option  is  
2931       set,  the  pattern  is implicitly anchored by PCRE, since it  
2932       can match only at the start of a subject string. However, if  CALLOUTS
2933       PCRE_DOTALL  is not set, PCRE cannot make this optimization,  
2934       because the . metacharacter does not then match  a  newline,       Perl has a  feature  whereby  using  the  sequence  (?{...})
2935       and if the subject string contains newlines, the pattern may       causes  arbitrary  Perl  code  to be obeyed in the middle of
2936       match from the character immediately following one  of  them       matching a  regular  expression.  This  makes  it  possible,
2937       instead of from the very start. For example, the pattern       amongst  other  things, to extract different substrings that
2938         match the same pair of parentheses when there is  a  repeti-
2939         tion.
2940    
2941         PCRE provides a similar feature, but  of  course  it  cannot
2942         obey  arbitrary  Perl code. The feature is called "callout".
2943         The caller of PCRE provides an external function by  putting
2944         its  entry  point  in  the global variable pcre_callout.  By
2945         default, this variable contains  NULL,  which  disables  all
2946         calling out.
2947    
2948         Within a regular expression, (?C) indicates  the  points  at
2949         which  the external function is to be called. If you want to
2950         identify different callout points, you can put a number less
2951         than 256 after the letter C. The default value is zero.  For
2952         example, this pattern has two callout points:
2953    
2954           (?C1)9abc(?C2)def
2955    
2956         During matching, when PCRE  reaches  a  callout  point  (and
2957         pcre_callout is set), the external function is called. It is
2958         provided with the number of the  callout,  and,  optionally,
2959         one  item  of  data  originally  supplied  by  the caller of
2960         pcre_exec(). The callout  function  may  cause  matching  to
2961         backtrack,  or to fail altogether. A complete description of
2962         the interface to the callout function is given in the  pcre-
2963         callout documentation.
2964    
2965    Last updated: 03 February 2003
2966    Copyright (c) 1997-2003 University of Cambridge.
2967    -----------------------------------------------------------------------------
2968    
2969    NAME
2970         PCRE - Perl-compatible regular expressions
2971    
2972         (.*) second  
2973    PCRE PERFORMANCE
2974    
2975         Certain items that may appear in regular expression patterns
2976         are  more efficient than others. It is more efficient to use
2977         a character class like [aeiou] than a  set  of  alternatives
2978         such  as  (a|e|i|o|u). In general, the simplest construction
2979         that provides the required behaviour  is  usually  the  most
2980         efficient.  Jeffrey  Friedl's book contains a lot of discus-
2981         sion about optimizing regular expressions for efficient per-
2982         formance.
2983    
2984         When a pattern begins with .*  not  in  parentheses,  or  in
2985         parentheses that are not the subject of a backreference, and
2986         the PCRE_DOTALL option is set,  the  pattern  is  implicitly
2987         anchored  by PCRE, since it can match only at the start of a
2988         subject string. However, if PCRE_DOTALL  is  not  set,  PCRE
2989         cannot  make  this optimization, because the . metacharacter
2990         does not then match a newline, and  if  the  subject  string
2991         contains  newlines, the pattern may match from the character
2992         immediately following one of them instead of from  the  very
2993         start. For example, the pattern
2994    
2995           .*second
2996    
2997       matches the subject "first\nand second" (where \n stands for       matches the subject "first\nand second" (where \n stands for
2998       a newline character) with the first captured substring being       a newline character), with the match starting at the seventh
2999       "and". In order to do this, PCRE  has  to  retry  the  match       character. In order to do this, PCRE has to retry the  match
3000       starting after every newline in the subject.       starting after every newline in the subject.
3001    
3002       If you are using such a pattern with subject strings that do       If you are using such a pattern with subject strings that do
# Line 2058  PERFORMANCE Line 3019  PERFORMANCE
3019       that  the entire match is going to fail, PCRE has in princi-       that  the entire match is going to fail, PCRE has in princi-
3020       ple to try every possible variation, and this  can  take  an       ple to try every possible variation, and this  can  take  an
3021       extremely long time.       extremely long time.
   
3022       An optimization catches some of the more simple  cases  such       An optimization catches some of the more simple  cases  such
3023       as       as
3024    
# Line 2078  PERFORMANCE Line 3038  PERFORMANCE
3038       whereas the latter takes an appreciable  time  with  strings       whereas the latter takes an appreciable  time  with  strings
3039       longer than about 20 characters.       longer than about 20 characters.
3040    
3041    Last updated: 03 February 2003
3042    Copyright (c) 1997-2003 University of Cambridge.
3043    -----------------------------------------------------------------------------
3044    
3045    NAME
3046         PCRE - Perl-compatible regular expressions.
3047    
 UTF-8 SUPPORT  
      Starting at release 3.3, PCRE has some support for character  
      strings encoded in the UTF-8 format. This is incomplete, and  
      is regarded as experimental. In order to use  it,  you  must  
      configure PCRE to include UTF-8 support in the code, and, in  
      addition, you must call pcre_compile()  with  the  PCRE_UTF8  
      option flag. When you do this, both the pattern and any sub-  
      ject strings that are matched  against  it  are  treated  as  
      UTF-8  strings instead of just strings of bytes, but only in  
      the cases that are mentioned below.  
3048    
3049       If you compile PCRE with UTF-8 support, but do not use it at  SYNOPSIS OF POSIX API
3050       run  time,  the  library will be a bit bigger, but the addi-       #include <pcreposix.h>
      tional run time overhead is limited to testing the PCRE_UTF8  
      flag in several places, so should not be very large.  
3051    
3052       PCRE assumes that the strings  it  is  given  contain  valid       int regcomp(regex_t *preg, const char *pattern,
3053       UTF-8  codes. It does not diagnose invalid UTF-8 strings. If            int cflags);
      you pass invalid UTF-8 strings  to  PCRE,  the  results  are  
      undefined.  
3054    
3055       Running with PCRE_UTF8 set causes these changes in  the  way       int regexec(regex_t *preg, const char *string,
3056       PCRE works:            size_t nmatch, regmatch_t pmatch[], int eflags);
3057    
3058       1. In a pattern, the  escape  sequence  \x{...},  where  the       size_t regerror(int errcode, const regex_t *preg,
3059       contents of the braces is a string of hexadecimal digits, is            char *errbuf, size_t errbuf_size);
      interpreted as a UTF-8 character whose code  number  is  the  
      given   hexadecimal  number,  for  example:  \x{1234}.  This  
      inserts from one to six  literal  bytes  into  the  pattern,  
      using the UTF-8 encoding. If a non-hexadecimal digit appears  
      between the braces, the item is not recognized.  
   
      2. The original hexadecimal escape sequence, \xhh, generates  
      a two-byte UTF-8 character if its value is greater than 127.  
   
      3. Repeat quantifiers are NOT correctly handled if they fol-  
      low  a  multibyte character. For example, \x{100}* and \xc3+  
      do not work. If you want to repeat such characters, you must  
      enclose  them  in  non-capturing  parentheses,  for  example  
      (?:\x{100}), at present.  
3060    
3061       4. The dot metacharacter matches one UTF-8 character instead       void regfree(regex_t *preg);
      of a single byte.  
3062    
      5. Unlike literal UTF-8 characters,  the  dot  metacharacter  
      followed  by  a  repeat quantifier does operate correctly on  
      UTF-8 characters instead of single bytes.  
3063    
3064       4. Although the \x{...} escape is permitted in  a  character  DESCRIPTION
      class,  characters  whose values are greater than 255 cannot  
      be included in a class.  
3065    
3066       5. A class is matched against a UTF-8 character  instead  of       This set of functions provides a POSIX-style API to the PCRE
3067       just  a  single byte, but it can match only characters whose       regular  expression  package.  See the pcreapi documentation
3068       values are less than 256.  Characters  with  greater  values       for a description of the native API,  which  contains  addi-
3069       always fail to match a class.       tional functionality.
3070    
3071         The functions described here are just wrapper functions that
3072         ultimately  call  the  PCRE native API. Their prototypes are
3073         defined in the pcreposix.h header file, and on Unix  systems
3074         the library itself is called pcreposix.a, so can be accessed
3075         by adding -lpcreposix to the command for linking an applica-
3076         tion  which  uses them. Because the POSIX functions call the
3077         native ones, it is also necessary to add -lpcre.
3078    
3079         I have implemented only those option bits that can  be  rea-
3080         sonably  mapped  to  PCRE  native  options. In addition, the
3081         options REG_EXTENDED and  REG_NOSUB  are  defined  with  the
3082         value zero. They have no effect, but since programs that are
3083         written to the POSIX interface often use them, this makes it
3084         easier to slot in PCRE as a replacement library. Other POSIX
3085         options are not even defined.
3086    
3087         When PCRE is called via these functions, it is only the  API
3088         that is POSIX-like in style. The syntax and semantics of the
3089         regular expressions themselves are still those of Perl, sub-
3090         ject  to  the  setting of various PCRE options, as described
3091         below.
3092    
3093         The header for these functions is supplied as pcreposix.h to
3094         avoid  any  potential  clash  with other POSIX libraries. It
3095         can, of course, be renamed or aliased as regex.h,  which  is
3096         the "correct" name. It provides two structure types, regex_t
3097         for compiled internal forms, and  regmatch_t  for  returning
3098         captured  substrings.  It  also defines some constants whose
3099         names start with "REG_"; these are used for setting  options
3100         and identifying error codes.
3101    
      6. Repeated classes work correctly on multiple characters.  
3102    
3103       7. Classes containing just a single character whose value is  COMPILING A PATTERN
3104       greater than 127 (but less than 256), for example, [\x80] or  
3105       [^\x{93}], do not work because these are optimized into sin-       The function regcomp() is called to compile a  pattern  into
3106       gle  byte  matches.  In the first case, of course, the class       an  internal form. The pattern is a C string terminated by a
3107       brackets are just redundant.       binary zero, and is passed in the argument pattern. The preg
3108         argument  is  a pointer to a regex_t structure which is used
3109         as a base for storing information about the compiled expres-
3110         sion.
3111    
3112         The argument cflags is either zero, or contains one or  more
3113         of the bits defined by the following macros:
3114    
3115           REG_ICASE
3116    
3117         The PCRE_CASELESS option  is  set  when  the  expression  is
3118         passed for compilation to the native function.
3119    
3120           REG_NEWLINE
3121    
3122         The PCRE_MULTILINE option is  set  when  the  expression  is
3123         passed  for  compilation  to  the native function. Note that
3124         this  does  not  mimic  the  defined  POSIX  behaviour   for
3125         REG_NEWLINE (see the following section).
3126    
3127         In the absence of these flags, no options are passed to  the
3128         native  function.  This means the the regex is compiled with
3129         PCRE default semantics. In particular, the  way  it  handles
3130         newline  characters  in  the subject string is the Perl way,
3131         not the POSIX way. Note that setting PCRE_MULTILINE has only
3132         some  of  the effects specified for REG_NEWLINE. It does not
3133         affect the way newlines are matched by . (they aren't) or by
3134         a negative class such as [^a] (they are).
3135    
3136         The yield of regcomp() is zero on success, and non-zero oth-
3137         erwise.  The preg structure is filled in on success, and one
3138         member of the structure  is  public:  re_nsub  contains  the
3139         number  of  capturing subpatterns in the regular expression.
3140         Various error codes are defined in the header file.
3141    
3142    
3143    MATCHING NEWLINE CHARACTERS
3144    
3145         This area is not simple, because POSIX and  Perl  take  dif-
3146         ferent  views  of things.  It is not possible to get PCRE to
3147         obey POSIX semantics, but then PCRE was never intended to be
3148         a POSIX engine. The following table lists the different pos-
3149         sibilities for matching newline characters in PCRE:
3150    
3151                                   Default   Change with
3152    
3153           . matches newline          no     PCRE_DOTALL
3154           newline matches [^a]       yes    not changeable
3155           $ matches \n at end        yes    PCRE_DOLLARENDONLY
3156           $ matches \n in middle     no     PCRE_MULTILINE
3157           ^ matches \n in middle     no     PCRE_MULTILINE
3158    
3159         This is the equivalent table for POSIX:
3160    
3161                                   Default   Change with
3162    
3163           . matches newline          yes      REG_NEWLINE
3164           newline matches [^a]       yes      REG_NEWLINE
3165           $ matches \n at end        no       REG_NEWLINE
3166           $ matches \n in middle     no       REG_NEWLINE
3167           ^ matches \n in middle     no       REG_NEWLINE
3168    
3169         PCRE's behaviour is the same as Perl's, except that there is
3170         no  equivalent  for PCRE_DOLLARENDONLY in Perl. In both PCRE
3171         and Perl, there is no way  to  stop  newline  from  matching
3172         [^a].
3173    
3174         The default POSIX newline handling can be obtained  by  set-
3175         ting PCRE_DOTALL and PCRE_DOLLARENDONLY, but there is no way
3176         to make PCRE behave exactly as for the REG_NEWLINE action.
3177    
3178    
3179    MATCHING A PATTERN
3180    
3181       8. Lookbehind assertions move backwards in the subject by  a       The function regexec() is called  to  match  a  pre-compiled
3182       fixed  number  of  characters  instead  of a fixed number of       pattern  preg against a given string, which is terminated by
3183       bytes. Simple cases have been tested to work correctly,  but       a zero byte, subject to the options in eflags. These can be:
3184       there may be hidden gotchas herein.  
3185           REG_NOTBOL
3186    
3187         The PCRE_NOTBOL option is set when  calling  the  underlying
3188         PCRE matching function.
3189    
3190           REG_NOTEOL
3191    
3192         The PCRE_NOTEOL option is set when  calling  the  underlying
3193         PCRE matching function.
3194    
3195         The portion of the string that was  matched,  and  also  any
3196         captured  substrings,  are returned via the pmatch argument,
3197         which points to  an  array  of  nmatch  structures  of  type
3198         regmatch_t,  containing  the  members rm_so and rm_eo. These
3199         contain the offset to the first character of each  substring
3200         and  the offset to the first character after the end of each
3201         substring, respectively.  The  0th  element  of  the  vector
3202         relates  to  the  entire portion of string that was matched;
3203         subsequent elements relate to the capturing  subpatterns  of
3204         the  regular  expression.  Unused  entries in the array have
3205         both structure members set to -1.
3206    
3207         A successful match yields a zero return; various error codes
3208         are  defined in the header file, of which REG_NOMATCH is the
3209         "expected" failure code.
3210    
3211    
3212    ERROR MESSAGES
3213    
3214         The regerror()  function  maps  a  non-zero  errorcode  from
3215         either  regcomp()  or  regexec()  to a printable message. If
3216         preg is not NULL, the error should have arisen from the  use
3217         of  that structure. A message terminated by a binary zero is
3218         placed in errbuf. The length of the message,  including  the
3219         zero,  is  limited to errbuf_size. The yield of the function
3220         is the size of buffer needed to hold the whole message.
3221    
3222    
3223    STORAGE
3224    
3225         Compiling a regular expression causes memory to be allocated
3226         and  associated  with  the preg structure. The function reg-
3227         free() frees all such memory, after which preg may no longer
3228         be used as a compiled expression.
3229    
      9. The character types  such  as  \d  and  \w  do  not  work  
      correctly  with  UTF-8  characters.  They continue to test a  
      single byte.  
3230    
3231       10. Anything not explicitly mentioned here continues to work  AUTHOR
      in bytes rather than in characters.  
3232    
3233       The following UTF-8 features of  Perl  5.6  are  not  imple-       Philip Hazel <ph10@cam.ac.uk>
3234       mented:       University Computing Service,
3235         Cambridge CB2 3QG, England.
3236    
3237       1. The escape sequence \C to match a single byte.  Last updated: 03 February 2003
3238    Copyright (c) 1997-2003 University of Cambridge.
3239    -----------------------------------------------------------------------------
3240    
3241       2. The use of Unicode tables and properties and escapes  \p,  NAME
3242       \P, and \X.       PCRE - Perl-compatible regular expressions
3243    
3244    
3245    PCRE SAMPLE PROGRAM
3246    
3247  SAMPLE PROGRAM       A simple, complete demonstration program, to get you started
3248       The code below is a simple, complete demonstration  program,       with  using  PCRE, is supplied in the file pcredemo.c in the
3249       to  get  you started with using PCRE. This code is also sup-       PCRE distribution.
      plied in the file pcredemo.c in the PCRE distribution.  
3250    
3251       The program compiles the  regular  expression  that  is  its       The program compiles the  regular  expression  that  is  its
3252       first argument, and matches it against the subject string in       first argument, and matches it against the subject string in
3253       its second argument. No options are set, and default charac-       its second argument. No PCRE options are  set,  and  default
3254       ter  tables are used. If matching succeeds, the program out-       character tables are used. If matching succeeds, the program
3255       puts the portion of the subject that matched, together  with       outputs the portion of the subject  that  matched,  together
3256       the contents of any captured substrings.       with the contents of any captured substrings.
3257    
3258         If the -g option is given on the command line,  the  program
3259         then  goes on to check for further matches of the same regu-
3260         lar expression in the same subject string. The  logic  is  a
3261         little  bit tricky because of the possibility of matching an
3262         empty string. Comments in the code explain what is going on.
3263    
3264       On a Unix system that has PCRE installed in /usr/local,  you       On a Unix system that has PCRE installed in /usr/local,  you
3265       can  compile  the demonstration program using a command like       can  compile  the demonstration program using a command like
3266       this:       this:
3267    
3268         gcc   -o    pcredemo    pcredemo.c    -I/usr/local/include         gcc -o pcredemo pcredemo.c -I/usr/local/include \
3269       -L/usr/local/lib -lpcre             -L/usr/local/lib -lpcre
3270    
3271       Then you can run simple tests like this:       Then you can run simple tests like this:
3272    
3273         ./pcredemo 'cat|dog' 'the cat sat on the mat'         ./pcredemo 'cat|dog' 'the cat sat on the mat'
3274           ./pcredemo -g 'cat|dog' 'the dog sat on the cat'
3275    
3276       Note that there is a much more comprehensive  test  program,       Note that there is a much more comprehensive  test  program,
3277       called  pcretest,  which  supports  many more facilities for       called  pcretest,  which  supports  many more facilities for
3278       testing regular expressions. The pcredemo  program  is  pro-       testing  regular  expressions  and  the  PCRE  library.  The
3279       vided as a simple coding example.       pcredemo program is provided as a simple coding example.
3280    
3281       On some operating systems (e.g.  Solaris)  you  may  get  an       On some operating systems (e.g.  Solaris)  you  may  get  an
3282       error like this when you try to run pcredemo:       error like this when you try to run pcredemo:
# Line 2206  SAMPLE PROGRAM Line 3289  SAMPLE PROGRAM
3289    
3290         -R/usr/local/lib         -R/usr/local/lib
3291    
3292       to the compile command to get round this problem. Here's the       to the compile command to get round this problem.
      code:  
3293    
3294         #include <stdio.h>  Last updated: 28 January 2003
3295         #include <string.h>  Copyright (c) 1997-2003 University of Cambridge.
3296         #include <pcre.h>  -----------------------------------------------------------------------------
   
        #define OVECCOUNT 30    /* should be a multiple of 3 */  
   
        int main(int argc, char **argv)  
        {  
        pcre *re;  
        const char *error;  
        int erroffset;  
        int ovector[OVECCOUNT];  
        int rc, i;  
   
        if (argc != 3)  
          {  
          printf("Two arguments required: a regex and a "  
            "subject string\n");  
          return 1;  
          }  
   
        /* Compile the regular expression in the first argument */  
   
        re = pcre_compile(  
          argv[1],     /* the pattern */  
          0,           /* default options */  
          &error,      /* for error message */  
          &erroffset,  /* for error offset */  
          NULL);       /* use default character tables */  
   
        /* Compilation failed: print the error message and exit */  
   
        if (re == NULL)  
          {  
          printf("PCRE compilation failed at offset %d: %s\n",  
            erroffset, error);  
          return 1;  
          }  
   
        /* Compilation succeeded: match the subject in the second  
           argument */  
   
        rc = pcre_exec(  
          re,          /* the compiled pattern */  
          NULL,        /* we didn't study the pattern */  
          argv[2],     /* the subject string */  
          (int)strlen(argv[2]), /* the length of the subject */  
          0,           /* start at offset 0 in the subject */  
          0,           /* default options */  
          ovector,     /* vector for substring information */  
          OVECCOUNT);  /* number of elements in the vector */  
   
        /* Matching failed: handle error cases */  
   
        if (rc < 0)  
          {  
          switch(rc)  
            {  
            case PCRE_ERROR_NOMATCH: printf("No match\n"); break;  
            /*  
            Handle other special cases if you like  
            */  
            default: printf("Matching error %d\n", rc); break;  
            }  
          return 1;  
          }  
   
        /* Match succeded */  
   
        printf("Match succeeded\n");  
   
        /* The output vector wasn't big enough */  
   
        if (rc == 0)  
          {  
          rc = OVECCOUNT/3;  
          printf("ovector only has room for %d captured "  
            substrings\n", rc - 1);  
          }  
   
        /* Show substrings stored in the output vector */  
   
        for (i = 0; i < rc; i++)  
          {  
          char *substring_start = argv[2] + ovector[2*i];  
          int substring_length = ovector[2*i+1] - ovector[2*i];  
          printf("%2d: %.*s\n", i, substring_length,  
            substring_start);  
          }  
   
        return 0;  
        }  
   
   
   
 AUTHOR  
      Philip Hazel <ph10@cam.ac.uk>  
      University Computing Service,  
      New Museums Site,  
      Cambridge CB2 3QG, England.  
      Phone: +44 1223 334714  
3297    
      Last updated: 15 August 2001  
      Copyright (c) 1997-2001 University of Cambridge.  

Legend:
Removed from v.53  
changed lines
  Added in v.63

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12