/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 53 by nigel, Sat Feb 24 21:39:42 2007 UTC revision 73 by nigel, Sat Feb 24 21:40:30 2007 UTC
# Line 1  Line 1 
1    This file contains a concatenation of the PCRE man pages, converted to plain
2    text format for ease of searching with a text editor, or for use on systems
3    that do not have a man page processor. The small individual files that give
4    synopses of each function in the library have not been included. There are
5    separate text files for the pcregrep and pcretest commands.
6    -----------------------------------------------------------------------------
7    
8    PCRE(3)                                                                PCRE(3)
9    
10    
11    
12  NAME  NAME
13       pcre - Perl-compatible regular expressions.         PCRE - Perl-compatible regular expressions
14    
15    DESCRIPTION
16    
17           The  PCRE  library is a set of functions that implement regular expres-
18           sion pattern matching using the same syntax and semantics as Perl, with
19           just  a  few  differences.  The current implementation of PCRE (release
20           4.x) corresponds approximately with Perl  5.8,  including  support  for
21           UTF-8  encoded  strings.   However,  this  support has to be explicitly
22           enabled; it is not the default.
23    
24           PCRE is written in C and released as a C library. However, a number  of
25           people  have  written  wrappers  and interfaces of various kinds. A C++
26           class is included in these contributions, which can  be  found  in  the
27           Contrib directory at the primary FTP site, which is:
28    
29           ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
30    
31           Details  of  exactly which Perl regular expression features are and are
32           not supported by PCRE are given in separate documents. See the pcrepat-
33           tern and pcrecompat pages.
34    
35           Some  features  of  PCRE can be included, excluded, or changed when the
36           library is built. The pcre_config() function makes it  possible  for  a
37           client  to  discover  which features are available. Documentation about
38           building PCRE for various operating systems can be found in the  README
39           file in the source distribution.
40    
41    
42    USER DOCUMENTATION
43    
44           The user documentation for PCRE has been split up into a number of dif-
45           ferent sections. In the "man" format, each of these is a separate  "man
46           page".  In  the  HTML  format, each is a separate page, linked from the
47           index page. In the plain text format, all  the  sections  are  concate-
48           nated, for ease of searching. The sections are as follows:
49    
50             pcre              this document
51             pcreapi           details of PCRE's native API
52             pcrebuild         options for building PCRE
53             pcrecallout       details of the callout feature
54             pcrecompat        discussion of Perl compatibility
55             pcregrep          description of the pcregrep command
56             pcrepattern       syntax and semantics of supported
57                                 regular expressions
58             pcreperform       discussion of performance issues
59             pcreposix         the POSIX-compatible API
60             pcresample        discussion of the sample program
61             pcretest          the pcretest testing command
62    
63           In  addition,  in the "man" and HTML formats, there is a short page for
64           each library function, listing its arguments and results.
65    
66    
67    LIMITATIONS
68    
69           There are some size limitations in PCRE but it is hoped that they  will
70           never in practice be relevant.
71    
72  SYNOPSIS         The  maximum  length of a compiled pattern is 65539 (sic) bytes if PCRE
73       #include <pcre.h>         is compiled with the default internal linkage size of 2. If you want to
74           process  regular  expressions  that are truly enormous, you can compile
75           PCRE with an internal linkage size of 3 or 4 (see the  README  file  in
76           the  source  distribution and the pcrebuild documentation for details).
77           If these cases the limit is substantially larger.  However,  the  speed
78           of execution will be slower.
79    
80           All values in repeating quantifiers must be less than 65536.  The maxi-
81           mum number of capturing subpatterns is 65535.
82    
83           There is no limit to the number of non-capturing subpatterns,  but  the
84           maximum  depth  of  nesting  of  all kinds of parenthesized subpattern,
85           including capturing subpatterns, assertions, and other types of subpat-
86           tern, is 200.
87    
88           The  maximum  length of a subject string is the largest positive number
89           that an integer variable can hold. However, PCRE uses recursion to han-
90           dle  subpatterns  and indefinite repetition. This means that the avail-
91           able stack space may limit the size of a subject  string  that  can  be
92           processed by certain patterns.
93    
      pcre *pcre_compile(const char *pattern, int options,  
           const char **errptr, int *erroffset,  
           const unsigned char *tableptr);  
94    
95       pcre_extra *pcre_study(const pcre *code, int options,  UTF-8 SUPPORT
           const char **errptr);  
96    
97       int pcre_exec(const pcre *code, const pcre_extra *extra,         Starting  at  release  3.3,  PCRE  has  had  some support for character
98            const char *subject, int length, int startoffset,         strings encoded in the UTF-8 format. For  release  4.0  this  has  been
99            int options, int *ovector, int ovecsize);         greatly extended to cover most common requirements.
100    
101           In  order  process  UTF-8 strings, you must build PCRE to include UTF-8
102           support in the code, and, in addition,  you  must  call  pcre_compile()
103           with  the PCRE_UTF8 option flag. When you do this, both the pattern and
104           any subject strings that are matched against it are  treated  as  UTF-8
105           strings instead of just strings of bytes.
106    
107           If  you compile PCRE with UTF-8 support, but do not use it at run time,
108           the library will be a bit bigger, but the additional run time  overhead
109           is  limited  to testing the PCRE_UTF8 flag in several places, so should
110           not be very large.
111    
112           The following comments apply when PCRE is running in UTF-8 mode:
113    
114           1. When you set the PCRE_UTF8 flag, the strings passed as patterns  and
115           subjects  are  checked for validity on entry to the relevant functions.
116           If an invalid UTF-8 string is passed, an error return is given. In some
117           situations,  you  may  already  know  that  your strings are valid, and
118           therefore want to skip these checks in order to improve performance. If
119           you  set  the  PCRE_NO_UTF8_CHECK  flag at compile time or at run time,
120           PCRE assumes that the pattern or subject  it  is  given  (respectively)
121           contains  only valid UTF-8 codes. In this case, it does not diagnose an
122           invalid UTF-8 string. If you pass an invalid UTF-8 string to PCRE  when
123           PCRE_NO_UTF8_CHECK  is set, the results are undefined. Your program may
124           crash.
125    
126           2. In a pattern, the escape sequence \x{...}, where the contents of the
127           braces  is  a  string  of hexadecimal digits, is interpreted as a UTF-8
128           character whose code number is the given hexadecimal number, for  exam-
129           ple:  \x{1234}.  If a non-hexadecimal digit appears between the braces,
130           the item is not recognized.  This escape sequence can be used either as
131           a literal, or within a character class.
132    
133           3.  The  original hexadecimal escape sequence, \xhh, matches a two-byte
134           UTF-8 character if the value is greater than 127.
135    
136           4. Repeat quantifiers apply to complete UTF-8 characters, not to  indi-
137           vidual bytes, for example: \x{100}{3}.
138    
139           5.  The  dot  metacharacter  matches  one  UTF-8 character instead of a
140           single byte.
141    
142           6. The escape sequence \C can be used to match a single byte  in  UTF-8
143           mode, but its use can lead to some strange effects.
144    
145           7.  The  character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly
146           test characters of any code value, but the characters that PCRE  recog-
147           nizes  as  digits,  spaces,  or  word characters remain the same set as
148           before, all with values less than 256.
149    
150           8. Case-insensitive matching applies only to  characters  whose  values
151           are  less  than  256.  PCRE  does  not support the notion of "case" for
152           higher-valued characters.
153    
154       int pcre_copy_substring(const char *subject, int *ovector,         9. PCRE does not support the use of Unicode tables  and  properties  or
155            int stringcount, int stringnumber, char *buffer,         the Perl escapes \p, \P, and \X.
           int buffersize);  
156    
      int pcre_get_substring(const char *subject, int *ovector,  
           int stringcount, int stringnumber,  
           const char **stringptr);  
157    
158       int pcre_get_substring_list(const char *subject,  AUTHOR
           int *ovector, int stringcount, const char ***listptr);  
159    
160       void pcre_free_substring(const char *stringptr);         Philip Hazel <ph10@cam.ac.uk>
161           University Computing Service,
162           Cambridge CB2 3QG, England.
163           Phone: +44 1223 334714
164    
165    Last updated: 20 August 2003
166    Copyright (c) 1997-2003 University of Cambridge.
167    -----------------------------------------------------------------------------
168    
169       void pcre_free_substring_list(const char **stringptr);  PCRE(3)                                                                PCRE(3)
170    
      const unsigned char *pcre_maketables(void);  
171    
      int pcre_fullinfo(const pcre *code, const pcre_extra *extra,  
           int what, void *where);  
172    
173       int pcre_info(const pcre *code, int *optptr, *firstcharptr);  NAME
174           PCRE - Perl-compatible regular expressions
175    
176       char *pcre_version(void);  PCRE BUILD-TIME OPTIONS
177    
178       void *(*pcre_malloc)(size_t);         This  document  describes  the  optional  features  of PCRE that can be
179           selected when the library is compiled. They are all selected, or  dese-
180           lected,  by  providing  options  to  the  configure script which is run
181           before the make command. The complete list  of  options  for  configure
182           (which  includes the standard ones such as the selection of the instal-
183           lation directory) can be obtained by running
184    
185             ./configure --help
186    
187           The following sections describe certain options whose names begin  with
188           --enable  or  --disable. These settings specify changes to the defaults
189           for the configure command. Because of the  way  that  configure  works,
190           --enable  and  --disable  always  come  in  pairs, so the complementary
191           option always exists as well, but as it specifies the  default,  it  is
192           not described.
193    
      void (*pcre_free)(void *);  
194    
195    UTF-8 SUPPORT
196    
197           To build PCRE with support for UTF-8 character strings, add
198    
199             --enable-utf8
200    
201  DESCRIPTION         to  the  configure  command.  Of  itself, this does not make PCRE treat
202       The PCRE library is a set of functions that implement  regu-         strings as UTF-8. As well as compiling PCRE with this option, you  also
203       lar  expression  pattern  matching using the same syntax and         have  have to set the PCRE_UTF8 option when you call the pcre_compile()
204       semantics as Perl  5,  with  just  a  few  differences  (see         function.
205    
206       below).  The  current  implementation  corresponds  to  Perl  
207       5.005, with some additional features  from  later  versions.  CODE VALUE OF NEWLINE
208       This  includes  some  experimental,  incomplete  support for  
209       UTF-8 encoded strings. Details of exactly what is  and  what         By default, PCRE treats character 10 (linefeed) as the newline  charac-
210       is not supported are given below.         ter. This is the normal newline character on Unix-like systems. You can
211           compile PCRE to use character 13 (carriage return) instead by adding
212       PCRE has its own native API,  which  is  described  in  this  
213       document.  There  is  also  a  set of wrapper functions that           --enable-newline-is-cr
214       correspond to the POSIX regular expression API.   These  are  
215       described in the pcreposix documentation.         to the configure command. For completeness there is  also  a  --enable-
216           newline-is-lf  option,  which explicitly specifies linefeed as the new-
217       The native API function prototypes are defined in the header         line character.
218       file  pcre.h,  and  on  Unix  systems  the library itself is  
219       called libpcre.a, so can be accessed by adding -lpcre to the  
220       command  for  linking  an  application  which  calls it. The  BUILDING SHARED AND STATIC LIBRARIES
221       header file defines the macros PCRE_MAJOR and PCRE_MINOR  to  
222       contain the major and minor release numbers for the library.         The PCRE building process uses libtool to build both shared and  static
223       Applications can use these to include support for  different         Unix  libraries by default. You can suppress one of these by adding one
224       releases.         of
225    
226       The functions pcre_compile(), pcre_study(), and  pcre_exec()           --disable-shared
227       are  used  for compiling and matching regular expressions. A           --disable-static
228       sample program that demonstrates the simplest way  of  using  
229       them  is  given  in the file pcredemo.c. The last section of         to the configure command, as required.
230       this man page describes how to run it.  
231    
232       The functions  pcre_copy_substring(),  pcre_get_substring(),  POSIX MALLOC USAGE
233       and  pcre_get_substring_list() are convenience functions for  
234       extracting  captured  substrings  from  a  matched   subject         When PCRE is called through the  POSIX  interface  (see  the  pcreposix
235       string; pcre_free_substring() and pcre_free_substring_list()         documentation),  additional working storage is required for holding the
236       are also provided, to free the  memory  used  for  extracted         pointers to capturing substrings because PCRE requires  three  integers
237       strings.         per  substring,  whereas  the POSIX interface provides only two. If the
238           number of expected substrings is small, the wrapper function uses space
239       The function pcre_maketables() is used (optionally) to build         on the stack, because this is faster than using malloc() for each call.
240       a  set of character tables in the current locale for passing         The default threshold above which the stack is no longer used is 10; it
241       to pcre_compile().         can be changed by adding a setting such as
242    
243       The function pcre_fullinfo() is used to find out information           --with-posix-malloc-threshold=20
244       about a compiled pattern; pcre_info() is an obsolete version  
245       which returns only some of the available information, but is         to the configure command.
246       retained   for   backwards   compatibility.    The  function  
247       pcre_version() returns a pointer to a string containing  the  
248       version of PCRE and its date of release.  LIMITING PCRE RESOURCE USAGE
249    
250       The global variables  pcre_malloc  and  pcre_free  initially         Internally,  PCRE  has a function called match() which it calls repeat-
251       contain the entry points of the standard malloc() and free()         edly (possibly recursively) when performing a  matching  operation.  By
252       functions respectively. PCRE  calls  the  memory  management         limiting  the  number of times this function may be called, a limit can
253       functions  via  these  variables,  so  a calling program can         be placed on the resources used by a single call  to  pcre_exec().  The
254       replace them if it  wishes  to  intercept  the  calls.  This         limit  can be changed at run time, as described in the pcreapi documen-
255       should be done before calling any PCRE functions.         tation. The default is 10 million, but this can be changed by adding  a
256           setting such as
257    
258             --with-match-limit=500000
259  MULTI-THREADING  
260       The PCRE functions can be used in  multi-threading  applica-         to the configure command.
261       tions, with the proviso that the memory management functions  
262       pointed to by pcre_malloc and pcre_free are  shared  by  all  
263       threads.  HANDLING VERY LARGE PATTERNS
264    
265       The compiled form of a regular  expression  is  not  altered         Within  a  compiled  pattern,  offset values are used to point from one
266       during  matching, so the same compiled pattern can safely be         part to another (for example, from an opening parenthesis to an  alter-
267       used by several threads at once.         nation  metacharacter).  By  default two-byte values are used for these
268           offsets, leading to a maximum size for a  compiled  pattern  of  around
269           64K.  This  is sufficient to handle all but the most gigantic patterns.
270           Nevertheless, some people do want to process enormous patterns,  so  it
271           is  possible  to compile PCRE to use three-byte or four-byte offsets by
272           adding a setting such as
273    
274             --with-link-size=3
275    
276           to the configure command. The value given must be 2,  3,  or  4.  Using
277           longer  offsets slows down the operation of PCRE because it has to load
278           additional bytes when handling them.
279    
280           If you build PCRE with an increased link size, test 2 (and  test  5  if
281           you  are using UTF-8) will fail. Part of the output of these tests is a
282           representation of the compiled pattern, and this changes with the  link
283           size.
284    
285    
286    AVOIDING EXCESSIVE STACK USAGE
287    
288           PCRE  implements  backtracking while matching by making recursive calls
289           to an internal function called match(). In environments where the  size
290           of the stack is limited, this can severely limit PCRE's operation. (The
291           Unix environment does not usually suffer from this problem.) An  alter-
292           native  approach  that  uses  memory  from  the  heap to remember data,
293           instead of using recursive function calls, has been implemented to work
294           round  this  problem. If you want to build a version of PCRE that works
295           this way, add
296    
297             --disable-stack-for-recursion
298    
299           to the configure command. With this configuration, PCRE  will  use  the
300           pcre_stack_malloc   and   pcre_stack_free   variables  to  call  memory
301           management functions. Separate functions are provided because the usage
302           is very predictable: the block sizes requested are always the same, and
303           the blocks are always freed in reverse order. A calling  program  might
304           be  able  to implement optimized functions that perform better than the
305           standard malloc() and  free()  functions.  PCRE  runs  noticeably  more
306           slowly when built in this way.
307    
308    
309    USING EBCDIC CODE
310    
311           PCRE  assumes  by  default that it will run in an environment where the
312           character code is ASCII (or UTF-8, which is a superset of ASCII).  PCRE
313           can, however, be compiled to run in an EBCDIC environment by adding
314    
315             --enable-ebcdic
316    
317           to the configure command.
318    
319    Last updated: 09 December 2003
320    Copyright (c) 1997-2003 University of Cambridge.
321    -----------------------------------------------------------------------------
322    
323    PCRE(3)                                                                PCRE(3)
324    
325    
326    
327    NAME
328           PCRE - Perl-compatible regular expressions
329    
330    SYNOPSIS OF PCRE API
331    
332           #include <pcre.h>
333    
334           pcre *pcre_compile(const char *pattern, int options,
335                const char **errptr, int *erroffset,
336                const unsigned char *tableptr);
337    
338           pcre_extra *pcre_study(const pcre *code, int options,
339                const char **errptr);
340    
341           int pcre_exec(const pcre *code, const pcre_extra *extra,
342                const char *subject, int length, int startoffset,
343                int options, int *ovector, int ovecsize);
344    
345           int pcre_copy_named_substring(const pcre *code,
346                const char *subject, int *ovector,
347                int stringcount, const char *stringname,
348                char *buffer, int buffersize);
349    
350           int pcre_copy_substring(const char *subject, int *ovector,
351                int stringcount, int stringnumber, char *buffer,
352                int buffersize);
353    
354           int pcre_get_named_substring(const pcre *code,
355                const char *subject, int *ovector,
356                int stringcount, const char *stringname,
357                const char **stringptr);
358    
359           int pcre_get_stringnumber(const pcre *code,
360                const char *name);
361    
362           int pcre_get_substring(const char *subject, int *ovector,
363                int stringcount, int stringnumber,
364                const char **stringptr);
365    
366           int pcre_get_substring_list(const char *subject,
367                int *ovector, int stringcount, const char ***listptr);
368    
369           void pcre_free_substring(const char *stringptr);
370    
371           void pcre_free_substring_list(const char **stringptr);
372    
373           const unsigned char *pcre_maketables(void);
374    
375           int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
376                int what, void *where);
377    
378           int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
379    
380           int pcre_config(int what, void *where);
381    
382           char *pcre_version(void);
383    
384           void *(*pcre_malloc)(size_t);
385    
386           void (*pcre_free)(void *);
387    
388           void *(*pcre_stack_malloc)(size_t);
389    
390           void (*pcre_stack_free)(void *);
391    
392           int (*pcre_callout)(pcre_callout_block *);
393    
394    
395    PCRE API
396    
397           PCRE has its own native API, which is described in this document. There
398           is also a set of wrapper functions that correspond to the POSIX regular
399           expression API.  These are described in the pcreposix documentation.
400    
401           The  native  API  function  prototypes  are  defined in the header file
402           pcre.h, and on Unix systems the library itself is called libpcre.a,  so
403           can be accessed by adding -lpcre to the command for linking an applica-
404           tion which calls it. The header file defines the macros PCRE_MAJOR  and
405           PCRE_MINOR  to  contain  the  major  and  minor release numbers for the
406           library. Applications can use these to include  support  for  different
407           releases.
408    
409           The  functions  pcre_compile(),  pcre_study(), and pcre_exec() are used
410           for compiling and matching regular expressions. A sample  program  that
411           demonstrates  the simplest way of using them is given in the file pcre-
412           demo.c. The pcresample documentation describes how to run it.
413    
414           There are convenience functions for extracting captured substrings from
415           a matched subject string. They are:
416    
417             pcre_copy_substring()
418             pcre_copy_named_substring()
419             pcre_get_substring()
420             pcre_get_named_substring()
421             pcre_get_substring_list()
422    
423           pcre_free_substring() and pcre_free_substring_list() are also provided,
424           to free the memory used for extracted strings.
425    
426           The function pcre_maketables() is used (optionally) to build a  set  of
427           character tables in the current locale for passing to pcre_compile().
428    
429           The  function  pcre_fullinfo()  is used to find out information about a
430           compiled pattern; pcre_info() is an obsolete version which returns only
431           some  of  the available information, but is retained for backwards com-
432           patibility.  The function pcre_version() returns a pointer to a  string
433           containing the version of PCRE and its date of release.
434    
435           The  global  variables  pcre_malloc and pcre_free initially contain the
436           entry points of the standard  malloc()  and  free()  functions  respec-
437           tively. PCRE calls the memory management functions via these variables,
438           so a calling program can replace them if it  wishes  to  intercept  the
439           calls. This should be done before calling any PCRE functions.
440    
441           The  global  variables  pcre_stack_malloc  and pcre_stack_free are also
442           indirections to memory management functions.  These  special  functions
443           are  used  only  when  PCRE is compiled to use the heap for remembering
444           data, instead of recursive function calls. This is a  non-standard  way
445           of  building  PCRE,  for  use in environments that have limited stacks.
446           Because of the greater use of memory management, it runs  more  slowly.
447           Separate  functions  are provided so that special-purpose external code
448           can be used for this case. When used, these functions are always called
449           in  a  stack-like  manner  (last obtained, first freed), and always for
450           memory blocks of the same size.
451    
452           The global variable pcre_callout initially contains NULL. It can be set
453           by  the  caller  to  a "callout" function, which PCRE will then call at
454           specified points during a matching operation. Details are given in  the
455           pcrecallout documentation.
456    
457    
458    MULTITHREADING
459    
460           The  PCRE  functions  can be used in multi-threading applications, with
461           the  proviso  that  the  memory  management  functions  pointed  to  by
462           pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
463           callout function pointed to by pcre_callout, are shared by all threads.
464    
465           The  compiled form of a regular expression is not altered during match-
466           ing, so the same compiled pattern can safely be used by several threads
467           at once.
468    
469    
470    CHECKING BUILD-TIME OPTIONS
471    
472           int pcre_config(int what, void *where);
473    
474           The  function pcre_config() makes it possible for a PCRE client to dis-
475           cover which optional features have been compiled into the PCRE library.
476           The  pcrebuild documentation has more details about these optional fea-
477           tures.
478    
479           The first argument for pcre_config() is an  integer,  specifying  which
480           information is required; the second argument is a pointer to a variable
481           into which the information is  placed.  The  following  information  is
482           available:
483    
484             PCRE_CONFIG_UTF8
485    
486           The  output is an integer that is set to one if UTF-8 support is avail-
487           able; otherwise it is set to zero.
488    
489             PCRE_CONFIG_NEWLINE
490    
491           The output is an integer that is set to the value of the code  that  is
492           used  for the newline character. It is either linefeed (10) or carriage
493           return (13), and should normally be the  standard  character  for  your
494           operating system.
495    
496             PCRE_CONFIG_LINK_SIZE
497    
498           The  output  is  an  integer that contains the number of bytes used for
499           internal linkage in compiled regular expressions. The value is 2, 3, or
500           4.  Larger  values  allow larger regular expressions to be compiled, at
501           the expense of slower matching. The default value of  2  is  sufficient
502           for  all  but  the  most massive patterns, since it allows the compiled
503           pattern to be up to 64K in size.
504    
505             PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
506    
507           The output is an integer that contains the threshold  above  which  the
508           POSIX  interface  uses malloc() for output vectors. Further details are
509           given in the pcreposix documentation.
510    
511             PCRE_CONFIG_MATCH_LIMIT
512    
513           The output is an integer that gives the default limit for the number of
514           internal  matching  function  calls in a pcre_exec() execution. Further
515           details are given with pcre_exec() below.
516    
517             PCRE_CONFIG_STACKRECURSE
518    
519           The output is an integer that is set to one if  internal  recursion  is
520           implemented  by recursive function calls that use the stack to remember
521           their state. This is the usual way that PCRE is compiled. The output is
522           zero  if PCRE was compiled to use blocks of data on the heap instead of
523           recursive  function  calls.  In  this   case,   pcre_stack_malloc   and
524           pcre_stack_free  are  called  to manage memory blocks on the heap, thus
525           avoiding the use of the stack.
526    
527    
528  COMPILING A PATTERN  COMPILING A PATTERN
529       The function pcre_compile() is called to compile  a  pattern  
530       into  an internal form. The pattern is a C string terminated         pcre *pcre_compile(const char *pattern, int options,
531       by a binary zero, and is passed in the argument  pattern.  A              const char **errptr, int *erroffset,
532       pointer  to  a  single  block of memory that is obtained via              const unsigned char *tableptr);
533       pcre_malloc is returned. This contains the compiled code and  
534       related  data.  The  pcre  type  is defined for the returned  
535       block; this is a typedef for a structure whose contents  are         The function pcre_compile() is called to  compile  a  pattern  into  an
536       not  externally  defined. It is up to the caller to free the         internal  form.  The pattern is a C string terminated by a binary zero,
537       memory when it is no longer required.         and is passed in the argument pattern. A pointer to a single  block  of
538           memory  that is obtained via pcre_malloc is returned. This contains the
539       Although the compiled code of a PCRE regex  is  relocatable,         compiled code and related data.  The  pcre  type  is  defined  for  the
540       that is, it does not depend on memory location, the complete         returned  block;  this  is a typedef for a structure whose contents are
541       pcre data block is not fully relocatable,  because  it  con-         not externally defined. It is up to the caller to free the memory  when
542       tains  a  copy of the tableptr argument, which is an address         it is no longer required.
543       (see below).  
544           Although  the compiled code of a PCRE regex is relocatable, that is, it
545       The size of a compiled pattern is  roughly  proportional  to         does not depend on memory location, the complete pcre data block is not
546       the length of the pattern string, except that each character         fully relocatable, because it contains a copy of the tableptr argument,
547       class (other than those containing just a single  character,         which is an address (see below).
548       negated  or  not)  requires 33 bytes, and repeat quantifiers  
549       with a minimum greater than one or a bounded  maximum  cause         The options argument contains independent bits that affect the compila-
550       the  relevant  portions of the compiled pattern to be repli-         tion.  It  should  be  zero  if  no  options  are required. Some of the
551       cated.         options, in particular, those that are compatible with Perl,  can  also
552           be  set and unset from within the pattern (see the detailed description
553       The options argument contains independent bits  that  affect         of regular expressions in the  pcrepattern  documentation).  For  these
554       the  compilation.  It  should  be  zero  if  no  options are         options,  the  contents of the options argument specifies their initial
555       required. Some of the options, in particular, those that are         settings at the start of compilation and execution.  The  PCRE_ANCHORED
556       compatible  with Perl, can also be set and unset from within         option can be set at the time of matching as well as at compile time.
557       the pattern (see the detailed description of regular expres-  
558       sions below). For these options, the contents of the options         If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,
559       argument specifies their initial settings at  the  start  of         if compilation of a pattern fails,  pcre_compile()  returns  NULL,  and
560       compilation  and  execution. The PCRE_ANCHORED option can be         sets the variable pointed to by errptr to point to a textual error mes-
561       set at the time of matching as well as at compile time.         sage. The offset from the start of the pattern to the  character  where
562           the  error  was  discovered  is  placed  in  the variable pointed to by
563       If errptr is NULL, pcre_compile() returns NULL  immediately.         erroffset, which must not be NULL. If it  is,  an  immediate  error  is
564       Otherwise, if compilation of a pattern fails, pcre_compile()         given.
565       returns NULL, and sets the variable pointed to by errptr  to  
566       point  to a textual error message. The offset from the start         If  the  final  argument, tableptr, is NULL, PCRE uses a default set of
567       of  the  pattern  to  the  character  where  the  error  was         character tables which are built when it is compiled, using the default
568       discovered   is   placed  in  the  variable  pointed  to  by         C  locale.  Otherwise,  tableptr  must  be  the  result  of  a  call to
569       erroffset, which must not be NULL. If it  is,  an  immediate         pcre_maketables(). See the section on locale support below.
570       error is given.  
571           This code fragment shows a typical straightforward  call  to  pcre_com-
572       If the final  argument,  tableptr,  is  NULL,  PCRE  uses  a         pile():
573       default  set  of character tables which are built when it is  
574       compiled, using the default C  locale.  Otherwise,  tableptr           pcre *re;
575       must  be  the result of a call to pcre_maketables(). See the           const char *error;
576       section on locale support below.           int erroffset;
577             re = pcre_compile(
578       This code fragment shows a typical straightforward  call  to             "^A.*Z",          /* the pattern */
579       pcre_compile():             0,                /* default options */
580               &error,           /* for error message */
581         pcre *re;             &erroffset,       /* for error offset */
582         const char *error;             NULL);            /* use default character tables */
583         int erroffset;  
584         re = pcre_compile(         The following option bits are defined:
585           "^A.*Z",          /* the pattern */  
586           0,                /* default options */           PCRE_ANCHORED
587           &error,           /* for error message */  
588           &erroffset,       /* for error offset */         If this bit is set, the pattern is forced to be "anchored", that is, it
589           NULL);            /* use default character tables */         is constrained to match only at the first matching point in the  string
590           which is being searched (the "subject string"). This effect can also be
591       The following option bits are defined in the header file:         achieved by appropriate constructs in the pattern itself, which is  the
592           only way to do it in Perl.
593         PCRE_ANCHORED  
594             PCRE_CASELESS
595       If this bit is set, the pattern is forced to be  "anchored",  
596       that is, it is constrained to match only at the start of the         If  this  bit is set, letters in the pattern match both upper and lower
597       string which is being searched (the "subject string").  This         case letters. It is equivalent to Perl's  /i  option,  and  it  can  be
598       effect can also be achieved by appropriate constructs in the         changed within a pattern by a (?i) option setting.
599       pattern itself, which is the only way to do it in Perl.  
600             PCRE_DOLLAR_ENDONLY
601         PCRE_CASELESS  
602           If  this bit is set, a dollar metacharacter in the pattern matches only
603       If this bit is set, letters in the pattern match both  upper         at the end of the subject string. Without this option,  a  dollar  also
604       and  lower  case  letters.  It  is  equivalent  to Perl's /i         matches  immediately before the final character if it is a newline (but
605       option.         not before any  other  newlines).  The  PCRE_DOLLAR_ENDONLY  option  is
606           ignored if PCRE_MULTILINE is set. There is no equivalent to this option
607         PCRE_DOLLAR_ENDONLY         in Perl, and no way to set it within a pattern.
608    
609       If this bit is set, a dollar metacharacter  in  the  pattern           PCRE_DOTALL
610       matches  only at the end of the subject string. Without this  
611       option, a dollar also matches immediately before  the  final         If this bit is set, a dot metacharater in the pattern matches all char-
612       character  if it is a newline (but not before any other new-         acters,  including  newlines.  Without  it, newlines are excluded. This
613       lines).  The  PCRE_DOLLAR_ENDONLY  option  is   ignored   if         option is equivalent to Perl's /s option, and it can be changed  within
614       PCRE_MULTILINE is set. There is no equivalent to this option         a  pattern  by  a  (?s)  option  setting. A negative class such as [^a]
615       in Perl.         always matches a newline character, independent of the setting of  this
616           option.
617         PCRE_DOTALL  
618             PCRE_EXTENDED
619       If this bit is  set,  a  dot  metacharater  in  the  pattern  
620       matches all characters, including newlines. Without it, new-         If  this  bit  is  set,  whitespace  data characters in the pattern are
621       lines are excluded. This option is equivalent to  Perl's  /s         totally ignored except  when  escaped  or  inside  a  character  class.
622       option.  A negative class such as [^a] always matches a new-         Whitespace  does  not  include the VT character (code 11). In addition,
623       line character, independent of the setting of this option.         characters between an unescaped # outside a  character  class  and  the
624           next newline character, inclusive, are also ignored. This is equivalent
625         PCRE_EXTENDED         to Perl's /x option, and it can be changed within a pattern by  a  (?x)
626           option setting.
627       If this bit is set, whitespace data characters in  the  pat-  
628       tern  are  totally  ignored  except when escaped or inside a         This  option  makes  it possible to include comments inside complicated
629       character class, and characters between an unescaped #  out-         patterns.  Note, however, that this applies only  to  data  characters.
630       side  a  character  class  and  the  next newline character,         Whitespace   characters  may  never  appear  within  special  character
631       inclusive, are also ignored. This is equivalent to Perl's /x         sequences in a pattern, for  example  within  the  sequence  (?(  which
632       option,  and  makes  it  possible to include comments inside         introduces a conditional subpattern.
633       complicated patterns. Note, however, that this applies  only  
634       to  data  characters. Whitespace characters may never appear           PCRE_EXTRA
635       within special character sequences in a pattern, for example  
636       within  the sequence (?( which introduces a conditional sub-         This  option  was invented in order to turn on additional functionality
637       pattern.         of PCRE that is incompatible with Perl, but it  is  currently  of  very
638           little  use. When set, any backslash in a pattern that is followed by a
639         PCRE_EXTRA         letter that has no special meaning  causes  an  error,  thus  reserving
640           these  combinations  for  future  expansion.  By default, as in Perl, a
641       This option was invented in  order  to  turn  on  additional         backslash followed by a letter with no special meaning is treated as  a
642       functionality of PCRE that is incompatible with Perl, but it         literal.  There  are  at  present  no other features controlled by this
643       is currently of very little use. When set, any backslash  in         option. It can also be set by a (?X) option setting within a pattern.
644       a  pattern  that is followed by a letter that has no special  
645       meaning causes an error, thus reserving  these  combinations           PCRE_MULTILINE
646       for  future  expansion.  By default, as in Perl, a backslash  
647       followed by a letter with no special meaning is treated as a         By default, PCRE treats the subject string as consisting  of  a  single
648       literal.  There  are at present no other features controlled         "line"  of  characters (even if it actually contains several newlines).
649       by this option. It can also be set by a (?X) option  setting         The "start of line" metacharacter (^) matches only at the start of  the
650       within a pattern.         string,  while  the "end of line" metacharacter ($) matches only at the
651           end of the string, or before a terminating  newline  (unless  PCRE_DOL-
652         PCRE_MULTILINE         LAR_ENDONLY is set). This is the same as Perl.
653    
654       By default, PCRE treats the subject string as consisting  of         When  PCRE_MULTILINE  it  is set, the "start of line" and "end of line"
655       a  single "line" of characters (even if it actually contains         constructs match immediately following or immediately before  any  new-
656       several newlines). The "start  of  line"  metacharacter  (^)         line  in the subject string, respectively, as well as at the very start
657       matches  only  at the start of the string, while the "end of         and end. This is equivalent to Perl's /m option, and it can be  changed
658       line" metacharacter ($) matches  only  at  the  end  of  the         within a pattern by a (?m) option setting. If there are no "\n" charac-
659       string,    or   before   a   terminating   newline   (unless         ters in a subject string, or no occurrences of ^ or  $  in  a  pattern,
660       PCRE_DOLLAR_ENDONLY is set). This is the same as Perl.         setting PCRE_MULTILINE has no effect.
661    
662       When PCRE_MULTILINE it is set, the "start of line" and  "end           PCRE_NO_AUTO_CAPTURE
663       of  line"  constructs match immediately following or immedi-  
664       ately before any newline  in  the  subject  string,  respec-         If this option is set, it disables the use of numbered capturing paren-
665       tively,  as  well  as  at  the  very  start and end. This is         theses in the pattern. Any opening parenthesis that is not followed  by
666       equivalent to Perl's /m option. If there are no "\n" charac-         ?  behaves as if it were followed by ?: but named parentheses can still
667       ters  in  a subject string, or no occurrences of ^ or $ in a         be used for capturing (and they acquire  numbers  in  the  usual  way).
668       pattern, setting PCRE_MULTILINE has no effect.         There is no equivalent of this option in Perl.
669    
670         PCRE_UNGREEDY           PCRE_UNGREEDY
671    
672       This option inverts the "greediness" of the  quantifiers  so         This  option  inverts  the "greediness" of the quantifiers so that they
673       that  they  are  not greedy by default, but become greedy if         are not greedy by default, but become greedy if followed by "?". It  is
674       followed by "?". It is not compatible with Perl. It can also         not  compatible  with Perl. It can also be set by a (?U) option setting
675       be set by a (?U) option setting within the pattern.         within the pattern.
676    
677         PCRE_UTF8           PCRE_UTF8
678    
679       This option causes PCRE to regard both the pattern  and  the         This option causes PCRE to regard both the pattern and the  subject  as
680       subject  as strings of UTF-8 characters instead of just byte         strings  of  UTF-8 characters instead of single-byte character strings.
681       strings. However, it is available  only  if  PCRE  has  been         However, it is available only if PCRE has been built to  include  UTF-8
682       built  to  include  UTF-8  support.  If not, the use of this         support.  If  not, the use of this option provokes an error. Details of
683       option provokes an error. Support for UTF-8 is new,  experi-         how this option changes the behaviour of PCRE are given in the  section
684       mental,  and incomplete.  Details of exactly what it entails         on UTF-8 support in the main pcre page.
685       are given below.  
686             PCRE_NO_UTF8_CHECK
687    
688           When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
689           automatically checked. If an invalid UTF-8 sequence of bytes is  found,
690           pcre_compile()  returns an error. If you already know that your pattern
691           is valid, and you want to skip this check for performance reasons,  you
692           can  set  the  PCRE_NO_UTF8_CHECK option. When it is set, the effect of
693           passing an invalid UTF-8 string as a pattern is undefined. It may cause
694           your  program  to  crash.  Note that there is a similar option for sup-
695           pressing the checking of subject strings passed to pcre_exec().
696    
697    
698    
699  STUDYING A PATTERN  STUDYING A PATTERN
      When a pattern is going to be  used  several  times,  it  is  
      worth  spending  more time analyzing it in order to speed up  
      the time taken for matching. The function pcre_study() takes  
      a  pointer  to a compiled pattern as its first argument, and  
      returns a pointer to a pcre_extra block (another typedef for  
      a  structure  with  hidden  contents)  containing additional  
      information  about  the  pattern;  this  can  be  passed  to  
      pcre_exec(). If no additional information is available, NULL  
      is returned.  
   
      The second argument contains option  bits.  At  present,  no  
      options  are  defined  for  pcre_study(),  and this argument  
      should always be zero.  
   
      The third argument for pcre_study() is a pointer to an error  
      message. If studying succeeds (even if no data is returned),  
      the variable it points to  is  set  to  NULL.  Otherwise  it  
      points to a textual error message.  
   
      This is a typical call to pcre_study():  
   
        pcre_extra *pe;  
        pe = pcre_study(  
          re,             /* result of pcre_compile() */  
          0,              /* no options exist */  
          &error);        /* set to NULL or points to a message */  
   
      At present, studying a  pattern  is  useful  only  for  non-  
      anchored  patterns  that do not have a single fixed starting  
      character. A  bitmap  of  possible  starting  characters  is  
      created.  
700    
701           pcre_extra *pcre_study(const pcre *code, int options,
702                const char **errptr);
703    
704           When a pattern is going to be used several times, it is worth  spending
705           more  time  analyzing it in order to speed up the time taken for match-
706           ing. The function pcre_study() takes a pointer to a compiled pattern as
707           its first argument. If studing the pattern produces additional informa-
708           tion that will help speed up matching, pcre_study() returns  a  pointer
709           to  a  pcre_extra  block,  in  which the study_data field points to the
710           results of the study.
711    
712           The returned value from  a  pcre_study()  can  be  passed  directly  to
713           pcre_exec().  However,  the pcre_extra block also contains other fields
714           that can be set by the caller before the block  is  passed;  these  are
715           described  below.  If  studying  the pattern does not produce any addi-
716           tional information, pcre_study() returns NULL. In that circumstance, if
717           the  calling  program  wants  to  pass  some  of  the  other  fields to
718           pcre_exec(), it must set up its own pcre_extra block.
719    
720           The second argument contains option bits. At present,  no  options  are
721           defined for pcre_study(), and this argument should always be zero.
722    
723           The  third argument for pcre_study() is a pointer for an error message.
724           If studying succeeds (even if no data is  returned),  the  variable  it
725           points  to  is set to NULL. Otherwise it points to a textual error mes-
726           sage. You should therefore test the error pointer for NULL after  call-
727           ing pcre_study(), to be sure that it has run successfully.
728    
729           This is a typical call to pcre_study():
730    
731             pcre_extra *pe;
732             pe = pcre_study(
733               re,             /* result of pcre_compile() */
734               0,              /* no options exist */
735               &error);        /* set to NULL or points to a message */
736    
737           At present, studying a pattern is useful only for non-anchored patterns
738           that do not have a single fixed starting character. A bitmap of  possi-
739           ble starting characters is created.
740    
741    
742  LOCALE SUPPORT  LOCALE SUPPORT
      PCRE handles caseless matching, and determines whether char-  
      acters  are  letters, digits, or whatever, by reference to a  
      set of tables. The library contains a default set of  tables  
      which  is  created in the default C locale when PCRE is com-  
      piled.  This  is   used   when   the   final   argument   of  
      pcre_compile()  is NULL, and is sufficient for many applica-  
      tions.  
   
      An alternative set of tables can, however, be supplied. Such  
      tables  are built by calling the pcre_maketables() function,  
      which has no arguments, in the relevant locale.  The  result  
      can  then be passed to pcre_compile() as often as necessary.  
      For example, to build and use tables  that  are  appropriate  
      for  the French locale (where accented characters with codes  
      greater than 128 are treated as letters), the following code  
      could be used:  
   
        setlocale(LC_CTYPE, "fr");  
        tables = pcre_maketables();  
        re = pcre_compile(..., tables);  
   
      The  tables  are  built  in  memory  that  is  obtained  via  
      pcre_malloc.  The  pointer that is passed to pcre_compile is  
      saved with the compiled pattern, and  the  same  tables  are  
      used  via this pointer by pcre_study() and pcre_exec(). Thus  
      for any single pattern, compilation, studying  and  matching  
      all happen in the same locale, but different patterns can be  
      compiled in different locales. It is the caller's  responsi-  
      bility  to  ensure  that  the  memory  containing the tables  
      remains available for as long as it is needed.  
743    
744           PCRE  handles  caseless matching, and determines whether characters are
745           letters, digits, or whatever, by reference to a  set  of  tables.  When
746           running  in UTF-8 mode, this applies only to characters with codes less
747           than 256. The library contains a default set of tables that is  created
748           in  the  default  C locale when PCRE is compiled. This is used when the
749           final argument of pcre_compile() is NULL, and is  sufficient  for  many
750           applications.
751    
752           An alternative set of tables can, however, be supplied. Such tables are
753           built by calling the pcre_maketables() function,  which  has  no  argu-
754           ments,  in  the  relevant  locale.  The  result  can  then be passed to
755           pcre_compile() as often as necessary. For example,  to  build  and  use
756           tables that are appropriate for the French locale (where accented char-
757           acters with codes greater than 128 are treated as letters), the follow-
758           ing code could be used:
759    
760             setlocale(LC_CTYPE, "fr");
761             tables = pcre_maketables();
762             re = pcre_compile(..., tables);
763    
764           The  tables  are  built in memory that is obtained via pcre_malloc. The
765           pointer that is passed to pcre_compile is saved with the compiled  pat-
766           tern, and the same tables are used via this pointer by pcre_study() and
767           pcre_exec(). Thus, for any single pattern,  compilation,  studying  and
768           matching  all  happen in the same locale, but different patterns can be
769           compiled in different locales. It is  the  caller's  responsibility  to
770           ensure  that  the memory containing the tables remains available for as
771           long as it is needed.
772    
773    
774  INFORMATION ABOUT A PATTERN  INFORMATION ABOUT A PATTERN
      The pcre_fullinfo() function  returns  information  about  a  
      compiled pattern. It replaces the obsolete pcre_info() func-  
      tion, which is nevertheless retained for backwards compabil-  
      ity (and is documented below).  
   
      The first argument for pcre_fullinfo() is a pointer  to  the  
      compiled  pattern.  The  second  argument  is  the result of  
      pcre_study(), or NULL if the pattern was  not  studied.  The  
      third  argument  specifies  which  piece  of  information is  
      required, while the fourth argument is a pointer to a  vari-  
      able  to receive the data. The yield of the function is zero  
      for success, or one of the following negative numbers:  
   
        PCRE_ERROR_NULL       the argument code was NULL  
                              the argument where was NULL  
        PCRE_ERROR_BADMAGIC   the "magic number" was not found  
        PCRE_ERROR_BADOPTION  the value of what was invalid  
   
      Here is a typical call of  pcre_fullinfo(),  to  obtain  the  
      length of the compiled pattern:  
   
        int rc;  
        unsigned long int length;  
        rc = pcre_fullinfo(  
          re,               /* result of pcre_compile() */  
          pe,               /* result of pcre_study(), or NULL */  
          PCRE_INFO_SIZE,   /* what is required */  
          &length);         /* where to put the data */  
   
      The possible values for the third argument  are  defined  in  
      pcre.h, and are as follows:  
   
        PCRE_INFO_OPTIONS  
   
      Return a copy of the options with which the pattern was com-  
      piled.  The fourth argument should point to an unsigned long  
      int variable. These option bits are those specified  in  the  
      call  to  pcre_compile(),  modified  by any top-level option  
      settings  within  the   pattern   itself,   and   with   the  
      PCRE_ANCHORED  bit  forcibly  set if the form of the pattern  
      implies that it can match only at the  start  of  a  subject  
      string.  
   
        PCRE_INFO_SIZE  
   
      Return the size of the compiled pattern, that is, the  value  
      that  was  passed as the argument to pcre_malloc() when PCRE  
      was getting memory in which to place the compiled data.  The  
      fourth argument should point to a size_t variable.  
   
        PCRE_INFO_CAPTURECOUNT  
   
      Return the number of capturing subpatterns in  the  pattern.  
      The fourth argument should point to an int variable.  
   
        PCRE_INFO_BACKREFMAX  
   
      Return the number of the highest back reference in the  pat-  
      tern.  The  fourth argument should point to an int variable.  
      Zero is returned if there are no back references.  
   
        PCRE_INFO_FIRSTCHAR  
   
      Return information about the first character of any  matched  
      string,  for  a  non-anchored  pattern.  If there is a fixed  
      first   character,   e.g.   from   a   pattern    such    as  
      (cat|cow|coyote),  it  is returned in the integer pointed to  
      by where. Otherwise, if either  
   
      (a) the pattern was compiled with the PCRE_MULTILINE option,  
      and every branch starts with "^", or  
   
      (b) every  branch  of  the  pattern  starts  with  ".*"  and  
      PCRE_DOTALL is not set (if it were set, the pattern would be  
      anchored),  
   
      -1 is returned, indicating that the pattern matches only  at  
      the  start  of a subject string or after any "\n" within the  
      string. Otherwise -2 is returned.  For anchored patterns, -2  
      is returned.  
   
        PCRE_INFO_FIRSTTABLE  
   
      If the pattern was studied, and this resulted  in  the  con-  
      struction of a 256-bit table indicating a fixed set of char-  
      acters for the first character in  any  matching  string,  a  
      pointer   to  the  table  is  returned.  Otherwise  NULL  is  
      returned. The fourth argument should point  to  an  unsigned  
      char * variable.  
   
        PCRE_INFO_LASTLITERAL  
   
      For a non-anchored pattern, return the value of  the  right-  
      most  literal  character  which  must  exist  in any matched  
      string, other than at its start. The fourth argument  should  
      point  to an int variable. If there is no such character, or  
      if the pattern is anchored, -1 is returned. For example, for  
      the pattern /a\d+z\d+/ the returned value is 'z'.  
   
      The pcre_info() function is now obsolete because its  inter-  
      face  is  too  restrictive  to return all the available data  
      about  a  compiled  pattern.   New   programs   should   use  
      pcre_fullinfo()  instead.  The  yield  of pcre_info() is the  
      number of capturing subpatterns, or  one  of  the  following  
      negative numbers:  
   
        PCRE_ERROR_NULL       the argument code was NULL  
        PCRE_ERROR_BADMAGIC   the "magic number" was not found  
   
      If the optptr argument is not NULL, a copy  of  the  options  
      with which the pattern was compiled is placed in the integer  
      it points to (see PCRE_INFO_OPTIONS above).  
   
      If the pattern is not anchored and the firstcharptr argument  
      is  not  NULL, it is used to pass back information about the  
      first    character    of    any    matched    string    (see  
      PCRE_INFO_FIRSTCHAR above).  
775    
776           int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
777                int what, void *where);
778    
779           The pcre_fullinfo() function returns information about a compiled  pat-
780           tern. It replaces the obsolete pcre_info() function, which is neverthe-
781           less retained for backwards compability (and is documented below).
782    
783           The first argument for pcre_fullinfo() is a  pointer  to  the  compiled
784           pattern.  The second argument is the result of pcre_study(), or NULL if
785           the pattern was not studied. The third argument specifies  which  piece
786           of  information  is required, and the fourth argument is a pointer to a
787           variable to receive the data. The yield of the  function  is  zero  for
788           success, or one of the following negative numbers:
789    
790             PCRE_ERROR_NULL       the argument code was NULL
791                                   the argument where was NULL
792             PCRE_ERROR_BADMAGIC   the "magic number" was not found
793             PCRE_ERROR_BADOPTION  the value of what was invalid
794    
795           Here  is a typical call of pcre_fullinfo(), to obtain the length of the
796           compiled pattern:
797    
798             int rc;
799             unsigned long int length;
800             rc = pcre_fullinfo(
801               re,               /* result of pcre_compile() */
802               pe,               /* result of pcre_study(), or NULL */
803               PCRE_INFO_SIZE,   /* what is required */
804               &length);         /* where to put the data */
805    
806           The possible values for the third argument are defined in  pcre.h,  and
807           are as follows:
808    
809             PCRE_INFO_BACKREFMAX
810    
811           Return  the  number  of  the highest back reference in the pattern. The
812           fourth argument should point to an int variable. Zero  is  returned  if
813           there are no back references.
814    
815             PCRE_INFO_CAPTURECOUNT
816    
817           Return  the  number of capturing subpatterns in the pattern. The fourth
818           argument should point to an int variable.
819    
820             PCRE_INFO_FIRSTBYTE
821    
822           Return information about the first byte of any matched  string,  for  a
823           non-anchored    pattern.    (This    option    used    to   be   called
824           PCRE_INFO_FIRSTCHAR; the old name is  still  recognized  for  backwards
825           compatibility.)
826    
827           If  there  is  a  fixed  first  byte,  e.g.  from  a  pattern  such  as
828           (cat|cow|coyote), it is returned in the integer pointed  to  by  where.
829           Otherwise, if either
830    
831           (a)  the pattern was compiled with the PCRE_MULTILINE option, and every
832           branch starts with "^", or
833    
834           (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
835           set (if it were set, the pattern would be anchored),
836    
837           -1  is  returned, indicating that the pattern matches only at the start
838           of a subject string or after any newline within the  string.  Otherwise
839           -2 is returned. For anchored patterns, -2 is returned.
840    
841             PCRE_INFO_FIRSTTABLE
842    
843           If  the pattern was studied, and this resulted in the construction of a
844           256-bit table indicating a fixed set of bytes for the first byte in any
845           matching  string, a pointer to the table is returned. Otherwise NULL is
846           returned. The fourth argument should point to an unsigned char *  vari-
847           able.
848    
849             PCRE_INFO_LASTLITERAL
850    
851           Return  the  value of the rightmost literal byte that must exist in any
852           matched string, other than at its  start,  if  such  a  byte  has  been
853           recorded. The fourth argument should point to an int variable. If there
854           is no such byte, -1 is returned. For anchored patterns, a last  literal
855           byte  is  recorded only if it follows something of variable length. For
856           example, for the pattern /^a\d+z\d+/ the returned value is "z", but for
857           /^a\dz\d/ the returned value is -1.
858    
859             PCRE_INFO_NAMECOUNT
860             PCRE_INFO_NAMEENTRYSIZE
861             PCRE_INFO_NAMETABLE
862    
863           PCRE  supports the use of named as well as numbered capturing parenthe-
864           ses. The names are just an additional way of identifying the  parenthe-
865           ses,  which still acquire a number. A caller that wants to extract data
866           from a named subpattern must convert the name to a number in  order  to
867           access  the  correct  pointers  in  the  output  vector (described with
868           pcre_exec() below). In order to do this, it must first use these  three
869           values to obtain the name-to-number mapping table for the pattern.
870    
871           The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
872           gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
873           of  each  entry;  both  of  these  return  an int value. The entry size
874           depends on the length of the longest name. PCRE_INFO_NAMETABLE  returns
875           a  pointer  to  the  first  entry of the table (a pointer to char). The
876           first two bytes of each entry are the number of the capturing parenthe-
877           sis,  most  significant byte first. The rest of the entry is the corre-
878           sponding name, zero terminated. The names are  in  alphabetical  order.
879           For  example,  consider  the following pattern (assume PCRE_EXTENDED is
880           set, so white space - including newlines - is ignored):
881    
882             (?P<date> (?P<year>(\d\d)?\d\d) -
883             (?P<month>\d\d) - (?P<day>\d\d) )
884    
885           There are four named subpatterns, so the table has  four  entries,  and
886           each  entry  in the table is eight bytes long. The table is as follows,
887           with non-printing bytes shows in hex, and undefined bytes shown as ??:
888    
889             00 01 d  a  t  e  00 ??
890             00 05 d  a  y  00 ?? ??
891             00 04 m  o  n  t  h  00
892             00 02 y  e  a  r  00 ??
893    
894           When writing code to extract data from named subpatterns, remember that
895           the length of each entry may be different for each compiled pattern.
896    
897             PCRE_INFO_OPTIONS
898    
899           Return  a  copy of the options with which the pattern was compiled. The
900           fourth argument should point to an unsigned long  int  variable.  These
901           option bits are those specified in the call to pcre_compile(), modified
902           by any top-level option settings within the pattern itself.
903    
904           A pattern is automatically anchored by PCRE if  all  of  its  top-level
905           alternatives begin with one of the following:
906    
907             ^     unless PCRE_MULTILINE is set
908             \A    always
909             \G    always
910             .*    if PCRE_DOTALL is set and there are no back
911                     references to the subpattern in which .* appears
912    
913           For such patterns, the PCRE_ANCHORED bit is set in the options returned
914           by pcre_fullinfo().
915    
916             PCRE_INFO_SIZE
917    
918           Return the size of the compiled pattern, that is, the  value  that  was
919           passed as the argument to pcre_malloc() when PCRE was getting memory in
920           which to place the compiled data. The fourth argument should point to a
921           size_t variable.
922    
923             PCRE_INFO_STUDYSIZE
924    
925           Returns  the  size of the data block pointed to by the study_data field
926           in a pcre_extra block. That is, it is the  value  that  was  passed  to
927           pcre_malloc() when PCRE was getting memory into which to place the data
928           created by pcre_study(). The fourth argument should point to  a  size_t
929           variable.
930    
931    
932    OBSOLETE INFO FUNCTION
933    
934           int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
935    
936           The  pcre_info()  function is now obsolete because its interface is too
937           restrictive to return all the available data about a compiled  pattern.
938           New   programs   should  use  pcre_fullinfo()  instead.  The  yield  of
939           pcre_info() is the number of capturing subpatterns, or one of the  fol-
940           lowing negative numbers:
941    
942             PCRE_ERROR_NULL       the argument code was NULL
943             PCRE_ERROR_BADMAGIC   the "magic number" was not found
944    
945           If  the  optptr  argument is not NULL, a copy of the options with which
946           the pattern was compiled is placed in the integer  it  points  to  (see
947           PCRE_INFO_OPTIONS above).
948    
949           If  the  pattern  is  not anchored and the firstcharptr argument is not
950           NULL, it is used to pass back information about the first character  of
951           any matched string (see PCRE_INFO_FIRSTBYTE above).
952    
 MATCHING A PATTERN  
      The function pcre_exec() is called to match a subject string  
953    
954    MATCHING A PATTERN
955    
956           int pcre_exec(const pcre *code, const pcre_extra *extra,
957                const char *subject, int length, int startoffset,
958                int options, int *ovector, int ovecsize);
959    
960           The  function pcre_exec() is called to match a subject string against a
961           pre-compiled pattern, which is passed in the code argument. If the pat-
962           tern  has been studied, the result of the study should be passed in the
963           extra argument.
964    
965           Here is an example of a simple call to pcre_exec():
966    
967             int rc;
968             int ovector[30];
969             rc = pcre_exec(
970               re,             /* result of pcre_compile() */
971               NULL,           /* we didn't study the pattern */
972               "some string",  /* the subject string */
973               11,             /* the length of the subject string */
974               0,              /* start at offset 0 in the subject */
975               0,              /* default options */
976               ovector,        /* vector for substring information */
977               30);            /* number of elements in the vector */
978    
979           If the extra argument is not NULL, it must point to a  pcre_extra  data
980           block.  The pcre_study() function returns such a block (when it doesn't
981           return NULL), but you can also create one for yourself, and pass  addi-
982           tional information in it. The fields in the block are as follows:
983    
984             unsigned long int flags;
985             void *study_data;
986             unsigned long int match_limit;
987             void *callout_data;
988    
989           The  flags  field  is a bitmap that specifies which of the other fields
990           are set. The flag bits are:
991    
992             PCRE_EXTRA_STUDY_DATA
993             PCRE_EXTRA_MATCH_LIMIT
994             PCRE_EXTRA_CALLOUT_DATA
995    
996           Other flag bits should be set to zero. The study_data field is  set  in
997           the  pcre_extra  block  that is returned by pcre_study(), together with
998           the appropriate flag bit. You should not set this yourself, but you can
999           add to the block by setting the other fields.
1000    
1001           The match_limit field provides a means of preventing PCRE from using up
1002           a vast amount of resources when running patterns that are not going  to
1003           match,  but  which  have  a very large number of possibilities in their
1004           search trees. The classic  example  is  the  use  of  nested  unlimited
1005           repeats. Internally, PCRE uses a function called match() which it calls
1006           repeatedly (sometimes recursively). The limit is imposed on the  number
1007           of  times  this function is called during a match, which has the effect
1008           of limiting the amount of recursion  and  backtracking  that  can  take
1009           place.  For  patterns that are not anchored, the count starts from zero
1010           for each position in the subject string.
1011    
1012           The default limit for the library can be set when PCRE  is  built;  the
1013           default  default  is 10 million, which handles all but the most extreme
1014           cases. You can reduce  the  default  by  suppling  pcre_exec()  with  a
1015           pcre_extra  block  in  which match_limit is set to a smaller value, and
1016           PCRE_EXTRA_MATCH_LIMIT is set in the  flags  field.  If  the  limit  is
1017           exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
1018    
1019           The  pcre_callout  field is used in conjunction with the "callout" fea-
1020           ture, which is described in the pcrecallout documentation.
1021    
1022           The PCRE_ANCHORED option can be passed in the options  argument,  whose
1023           unused  bits  must  be zero. This limits pcre_exec() to matching at the
1024           first matching position.  However,  if  a  pattern  was  compiled  with
1025           PCRE_ANCHORED,  or turned out to be anchored by virtue of its contents,
1026           it cannot be made unachored at matching time.
1027    
1028           When PCRE_UTF8 was set at compile time, the validity of the subject  as
1029           a  UTF-8  string is automatically checked, and the value of startoffset
1030           is also checked to ensure that it points to the start of a UTF-8  char-
1031           acter.  If  an  invalid  UTF-8  sequence of bytes is found, pcre_exec()
1032           returns  the  error  PCRE_ERROR_BADUTF8.  If  startoffset  contains  an
1033           invalid value, PCRE_ERROR_BADUTF8_OFFSET is returned.
1034    
1035           If  you  already  know that your subject is valid, and you want to skip
1036           these   checks   for   performance   reasons,   you   can    set    the
1037           PCRE_NO_UTF8_CHECK  option  when calling pcre_exec(). You might want to
1038           do this for the second and subsequent calls to pcre_exec() if  you  are
1039           making  repeated  calls  to  find  all  the matches in a single subject
1040           string. However, you should be  sure  that  the  value  of  startoffset
1041           points  to  the  start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is
1042           set, the effect of passing an invalid UTF-8 string as a subject,  or  a
1043           value  of startoffset that does not point to the start of a UTF-8 char-
1044           acter, is undefined. Your program may crash.
1045    
1046           There are also three further options that can be set only  at  matching
1047           time:
1048    
1049             PCRE_NOTBOL
1050    
1051           The  first  character  of the string is not the beginning of a line, so
1052           the circumflex metacharacter should not match before it.  Setting  this
1053           without  PCRE_MULTILINE  (at  compile  time) causes circumflex never to
1054           match.
1055    
1056             PCRE_NOTEOL
1057    
1058           The end of the string is not the end of a line, so the dollar metachar-
1059           acter  should  not  match  it  nor (except in multiline mode) a newline
1060           immediately before it. Setting this without PCRE_MULTILINE (at  compile
1061           time) causes dollar never to match.
1062    
1063             PCRE_NOTEMPTY
1064    
1065           An empty string is not considered to be a valid match if this option is
1066           set. If there are alternatives in the pattern, they are tried.  If  all
1067           the  alternatives  match  the empty string, the entire match fails. For
1068           example, if the pattern
1069    
1070             a?b?
1071    
1072           is applied to a string not beginning with "a" or "b",  it  matches  the
1073           empty  string at the start of the subject. With PCRE_NOTEMPTY set, this
1074           match is not valid, so PCRE searches further into the string for occur-
1075           rences of "a" or "b".
1076    
1077           Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe-
1078           cial case of a pattern match of the empty  string  within  its  split()
1079           function,  and  when  using  the /g modifier. It is possible to emulate
1080           Perl's behaviour after matching a null string by first trying the match
1081           again at the same offset with PCRE_NOTEMPTY set, and then if that fails
1082           by advancing the starting offset (see below)  and  trying  an  ordinary
1083           match again.
1084    
1085           The  subject string is passed to pcre_exec() as a pointer in subject, a
1086           length in length, and a starting byte offset in startoffset. Unlike the
1087           pattern  string,  the  subject  may contain binary zero bytes. When the
1088           starting offset is zero, the search for a match starts at the beginning
1089           of the subject, and this is by far the most common case.
1090    
1091           If the pattern was compiled with the PCRE_UTF8 option, the subject must
1092           be a sequence of bytes that is a valid UTF-8 string, and  the  starting
1093           offset  must point to the beginning of a UTF-8 character. If an invalid
1094           UTF-8 string or offset is passed, an error  (either  PCRE_ERROR_BADUTF8
1095           or   PCRE_ERROR_BADUTF8_OFFSET)   is   returned,   unless   the  option
1096           PCRE_NO_UTF8_CHECK is set,  in  which  case  PCRE's  behaviour  is  not
1097           defined.
1098    
1099           A  non-zero  starting offset is useful when searching for another match
1100           in the same subject by calling pcre_exec() again after a previous  suc-
1101           cess.   Setting  startoffset differs from just passing over a shortened
1102           string and setting PCRE_NOTBOL in the case of  a  pattern  that  begins
1103           with any kind of lookbehind. For example, consider the pattern
1104    
1105             \Biss\B
1106    
1107           which  finds  occurrences  of "iss" in the middle of words. (\B matches
1108           only if the current position in the subject is not  a  word  boundary.)
1109           When  applied  to the string "Mississipi" the first call to pcre_exec()
1110           finds the first occurrence. If pcre_exec() is called  again  with  just
1111           the  remainder  of  the  subject,  namely  "issipi", it does not match,
1112           because \B is always false at the start of the subject, which is deemed
1113           to  be  a  word  boundary. However, if pcre_exec() is passed the entire
1114           string again, but with startoffset  set  to  4,  it  finds  the  second
1115           occurrence  of  "iss"  because  it  is able to look behind the starting
1116           point to discover that it is preceded by a letter.
1117    
1118           If a non-zero starting offset is passed when the pattern  is  anchored,
1119           one  attempt  to match at the given offset is tried. This can only suc-
1120           ceed if the pattern does not require the match to be at  the  start  of
1121           the subject.
1122    
1123           In  general, a pattern matches a certain portion of the subject, and in
1124           addition, further substrings from the subject  may  be  picked  out  by
1125           parts  of  the  pattern.  Following the usage in Jeffrey Friedl's book,
1126           this is called "capturing" in what follows, and the  phrase  "capturing
1127           subpattern"  is  used for a fragment of a pattern that picks out a sub-
1128           string. PCRE supports several other kinds of  parenthesized  subpattern
1129           that do not cause substrings to be captured.
1130    
1131           Captured  substrings are returned to the caller via a vector of integer
1132           offsets whose address is passed in ovector. The number of  elements  in
1133           the vector is passed in ovecsize. The first two-thirds of the vector is
1134           used to pass back captured substrings, each substring using a  pair  of
1135           integers.  The  remaining  third  of the vector is used as workspace by
1136           pcre_exec() while matching capturing subpatterns, and is not  available
1137           for  passing  back  information.  The  length passed in ovecsize should
1138           always be a multiple of three. If it is not, it is rounded down.
1139    
1140           When a match has been successful, information about captured substrings
1141           is returned in pairs of integers, starting at the beginning of ovector,
1142           and continuing up to two-thirds of its length at the  most.  The  first
1143           element of a pair is set to the offset of the first character in a sub-
1144           string, and the second is set to the  offset  of  the  first  character
1145           after  the  end  of  a  substring. The first pair, ovector[0] and ovec-
1146           tor[1], identify the portion of  the  subject  string  matched  by  the
1147           entire  pattern.  The next pair is used for the first capturing subpat-
1148           tern, and so on. The value returned by pcre_exec()  is  the  number  of
1149           pairs  that  have  been set. If there are no capturing subpatterns, the
1150           return value from a successful match is 1,  indicating  that  just  the
1151           first pair of offsets has been set.
1152    
1153           Some  convenience  functions  are  provided for extracting the captured
1154           substrings as separate strings. These are described  in  the  following
1155           section.
1156    
1157           It  is  possible  for  an capturing subpattern number n+1 to match some
1158           part of the subject when subpattern n has not been  used  at  all.  For
1159           example, if the string "abc" is matched against the pattern (a|(z))(bc)
1160           subpatterns 1 and 3 are matched, but 2 is not. When this happens,  both
1161           offset values corresponding to the unused subpattern are set to -1.
1162    
1163           If a capturing subpattern is matched repeatedly, it is the last portion
1164           of the string that it matched that gets returned.
1165    
1166           If the vector is too small to hold all the captured substrings,  it  is
1167           used as far as possible (up to two-thirds of its length), and the func-
1168           tion returns a value of zero. In particular, if the  substring  offsets
1169           are  not  of interest, pcre_exec() may be called with ovector passed as
1170           NULL and ovecsize as zero. However, if the pattern contains back refer-
1171           ences  and  the  ovector  isn't big enough to remember the related sub-
1172           strings, PCRE has to get additional memory  for  use  during  matching.
1173           Thus it is usually advisable to supply an ovector.
1174    
1175           Note  that  pcre_info() can be used to find out how many capturing sub-
1176           patterns there are in a compiled pattern. The smallest size for ovector
1177           that  will  allow for n captured substrings, in addition to the offsets
1178           of the substring matched by the whole pattern, is (n+1)*3.
1179    
1180           If pcre_exec() fails, it returns a negative number. The  following  are
1181           defined in the header file:
1182    
1183             PCRE_ERROR_NOMATCH        (-1)
1184    
1185           The subject string did not match the pattern.
1186    
1187             PCRE_ERROR_NULL           (-2)
1188    
1189           Either  code  or  subject  was  passed as NULL, or ovector was NULL and
1190           ovecsize was not zero.
1191    
1192             PCRE_ERROR_BADOPTION      (-3)
1193    
1194           An unrecognized bit was set in the options argument.
1195    
1196             PCRE_ERROR_BADMAGIC       (-4)
1197    
1198           PCRE stores a 4-byte "magic number" at the start of the compiled  code,
1199           to  catch  the case when it is passed a junk pointer. This is the error
1200           it gives when the magic number isn't present.
1201    
1202             PCRE_ERROR_UNKNOWN_NODE   (-5)
1203    
1204           While running the pattern match, an unknown item was encountered in the
1205           compiled  pattern.  This  error  could be caused by a bug in PCRE or by
1206           overwriting of the compiled pattern.
1207    
1208             PCRE_ERROR_NOMEMORY       (-6)
1209    
1210           If a pattern contains back references, but the ovector that  is  passed
1211           to pcre_exec() is not big enough to remember the referenced substrings,
1212           PCRE gets a block of memory at the start of matching to  use  for  this
1213           purpose.  If the call via pcre_malloc() fails, this error is given. The
1214           memory is freed at the end of matching.
1215    
1216             PCRE_ERROR_NOSUBSTRING    (-7)
1217    
1218           This error is used by the pcre_copy_substring(),  pcre_get_substring(),
1219           and  pcre_get_substring_list()  functions  (see  below).  It  is  never
1220           returned by pcre_exec().
1221    
1222             PCRE_ERROR_MATCHLIMIT     (-8)
1223    
1224           The recursion and backtracking limit, as specified by  the  match_limit
1225           field  in  a  pcre_extra  structure (or defaulted) was reached. See the
1226           description above.
1227    
1228             PCRE_ERROR_CALLOUT        (-9)
1229    
1230           This error is never generated by pcre_exec() itself. It is provided for
1231           use  by  callout functions that want to yield a distinctive error code.
1232           See the pcrecallout documentation for details.
1233    
1234             PCRE_ERROR_BADUTF8        (-10)
1235    
1236           A string that contains an invalid UTF-8 byte sequence was passed  as  a
1237           subject.
1238    
1239             PCRE_ERROR_BADUTF8_OFFSET (-11)
1240    
1241           The UTF-8 byte sequence that was passed as a subject was valid, but the
1242           value of startoffset did not point to the beginning of a UTF-8  charac-
1243           ter.
1244    
1245    
1246    EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
1247    
1248           int pcre_copy_substring(const char *subject, int *ovector,
1249                int stringcount, int stringnumber, char *buffer,
1250                int buffersize);
1251    
1252           int pcre_get_substring(const char *subject, int *ovector,
1253                int stringcount, int stringnumber,
1254                const char **stringptr);
1255    
1256           int pcre_get_substring_list(const char *subject,
1257                int *ovector, int stringcount, const char ***listptr);
1258    
1259           Captured  substrings  can  be  accessed  directly  by using the offsets
1260           returned by pcre_exec() in  ovector.  For  convenience,  the  functions
1261           pcre_copy_substring(),    pcre_get_substring(),    and    pcre_get_sub-
1262           string_list() are provided for extracting captured substrings  as  new,
1263           separate,  zero-terminated strings. These functions identify substrings
1264           by number. The next section describes functions  for  extracting  named
1265           substrings.  A  substring  that  contains  a  binary  zero is correctly
1266           extracted and has a further zero added on the end, but  the  result  is
1267           not, of course, a C string.
1268    
1269           The  first  three  arguments  are the same for all three of these func-
1270           tions: subject is the subject string which has just  been  successfully
1271           matched, ovector is a pointer to the vector of integer offsets that was
1272           passed to pcre_exec(), and stringcount is the number of substrings that
1273           were  captured  by  the match, including the substring that matched the
1274           entire regular expression. This is the value returned by  pcre_exec  if
1275           it  is greater than zero. If pcre_exec() returned zero, indicating that
1276           it ran out of space in ovector, the value passed as stringcount  should
1277           be the size of the vector divided by three.
1278    
1279           The  functions pcre_copy_substring() and pcre_get_substring() extract a
1280           single substring, whose number is given as  stringnumber.  A  value  of
1281           zero  extracts  the  substring  that  matched the entire pattern, while
1282           higher values  extract  the  captured  substrings.  For  pcre_copy_sub-
1283           string(),  the  string  is  placed  in buffer, whose length is given by
1284           buffersize, while for pcre_get_substring() a new  block  of  memory  is
1285           obtained  via  pcre_malloc,  and its address is returned via stringptr.
1286           The yield of the function is the length of the  string,  not  including
1287           the terminating zero, or one of
1288    
1289             PCRE_ERROR_NOMEMORY       (-6)
1290    
1291           The  buffer  was too small for pcre_copy_substring(), or the attempt to
1292           get memory failed for pcre_get_substring().
1293    
1294             PCRE_ERROR_NOSUBSTRING    (-7)
1295    
1296           There is no substring whose number is stringnumber.
1297    
1298           The pcre_get_substring_list()  function  extracts  all  available  sub-
1299           strings  and  builds  a list of pointers to them. All this is done in a
1300           single block of memory which is obtained via pcre_malloc.  The  address
1301           of the memory block is returned via listptr, which is also the start of
1302           the list of string pointers. The end of the list is marked  by  a  NULL
1303           pointer. The yield of the function is zero if all went well, or
1304    
1305             PCRE_ERROR_NOMEMORY       (-6)
1306    
1307           if the attempt to get the memory block failed.
1308    
1309           When  any of these functions encounter a substring that is unset, which
1310           can happen when capturing subpattern number n+1 matches  some  part  of
1311           the  subject, but subpattern n has not been used at all, they return an
1312           empty string. This can be distinguished from a genuine zero-length sub-
1313           string  by inspecting the appropriate offset in ovector, which is nega-
1314           tive for unset substrings.
1315    
1316           The    two    convenience    functions    pcre_free_substring()     and
1317           pcre_free_substring_list() can be used to free the memory returned by a
1318           previous call  of  pcre_get_substring()  or  pcre_get_substring_list(),
1319           respectively. They do nothing more than call the function pointed to by
1320           pcre_free, which of course could be called directly from a  C  program.
1321           However,  PCRE is used in some situations where it is linked via a spe-
1322           cial  interface  to  another  programming  language  which  cannot  use
1323           pcre_free  directly;  it is for these cases that the functions are pro-
1324           vided.
1325    
1326    
1327    EXTRACTING CAPTURED SUBSTRINGS BY NAME
1328    
1329           int pcre_copy_named_substring(const pcre *code,
1330                const char *subject, int *ovector,
1331                int stringcount, const char *stringname,
1332                char *buffer, int buffersize);
1333    
1334           int pcre_get_stringnumber(const pcre *code,
1335                const char *name);
1336    
1337           int pcre_get_named_substring(const pcre *code,
1338                const char *subject, int *ovector,
1339                int stringcount, const char *stringname,
1340                const char **stringptr);
1341    
1342           To extract a substring by name, you first have to find associated  num-
1343           ber.  This  can  be  done by calling pcre_get_stringnumber(). The first
1344           argument is the compiled pattern, and the second is the name. For exam-
1345           ple, for this pattern
1346    
1347             ab(?<xxx>\d+)...
1348    
1349           the  number  of the subpattern called "xxx" is 1. Given the number, you
1350           can then extract the substring directly, or use one  of  the  functions
1351           described  in the previous section. For convenience, there are also two
1352           functions that do the whole job.
1353    
1354           Most   of   the   arguments    of    pcre_copy_named_substring()    and
1355           pcre_get_named_substring() are the same as those for the functions that
1356           extract by number, and so are not re-described here. There are just two
1357           differences.
1358    
1359           First,  instead  of a substring number, a substring name is given. Sec-
1360           ond, there is an extra argument, given at the start, which is a pointer
1361           to  the compiled pattern. This is needed in order to gain access to the
1362           name-to-number translation table.
1363    
1364           These functions call pcre_get_stringnumber(), and if it succeeds,  they
1365           then  call  pcre_copy_substring() or pcre_get_substring(), as appropri-
1366           ate.
1367    
1368    Last updated: 09 December 2003
1369    Copyright (c) 1997-2003 University of Cambridge.
1370    -----------------------------------------------------------------------------
1371    
1372    PCRE(3)                                                                PCRE(3)
1373    
1374    
 SunOS 5.8                 Last change:                          9  
1375    
1376    NAME
1377           PCRE - Perl-compatible regular expressions
1378    
1379    PCRE CALLOUTS
1380    
1381       against  a pre-compiled pattern, which is passed in the code         int (*pcre_callout)(pcre_callout_block *);
      argument. If the pattern has been studied, the result of the  
      study should be passed in the extra argument. Otherwise this  
      must be NULL.  
   
      Here is an example of a simple call to pcre_exec():  
   
        int rc;  
        int ovector[30];  
        rc = pcre_exec(  
          re,             /* result of pcre_compile() */  
          NULL,           /* we didn't study the pattern */  
          "some string",  /* the subject string */  
          11,             /* the length of the subject string */  
          0,              /* start at offset 0 in the subject */  
          0,              /* default options */  
          ovector,        /* vector for substring information */  
          30);            /* number of elements in the vector */  
   
      The PCRE_ANCHORED option can be passed in the options  argu-  
      ment,  whose unused bits must be zero. However, if a pattern  
      was  compiled  with  PCRE_ANCHORED,  or  turned  out  to  be  
      anchored  by  virtue  of  its  contents,  it  cannot be made  
      unachored at matching time.  
   
      There are also three further options that can be set only at  
      matching time:  
   
        PCRE_NOTBOL  
   
      The first character of the string is not the beginning of  a  
      line,  so  the  circumflex  metacharacter  should  not match  
      before it. Setting this without PCRE_MULTILINE  (at  compile  
      time) causes circumflex never to match.  
   
        PCRE_NOTEOL  
   
      The end of the string is not the end of a line, so the  dol-  
      lar  metacharacter should not match it nor (except in multi-  
      line mode) a newline immediately  before  it.  Setting  this  
      without PCRE_MULTILINE (at compile time) causes dollar never  
      to match.  
   
        PCRE_NOTEMPTY  
   
      An empty string is not considered to be  a  valid  match  if  
      this  option  is  set. If there are alternatives in the pat-  
      tern, they are tried. If  all  the  alternatives  match  the  
      empty  string,  the  entire match fails. For example, if the  
      pattern  
   
        a?b?  
   
      is applied to a string not beginning with  "a"  or  "b",  it  
      matches  the  empty string at the start of the subject. With  
      PCRE_NOTEMPTY set, this match is not valid, so PCRE searches  
      further into the string for occurrences of "a" or "b".  
   
      Perl has no direct equivalent of PCRE_NOTEMPTY, but it  does  
      make  a  special case of a pattern match of the empty string  
      within its split() function, and when using the /g modifier.  
      It  is possible to emulate Perl's behaviour after matching a  
      null string by first trying the  match  again  at  the  same  
      offset  with  PCRE_NOTEMPTY  set,  and then if that fails by  
      advancing the starting offset  (see  below)  and  trying  an  
      ordinary match again.  
   
      The subject string is passed as  a  pointer  in  subject,  a  
      length  in  length,  and  a  starting offset in startoffset.  
      Unlike the pattern string, the subject  may  contain  binary  
      zero  characters.  When  the  starting  offset  is zero, the  
      search for a match starts at the beginning of  the  subject,  
      and this is by far the most common case.  
   
      A non-zero starting offset  is  useful  when  searching  for  
      another  match  in  the  same subject by calling pcre_exec()  
      again after a previous success.  Setting startoffset differs  
      from  just  passing  over  a  shortened  string  and setting  
      PCRE_NOTBOL in the case of a pattern that  begins  with  any  
      kind of lookbehind. For example, consider the pattern  
   
        \Biss\B  
   
      which finds occurrences of "iss" in the middle of words. (\B  
      matches only if the current position in the subject is not a  
      word boundary.) When applied to the string "Mississipi"  the  
      first  call  to  pcre_exec()  finds the first occurrence. If  
      pcre_exec() is called again with just the remainder  of  the  
      subject,  namely  "issipi", it does not match, because \B is  
      always false at the start of the subject, which is deemed to  
      be  a  word  boundary. However, if pcre_exec() is passed the  
      entire string again, but with startoffset set to 4, it finds  
      the  second  occurrence  of "iss" because it is able to look  
      behind the starting point to discover that it is preceded by  
      a letter.  
   
      If a non-zero starting offset is passed when the pattern  is  
      anchored, one attempt to match at the given offset is tried.  
      This can only succeed if the pattern does  not  require  the  
      match to be at the start of the subject.  
   
      In general, a pattern matches a certain portion of the  sub-  
      ject,  and  in addition, further substrings from the subject  
      may be picked out by parts of  the  pattern.  Following  the  
      usage  in  Jeffrey Friedl's book, this is called "capturing"  
      in what follows, and the phrase  "capturing  subpattern"  is  
      used for a fragment of a pattern that picks out a substring.  
      PCRE supports several other kinds of  parenthesized  subpat-  
      tern that do not cause substrings to be captured.  
   
      Captured substrings are returned to the caller via a  vector  
      of  integer  offsets whose address is passed in ovector. The  
      number of elements in the vector is passed in ovecsize.  The  
      first two-thirds of the vector is used to pass back captured  
      substrings, each substring using a  pair  of  integers.  The  
      remaining  third  of  the  vector  is  used  as workspace by  
      pcre_exec() while matching capturing subpatterns, and is not  
      available for passing back information. The length passed in  
      ovecsize should always be a multiple of three. If it is not,  
      it is rounded down.  
   
      When a match has been successful, information about captured  
      substrings is returned in pairs of integers, starting at the  
      beginning of ovector, and continuing up to two-thirds of its  
      length  at  the  most. The first element of a pair is set to  
      the offset of the first character in a  substring,  and  the  
      second is set to the offset of the first character after the  
      end of a substring. The first  pair,  ovector[0]  and  ovec-  
      tor[1],  identify  the portion of the subject string matched  
      by the entire pattern. The next pair is used for  the  first  
      capturing  subpattern,  and  so  on.  The  value returned by  
      pcre_exec() is the number of pairs that have  been  set.  If  
      there  are no capturing subpatterns, the return value from a  
      successful match is 1, indicating that just the  first  pair  
      of offsets has been set.  
   
      Some convenience functions are provided for  extracting  the  
      captured substrings as separate strings. These are described  
      in the following section.  
   
      It is possible for an capturing  subpattern  number  n+1  to  
      match  some  part  of  the subject when subpattern n has not  
      been used at all.  For  example,  if  the  string  "abc"  is  
      matched  against the pattern (a|(z))(bc) subpatterns 1 and 3  
      are matched, but 2 is not. When this  happens,  both  offset  
      values corresponding to the unused subpattern are set to -1.  
   
      If a capturing subpattern is matched repeatedly, it  is  the  
      last  portion  of  the  string  that  it  matched  that gets  
      returned.  
   
      If the vector is too small to hold  all  the  captured  sub-  
      strings,  it is used as far as possible (up to two-thirds of  
      its length), and the function returns a value  of  zero.  In  
      particular,  if  the  substring offsets are not of interest,  
      pcre_exec() may be called with ovector passed  as  NULL  and  
      ovecsize  as  zero.  However,  if  the pattern contains back  
      references and the ovector isn't big enough to remember  the  
      related  substrings,  PCRE  has to get additional memory for  
      use during matching. Thus it is usually advisable to  supply  
      an ovector.  
   
      Note that pcre_info() can be used to find out how many  cap-  
      turing  subpatterns  there  are  in  a compiled pattern. The  
      smallest size for ovector that will  allow  for  n  captured  
      substrings  in  addition  to  the  offsets  of the substring  
      matched by the whole pattern is (n+1)*3.  
   
      If pcre_exec() fails, it returns a negative number. The fol-  
      lowing are defined in the header file:  
   
        PCRE_ERROR_NOMATCH        (-1)  
   
      The subject string did not match the pattern.  
   
        PCRE_ERROR_NULL           (-2)  
   
      Either code or subject was passed as NULL,  or  ovector  was  
      NULL and ovecsize was not zero.  
   
        PCRE_ERROR_BADOPTION      (-3)  
   
      An unrecognized bit was set in the options argument.  
   
        PCRE_ERROR_BADMAGIC       (-4)  
   
      PCRE stores a 4-byte "magic number" at the start of the com-  
      piled  code,  to  catch  the  case  when it is passed a junk  
      pointer. This is the error it gives when  the  magic  number  
      isn't present.  
   
        PCRE_ERROR_UNKNOWN_NODE   (-5)  
   
      While running the pattern match, an unknown item was encoun-  
      tered in the compiled pattern. This error could be caused by  
      a bug in PCRE or by overwriting of the compiled pattern.  
   
        PCRE_ERROR_NOMEMORY       (-6)  
   
      If a pattern contains back references, but the ovector  that  
      is  passed  to pcre_exec() is not big enough to remember the  
      referenced substrings, PCRE gets a block of  memory  at  the  
      start  of  matching to use for this purpose. If the call via  
      pcre_malloc() fails, this error  is  given.  The  memory  is  
      freed at the end of matching.  
   
   
   
   
 EXTRACTING CAPTURED SUBSTRINGS  
      Captured substrings can be accessed directly  by  using  the  
      offsets returned by pcre_exec() in ovector. For convenience,  
      the functions  pcre_copy_substring(),  pcre_get_substring(),  
      and  pcre_get_substring_list()  are  provided for extracting  
      captured  substrings  as  new,   separate,   zero-terminated  
      strings.   A  substring  that  contains  a  binary  zero  is  
      correctly extracted and has a further zero added on the end,  
      but the result does not, of course, function as a C string.  
   
      The first three arguments are the same for all  three  func-  
      tions:  subject  is  the  subject string which has just been  
      successfully matched, ovector is a pointer to the vector  of  
      integer   offsets   that  was  passed  to  pcre_exec(),  and  
      stringcount is the number of substrings that  were  captured  
      by  the  match,  including  the  substring  that matched the  
      entire regular expression. This is  the  value  returned  by  
      pcre_exec  if  it  is  greater  than  zero.  If  pcre_exec()  
      returned zero, indicating that it ran out of space in  ovec-  
      tor,  the  value passed as stringcount should be the size of  
      the vector divided by three.  
   
      The functions pcre_copy_substring() and pcre_get_substring()  
      extract a single substring, whose number is given as string-  
      number. A value of zero extracts the substring that  matched  
      the entire pattern, while higher values extract the captured  
      substrings. For pcre_copy_substring(), the string is  placed  
      in  buffer,  whose  length is given by buffersize, while for  
      pcre_get_substring() a new block of memory is  obtained  via  
      pcre_malloc,  and its address is returned via stringptr. The  
      yield of the function is  the  length  of  the  string,  not  
      including the terminating zero, or one of  
   
        PCRE_ERROR_NOMEMORY       (-6)  
   
      The buffer was too small for pcre_copy_substring(),  or  the  
      attempt to get memory failed for pcre_get_substring().  
   
        PCRE_ERROR_NOSUBSTRING    (-7)  
   
      There is no substring whose number is stringnumber.  
   
      The pcre_get_substring_list() function extracts  all  avail-  
      able  substrings  and builds a list of pointers to them. All  
      this is done in a single block of memory which  is  obtained  
      via pcre_malloc. The address of the memory block is returned  
      via listptr, which is also the start of the list  of  string  
      pointers.  The  end of the list is marked by a NULL pointer.  
      The yield of the function is zero if all went well, or  
   
        PCRE_ERROR_NOMEMORY       (-6)  
   
      if the attempt to get the memory block failed.  
   
      When any of these functions encounter a  substring  that  is  
      unset, which can happen when capturing subpattern number n+1  
      matches some part of the subject, but subpattern n  has  not  
      been  used  at all, they return an empty string. This can be  
      distinguished  from  a  genuine  zero-length  substring   by  
      inspecting the appropriate offset in ovector, which is nega-  
      tive for unset substrings.  
   
      The  two  convenience  functions  pcre_free_substring()  and  
      pcre_free_substring_list()  can  be  used to free the memory  
      returned by  a  previous  call  of  pcre_get_substring()  or  
      pcre_get_substring_list(),  respectively.  They  do  nothing  
      more than call the function pointed to by  pcre_free,  which  
      of  course  could  be called directly from a C program. How-  
      ever, PCRE is used in some situations where it is linked via  
      a  special  interface  to another programming language which  
      cannot use pcre_free directly; it is for  these  cases  that  
      the functions are provided.  
1382    
1383           PCRE provides a feature called "callout", which is a means of temporar-
1384           ily passing control to the caller of PCRE  in  the  middle  of  pattern
1385           matching.  The  caller of PCRE provides an external function by putting
1386           its entry point in the global variable pcre_callout. By  default,  this
1387           variable contains NULL, which disables all calling out.
1388    
1389           Within  a  regular  expression,  (?C) indicates the points at which the
1390           external function is to be called.  Different  callout  points  can  be
1391           identified  by  putting  a number less than 256 after the letter C. The
1392           default value is zero.  For  example,  this  pattern  has  two  callout
1393           points:
1394    
1395             (?C1)abc(?C2)def
1396    
1397           During matching, when PCRE reaches a callout point (and pcre_callout is
1398           set), the external function is called. Its only argument is  a  pointer
1399           to a pcre_callout block. This contains the following variables:
1400    
1401             int          version;
1402             int          callout_number;
1403             int         *offset_vector;
1404             const char  *subject;
1405             int          subject_length;
1406             int          start_match;
1407             int          current_position;
1408             int          capture_top;
1409             int          capture_last;
1410             void        *callout_data;
1411    
1412           The  version  field  is an integer containing the version number of the
1413           block format. The current version  is  zero.  The  version  number  may
1414           change  in  future if additional fields are added, but the intention is
1415           never to remove any of the existing fields.
1416    
1417           The callout_number field contains the number of the  callout,  as  com-
1418           piled into the pattern (that is, the number after ?C).
1419    
1420           The  offset_vector field is a pointer to the vector of offsets that was
1421           passed by the caller to pcre_exec(). The contents can be  inspected  in
1422           order  to extract substrings that have been matched so far, in the same
1423           way as for extracting substrings after a match has completed.
1424    
1425           The subject and subject_length fields contain copies  the  values  that
1426           were passed to pcre_exec().
1427    
1428           The  start_match  field contains the offset within the subject at which
1429           the current match attempt started. If the pattern is not anchored,  the
1430           callout  function  may  be  called several times for different starting
1431           points.
1432    
1433           The current_position field contains the offset within  the  subject  of
1434           the current match pointer.
1435    
1436           The  capture_top field contains one more than the number of the highest
1437           numbered  captured  substring  so  far.  If  no  substrings  have  been
1438           captured, the value of capture_top is one.
1439    
1440           The  capture_last  field  contains the number of the most recently cap-
1441           tured substring.
1442    
1443           The callout_data field contains a value that is passed  to  pcre_exec()
1444           by  the  caller specifically so that it can be passed back in callouts.
1445           It is passed in the pcre_callout field of the  pcre_extra  data  struc-
1446           ture.  If  no  such  data  was  passed,  the value of callout_data in a
1447           pcre_callout block is NULL. There is a description  of  the  pcre_extra
1448           structure in the pcreapi documentation.
1449    
1450    
1451    
1452    RETURN VALUES
1453    
1454           The callout function returns an integer. If the value is zero, matching
1455           proceeds as normal. If the value is greater than zero,  matching  fails
1456           at the current point, but backtracking to test other possibilities goes
1457           ahead, just as if a lookahead assertion had failed.  If  the  value  is
1458           less  than  zero,  the  match is abandoned, and pcre_exec() returns the
1459           value.
1460    
1461           Negative  values  should  normally  be   chosen   from   the   set   of
1462           PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-
1463           dard "no  match"  failure.   The  error  number  PCRE_ERROR_CALLOUT  is
1464           reserved  for  use  by callout functions; it will never be used by PCRE
1465           itself.
1466    
1467    Last updated: 21 January 2003
1468    Copyright (c) 1997-2003 University of Cambridge.
1469    -----------------------------------------------------------------------------
1470    
1471    PCRE(3)                                                                PCRE(3)
1472    
 LIMITATIONS  
      There are some size limitations in PCRE but it is hoped that  
      they will never in practice be relevant.  The maximum length  
      of a compiled pattern is 65539 (sic) bytes.  All  values  in  
      repeating  quantifiers  must be less than 65536.  There max-  
      imum number of capturing subpatterns is 65535.  There is  no  
      limit  to  the  number of non-capturing subpatterns, but the  
      maximum depth of nesting of all kinds of parenthesized  sub-  
      pattern,  including  capturing  subpatterns, assertions, and  
      other types of subpattern, is 200.  
   
      The maximum length of a subject string is the largest  posi-  
      tive number that an integer variable can hold. However, PCRE  
      uses recursion to handle subpatterns and indefinite  repeti-  
      tion.  This  means  that the available stack space may limit  
      the size of a subject string that can be processed  by  cer-  
      tain patterns.  
1473    
1474    
1475    NAME
1476           PCRE - Perl-compatible regular expressions
1477    
1478  DIFFERENCES FROM PERL  DIFFERENCES FROM PERL
      The differences described here  are  with  respect  to  Perl  
      5.005.  
1479    
1480       1. By default, a whitespace character is any character  that         This  document describes the differences in the ways that PCRE and Perl
1481       the  C  library  function isspace() recognizes, though it is         handle regular expressions. The differences  described  here  are  with
1482       possible to compile PCRE  with  alternative  character  type         respect to Perl 5.8.
1483       tables. Normally isspace() matches space, formfeed, newline,  
1484       carriage return, horizontal tab, and vertical tab. Perl 5 no         1.  PCRE does not have full UTF-8 support. Details of what it does have
1485       longer  includes vertical tab in its set of whitespace char-         are given in the section on UTF-8 support in the main pcre page.
1486       acters. The \v escape that was in the Perl documentation for  
1487       a long time was never in fact recognized. However, the char-         2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl
1488       acter itself was treated as whitespace at least up to 5.002.         permits  them,  but they do not mean what you might think. For example,
1489       In 5.004 and 5.005 it does not match \s.         (?!a){3} does not assert that the next three characters are not "a". It
1490           just asserts that the next character is not "a" three times.
1491       2. PCRE does  not  allow  repeat  quantifiers  on  lookahead  
1492       assertions. Perl permits them, but they do not mean what you         3.  Capturing  subpatterns  that occur inside negative lookahead asser-
1493       might think. For example, (?!a){3} does not assert that  the         tions are counted, but their entries in the offsets  vector  are  never
1494       next  three characters are not "a". It just asserts that the         set.  Perl sets its numerical variables from any such patterns that are
1495       next character is not "a" three times.         matched before the assertion fails to match something (thereby succeed-
1496           ing),  but  only  if the negative lookahead assertion contains just one
1497       3. Capturing subpatterns that occur inside  negative  looka-         branch.
1498       head  assertions  are  counted,  but  their  entries  in the  
1499       offsets vector are never set. Perl sets its numerical  vari-         4. Though binary zero characters are supported in the  subject  string,
1500       ables  from  any  such  patterns that are matched before the         they are not allowed in a pattern string because it is passed as a nor-
1501       assertion fails to match something (thereby succeeding), but         mal C string, terminated by zero. The escape sequence "\0" can be  used
1502       only  if  the negative lookahead assertion contains just one         in the pattern to represent a binary zero.
1503       branch.  
1504           5.  The  following Perl escape sequences are not supported: \l, \u, \L,
1505       4. Though binary zero characters are supported in  the  sub-         \U, \P, \p, \N, and \X. In fact these are implemented by Perl's general
1506       ject  string,  they  are  not  allowed  in  a pattern string         string-handling and are not part of its pattern matching engine. If any
1507       because it is passed as a normal  C  string,  terminated  by         of these are encountered by PCRE, an error is generated.
1508       zero. The escape sequence "\0" can be used in the pattern to  
1509       represent a binary zero.         6. PCRE does support the \Q...\E escape for quoting substrings. Charac-
1510           ters  in  between  are  treated as literals. This is slightly different
1511       5. The following Perl escape sequences  are  not  supported:         from Perl in that $ and @ are  also  handled  as  literals  inside  the
1512       \l,  \u,  \L,  \U,  \E, \Q. In fact these are implemented by         quotes.  In Perl, they cause variable interpolation (but of course PCRE
1513       Perl's general string-handling and are not part of its  pat-         does not have variables). Note the following examples:
1514       tern matching engine.  
1515               Pattern            PCRE matches      Perl matches
1516       6. The Perl \G assertion is  not  supported  as  it  is  not  
1517       relevant to single pattern matches.             \Qabc$xyz\E        abc$xyz           abc followed by the
1518                                                      contents of $xyz
1519       7. Fairly obviously, PCRE does not support the (?{code}) and             \Qabc\$xyz\E       abc\$xyz          abc\$xyz
1520       (?p{code})  constructions. However, there is some experimen-             \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz
1521       tal support for recursive patterns using the  non-Perl  item  
1522       (?R).         The \Q...\E sequence is recognized both inside  and  outside  character
1523           classes.
1524       8. There are at the time of writing some  oddities  in  Perl  
1525       5.005_02  concerned  with  the  settings of captured strings         7. Fairly obviously, PCRE does not support the (?{code}) and (?p{code})
1526       when part of a pattern is repeated.  For  example,  matching         constructions. However, there is some experimental support  for  recur-
1527       "aba"  against the pattern /^(a(b)?)+$/ sets $2 to the value         sive  patterns  using the non-Perl items (?R), (?number) and (?P>name).
1528       "b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves  $2         Also, the PCRE "callout" feature allows  an  external  function  to  be
1529       unset.    However,    if   the   pattern   is   changed   to         called during pattern matching.
1530       /^(aa(b(b))?)+$/ then $2 (and $3) are set.  
1531           8.  There  are some differences that are concerned with the settings of
1532       In Perl 5.004 $2 is set in both cases, and that is also true         captured strings when part of  a  pattern  is  repeated.  For  example,
1533       of PCRE. If in the future Perl changes to a consistent state         matching  "aba"  against  the  pattern  /^(a(b)?)+$/  in Perl leaves $2
1534       that is different, PCRE may change to follow.         unset, but in PCRE it is set to "b".
1535    
1536       9. Another as yet unresolved discrepancy  is  that  in  Perl         9. PCRE  provides  some  extensions  to  the  Perl  regular  expression
1537       5.005_02  the  pattern /^(a)?(?(1)a|b)+$/ matches the string         facilities:
1538       "a", whereas in PCRE it does not.  However, in both Perl and  
1539       PCRE /^(a)?a/ matched against "a" leaves $1 unset.         (a)  Although  lookbehind  assertions  must match fixed length strings,
1540           each alternative branch of a lookbehind assertion can match a different
1541       10. PCRE  provides  some  extensions  to  the  Perl  regular         length of string. Perl requires them all to have the same length.
1542       expression facilities:  
1543           (b)  If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $
1544       (a) Although lookbehind assertions must match  fixed  length         meta-character matches only at the very end of the string.
1545       strings,  each  alternative branch of a lookbehind assertion  
1546       can match a different length of string. Perl 5.005  requires         (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-
1547       them all to have the same length.         cial meaning is faulted.
1548    
1549       (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is  not         (d)  If  PCRE_UNGREEDY is set, the greediness of the repetition quanti-
1550       set,  the  $ meta- character matches only at the very end of         fiers is inverted, that is, by default they are not greedy, but if fol-
1551       the string.         lowed by a question mark they are.
   
      (c) If PCRE_EXTRA is set, a backslash followed by  a  letter  
      with no special meaning is faulted.  
   
      (d) If PCRE_UNGREEDY is set, the greediness of  the  repeti-  
      tion  quantifiers  is inverted, that is, by default they are  
      not greedy, but if followed by a question mark they are.  
   
      (e) PCRE_ANCHORED can be used to force a pattern to be tried  
      only at the start of the subject.  
   
      (f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY  options  
      for pcre_exec() have no Perl equivalents.  
   
      (g) The (?R) construct allows for recursive pattern matching  
      (Perl  5.6 can do this using the (?p{code}) construct, which  
      PCRE cannot of course support.)  
   
   
   
 REGULAR EXPRESSION DETAILS  
      The syntax and semantics of  the  regular  expressions  sup-  
      ported  by PCRE are described below. Regular expressions are  
      also described in the Perl documentation and in a number  of  
      other  books,  some  of which have copious examples. Jeffrey  
      Friedl's  "Mastering  Regular  Expressions",  published   by  
      O'Reilly (ISBN 1-56592-257), covers them in great detail.  
   
      The description here is intended as reference documentation.  
      The basic operation of PCRE is on strings of bytes. However,  
      there is the beginnings of some support for UTF-8  character  
      strings.  To  use  this  support  you must configure PCRE to  
      include it, and then call pcre_compile() with the  PCRE_UTF8  
      option.  How  this affects the pattern matching is described  
      in the final section of this document.  
   
      A regular expression is a pattern that is matched against  a  
      subject string from left to right. Most characters stand for  
      themselves in a pattern, and match the corresponding charac-  
      ters in the subject. As a trivial example, the pattern  
   
        The quick brown fox  
   
      matches a portion of a subject string that is  identical  to  
      itself.  The  power  of  regular  expressions comes from the  
      ability to include alternatives and repetitions in the  pat-  
      tern.  These  are encoded in the pattern by the use of meta-  
      characters, which do not stand for  themselves  but  instead  
      are interpreted in some special way.  
   
      There are two different sets of meta-characters: those  that  
      are  recognized anywhere in the pattern except within square  
      brackets, and those that are recognized in square  brackets.  
      Outside square brackets, the meta-characters are as follows:  
   
        \      general escape character with several uses  
        ^      assert start of  subject  (or  line,  in  multiline  
      mode)  
        $      assert end of subject (or line, in multiline mode)  
        .      match any character except newline (by default)  
        [      start character class definition  
        |      start of alternative branch  
        (      start subpattern  
        )      end subpattern  
        ?      extends the meaning of (  
               also 0 or 1 quantifier  
               also quantifier minimizer  
        *      0 or more quantifier  
        +      1 or more quantifier  
        {      start min/max quantifier  
   
      Part of a pattern that is in square  brackets  is  called  a  
      "character  class".  In  a  character  class  the only meta-  
      characters are:  
   
        \      general escape character  
        ^      negate the class, but only if the first character  
        -      indicates character range  
        ]      terminates the character class  
1552    
1553       The following sections describe  the  use  of  each  of  the         (e)  PCRE_ANCHORED  can  be used to force a pattern to be tried only at
1554       meta-characters.         the first matching position in the subject string.
1555    
1556           (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and  PCRE_NO_AUTO_CAP-
1557           TURE options for pcre_exec() have no Perl equivalents.
1558    
1559           (g)  The (?R), (?number), and (?P>name) constructs allows for recursive
1560           pattern matching (Perl can do  this  using  the  (?p{code})  construct,
1561           which PCRE cannot support.)
1562    
1563           (h)  PCRE supports named capturing substrings, using the Python syntax.
1564    
1565           (i) PCRE supports the possessive quantifier  "++"  syntax,  taken  from
1566           Sun's Java package.
1567    
1568           (j) The (R) condition, for testing recursion, is a PCRE extension.
1569    
1570           (k) The callout facility is PCRE-specific.
1571    
1572    Last updated: 09 December 2003
1573    Copyright (c) 1997-2003 University of Cambridge.
1574    -----------------------------------------------------------------------------
1575    
1576    PCRE(3)                                                                PCRE(3)
1577    
1578    
1579    
1580    NAME
1581           PCRE - Perl-compatible regular expressions
1582    
1583    PCRE REGULAR EXPRESSION DETAILS
1584    
1585           The  syntax  and semantics of the regular expressions supported by PCRE
1586           are described below. Regular expressions are also described in the Perl
1587           documentation  and in a number of other books, some of which have copi-
1588           ous examples. Jeffrey Friedl's "Mastering  Regular  Expressions",  pub-
1589           lished  by  O'Reilly, covers them in great detail. The description here
1590           is intended as reference documentation.
1591    
1592           The basic operation of PCRE is on strings of bytes. However,  there  is
1593           also  support for UTF-8 character strings. To use this support you must
1594           build PCRE to include UTF-8 support, and then call pcre_compile()  with
1595           the  PCRE_UTF8  option.  How  this affects the pattern matching is men-
1596           tioned in several places below. There is also a summary of  UTF-8  fea-
1597           tures in the section on UTF-8 support in the main pcre page.
1598    
1599           A  regular  expression  is  a pattern that is matched against a subject
1600           string from left to right. Most characters stand for  themselves  in  a
1601           pattern,  and  match  the corresponding characters in the subject. As a
1602           trivial example, the pattern
1603    
1604             The quick brown fox
1605    
1606           matches a portion of a subject string that is identical to itself.  The
1607           power of regular expressions comes from the ability to include alterna-
1608           tives and repetitions in the pattern. These are encoded in the  pattern
1609           by  the  use  of meta-characters, which do not stand for themselves but
1610           instead are interpreted in some special way.
1611    
1612           There are two different sets of meta-characters: those that are  recog-
1613           nized  anywhere in the pattern except within square brackets, and those
1614           that are recognized in square brackets. Outside  square  brackets,  the
1615           meta-characters are as follows:
1616    
1617             \      general escape character with several uses
1618             ^      assert start of string (or line, in multiline mode)
1619             $      assert end of string (or line, in multiline mode)
1620             .      match any character except newline (by default)
1621             [      start character class definition
1622             |      start of alternative branch
1623             (      start subpattern
1624             )      end subpattern
1625             ?      extends the meaning of (
1626                    also 0 or 1 quantifier
1627                    also quantifier minimizer
1628             *      0 or more quantifier
1629             +      1 or more quantifier
1630                    also "possessive quantifier"
1631             {      start min/max quantifier
1632    
1633           Part  of  a  pattern  that is in square brackets is called a "character
1634           class". In a character class the only meta-characters are:
1635    
1636             \      general escape character
1637             ^      negate the class, but only if the first character
1638             -      indicates character range
1639             [      POSIX character class (only if followed by POSIX
1640                      syntax)
1641             ]      terminates the character class
1642    
1643           The following sections describe the use of each of the meta-characters.
1644    
1645    
1646  BACKSLASH  BACKSLASH
      The backslash character has several uses. Firstly, if it  is  
      followed  by  a  non-alphameric character, it takes away any  
      special  meaning  that  character  may  have.  This  use  of  
   
      backslash  as  an  escape  character applies both inside and  
      outside character classes.  
   
      For example, if you want to match a "*" character, you write  
      "\*" in the pattern. This applies whether or not the follow-  
      ing character would otherwise  be  interpreted  as  a  meta-  
      character,  so it is always safe to precede a non-alphameric  
      with "\" to specify that it stands for itself.  In  particu-  
      lar, if you want to match a backslash, you write "\\".  
   
      If a pattern is compiled with the PCRE_EXTENDED option, whi-  
      tespace in the pattern (other than in a character class) and  
      characters between a "#" outside a character class  and  the  
      next  newline  character  are ignored. An escaping backslash  
      can be used to include a whitespace or "#" character as part  
      of the pattern.  
   
      A second use of backslash provides a way  of  encoding  non-  
      printing  characters  in patterns in a visible manner. There  
      is no restriction on the appearance of non-printing  charac-  
      ters,  apart from the binary zero that terminates a pattern,  
      but when a pattern is being prepared by text editing, it  is  
      usually  easier to use one of the following escape sequences  
      than the binary character it represents:  
   
        \a     alarm, that is, the BEL character (hex 07)  
        \cx    "control-x", where x is any character  
        \e     escape (hex 1B)  
        \f     formfeed (hex 0C)  
        \n     newline (hex 0A)  
        \r     carriage return (hex 0D)  
        \t     tab (hex 09)  
        \xhh   character with hex code hh  
        \ddd   character with octal code ddd, or backreference  
   
      The precise effect of "\cx" is as follows: if "x" is a lower  
      case  letter,  it  is converted to upper case. Then bit 6 of  
      the character (hex 40) is inverted.  Thus "\cz" becomes  hex  
      1A, but "\c{" becomes hex 3B, while "\c;" becomes hex 7B.  
   
      After "\x", up to two hexadecimal digits are  read  (letters  
      can be in upper or lower case).  
   
      After "\0" up to two further octal digits are read. In  both  
      cases,  if  there are fewer than two digits, just those that  
      are present are used. Thus the sequence "\0\x\07"  specifies  
      two binary zeros followed by a BEL character.  Make sure you  
      supply two digits after the initial zero  if  the  character  
      that follows is itself an octal digit.  
   
      The handling of a backslash followed by a digit other than 0  
      is  complicated.   Outside  a character class, PCRE reads it  
      and any following digits as a decimal number. If the  number  
      is  less  than  10, or if there have been at least that many  
      previous capturing left parentheses in the  expression,  the  
      entire  sequence is taken as a back reference. A description  
      of how this works is given later, following  the  discussion  
      of parenthesized subpatterns.  
   
      Inside a character  class,  or  if  the  decimal  number  is  
      greater  than  9 and there have not been that many capturing  
      subpatterns, PCRE re-reads up to three octal digits  follow-  
      ing  the  backslash,  and  generates  a single byte from the  
      least significant 8 bits of the value. Any subsequent digits  
      stand for themselves.  For example:  
   
        \040   is another way of writing a space  
        \40    is the same, provided there are fewer than 40  
                  previous capturing subpatterns  
        \7     is always a back reference  
        \11    might be a back reference, or another way of  
                  writing a tab  
        \011   is always a tab  
        \0113  is a tab followed by the character "3"  
        \113   is the character with octal code 113 (since there  
                  can be no more than 99 back references)  
        \377   is a byte consisting entirely of 1 bits  
        \81    is either a back reference, or a binary zero  
                  followed by the two characters "8" and "1"  
   
      Note that octal values of 100 or greater must not be  intro-  
      duced  by  a  leading zero, because no more than three octal  
      digits are ever read.  
   
      All the sequences that define a single  byte  value  can  be  
      used both inside and outside character classes. In addition,  
      inside a character class, the sequence "\b"  is  interpreted  
      as  the  backspace  character  (hex 08). Outside a character  
      class it has a different meaning (see below).  
   
      The third use of backslash is for specifying generic charac-  
      ter types:  
   
        \d     any decimal digit  
        \D     any character that is not a decimal digit  
        \s     any whitespace character  
        \S     any character that is not a whitespace character  
        \w     any "word" character  
        \W     any "non-word" character  
   
      Each pair of escape sequences partitions the complete set of  
      characters  into  two  disjoint  sets.  Any  given character  
      matches one, and only one, of each pair.  
   
      A "word" character is any letter or digit or the  underscore  
      character,  that  is,  any  character which can be part of a  
      Perl "word". The definition of letters and  digits  is  con-  
      trolled  by PCRE's character tables, and may vary if locale-  
      specific matching is  taking  place  (see  "Locale  support"  
      above). For example, in the "fr" (French) locale, some char-  
      acter codes greater than 128 are used for accented  letters,  
      and these are matched by \w.  
   
      These character type sequences can appear  both  inside  and  
      outside  character classes. They each match one character of  
      the appropriate type. If the current matching  point  is  at  
      the end of the subject string, all of them fail, since there  
      is no character to match.  
   
      The fourth use of backslash is  for  certain  simple  asser-  
      tions. An assertion specifies a condition that has to be met  
      at a particular point in  a  match,  without  consuming  any  
      characters  from  the subject string. The use of subpatterns  
      for more complicated  assertions  is  described  below.  The  
      backslashed assertions are  
   
        \b     word boundary  
        \B     not a word boundary  
        \A     start of subject (independent of multiline mode)  
        \Z     end of subject or newline at  end  (independent  of  
      multiline mode)  
        \z     end of subject (independent of multiline mode)  
   
      These assertions may not appear in  character  classes  (but  
      note that "\b" has a different meaning, namely the backspace  
      character, inside a character class).  
   
      A word boundary is a position in the  subject  string  where  
      the current character and the previous character do not both  
      match \w or \W (i.e. one matches \w and  the  other  matches  
      \W),  or the start or end of the string if the first or last  
      character matches \w, respectively.  
   
      The \A, \Z, and \z assertions differ  from  the  traditional  
      circumflex  and  dollar  (described below) in that they only  
      ever match at the very start and end of the subject  string,  
      whatever  options  are  set.  They  are  not affected by the  
      PCRE_NOTBOL or PCRE_NOTEOL options. If the startoffset argu-  
      ment  of  pcre_exec()  is  non-zero, \A can never match. The  
      difference between \Z and \z is that  \Z  matches  before  a  
      newline  that is the last character of the string as well as  
      at the end of the string, whereas \z  matches  only  at  the  
      end.  
1647    
1648           The backslash character has several uses. Firstly, if it is followed by
1649           a non-alphameric character, it takes  away  any  special  meaning  that
1650           character  may  have.  This  use  of  backslash  as an escape character
1651           applies both inside and outside character classes.
1652    
1653           For example, if you want to match a * character, you write  \*  in  the
1654           pattern.   This  escaping  action  applies whether or not the following
1655           character would otherwise be interpreted as a meta-character, so it  is
1656           always  safe to precede a non-alphameric with backslash to specify that
1657           it stands for itself. In particular, if you want to match a  backslash,
1658           you write \\.
1659    
1660           If  a  pattern is compiled with the PCRE_EXTENDED option, whitespace in
1661           the pattern (other than in a character class) and characters between  a
1662           # outside a character class and the next newline character are ignored.
1663           An escaping backslash can be used to include a whitespace or #  charac-
1664           ter as part of the pattern.
1665    
1666           If  you  want  to remove the special meaning from a sequence of charac-
1667           ters, you can do so by putting them between \Q and \E. This is  differ-
1668           ent  from  Perl  in  that  $  and  @ are handled as literals in \Q...\E
1669           sequences in PCRE, whereas in Perl, $ and @ cause  variable  interpola-
1670           tion. Note the following examples:
1671    
1672             Pattern            PCRE matches   Perl matches
1673    
1674             \Qabc$xyz\E        abc$xyz        abc followed by the
1675                                                 contents of $xyz
1676             \Qabc\$xyz\E       abc\$xyz       abc\$xyz
1677             \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
1678    
1679           The  \Q...\E  sequence  is recognized both inside and outside character
1680           classes.
1681    
1682           A second use of backslash provides a way of encoding non-printing char-
1683           acters  in patterns in a visible manner. There is no restriction on the
1684           appearance of non-printing characters, apart from the binary zero  that
1685           terminates  a  pattern,  but  when  a pattern is being prepared by text
1686           editing, it is usually easier  to  use  one  of  the  following  escape
1687           sequences than the binary character it represents:
1688    
1689             \a        alarm, that is, the BEL character (hex 07)
1690             \cx       "control-x", where x is any character
1691             \e        escape (hex 1B)
1692             \f        formfeed (hex 0C)
1693             \n        newline (hex 0A)
1694             \r        carriage return (hex 0D)
1695             \t        tab (hex 09)
1696             \ddd      character with octal code ddd, or backreference
1697             \xhh      character with hex code hh
1698             \x{hhh..} character with hex code hhh... (UTF-8 mode only)
1699    
1700           The  precise  effect of \cx is as follows: if x is a lower case letter,
1701           it is converted to upper case. Then bit 6 of the character (hex 40)  is
1702           inverted.   Thus  \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;
1703           becomes hex 7B.
1704    
1705           After \x, from zero to two hexadecimal digits are read (letters can  be
1706           in  upper or lower case). In UTF-8 mode, any number of hexadecimal dig-
1707           its may appear between \x{ and }, but the value of the  character  code
1708           must  be  less  than  2**31  (that is, the maximum hexadecimal value is
1709           7FFFFFFF). If characters other than hexadecimal digits  appear  between
1710           \x{  and }, or if there is no terminating }, this form of escape is not
1711           recognized. Instead, the initial \x will be interpreted as a basic hex-
1712           adecimal escape, with no following digits, giving a byte whose value is
1713           zero.
1714    
1715           Characters whose value is less than 256 can be defined by either of the
1716           two  syntaxes for \x when PCRE is in UTF-8 mode. There is no difference
1717           in the way they are handled. For example, \xdc is exactly the  same  as
1718           \x{dc}.
1719    
1720           After  \0  up  to  two further octal digits are read. In both cases, if
1721           there are fewer than two digits, just those that are present are  used.
1722           Thus  the sequence \0\x\07 specifies two binary zeros followed by a BEL
1723           character (code value 7). Make sure you supply  two  digits  after  the
1724           initial zero if the character that follows is itself an octal digit.
1725    
1726           The handling of a backslash followed by a digit other than 0 is compli-
1727           cated.  Outside a character class, PCRE reads it and any following dig-
1728           its  as  a  decimal  number. If the number is less than 10, or if there
1729           have been at least that many previous capturing left parentheses in the
1730           expression,  the  entire  sequence  is  taken  as  a  back reference. A
1731           description of how this works is given later, following the  discussion
1732           of parenthesized subpatterns.
1733    
1734           Inside  a  character  class, or if the decimal number is greater than 9
1735           and there have not been that many capturing subpatterns, PCRE  re-reads
1736           up  to three octal digits following the backslash, and generates a sin-
1737           gle byte from the least significant 8 bits of the value. Any subsequent
1738           digits stand for themselves.  For example:
1739    
1740             \040   is another way of writing a space
1741             \40    is the same, provided there are fewer than 40
1742                       previous capturing subpatterns
1743             \7     is always a back reference
1744             \11    might be a back reference, or another way of
1745                       writing a tab
1746             \011   is always a tab
1747             \0113  is a tab followed by the character "3"
1748             \113   might be a back reference, otherwise the
1749                       character with octal code 113
1750             \377   might be a back reference, otherwise
1751                       the byte consisting entirely of 1 bits
1752             \81    is either a back reference, or a binary zero
1753                       followed by the two characters "8" and "1"
1754    
1755           Note  that  octal  values of 100 or greater must not be introduced by a
1756           leading zero, because no more than three octal digits are ever read.
1757    
1758           All the sequences that define a single byte value  or  a  single  UTF-8
1759           character (in UTF-8 mode) can be used both inside and outside character
1760           classes. In addition, inside a character  class,  the  sequence  \b  is
1761           interpreted  as  the  backspace character (hex 08). Outside a character
1762           class it has a different meaning (see below).
1763    
1764           The third use of backslash is for specifying generic character types:
1765    
1766             \d     any decimal digit
1767             \D     any character that is not a decimal digit
1768             \s     any whitespace character
1769             \S     any character that is not a whitespace character
1770             \w     any "word" character
1771             \W     any "non-word" character
1772    
1773           Each pair of escape sequences partitions the complete set of characters
1774           into  two disjoint sets. Any given character matches one, and only one,
1775           of each pair.
1776    
1777           In UTF-8 mode, characters with values greater than 255 never match  \d,
1778           \s, or \w, and always match \D, \S, and \W.
1779    
1780           For  compatibility  with Perl, \s does not match the VT character (code
1781           11).  This makes it different from the the POSIX "space" class. The  \s
1782           characters are HT (9), LF (10), FF (12), CR (13), and space (32).
1783    
1784           A  "word" character is any letter or digit or the underscore character,
1785           that is, any character which can be part of a Perl "word". The  defini-
1786           tion  of  letters  and digits is controlled by PCRE's character tables,
1787           and may vary if locale- specific matching is taking place (see  "Locale
1788           support"  in  the  pcreapi  page).  For  example,  in the "fr" (French)
1789           locale, some character codes greater than 128  are  used  for  accented
1790           letters, and these are matched by \w.
1791    
1792           These character type sequences can appear both inside and outside char-
1793           acter classes. They each match one character of the  appropriate  type.
1794           If  the current matching point is at the end of the subject string, all
1795           of them fail, since there is no character to match.
1796    
1797           The fourth use of backslash is for certain simple assertions. An asser-
1798           tion  specifies a condition that has to be met at a particular point in
1799           a match, without consuming any characters from the subject string.  The
1800           use  of subpatterns for more complicated assertions is described below.
1801           The backslashed assertions are
1802    
1803             \b     matches at a word boundary
1804             \B     matches when not at a word boundary
1805             \A     matches at start of subject
1806             \Z     matches at end of subject or before newline at end
1807             \z     matches at end of subject
1808             \G     matches at first matching position in subject
1809    
1810           These assertions may not appear in character classes (but note that  \b
1811           has a different meaning, namely the backspace character, inside a char-
1812           acter class).
1813    
1814           A word boundary is a position in the subject string where  the  current
1815           character  and  the previous character do not both match \w or \W (i.e.
1816           one matches \w and the other matches \W), or the start or  end  of  the
1817           string if the first or last character matches \w, respectively.
1818    
1819           The  \A,  \Z,  and \z assertions differ from the traditional circumflex
1820           and dollar (described below) in that they only ever match at  the  very
1821           start  and  end  of the subject string, whatever options are set. Thus,
1822           they are independent of multiline mode.
1823    
1824           They are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options. If the
1825           startoffset argument of pcre_exec() is non-zero, indicating that match-
1826           ing is to start at a point other than the beginning of the subject,  \A
1827           can  never  match.  The difference between \Z and \z is that \Z matches
1828           before a newline that is the last character of the string as well as at
1829           the end of the string, whereas \z matches only at the end.
1830    
1831           The  \G assertion is true only when the current matching position is at
1832           the start point of the match, as specified by the startoffset  argument
1833           of  pcre_exec().  It  differs  from \A when the value of startoffset is
1834           non-zero. By calling pcre_exec() multiple times with appropriate  argu-
1835           ments, you can mimic Perl's /g option, and it is in this kind of imple-
1836           mentation where \G can be useful.
1837    
1838           Note, however, that PCRE's interpretation of \G, as the  start  of  the
1839           current match, is subtly different from Perl's, which defines it as the
1840           end of the previous match. In Perl, these can  be  different  when  the
1841           previously  matched  string was empty. Because PCRE does just one match
1842           at a time, it cannot reproduce this behaviour.
1843    
1844           If all the alternatives of a pattern begin with \G, the  expression  is
1845           anchored to the starting match position, and the "anchored" flag is set
1846           in the compiled regular expression.
1847    
1848    
1849  CIRCUMFLEX AND DOLLAR  CIRCUMFLEX AND DOLLAR
      Outside a character class, in the default matching mode, the  
      circumflex  character  is an assertion which is true only if  
      the current matching point is at the start  of  the  subject  
      string.  If  the startoffset argument of pcre_exec() is non-  
      zero, circumflex can never match. Inside a character  class,  
      circumflex has an entirely different meaning (see below).  
   
      Circumflex need not be the first character of the pattern if  
      a  number of alternatives are involved, but it should be the  
      first thing in each alternative in which it appears  if  the  
      pattern is ever to match that branch. If all possible alter-  
      natives start with a circumflex, that is, if the pattern  is  
      constrained to match only at the start of the subject, it is  
      said to be an "anchored" pattern. (There are also other con-  
      structs that can cause a pattern to be anchored.)  
   
      A dollar character is an assertion which is true only if the  
      current  matching point is at the end of the subject string,  
      or immediately before a newline character that is  the  last  
      character in the string (by default). Dollar need not be the  
      last character of the pattern if a  number  of  alternatives  
      are  involved,  but it should be the last item in any branch  
      in which it appears.  Dollar has no  special  meaning  in  a  
      character class.  
   
      The meaning of dollar can be changed so that it matches only  
      at   the   very   end   of   the   string,  by  setting  the  
      PCRE_DOLLAR_ENDONLY option at compile or matching time. This  
      does not affect the \Z assertion.  
   
      The meanings of the circumflex  and  dollar  characters  are  
      changed  if  the  PCRE_MULTILINE option is set. When this is  
      the case,  they  match  immediately  after  and  immediately  
      before an internal "\n" character, respectively, in addition  
      to matching at the start and end of the subject string.  For  
      example,  the  pattern  /^abc$/  matches  the subject string  
      "def\nabc" in multiline  mode,  but  not  otherwise.  Conse-  
      quently,  patterns  that  are  anchored  in single line mode  
      because all branches start with "^" are not anchored in mul-  
      tiline mode, and a match for circumflex is possible when the  
      startoffset  argument  of  pcre_exec()  is   non-zero.   The  
      PCRE_DOLLAR_ENDONLY  option  is ignored if PCRE_MULTILINE is  
      set.  
   
      Note that the sequences \A, \Z, and \z can be used to  match  
      the  start  and end of the subject in both modes, and if all  
      branches of a pattern start with \A it is  always  anchored,  
      whether PCRE_MULTILINE is set or not.  
1850    
1851           Outside a character class, in the default matching mode, the circumflex
1852           character  is  an  assertion which is true only if the current matching
1853           point is at the start of the subject string. If the  startoffset  argu-
1854           ment  of  pcre_exec()  is  non-zero,  circumflex can never match if the
1855           PCRE_MULTILINE option is unset. Inside a  character  class,  circumflex
1856           has an entirely different meaning (see below).
1857    
1858           Circumflex  need  not be the first character of the pattern if a number
1859           of alternatives are involved, but it should be the first thing in  each
1860           alternative  in  which  it appears if the pattern is ever to match that
1861           branch. If all possible alternatives start with a circumflex, that  is,
1862           if  the  pattern  is constrained to match only at the start of the sub-
1863           ject, it is said to be an "anchored" pattern.  (There  are  also  other
1864           constructs that can cause a pattern to be anchored.)
1865    
1866           A  dollar  character  is an assertion which is true only if the current
1867           matching point is at the end of  the  subject  string,  or  immediately
1868           before a newline character that is the last character in the string (by
1869           default). Dollar need not be the last character of  the  pattern  if  a
1870           number  of alternatives are involved, but it should be the last item in
1871           any branch in which it appears.  Dollar has no  special  meaning  in  a
1872           character class.
1873    
1874           The  meaning  of  dollar  can be changed so that it matches only at the
1875           very end of the string, by setting the  PCRE_DOLLAR_ENDONLY  option  at
1876           compile time. This does not affect the \Z assertion.
1877    
1878           The meanings of the circumflex and dollar characters are changed if the
1879           PCRE_MULTILINE option is set. When this is the case, they match immedi-
1880           ately  after  and  immediately  before  an  internal newline character,
1881           respectively, in addition to matching at the start and end of the  sub-
1882           ject  string.  For  example,  the  pattern  /^abc$/ matches the subject
1883           string "def\nabc" in multiline mode, but not  otherwise.  Consequently,
1884           patterns  that  are  anchored  in single line mode because all branches
1885           start with ^ are not anchored in multiline mode, and a match  for  cir-
1886           cumflex  is  possible  when  the startoffset argument of pcre_exec() is
1887           non-zero. The PCRE_DOLLAR_ENDONLY option is ignored  if  PCRE_MULTILINE
1888           is set.
1889    
1890           Note  that  the sequences \A, \Z, and \z can be used to match the start
1891           and end of the subject in both modes, and if all branches of a  pattern
1892           start  with  \A it is always anchored, whether PCRE_MULTILINE is set or
1893           not.
1894    
1895    
1896  FULL STOP (PERIOD, DOT)  FULL STOP (PERIOD, DOT)
      Outside a character class, a dot in the pattern matches  any  
      one character in the subject, including a non-printing char-  
      acter, but not (by default)  newline.   If  the  PCRE_DOTALL  
      option  is set, dots match newlines as well. The handling of  
      dot is entirely independent of the  handling  of  circumflex  
      and  dollar,  the  only  relationship  being  that they both  
      involve newline characters. Dot has no special meaning in  a  
      character class.  
1897    
1898           Outside a character class, a dot in the pattern matches any one charac-
1899           ter  in  the  subject,  including a non-printing character, but not (by
1900           default) newline.  In UTF-8 mode, a dot matches  any  UTF-8  character,
1901           which  might  be  more than one byte long, except (by default) for new-
1902           line. If the PCRE_DOTALL option is set, dots match  newlines  as  well.
1903           The  handling of dot is entirely independent of the handling of circum-
1904           flex and dollar, the only relationship being  that  they  both  involve
1905           newline characters. Dot has no special meaning in a character class.
1906    
1907    
1908    MATCHING A SINGLE BYTE
1909    
1910           Outside a character class, the escape sequence \C matches any one byte,
1911           both in and out of UTF-8 mode. Unlike a dot, it always matches  a  new-
1912           line.  The  feature  is  provided  in Perl in order to match individual
1913           bytes in UTF-8 mode.  Because it breaks up UTF-8 characters into  indi-
1914           vidual  bytes,  what  remains  in  the  string may be a malformed UTF-8
1915           string. For this reason it is best avoided.
1916    
1917           PCRE does not allow \C to appear in lookbehind assertions (see  below),
1918           because in UTF-8 mode it makes it impossible to calculate the length of
1919           the lookbehind.
1920    
1921    
1922  SQUARE BRACKETS  SQUARE BRACKETS
      An opening square bracket introduces a character class, ter-  
      minated  by  a  closing  square  bracket.  A  closing square  
      bracket on its own is  not  special.  If  a  closing  square  
      bracket  is  required as a member of the class, it should be  
      the first data character in the class (after an initial cir-  
      cumflex, if present) or escaped with a backslash.  
   
      A character class matches a single character in the subject;  
      the  character  must  be in the set of characters defined by  
      the class, unless the first character in the class is a cir-  
      cumflex,  in which case the subject character must not be in  
      the set defined by the class. If a  circumflex  is  actually  
      required  as  a  member  of  the class, ensure it is not the  
      first character, or escape it with a backslash.  
   
      For example, the character class [aeiou] matches  any  lower  
      case vowel, while [^aeiou] matches any character that is not  
      a lower case vowel. Note that a circumflex is  just  a  con-  
      venient  notation for specifying the characters which are in  
      the class by enumerating those that are not. It  is  not  an  
      assertion:  it  still  consumes a character from the subject  
      string, and fails if the current pointer is at  the  end  of  
      the string.  
   
      When caseless matching  is  set,  any  letters  in  a  class  
      represent  both their upper case and lower case versions, so  
      for example, a caseless [aeiou] matches "A" as well as  "a",  
      and  a caseless [^aeiou] does not match "A", whereas a case-  
      ful version would.  
   
      The newline character is never treated in any special way in  
      character  classes,  whatever the setting of the PCRE_DOTALL  
      or PCRE_MULTILINE options is. A  class  such  as  [^a]  will  
      always match a newline.  
   
      The minus (hyphen) character can be used to specify a  range  
      of  characters  in  a  character  class.  For example, [d-m]  
      matches any letter between d and m, inclusive.  If  a  minus  
      character  is required in a class, it must be escaped with a  
      backslash or appear in a position where it cannot be  inter-  
      preted as indicating a range, typically as the first or last  
      character in the class.  
   
      It is not possible to have the literal character "]" as  the  
      end  character  of  a  range.  A  pattern such as [W-]46] is  
      interpreted as a class of two characters ("W" and "-")  fol-  
      lowed by a literal string "46]", so it would match "W46]" or  
      "-46]". However, if the "]" is escaped with a  backslash  it  
      is  interpreted  as  the end of range, so [W-\]46] is inter-  
      preted as a single class containing a range followed by  two  
      separate characters. The octal or hexadecimal representation  
      of "]" can also be used to end a range.  
   
      Ranges operate in ASCII collating sequence. They can also be  
      used  for  characters  specified  numerically,  for  example  
      [\000-\037]. If a range that includes letters is  used  when  
      caseless  matching  is set, it matches the letters in either  
      case. For example, [W-c] is equivalent  to  [][\^_`wxyzabc],  
      matched  caselessly,  and  if  character tables for the "fr"  
      locale are in use, [\xc8-\xcb] matches accented E characters  
      in both cases.  
   
      The character types \d, \D, \s, \S,  \w,  and  \W  may  also  
      appear  in  a  character  class, and add the characters that  
      they match to the class. For example, [\dABCDEF] matches any  
      hexadecimal  digit.  A  circumflex  can conveniently be used  
      with the upper case character types to specify a  more  res-  
      tricted set of characters than the matching lower case type.  
      For example, the class [^\W_] matches any letter  or  digit,  
      but not underscore.  
   
      All non-alphameric characters other than \,  -,  ^  (at  the  
      start)  and  the  terminating ] are non-special in character  
      classes, but it does no harm if they are escaped.  
1923    
1924           An opening square bracket introduces a character class, terminated by a
1925           closing square bracket. A closing square bracket on its own is not spe-
1926           cial. If a closing square bracket is required as a member of the class,
1927           it  should  be  the first data character in the class (after an initial
1928           circumflex, if present) or escaped with a backslash.
1929    
1930           A character class matches a single character in the subject.  In  UTF-8
1931           mode,  the character may occupy more than one byte. A matched character
1932           must be in the set of characters defined by the class, unless the first
1933           character  in  the  class definition is a circumflex, in which case the
1934           subject character must not be in the set defined by  the  class.  If  a
1935           circumflex  is actually required as a member of the class, ensure it is
1936           not the first character, or escape it with a backslash.
1937    
1938           For example, the character class [aeiou] matches any lower case  vowel,
1939           while  [^aeiou]  matches  any character that is not a lower case vowel.
1940           Note that a circumflex is just a convenient notation for specifying the
1941           characters which are in the class by enumerating those that are not. It
1942           is not an assertion: it still consumes a  character  from  the  subject
1943           string, and fails if the current pointer is at the end of the string.
1944    
1945           In  UTF-8 mode, characters with values greater than 255 can be included
1946           in a class as a literal string of bytes, or by using the  \x{  escaping
1947           mechanism.
1948    
1949           When  caseless  matching  is set, any letters in a class represent both
1950           their upper case and lower case versions, so for  example,  a  caseless
1951           [aeiou]  matches  "A"  as well as "a", and a caseless [^aeiou] does not
1952           match "A", whereas a caseful version would. PCRE does not  support  the
1953           concept of case for characters with values greater than 255.
1954    
1955           The  newline character is never treated in any special way in character
1956           classes, whatever the setting  of  the  PCRE_DOTALL  or  PCRE_MULTILINE
1957           options is. A class such as [^a] will always match a newline.
1958    
1959           The  minus (hyphen) character can be used to specify a range of charac-
1960           ters in a character  class.  For  example,  [d-m]  matches  any  letter
1961           between  d  and  m,  inclusive.  If  a minus character is required in a
1962           class, it must be escaped with a backslash  or  appear  in  a  position
1963           where  it cannot be interpreted as indicating a range, typically as the
1964           first or last character in the class.
1965    
1966           It is not possible to have the literal character "]" as the end charac-
1967           ter  of a range. A pattern such as [W-]46] is interpreted as a class of
1968           two characters ("W" and "-") followed by a literal string "46]", so  it
1969           would  match  "W46]"  or  "-46]". However, if the "]" is escaped with a
1970           backslash it is interpreted as the end of range, so [W-\]46] is  inter-
1971           preted  as  a  single class containing a range followed by two separate
1972           characters. The octal or hexadecimal representation of "]" can also  be
1973           used to end a range.
1974    
1975           Ranges  operate in the collating sequence of character values. They can
1976           also  be  used  for  characters  specified  numerically,  for   example
1977           [\000-\037].  In UTF-8 mode, ranges can include characters whose values
1978           are greater than 255, for example [\x{100}-\x{2ff}].
1979    
1980           If a range that includes letters is used when caseless matching is set,
1981           it matches the letters in either case. For example, [W-c] is equivalent
1982           to [][\^_`wxyzabc], matched caselessly, and if character tables for the
1983           "fr"  locale  are  in use, [\xc8-\xcb] matches accented E characters in
1984           both cases.
1985    
1986           The character types \d, \D, \s, \S, \w, and \W may  also  appear  in  a
1987           character  class,  and add the characters that they match to the class.
1988           For example, [\dABCDEF] matches any hexadecimal digit. A circumflex can
1989           conveniently  be  used with the upper case character types to specify a
1990           more restricted set of characters than the matching  lower  case  type.
1991           For  example,  the  class  [^\W_]  matches any letter or digit, but not
1992           underscore.
1993    
1994           All non-alphameric characters other than \, -, ^ (at the start) and the
1995           terminating ] are non-special in character classes, but it does no harm
1996           if they are escaped.
1997    
1998    
1999  POSIX CHARACTER CLASSES  POSIX CHARACTER CLASSES
      Perl 5.6 (not yet released at the time of writing) is  going  
      to  support  the POSIX notation for character classes, which  
      uses names enclosed by  [:  and  :]   within  the  enclosing  
      square brackets. PCRE supports this notation. For example,  
   
        [01[:alpha:]%]  
   
      matches "0", "1", any alphabetic character, or "%". The sup-  
      ported class names are  
   
        alnum    letters and digits  
        alpha    letters  
        ascii    character codes 0 - 127  
        cntrl    control characters  
        digit    decimal digits (same as \d)  
        graph    printing characters, excluding space  
        lower    lower case letters  
        print    printing characters, including space  
        punct    printing characters, excluding letters and digits  
        space    white space (same as \s)  
        upper    upper case letters  
        word     "word" characters (same as \w)  
        xdigit   hexadecimal digits  
   
      The names "ascii" and "word" are  Perl  extensions.  Another  
      Perl  extension is negation, which is indicated by a ^ char-  
      acter after the colon. For example,  
   
        [12[:^digit:]]  
   
      matches "1", "2", or any non-digit.  PCRE  (and  Perl)  also  
      recognize the POSIX syntax [.ch.] and [=ch=] where "ch" is a  
      "collating element", but these are  not  supported,  and  an  
      error is given if they are encountered.  
2000    
2001           Perl supports the POSIX notation  for  character  classes,  which  uses
2002           names  enclosed by [: and :] within the enclosing square brackets. PCRE
2003           also supports this notation. For example,
2004    
2005             [01[:alpha:]%]
2006    
2007           matches "0", "1", any alphabetic character, or "%". The supported class
2008           names are
2009    
2010             alnum    letters and digits
2011             alpha    letters
2012             ascii    character codes 0 - 127
2013             blank    space or tab only
2014             cntrl    control characters
2015             digit    decimal digits (same as \d)
2016             graph    printing characters, excluding space
2017             lower    lower case letters
2018             print    printing characters, including space
2019             punct    printing characters, excluding letters and digits
2020             space    white space (not quite the same as \s)
2021             upper    upper case letters
2022             word     "word" characters (same as \w)
2023             xdigit   hexadecimal digits
2024    
2025           The  "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
2026           and space (32). Notice that this list includes the VT  character  (code
2027           11). This makes "space" different to \s, which does not include VT (for
2028           Perl compatibility).
2029    
2030           The name "word" is a Perl extension, and "blank"  is  a  GNU  extension
2031           from  Perl  5.8. Another Perl extension is negation, which is indicated
2032           by a ^ character after the colon. For example,
2033    
2034             [12[:^digit:]]
2035    
2036           matches "1", "2", or any non-digit. PCRE (and Perl) also recognize  the
2037           POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
2038           these are not supported, and an error is given if they are encountered.
2039    
2040           In UTF-8 mode, characters with values greater than 255 do not match any
2041           of the POSIX character classes.
2042    
2043    
2044  VERTICAL BAR  VERTICAL BAR
      Vertical bar characters are  used  to  separate  alternative  
      patterns. For example, the pattern  
2045    
2046         gilbert|sullivan         Vertical bar characters are used to separate alternative patterns.  For
2047           example, the pattern
2048    
2049       matches either "gilbert" or "sullivan". Any number of alter-           gilbert|sullivan
      natives  may  appear,  and an empty alternative is permitted  
      (matching the empty string).   The  matching  process  tries  
      each  alternative in turn, from left to right, and the first  
      one that succeeds is used. If the alternatives are within  a  
      subpattern  (defined  below),  "succeeds" means matching the  
      rest of the main pattern as well as the alternative  in  the  
      subpattern.  
2050    
2051           matches  either "gilbert" or "sullivan". Any number of alternatives may
2052           appear, and an empty  alternative  is  permitted  (matching  the  empty
2053           string).   The  matching  process  tries each alternative in turn, from
2054           left to right, and the first one that succeeds is used. If the alterna-
2055           tives  are within a subpattern (defined below), "succeeds" means match-
2056           ing the rest of the main pattern as well as the alternative in the sub-
2057           pattern.
2058    
2059    
2060  INTERNAL OPTION SETTING  INTERNAL OPTION SETTING
      The settings of PCRE_CASELESS, PCRE_MULTILINE,  PCRE_DOTALL,  
      and  PCRE_EXTENDED can be changed from within the pattern by  
      a sequence of Perl option letters enclosed between "(?"  and  
      ")". The option letters are  
   
        i  for PCRE_CASELESS  
        m  for PCRE_MULTILINE  
        s  for PCRE_DOTALL  
        x  for PCRE_EXTENDED  
   
      For example, (?im) sets caseless, multiline matching. It  is  
      also possible to unset these options by preceding the letter  
      with a hyphen, and a combined setting and unsetting such  as  
      (?im-sx),  which sets PCRE_CASELESS and PCRE_MULTILINE while  
      unsetting PCRE_DOTALL and PCRE_EXTENDED, is also  permitted.  
      If  a  letter  appears both before and after the hyphen, the  
      option is unset.  
   
      The scope of these option changes depends on  where  in  the  
      pattern  the  setting  occurs. For settings that are outside  
      any subpattern (defined below), the effect is the same as if  
      the  options were set or unset at the start of matching. The  
      following patterns all behave in exactly the same way:  
   
        (?i)abc  
        a(?i)bc  
        ab(?i)c  
        abc(?i)  
   
      which in turn is the same as compiling the pattern abc  with  
      PCRE_CASELESS  set.   In  other words, such "top level" set-  
      tings apply to the whole pattern  (unless  there  are  other  
      changes  inside subpatterns). If there is more than one set-  
      ting of the same option at top level, the rightmost  setting  
      is used.  
   
      If an option change occurs inside a subpattern,  the  effect  
      is  different.  This is a change of behaviour in Perl 5.005.  
      An option change inside a subpattern affects only that  part  
      of the subpattern that follows it, so  
   
        (a(?i)b)c  
   
      matches  abc  and  aBc  and  no  other   strings   (assuming  
      PCRE_CASELESS  is  not used).  By this means, options can be  
      made to have different settings in different  parts  of  the  
      pattern.  Any  changes  made  in one alternative do carry on  
      into subsequent branches within  the  same  subpattern.  For  
      example,  
   
        (a(?i)b|c)  
   
      matches "ab", "aB", "c", and "C", even though when  matching  
      "C" the first branch is abandoned before the option setting.  
      This is because the effects of  option  settings  happen  at  
      compile  time. There would be some very weird behaviour oth-  
      erwise.  
   
      The PCRE-specific options PCRE_UNGREEDY and  PCRE_EXTRA  can  
      be changed in the same way as the Perl-compatible options by  
      using the characters U and X  respectively.  The  (?X)  flag  
      setting  is  special in that it must always occur earlier in  
      the pattern than any of the additional features it turns on,  
      even when it is at top level. It is best put at the start.  
2061    
2062           The  settings  of  the  PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
2063           PCRE_EXTENDED options can be changed  from  within  the  pattern  by  a
2064           sequence  of  Perl  option  letters  enclosed between "(?" and ")". The
2065           option letters are
2066    
2067             i  for PCRE_CASELESS
2068             m  for PCRE_MULTILINE
2069             s  for PCRE_DOTALL
2070             x  for PCRE_EXTENDED
2071    
2072           For example, (?im) sets caseless, multiline matching. It is also possi-
2073           ble to unset these options by preceding the letter with a hyphen, and a
2074           combined setting and unsetting such as (?im-sx), which sets  PCRE_CASE-
2075           LESS  and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
2076           is also permitted. If a  letter  appears  both  before  and  after  the
2077           hyphen, the option is unset.
2078    
2079           When  an option change occurs at top level (that is, not inside subpat-
2080           tern parentheses), the change applies to the remainder of  the  pattern
2081           that follows.  If the change is placed right at the start of a pattern,
2082           PCRE extracts it into the global options (and it will therefore show up
2083           in data extracted by the pcre_fullinfo() function).
2084    
2085           An option change within a subpattern affects only that part of the cur-
2086           rent pattern that follows it, so
2087    
2088             (a(?i)b)c
2089    
2090           matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
2091           used).   By  this means, options can be made to have different settings
2092           in different parts of the pattern. Any changes made in one  alternative
2093           do  carry  on  into subsequent branches within the same subpattern. For
2094           example,
2095    
2096             (a(?i)b|c)
2097    
2098           matches "ab", "aB", "c", and "C", even though  when  matching  "C"  the
2099           first  branch  is  abandoned before the option setting. This is because
2100           the effects of option settings happen at compile time. There  would  be
2101           some very weird behaviour otherwise.
2102    
2103           The  PCRE-specific  options PCRE_UNGREEDY and PCRE_EXTRA can be changed
2104           in the same way as the Perl-compatible options by using the  characters
2105           U  and X respectively. The (?X) flag setting is special in that it must
2106           always occur earlier in the pattern than any of the additional features
2107           it turns on, even when it is at top level. It is best put at the start.
2108    
2109    
2110  SUBPATTERNS  SUBPATTERNS
      Subpatterns are delimited by parentheses  (round  brackets),  
      which can be nested.  Marking part of a pattern as a subpat-  
      tern does two things:  
   
      1. It localizes a set of alternatives. For example, the pat-  
      tern  
   
        cat(aract|erpillar|)  
   
      matches one of the words "cat",  "cataract",  or  "caterpil-  
      lar".  Without  the  parentheses, it would match "cataract",  
      "erpillar" or the empty string.  
   
      2. It sets up the subpattern as a capturing  subpattern  (as  
      defined  above).   When the whole pattern matches, that por-  
      tion of the subject string that matched  the  subpattern  is  
      passed  back  to  the  caller  via  the  ovector argument of  
      pcre_exec(). Opening parentheses are counted  from  left  to  
      right (starting from 1) to obtain the numbers of the captur-  
      ing subpatterns.  
   
      For example, if the string "the red king" is matched against  
      the pattern  
   
        the ((red|white) (king|queen))  
   
      the captured substrings are "red king", "red",  and  "king",  
      and are numbered 1, 2, and 3, respectively.  
   
      The fact that plain parentheses fulfil two functions is  not  
      always  helpful.  There are often times when a grouping sub-  
      pattern is required without a capturing requirement.  If  an  
      opening parenthesis is followed by "?:", the subpattern does  
      not do any capturing, and is not counted when computing  the  
      number of any subsequent capturing subpatterns. For example,  
      if the string "the white queen" is matched against the  pat-  
      tern  
   
        the ((?:red|white) (king|queen))  
   
      the captured substrings are "white queen" and  "queen",  and  
      are  numbered  1  and 2. The maximum number of captured sub-  
      strings is 99, and the maximum number  of  all  subpatterns,  
      both capturing and non-capturing, is 200.  
   
      As a  convenient  shorthand,  if  any  option  settings  are  
      required  at  the  start  of a non-capturing subpattern, the  
      option letters may appear between the "?" and the ":".  Thus  
      the two patterns  
   
        (?i:saturday|sunday)  
        (?:(?i)saturday|sunday)  
   
      match exactly the same set of strings.  Because  alternative  
      branches  are  tried from left to right, and options are not  
      reset until the end of the subpattern is reached, an  option  
      setting  in  one  branch does affect subsequent branches, so  
      the above patterns match "SUNDAY" as well as "Saturday".  
2111    
2112           Subpatterns are delimited by parentheses (round brackets), which can be
2113           nested.  Marking part of a pattern as a subpattern does two things:
2114    
2115           1. It localizes a set of alternatives. For example, the pattern
2116    
2117             cat(aract|erpillar|)
2118    
2119           matches  one  of the words "cat", "cataract", or "caterpillar". Without
2120           the parentheses, it would match "cataract",  "erpillar"  or  the  empty
2121           string.
2122    
2123           2.  It  sets  up  the  subpattern as a capturing subpattern (as defined
2124           above).  When the whole pattern matches, that portion  of  the  subject
2125           string that matched the subpattern is passed back to the caller via the
2126           ovector argument of pcre_exec(). Opening parentheses are  counted  from
2127           left  to right (starting from 1) to obtain the numbers of the capturing
2128           subpatterns.
2129    
2130           For example, if the string "the red king" is matched against  the  pat-
2131           tern
2132    
2133             the ((red|white) (king|queen))
2134    
2135           the captured substrings are "red king", "red", and "king", and are num-
2136           bered 1, 2, and 3, respectively.
2137    
2138           The fact that plain parentheses fulfil  two  functions  is  not  always
2139           helpful.   There are often times when a grouping subpattern is required
2140           without a capturing requirement. If an opening parenthesis is  followed
2141           by  a question mark and a colon, the subpattern does not do any captur-
2142           ing, and is not counted when computing the  number  of  any  subsequent
2143           capturing  subpatterns. For example, if the string "the white queen" is
2144           matched against the pattern
2145    
2146             the ((?:red|white) (king|queen))
2147    
2148           the captured substrings are "white queen" and "queen", and are numbered
2149           1  and 2. The maximum number of capturing subpatterns is 65535, and the
2150           maximum depth of nesting of all subpatterns, both  capturing  and  non-
2151           capturing, is 200.
2152    
2153           As  a  convenient shorthand, if any option settings are required at the
2154           start of a non-capturing subpattern,  the  option  letters  may  appear
2155           between the "?" and the ":". Thus the two patterns
2156    
2157             (?i:saturday|sunday)
2158             (?:(?i)saturday|sunday)
2159    
2160           match exactly the same set of strings. Because alternative branches are
2161           tried from left to right, and options are not reset until  the  end  of
2162           the  subpattern is reached, an option setting in one branch does affect
2163           subsequent branches, so the above patterns match "SUNDAY"  as  well  as
2164           "Saturday".
2165    
2166    
2167    NAMED SUBPATTERNS
2168    
2169           Identifying  capturing  parentheses  by number is simple, but it can be
2170           very hard to keep track of the numbers in complicated  regular  expres-
2171           sions.  Furthermore,  if  an  expression  is  modified, the numbers may
2172           change. To help with the difficulty, PCRE supports the naming  of  sub-
2173           patterns,  something  that  Perl  does  not  provide. The Python syntax
2174           (?P<name>...) is used. Names consist  of  alphanumeric  characters  and
2175           underscores, and must be unique within a pattern.
2176    
2177           Named  capturing  parentheses  are  still  allocated numbers as well as
2178           names. The PCRE API provides function calls for extracting the name-to-
2179           number  translation  table from a compiled pattern. For further details
2180           see the pcreapi documentation.
2181    
2182    
2183  REPETITION  REPETITION
      Repetition is specified by quantifiers, which can follow any  
      of the following items:  
2184    
2185         a single character, possibly escaped         Repetition is specified by quantifiers, which can  follow  any  of  the
2186         the . metacharacter         following items:
        a character class  
        a back reference (see next section)  
        a parenthesized subpattern (unless it is  an  assertion  -  
      see below)  
   
      The general repetition quantifier specifies  a  minimum  and  
      maximum  number  of  permitted  matches,  by  giving the two  
      numbers in curly brackets (braces), separated  by  a  comma.  
      The  numbers  must be less than 65536, and the first must be  
      less than or equal to the second. For example:  
   
        z{2,4}  
   
      matches "zz", "zzz", or "zzzz". A closing brace on  its  own  
      is not a special character. If the second number is omitted,  
      but the comma is present, there is no upper  limit;  if  the  
      second number and the comma are both omitted, the quantifier  
      specifies an exact number of required matches. Thus  
   
        [aeiou]{3,}  
   
      matches at least 3 successive vowels,  but  may  match  many  
      more, while  
   
        \d{8}  
   
      matches exactly 8 digits.  An  opening  curly  bracket  that  
      appears  in a position where a quantifier is not allowed, or  
      one that does not match the syntax of a quantifier, is taken  
      as  a literal character. For example, {,6} is not a quantif-  
      ier, but a literal string of four characters.  
      The quantifier {0} is permitted, causing the  expression  to  
      behave  as  if the previous item and the quantifier were not  
      present.  
   
      For convenience (and  historical  compatibility)  the  three  
      most common quantifiers have single-character abbreviations:  
   
        *    is equivalent to {0,}  
        +    is equivalent to {1,}  
        ?    is equivalent to {0,1}  
   
      It is possible to construct infinite loops  by  following  a  
      subpattern  that  can  match no characters with a quantifier  
      that has no upper limit, for example:  
   
        (a?)*  
   
      Earlier versions of Perl and PCRE used to give an  error  at  
      compile  time  for such patterns. However, because there are  
      cases where this  can  be  useful,  such  patterns  are  now  
      accepted,  but  if  any repetition of the subpattern does in  
      fact match no characters, the loop is forcibly broken.  
   
      By default, the quantifiers  are  "greedy",  that  is,  they  
      match  as much as possible (up to the maximum number of per-  
      mitted times), without causing the rest of  the  pattern  to  
      fail. The classic example of where this gives problems is in  
      trying to match comments in C programs. These appear between  
      the  sequences /* and */ and within the sequence, individual  
      * and / characters may appear. An attempt to  match  C  com-  
      ments by applying the pattern  
   
        /\*.*\*/  
   
      to the string  
   
        /* first command */  not comment  /* second comment */  
   
      fails, because it matches the entire  string  owing  to  the  
      greediness of the .*  item.  
   
      However, if a quantifier is followed by a question mark,  it  
      ceases  to be greedy, and instead matches the minimum number  
      of times possible, so the pattern  
   
        /\*.*?\*/  
   
      does the right thing with the C comments. The meaning of the  
      various  quantifiers is not otherwise changed, just the pre-  
      ferred number of matches.  Do not confuse this use of  ques-  
      tion  mark  with  its  use as a quantifier in its own right.  
      Because it has two uses, it can sometimes appear doubled, as  
      in  
   
        \d??\d  
   
      which matches one digit by preference, but can match two  if  
      that is the only way the rest of the pattern matches.  
   
      If the PCRE_UNGREEDY option is set (an option which  is  not  
      available  in  Perl),  the  quantifiers  are  not  greedy by  
      default, but individual ones can be made greedy by following  
      them  with  a  question mark. In other words, it inverts the  
      default behaviour.  
   
      When a parenthesized subpattern is quantified with a minimum  
      repeat  count  that is greater than 1 or with a limited max-  
      imum, more store is required for the  compiled  pattern,  in  
      proportion to the size of the minimum or maximum.  
   
      If a pattern starts with .* or  .{0,}  and  the  PCRE_DOTALL  
      option (equivalent to Perl's /s) is set, thus allowing the .  
      to match  newlines,  the  pattern  is  implicitly  anchored,  
      because whatever follows will be tried against every charac-  
      ter position in the subject string, so there is no point  in  
      retrying  the overall match at any position after the first.  
      PCRE treats such a pattern as though it were preceded by \A.  
      In  cases where it is known that the subject string contains  
      no newlines, it is worth setting PCRE_DOTALL when  the  pat-  
      tern begins with .* in order to obtain this optimization, or  
      alternatively using ^ to indicate anchoring explicitly.  
   
      When a capturing subpattern is repeated, the value  captured  
      is the substring that matched the final iteration. For exam-  
      ple, after  
   
        (tweedle[dume]{3}\s*)+  
   
      has matched "tweedledum tweedledee" the value  of  the  cap-  
      tured  substring  is  "tweedledee".  However,  if  there are  
      nested capturing  subpatterns,  the  corresponding  captured  
      values  may  have been set in previous iterations. For exam-  
      ple, after  
2187    
2188         /(a|(b))+/           a literal data character
2189             the . metacharacter
2190             the \C escape sequence
2191             escapes such as \d that match single characters
2192             a character class
2193             a back reference (see next section)
2194             a parenthesized subpattern (unless it is an assertion)
2195    
2196       matches "aba" the value of the second captured substring  is         The  general repetition quantifier specifies a minimum and maximum num-
2197       "b".         ber of permitted matches, by giving the two numbers in  curly  brackets
2198           (braces),  separated  by  a comma. The numbers must be less than 65536,
2199           and the first must be less than or equal to the second. For example:
2200    
2201             z{2,4}
2202    
2203           matches "zz", "zzz", or "zzzz". A closing brace on its  own  is  not  a
2204           special  character.  If  the second number is omitted, but the comma is
2205           present, there is no upper limit; if the second number  and  the  comma
2206           are  both omitted, the quantifier specifies an exact number of required
2207           matches. Thus
2208    
2209  BACK REFERENCES           [aeiou]{3,}
2210       Outside a character class, a backslash followed by  a  digit  
2211       greater  than  0  (and  possibly  further  digits) is a back         matches at least 3 successive vowels, but may match many more, while
2212    
2213             \d{8}
2214    
2215           matches exactly 8 digits. An opening curly bracket that  appears  in  a
2216           position  where a quantifier is not allowed, or one that does not match
2217           the syntax of a quantifier, is taken as a literal character. For  exam-
2218           ple, {,6} is not a quantifier, but a literal string of four characters.
2219    
2220           In UTF-8 mode, quantifiers apply to UTF-8  characters  rather  than  to
2221           individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char-
2222           acters, each of which is represented by a two-byte sequence.
2223    
2224           The quantifier {0} is permitted, causing the expression to behave as if
2225           the previous item and the quantifier were not present.
2226    
2227           For  convenience  (and  historical compatibility) the three most common
2228           quantifiers have single-character abbreviations:
2229    
2230             *    is equivalent to {0,}
2231             +    is equivalent to {1,}
2232             ?    is equivalent to {0,1}
2233    
2234           It is possible to construct infinite loops by  following  a  subpattern
2235           that can match no characters with a quantifier that has no upper limit,
2236           for example:
2237    
2238             (a?)*
2239    
2240           Earlier versions of Perl and PCRE used to give an error at compile time
2241           for  such  patterns. However, because there are cases where this can be
2242           useful, such patterns are now accepted, but if any  repetition  of  the
2243           subpattern  does in fact match no characters, the loop is forcibly bro-
2244           ken.
2245    
2246           By default, the quantifiers are "greedy", that is, they match  as  much
2247           as  possible  (up  to  the  maximum number of permitted times), without
2248           causing the rest of the pattern to fail. The classic example  of  where
2249           this gives problems is in trying to match comments in C programs. These
2250           appear between the sequences /* and */ and within the  sequence,  indi-
2251           vidual * and / characters may appear. An attempt to match C comments by
2252           applying the pattern
2253    
2254             /\*.*\*/
2255    
2256           to the string
2257    
2258             /* first command */  not comment  /* second comment */
2259    
2260           fails, because it matches the entire string owing to the greediness  of
2261           the .*  item.
2262    
2263           However,  if  a quantifier is followed by a question mark, it ceases to
2264           be greedy, and instead matches the minimum number of times possible, so
2265           the pattern
2266    
2267             /\*.*?\*/
2268    
2269           does  the  right  thing with the C comments. The meaning of the various
2270           quantifiers is not otherwise changed,  just  the  preferred  number  of
2271           matches.   Do  not  confuse this use of question mark with its use as a
2272           quantifier in its own right. Because it has two uses, it can  sometimes
2273           appear doubled, as in
2274    
2275             \d??\d
2276    
2277  SunOS 5.8                 Last change:                         30         which matches one digit by preference, but can match two if that is the
2278           only way the rest of the pattern matches.
2279    
2280           If the PCRE_UNGREEDY option is set (an option which is not available in
2281           Perl),  the  quantifiers are not greedy by default, but individual ones
2282           can be made greedy by following them with a  question  mark.  In  other
2283           words, it inverts the default behaviour.
2284    
2285           When  a  parenthesized  subpattern  is quantified with a minimum repeat
2286           count that is greater than 1 or with a limited maximum, more  store  is
2287           required  for  the  compiled  pattern, in proportion to the size of the
2288           minimum or maximum.
2289    
2290       reference to a capturing subpattern  earlier  (i.e.  to  its         If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
2291       left)  in  the  pattern,  provided there have been that many         alent  to Perl's /s) is set, thus allowing the . to match newlines, the
2292       previous capturing left parentheses.         pattern is implicitly anchored, because whatever follows will be  tried
2293           against  every character position in the subject string, so there is no
2294           point in retrying the overall match at any position  after  the  first.
2295           PCRE normally treats such a pattern as though it were preceded by \A.
2296    
2297       However, if the decimal number following  the  backslash  is         In  cases  where  it  is known that the subject string contains no new-
2298       less  than  10,  it is always taken as a back reference, and         lines, it is worth setting PCRE_DOTALL in order to  obtain  this  opti-
2299       causes an error only if there are not  that  many  capturing         mization, or alternatively using ^ to indicate anchoring explicitly.
      left  parentheses in the entire pattern. In other words, the  
      parentheses that are referenced need not be to the  left  of  
      the  reference  for  numbers  less  than 10. See the section  
      entitled "Backslash" above for further details of  the  han-  
      dling of digits following a backslash.  
2300    
2301       A back reference matches whatever actually matched the  cap-         However,  there is one situation where the optimization cannot be used.
2302       turing subpattern in the current subject string, rather than         When .*  is inside capturing parentheses that  are  the  subject  of  a
2303       anything matching the subpattern itself. So the pattern         backreference  elsewhere in the pattern, a match at the start may fail,
2304           and a later one succeed. Consider, for example:
2305    
2306         (sens|respons)e and \1ibility           (.*)abc\1
2307    
2308       matches "sense and sensibility" and "response and  responsi-         If the subject is "xyz123abc123" the match point is the fourth  charac-
2309       bility",  but  not  "sense  and  responsibility". If caseful         ter. For this reason, such a pattern is not implicitly anchored.
      matching is in force at the time of the back reference,  the  
      case of letters is relevant. For example,  
2310    
2311         ((?i)rah)\s+\1         When a capturing subpattern is repeated, the value captured is the sub-
2312           string that matched the final iteration. For example, after
2313    
2314       matches "rah rah" and "RAH RAH", but  not  "RAH  rah",  even           (tweedle[dume]{3}\s*)+
      though  the  original  capturing subpattern is matched case-  
      lessly.  
2315    
2316       There may be more than one back reference to the  same  sub-         has matched "tweedledum tweedledee" the value of the captured substring
2317       pattern.  If  a  subpattern  has not actually been used in a         is  "tweedledee".  However,  if there are nested capturing subpatterns,
2318       particular match, any back references to it always fail. For         the corresponding captured values may have been set in previous  itera-
2319       example, the pattern         tions. For example, after
2320    
2321         (a|(bc))\2           /(a|(b))+/
2322    
2323       always fails if it starts to match  "a"  rather  than  "bc".         matches "aba" the value of the second captured substring is "b".
      Because  there  may  be up to 99 back references, all digits  
      following the backslash are taken as  part  of  a  potential  
      back reference number. If the pattern continues with a digit  
      character, some delimiter must be used to terminate the back  
      reference.   If the PCRE_EXTENDED option is set, this can be  
      whitespace. Otherwise an empty comment can be used.  
2324    
      A back reference that occurs inside the parentheses to which  
      it  refers  fails when the subpattern is first used, so, for  
      example, (a\1) never matches.  However, such references  can  
      be useful inside repeated subpatterns. For example, the pat-  
      tern  
2325    
2326         (a|b\1)+  ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
2327    
2328       matches any number of "a"s and also "aba", "ababbaa" etc. At         With both maximizing and minimizing repetition, failure of what follows
2329       each iteration of the subpattern, the back reference matches         normally causes the repeated item to be re-evaluated to see if  a  dif-
2330       the character string corresponding to  the  previous  itera-         ferent number of repeats allows the rest of the pattern to match. Some-
2331       tion.  In  order  for this to work, the pattern must be such         times it is useful to prevent this, either to change the nature of  the
2332       that the first iteration does not need  to  match  the  back         match,  or  to  cause it fail earlier than it otherwise might, when the
2333       reference.  This  can  be  done using alternation, as in the         author of the pattern knows there is no point in carrying on.
      example above, or by a quantifier with a minimum of zero.  
2334    
2335           Consider, for example, the pattern \d+foo when applied to  the  subject
2336           line
2337    
2338             123456bar
2339    
2340           After matching all 6 digits and then failing to match "foo", the normal
2341           action of the matcher is to try again with only 5 digits  matching  the
2342           \d+  item,  and  then  with  4,  and  so on, before ultimately failing.
2343           "Atomic grouping" (a term taken from Jeffrey  Friedl's  book)  provides
2344           the  means for specifying that once a subpattern has matched, it is not
2345           to be re-evaluated in this way.
2346    
2347           If we use atomic grouping for the previous example, the  matcher  would
2348           give up immediately on failing to match "foo" the first time. The nota-
2349           tion is a kind of special parenthesis, starting with  (?>  as  in  this
2350           example:
2351    
2352             (?>\d+)foo
2353    
2354           This  kind  of  parenthesis "locks up" the  part of the pattern it con-
2355           tains once it has matched, and a failure further into  the  pattern  is
2356           prevented  from  backtracking into it. Backtracking past it to previous
2357           items, however, works as normal.
2358    
2359           An alternative description is that a subpattern of  this  type  matches
2360           the  string  of  characters  that an identical standalone pattern would
2361           match, if anchored at the current point in the subject string.
2362    
2363           Atomic grouping subpatterns are not capturing subpatterns. Simple cases
2364           such as the above example can be thought of as a maximizing repeat that
2365           must swallow everything it can. So, while both \d+ and  \d+?  are  pre-
2366           pared  to  adjust  the number of digits they match in order to make the
2367           rest of the pattern match, (?>\d+) can only match an entire sequence of
2368           digits.
2369    
2370           Atomic  groups in general can of course contain arbitrarily complicated
2371           subpatterns, and can be nested. However, when  the  subpattern  for  an
2372           atomic group is just a single repeated item, as in the example above, a
2373           simpler notation, called a "possessive quantifier" can  be  used.  This
2374           consists  of  an  additional  + character following a quantifier. Using
2375           this notation, the previous example can be rewritten as
2376    
2377             \d++bar
2378    
2379           Possessive  quantifiers  are  always  greedy;  the   setting   of   the
2380           PCRE_UNGREEDY option is ignored. They are a convenient notation for the
2381           simpler forms of atomic group. However, there is no difference  in  the
2382           meaning  or  processing  of  a possessive quantifier and the equivalent
2383           atomic group.
2384    
2385           The possessive quantifier syntax is an extension to the Perl syntax. It
2386           originates in Sun's Java package.
2387    
2388           When  a  pattern  contains an unlimited repeat inside a subpattern that
2389           can itself be repeated an unlimited number of  times,  the  use  of  an
2390           atomic  group  is  the  only way to avoid some failing matches taking a
2391           very long time indeed. The pattern
2392    
2393             (\D+|<\d+>)*[!?]
2394    
2395           matches an unlimited number of substrings that either consist  of  non-
2396           digits,  or  digits  enclosed in <>, followed by either ! or ?. When it
2397           matches, it runs quickly. However, if it is applied to
2398    
2399             aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
2400    
2401           it takes a long time before reporting  failure.  This  is  because  the
2402           string  can  be  divided  between  the two repeats in a large number of
2403           ways, and all have to be tried. (The example used [!?]  rather  than  a
2404           single  character  at the end, because both PCRE and Perl have an opti-
2405           mization that allows for fast failure when a single character is  used.
2406           They  remember  the last single character that is required for a match,
2407           and fail early if it is not present in the string.)  If the pattern  is
2408           changed to
2409    
2410             ((?>\D+)|<\d+>)*[!?]
2411    
2412           sequences  of non-digits cannot be broken, and failure happens quickly.
2413    
2414    
2415    BACK REFERENCES
2416    
2417           Outside a character class, a backslash followed by a digit greater than
2418           0 (and possibly further digits) is a back reference to a capturing sub-
2419           pattern earlier (that is, to its left) in the pattern,  provided  there
2420           have been that many previous capturing left parentheses.
2421    
2422           However, if the decimal number following the backslash is less than 10,
2423           it is always taken as a back reference, and causes  an  error  only  if
2424           there  are  not that many capturing left parentheses in the entire pat-
2425           tern. In other words, the parentheses that are referenced need  not  be
2426           to  the left of the reference for numbers less than 10. See the section
2427           entitled "Backslash" above for further details of the handling of  dig-
2428           its following a backslash.
2429    
2430           A  back  reference matches whatever actually matched the capturing sub-
2431           pattern in the current subject string, rather  than  anything  matching
2432           the subpattern itself (see "Subpatterns as subroutines" below for a way
2433           of doing that). So the pattern
2434    
2435             (sens|respons)e and \1ibility
2436    
2437           matches "sense and sensibility" and "response and responsibility",  but
2438           not  "sense and responsibility". If caseful matching is in force at the
2439           time of the back reference, the case of letters is relevant. For  exam-
2440           ple,
2441    
2442             ((?i)rah)\s+\1
2443    
2444           matches  "rah  rah"  and  "RAH RAH", but not "RAH rah", even though the
2445           original capturing subpattern is matched caselessly.
2446    
2447           Back references to named subpatterns use the Python  syntax  (?P=name).
2448           We could rewrite the above example as follows:
2449    
2450             (?<p1>(?i)rah)\s+(?P=p1)
2451    
2452           There  may be more than one back reference to the same subpattern. If a
2453           subpattern has not actually been used in a particular match,  any  back
2454           references to it always fail. For example, the pattern
2455    
2456             (a|(bc))\2
2457    
2458           always  fails if it starts to match "a" rather than "bc". Because there
2459           may be many capturing parentheses in a pattern,  all  digits  following
2460           the  backslash  are taken as part of a potential back reference number.
2461           If the pattern continues with a digit character, some delimiter must be
2462           used  to  terminate  the back reference. If the PCRE_EXTENDED option is
2463           set, this can be whitespace.  Otherwise an empty comment can be used.
2464    
2465           A back reference that occurs inside the parentheses to which it  refers
2466           fails  when  the subpattern is first used, so, for example, (a\1) never
2467           matches.  However, such references can be useful inside  repeated  sub-
2468           patterns. For example, the pattern
2469    
2470             (a|b\1)+
2471    
2472           matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
2473           ation of the subpattern,  the  back  reference  matches  the  character
2474           string  corresponding  to  the previous iteration. In order for this to
2475     &n