/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Contents of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log


Revision 63 - (hide annotations) (download)
Sat Feb 24 21:40:03 2007 UTC (7 years, 1 month ago) by nigel
File MIME type: text/plain
File size: 142247 byte(s)
Load pcre-4.0 into code/trunk.

1 nigel 63 This file contains a concatenation of the PCRE man pages, converted to plain
2     text format for ease of searching with a text editor, or for use on systems
3     that do not have a man page processor. The small individual files that give
4     synopses of each function in the library have not been included. There are
5     separate text files for the pcregrep and pcretest commands.
6     -----------------------------------------------------------------------------
7    
8 nigel 41 NAME
9 nigel 63 PCRE - Perl-compatible regular expressions
10 nigel 41
11    
12 nigel 63 DESCRIPTION
13 nigel 41
14 nigel 63 The PCRE library is a set of functions that implement regu-
15     lar expression pattern matching using the same syntax and
16     semantics as Perl, with just a few differences. The current
17     implementation of PCRE (release 4.x) corresponds approxi-
18     mately with Perl 5.8, including support for UTF-8 encoded
19     strings. However, this support has to be explicitly
20     enabled; it is not the default.
21    
22     PCRE is written in C and released as a C library. However, a
23     number of people have written wrappers and interfaces of
24     various kinds. A C++ class is included in these contribu-
25     tions, which can be found in the Contrib directory at the
26     primary FTP site, which is:
27    
28     ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
29    
30     Details of exactly which Perl regular expression features
31     are and are not supported by PCRE are given in separate
32     documents. See the pcrepattern and pcrecompat pages.
33    
34     Some features of PCRE can be included, excluded, or changed
35     when the library is built. The pcre_config() function makes
36     it possible for a client to discover which features are
37     available. Documentation about building PCRE for various
38     operating systems can be found in the README file in the
39     source distribution.
40    
41    
42     USER DOCUMENTATION
43    
44     The user documentation for PCRE has been split up into a
45     number of different sections. In the "man" format, each of
46     these is a separate "man page". In the HTML format, each is
47     a separate page, linked from the index page. In the plain
48     text format, all the sections are concatenated, for ease of
49     searching. The sections are as follows:
50    
51     pcre this document
52     pcreapi details of PCRE's native API
53     pcrebuild options for building PCRE
54     pcrecallout details of the callout feature
55     pcrecompat discussion of Perl compatibility
56     pcregrep description of the pcregrep command
57     pcrepattern syntax and semantics of supported
58     regular expressions
59     pcreperform discussion of performance issues
60     pcreposix the POSIX-compatible API
61     pcresample discussion of the sample program
62     pcretest the pcretest testing command
63    
64     In addition, in the "man" and HTML formats, there is a short
65     page for each library function, listing its arguments and
66     results.
67    
68    
69     LIMITATIONS
70    
71     There are some size limitations in PCRE but it is hoped that
72     they will never in practice be relevant.
73    
74     The maximum length of a compiled pattern is 65539 (sic)
75     bytes if PCRE is compiled with the default internal linkage
76     size of 2. If you want to process regular expressions that
77     are truly enormous, you can compile PCRE with an internal
78     linkage size of 3 or 4 (see the README file in the source
79     distribution and the pcrebuild documentation for details).
80     If these cases the limit is substantially larger. However,
81     the speed of execution will be slower.
82    
83     All values in repeating quantifiers must be less than 65536.
84     The maximum number of capturing subpatterns is 65535.
85    
86     There is no limit to the number of non-capturing subpat-
87     terns, but the maximum depth of nesting of all kinds of
88     parenthesized subpattern, including capturing subpatterns,
89     assertions, and other types of subpattern, is 200.
90    
91     The maximum length of a subject string is the largest posi-
92     tive number that an integer variable can hold. However, PCRE
93     uses recursion to handle subpatterns and indefinite repeti-
94     tion. This means that the available stack space may limit
95     the size of a subject string that can be processed by cer-
96     tain patterns.
97    
98    
99     UTF-8 SUPPORT
100    
101     Starting at release 3.3, PCRE has had some support for char-
102     acter strings encoded in the UTF-8 format. For release 4.0
103     this has been greatly extended to cover most common require-
104     ments.
105    
106     In order process UTF-8 strings, you must build PCRE to
107     include UTF-8 support in the code, and, in addition, you
108     must call pcre_compile() with the PCRE_UTF8 option flag.
109     When you do this, both the pattern and any subject strings
110     that are matched against it are treated as UTF-8 strings
111     instead of just strings of bytes.
112    
113     If you compile PCRE with UTF-8 support, but do not use it at
114     run time, the library will be a bit bigger, but the addi-
115     tional run time overhead is limited to testing the PCRE_UTF8
116     flag in several places, so should not be very large.
117    
118     The following comments apply when PCRE is running in UTF-8
119     mode:
120    
121     1. PCRE assumes that the strings it is given contain valid
122     UTF-8 codes. It does not diagnose invalid UTF-8 strings. If
123     you pass invalid UTF-8 strings to PCRE, the results are
124     undefined.
125    
126     2. In a pattern, the escape sequence \x{...}, where the con-
127     tents of the braces is a string of hexadecimal digits, is
128     interpreted as a UTF-8 character whose code number is the
129     given hexadecimal number, for example: \x{1234}. If a non-
130     hexadecimal digit appears between the braces, the item is
131     not recognized. This escape sequence can be used either as
132     a literal, or within a character class.
133    
134     3. The original hexadecimal escape sequence, \xhh, matches a
135     two-byte UTF-8 character if the value is greater than 127.
136    
137     4. Repeat quantifiers apply to complete UTF-8 characters,
138     not to individual bytes, for example: \x{100}{3}.
139    
140     5. The dot metacharacter matches one UTF-8 character instead
141     of a single byte.
142    
143     6. The escape sequence \C can be used to match a single byte
144     in UTF-8 mode, but its use can lead to some strange effects.
145    
146     7. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W
147     correctly test characters of any code value, but the charac-
148     ters that PCRE recognizes as digits, spaces, or word charac-
149     ters remain the same set as before, all with values less
150     than 256.
151    
152     8. Case-insensitive matching applies only to characters
153     whose values are less than 256. PCRE does not support the
154     notion of "case" for higher-valued characters.
155    
156     9. PCRE does not support the use of Unicode tables and pro-
157     perties or the Perl escapes \p, \P, and \X.
158    
159    
160     AUTHOR
161    
162     Philip Hazel <ph10@cam.ac.uk>
163     University Computing Service,
164     Cambridge CB2 3QG, England.
165     Phone: +44 1223 334714
166    
167     Last updated: 04 February 2003
168     Copyright (c) 1997-2003 University of Cambridge.
169     -----------------------------------------------------------------------------
170    
171     NAME
172     PCRE - Perl-compatible regular expressions
173    
174    
175     PCRE BUILD-TIME OPTIONS
176    
177     This document describes the optional features of PCRE that
178     can be selected when the library is compiled. They are all
179     selected, or deselected, by providing options to the config-
180     ure script which is run before the make command. The com-
181     plete list of options for configure (which includes the
182     standard ones such as the selection of the installation
183     directory) can be obtained by running
184    
185     ./configure --help
186    
187     The following sections describe certain options whose names
188     begin with --enable or --disable. These settings specify
189     changes to the defaults for the configure command. Because
190     of the way that configure works, --enable and --disable
191     always come in pairs, so the complementary option always
192     exists as well, but as it specifies the default, it is not
193     described.
194    
195    
196     UTF-8 SUPPORT
197    
198     To build PCRE with support for UTF-8 character strings, add
199    
200     --enable-utf8
201    
202     to the configure command. Of itself, this does not make PCRE
203     treat strings as UTF-8. As well as compiling PCRE with this
204     option, you also have have to set the PCRE_UTF8 option when
205     you call the pcre_compile() function.
206    
207    
208     CODE VALUE OF NEWLINE
209    
210     By default, PCRE treats character 10 (linefeed) as the new-
211     line character. This is the normal newline character on
212     Unix-like systems. You can compile PCRE to use character 13
213     (carriage return) instead by adding
214    
215     --enable-newline-is-cr
216    
217     to the configure command. For completeness there is also a
218     --enable-newline-is-lf option, which explicitly specifies
219     linefeed as the newline character.
220    
221    
222     BUILDING SHARED AND STATIC LIBRARIES
223    
224     The PCRE building process uses libtool to build both shared
225     and static Unix libraries by default. You can suppress one
226     of these by adding one of
227    
228     --disable-shared
229     --disable-static
230    
231     to the configure command, as required.
232    
233    
234     POSIX MALLOC USAGE
235    
236     When PCRE is called through the POSIX interface (see the
237     pcreposix documentation), additional working storage is
238     required for holding the pointers to capturing substrings
239     because PCRE requires three integers per substring, whereas
240     the POSIX interface provides only two. If the number of
241     expected substrings is small, the wrapper function uses
242     space on the stack, because this is faster than using mal-
243     loc() for each call. The default threshold above which the
244     stack is no longer used is 10; it can be changed by adding a
245     setting such as
246    
247     --with-posix-malloc-threshold=20
248    
249     to the configure command.
250    
251    
252     LIMITING PCRE RESOURCE USAGE
253    
254     Internally, PCRE has a function called match() which it
255     calls repeatedly (possibly recursively) when performing a
256     matching operation. By limiting the number of times this
257     function may be called, a limit can be placed on the
258     resources used by a single call to pcre_exec(). The limit
259     can be changed at run time, as described in the pcreapi
260     documentation. The default is 10 million, but this can be
261     changed by adding a setting such as
262    
263     --with-match-limit=500000
264    
265     to the configure command.
266    
267    
268     HANDLING VERY LARGE PATTERNS
269    
270     Within a compiled pattern, offset values are used to point
271     from one part to another (for example, from an opening
272     parenthesis to an alternation metacharacter). By default
273     two-byte values are used for these offsets, leading to a
274     maximum size for a compiled pattern of around 64K. This is
275     sufficient to handle all but the most gigantic patterns.
276     Nevertheless, some people do want to process enormous pat-
277     terns, so it is possible to compile PCRE to use three-byte
278     or four-byte offsets by adding a setting such as
279    
280     --with-link-size=3
281    
282     to the configure command. The value given must be 2, 3, or
283     4. Using longer offsets slows down the operation of PCRE
284     because it has to load additional bytes when handling them.
285    
286     If you build PCRE with an increased link size, test 2 (and
287     test 5 if you are using UTF-8) will fail. Part of the output
288     of these tests is a representation of the compiled pattern,
289     and this changes with the link size.
290    
291     Last updated: 21 January 2003
292     Copyright (c) 1997-2003 University of Cambridge.
293     -----------------------------------------------------------------------------
294    
295     NAME
296     PCRE - Perl-compatible regular expressions
297    
298    
299     SYNOPSIS OF PCRE API
300    
301 nigel 41 #include <pcre.h>
302    
303     pcre *pcre_compile(const char *pattern, int options,
304     const char **errptr, int *erroffset,
305     const unsigned char *tableptr);
306    
307     pcre_extra *pcre_study(const pcre *code, int options,
308     const char **errptr);
309    
310     int pcre_exec(const pcre *code, const pcre_extra *extra,
311     const char *subject, int length, int startoffset,
312     int options, int *ovector, int ovecsize);
313    
314 nigel 63 int pcre_copy_named_substring(const pcre *code,
315     const char *subject, int *ovector,
316     int stringcount, const char *stringname,
317     char *buffer, int buffersize);
318    
319 nigel 41 int pcre_copy_substring(const char *subject, int *ovector,
320     int stringcount, int stringnumber, char *buffer,
321     int buffersize);
322    
323 nigel 63 int pcre_get_named_substring(const pcre *code,
324     const char *subject, int *ovector,
325     int stringcount, const char *stringname,
326     const char **stringptr);
327    
328     int pcre_get_stringnumber(const pcre *code,
329     const char *name);
330    
331 nigel 41 int pcre_get_substring(const char *subject, int *ovector,
332     int stringcount, int stringnumber,
333     const char **stringptr);
334    
335     int pcre_get_substring_list(const char *subject,
336     int *ovector, int stringcount, const char ***listptr);
337    
338 nigel 49 void pcre_free_substring(const char *stringptr);
339    
340     void pcre_free_substring_list(const char **stringptr);
341    
342 nigel 41 const unsigned char *pcre_maketables(void);
343    
344 nigel 43 int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
345     int what, void *where);
346    
347 nigel 63
348 nigel 41 int pcre_info(const pcre *code, int *optptr, *firstcharptr);
349    
350 nigel 63 int pcre_config(int what, void *where);
351    
352 nigel 41 char *pcre_version(void);
353    
354     void *(*pcre_malloc)(size_t);
355    
356     void (*pcre_free)(void *);
357    
358 nigel 63 int (*pcre_callout)(pcre_callout_block *);
359 nigel 41
360    
361 nigel 63 PCRE API
362 nigel 41
363     PCRE has its own native API, which is described in this
364     document. There is also a set of wrapper functions that
365 nigel 43 correspond to the POSIX regular expression API. These are
366     described in the pcreposix documentation.
367    
368 nigel 41 The native API function prototypes are defined in the header
369     file pcre.h, and on Unix systems the library itself is
370     called libpcre.a, so can be accessed by adding -lpcre to the
371 nigel 43 command for linking an application which calls it. The
372     header file defines the macros PCRE_MAJOR and PCRE_MINOR to
373     contain the major and minor release numbers for the library.
374     Applications can use these to include support for different
375     releases.
376 nigel 41
377     The functions pcre_compile(), pcre_study(), and pcre_exec()
378 nigel 53 are used for compiling and matching regular expressions. A
379     sample program that demonstrates the simplest way of using
380 nigel 63 them is given in the file pcredemo.c. The pcresample docu-
381     mentation describes how to run it.
382 nigel 49
383 nigel 63 There are convenience functions for extracting captured sub-
384     strings from a matched subject string. They are:
385    
386     pcre_copy_substring()
387     pcre_copy_named_substring()
388     pcre_get_substring()
389     pcre_get_named_substring()
390     pcre_get_substring_list()
391    
392     pcre_free_substring() and pcre_free_substring_list() are
393     also provided, to free the memory used for extracted
394 nigel 49 strings.
395 nigel 41
396 nigel 49 The function pcre_maketables() is used (optionally) to build
397     a set of character tables in the current locale for passing
398     to pcre_compile().
399    
400 nigel 43 The function pcre_fullinfo() is used to find out information
401     about a compiled pattern; pcre_info() is an obsolete version
402     which returns only some of the available information, but is
403     retained for backwards compatibility. The function
404     pcre_version() returns a pointer to a string containing the
405     version of PCRE and its date of release.
406 nigel 41
407     The global variables pcre_malloc and pcre_free initially
408     contain the entry points of the standard malloc() and free()
409     functions respectively. PCRE calls the memory management
410     functions via these variables, so a calling program can
411     replace them if it wishes to intercept the calls. This
412     should be done before calling any PCRE functions.
413    
414 nigel 63 The global variable pcre_callout initially contains NULL. It
415     can be set by the caller to a "callout" function, which PCRE
416     will then call at specified points during a matching opera-
417     tion. Details are given in the pcrecallout documentation.
418 nigel 41
419    
420 nigel 63 MULTITHREADING
421    
422 nigel 53 The PCRE functions can be used in multi-threading applica-
423     tions, with the proviso that the memory management functions
424 nigel 63 pointed to by pcre_malloc and pcre_free, and the callout
425     function pointed to by pcre_callout, are shared by all
426 nigel 53 threads.
427 nigel 41
428     The compiled form of a regular expression is not altered
429     during matching, so the same compiled pattern can safely be
430     used by several threads at once.
431    
432    
433 nigel 63 CHECKING BUILD-TIME OPTIONS
434 nigel 41
435 nigel 63 int pcre_config(int what, void *where);
436    
437     The function pcre_config() makes it possible for a PCRE
438     client to discover which optional features have been com-
439     piled into the PCRE library. The pcrebuild documentation has
440     more details about these optional features.
441    
442     The first argument for pcre_config() is an integer, specify-
443     ing which information is required; the second argument is a
444     pointer to a variable into which the information is placed.
445     The following information is available:
446    
447     PCRE_CONFIG_UTF8
448    
449     The output is an integer that is set to one if UTF-8 support
450     is available; otherwise it is set to zero.
451    
452     PCRE_CONFIG_NEWLINE
453    
454     The output is an integer that is set to the value of the
455     code that is used for the newline character. It is either
456     linefeed (10) or carriage return (13), and should normally
457     be the standard character for your operating system.
458    
459     PCRE_CONFIG_LINK_SIZE
460    
461     The output is an integer that contains the number of bytes
462     used for internal linkage in compiled regular expressions.
463     The value is 2, 3, or 4. Larger values allow larger regular
464     expressions to be compiled, at the expense of slower match-
465     ing. The default value of 2 is sufficient for all but the
466     most massive patterns, since it allows the compiled pattern
467     to be up to 64K in size.
468    
469     PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
470    
471     The output is an integer that contains the threshold above
472     which the POSIX interface uses malloc() for output vectors.
473     Further details are given in the pcreposix documentation.
474    
475     PCRE_CONFIG_MATCH_LIMIT
476    
477     The output is an integer that gives the default limit for
478     the number of internal matching function calls in a
479     pcre_exec() execution. Further details are given with
480     pcre_exec() below.
481    
482    
483 nigel 41 COMPILING A PATTERN
484 nigel 63
485     pcre *pcre_compile(const char *pattern, int options,
486     const char **errptr, int *erroffset,
487     const unsigned char *tableptr);
488    
489 nigel 41 The function pcre_compile() is called to compile a pattern
490     into an internal form. The pattern is a C string terminated
491     by a binary zero, and is passed in the argument pattern. A
492     pointer to a single block of memory that is obtained via
493     pcre_malloc is returned. This contains the compiled code and
494 nigel 53 related data. The pcre type is defined for the returned
495     block; this is a typedef for a structure whose contents are
496     not externally defined. It is up to the caller to free the
497     memory when it is no longer required.
498 nigel 41
499 nigel 53 Although the compiled code of a PCRE regex is relocatable,
500     that is, it does not depend on memory location, the complete
501     pcre data block is not fully relocatable, because it con-
502     tains a copy of the tableptr argument, which is an address
503     (see below).
504 nigel 41 The options argument contains independent bits that affect
505     the compilation. It should be zero if no options are
506     required. Some of the options, in particular, those that are
507     compatible with Perl, can also be set and unset from within
508     the pattern (see the detailed description of regular expres-
509 nigel 63 sions in the pcrepattern documentation). For these options,
510     the contents of the options argument specifies their initial
511     settings at the start of compilation and execution. The
512     PCRE_ANCHORED option can be set at the time of matching as
513     well as at compile time.
514 nigel 41
515     If errptr is NULL, pcre_compile() returns NULL immediately.
516     Otherwise, if compilation of a pattern fails, pcre_compile()
517     returns NULL, and sets the variable pointed to by errptr to
518     point to a textual error message. The offset from the start
519     of the pattern to the character where the error was
520     discovered is placed in the variable pointed to by
521     erroffset, which must not be NULL. If it is, an immediate
522     error is given.
523    
524     If the final argument, tableptr, is NULL, PCRE uses a
525     default set of character tables which are built when it is
526     compiled, using the default C locale. Otherwise, tableptr
527     must be the result of a call to pcre_maketables(). See the
528     section on locale support below.
529    
530 nigel 53 This code fragment shows a typical straightforward call to
531     pcre_compile():
532    
533     pcre *re;
534     const char *error;
535     int erroffset;
536     re = pcre_compile(
537     "^A.*Z", /* the pattern */
538     0, /* default options */
539     &error, /* for error message */
540     &erroffset, /* for error offset */
541     NULL); /* use default character tables */
542    
543 nigel 63 The following option bits are defined:
544 nigel 41
545     PCRE_ANCHORED
546    
547     If this bit is set, the pattern is forced to be "anchored",
548 nigel 63 that is, it is constrained to match only at the first match-
549     ing point in the string which is being searched (the "sub-
550     ject string"). This effect can also be achieved by appropri-
551     ate constructs in the pattern itself, which is the only way
552     to do it in Perl.
553 nigel 41
554     PCRE_CASELESS
555    
556     If this bit is set, letters in the pattern match both upper
557     and lower case letters. It is equivalent to Perl's /i
558 nigel 63 option, and it can be changed within a pattern by a (?i)
559     option setting.
560 nigel 41
561     PCRE_DOLLAR_ENDONLY
562    
563     If this bit is set, a dollar metacharacter in the pattern
564     matches only at the end of the subject string. Without this
565     option, a dollar also matches immediately before the final
566     character if it is a newline (but not before any other new-
567     lines). The PCRE_DOLLAR_ENDONLY option is ignored if
568     PCRE_MULTILINE is set. There is no equivalent to this option
569 nigel 63 in Perl, and no way to set it within a pattern.
570 nigel 41
571     PCRE_DOTALL
572    
573     If this bit is set, a dot metacharater in the pattern
574     matches all characters, including newlines. Without it, new-
575     lines are excluded. This option is equivalent to Perl's /s
576 nigel 63 option, and it can be changed within a pattern by a (?s)
577     option setting. A negative class such as [^a] always matches
578     a newline character, independent of the setting of this
579     option.
580 nigel 41
581     PCRE_EXTENDED
582    
583     If this bit is set, whitespace data characters in the pat-
584     tern are totally ignored except when escaped or inside a
585 nigel 63 character class. Whitespace does not include the VT charac-
586     ter (code 11). In addition, characters between an unescaped
587     # outside a character class and the next newline character,
588 nigel 41 inclusive, are also ignored. This is equivalent to Perl's /x
589 nigel 63 option, and it can be changed within a pattern by a (?x)
590     option setting.
591    
592     This option makes it possible to include comments inside
593     complicated patterns. Note, however, that this applies only
594     to data characters. Whitespace characters may never appear
595 nigel 41 within special character sequences in a pattern, for example
596 nigel 63 within the sequence (?( which introduces a conditional sub-
597 nigel 41 pattern.
598    
599     PCRE_EXTRA
600    
601 nigel 43 This option was invented in order to turn on additional
602     functionality of PCRE that is incompatible with Perl, but it
603     is currently of very little use. When set, any backslash in
604     a pattern that is followed by a letter that has no special
605     meaning causes an error, thus reserving these combinations
606     for future expansion. By default, as in Perl, a backslash
607     followed by a letter with no special meaning is treated as a
608     literal. There are at present no other features controlled
609     by this option. It can also be set by a (?X) option setting
610     within a pattern.
611 nigel 41
612     PCRE_MULTILINE
613    
614     By default, PCRE treats the subject string as consisting of
615     a single "line" of characters (even if it actually contains
616     several newlines). The "start of line" metacharacter (^)
617     matches only at the start of the string, while the "end of
618     line" metacharacter ($) matches only at the end of the
619     string, or before a terminating newline (unless
620     PCRE_DOLLAR_ENDONLY is set). This is the same as Perl.
621    
622     When PCRE_MULTILINE it is set, the "start of line" and "end
623 nigel 43 of line" constructs match immediately following or immedi-
624     ately before any newline in the subject string, respec-
625     tively, as well as at the very start and end. This is
626 nigel 63 equivalent to Perl's /m option, and it can be changed within
627     a pattern by a (?m) option setting. If there are no "\n"
628     characters in a subject string, or no occurrences of ^ or $
629     in a pattern, setting PCRE_MULTILINE has no effect.
630 nigel 41
631 nigel 63 PCRE_NO_AUTO_CAPTURE
632    
633     If this option is set, it disables the use of numbered cap-
634     turing parentheses in the pattern. Any opening parenthesis
635     that is not followed by ? behaves as if it were followed by
636     ?: but named parentheses can still be used for capturing
637     (and they acquire numbers in the usual way). There is no
638     equivalent of this option in Perl.
639    
640 nigel 41 PCRE_UNGREEDY
641    
642     This option inverts the "greediness" of the quantifiers so
643     that they are not greedy by default, but become greedy if
644     followed by "?". It is not compatible with Perl. It can also
645     be set by a (?U) option setting within the pattern.
646    
647 nigel 49 PCRE_UTF8
648 nigel 41
649 nigel 49 This option causes PCRE to regard both the pattern and the
650 nigel 63 subject as strings of UTF-8 characters instead of single-
651     byte character strings. However, it is available only if
652     PCRE has been built to include UTF-8 support. If not, the
653     use of this option provokes an error. Details of how this
654     option changes the behaviour of PCRE are given in the sec-
655     tion on UTF-8 support in the main pcre page.
656 nigel 41
657 nigel 49
658 nigel 63 STUDYING A PATTERN
659 nigel 49
660 nigel 63 pcre_extra *pcre_study(const pcre *code, int options,
661     const char **errptr);
662    
663 nigel 41 When a pattern is going to be used several times, it is
664     worth spending more time analyzing it in order to speed up
665     the time taken for matching. The function pcre_study() takes
666 nigel 63 a pointer to a compiled pattern as its first argument. If
667     studing the pattern produces additional information that
668     will help speed up matching, pcre_study() returns a pointer
669     to a pcre_extra block, in which the study_data field points
670     to the results of the study.
671 nigel 41
672 nigel 63 The returned value from a pcre_study() can be passed
673     directly to pcre_exec(). However, the pcre_extra block also
674     contains other fields that can be set by the caller before
675     the block is passed; these are described below. If studying
676     the pattern does not produce any additional information,
677     pcre_study() returns NULL. In that circumstance, if the cal-
678     ling program wants to pass some of the other fields to
679     pcre_exec(), it must set up its own pcre_extra block.
680    
681 nigel 41 The second argument contains option bits. At present, no
682     options are defined for pcre_study(), and this argument
683     should always be zero.
684    
685 nigel 63 The third argument for pcre_study() is a pointer for an
686     error message. If studying succeeds (even if no data is
687     returned), the variable it points to is set to NULL. Other-
688     wise it points to a textual error message. You should there-
689     fore test the error pointer for NULL after calling
690     pcre_study(), to be sure that it has run successfully.
691 nigel 41
692 nigel 53 This is a typical call to pcre_study():
693    
694     pcre_extra *pe;
695     pe = pcre_study(
696     re, /* result of pcre_compile() */
697     0, /* no options exist */
698     &error); /* set to NULL or points to a message */
699    
700 nigel 41 At present, studying a pattern is useful only for non-
701     anchored patterns that do not have a single fixed starting
702     character. A bitmap of possible starting characters is
703     created.
704    
705    
706 nigel 63 LOCALE SUPPORT
707 nigel 41
708     PCRE handles caseless matching, and determines whether char-
709     acters are letters, digits, or whatever, by reference to a
710 nigel 63 set of tables. When running in UTF-8 mode, this applies only
711     to characters with codes less than 256. The library contains
712     a default set of tables that is created in the default C
713     locale when PCRE is compiled. This is used when the final
714     argument of pcre_compile() is NULL, and is sufficient for
715     many applications.
716 nigel 41
717     An alternative set of tables can, however, be supplied. Such
718     tables are built by calling the pcre_maketables() function,
719     which has no arguments, in the relevant locale. The result
720     can then be passed to pcre_compile() as often as necessary.
721     For example, to build and use tables that are appropriate
722     for the French locale (where accented characters with codes
723     greater than 128 are treated as letters), the following code
724     could be used:
725    
726     setlocale(LC_CTYPE, "fr");
727     tables = pcre_maketables();
728     re = pcre_compile(..., tables);
729    
730     The tables are built in memory that is obtained via
731     pcre_malloc. The pointer that is passed to pcre_compile is
732     saved with the compiled pattern, and the same tables are
733 nigel 63 used via this pointer by pcre_study() and pcre_exec(). Thus,
734 nigel 41 for any single pattern, compilation, studying and matching
735     all happen in the same locale, but different patterns can be
736     compiled in different locales. It is the caller's responsi-
737     bility to ensure that the memory containing the tables
738     remains available for as long as it is needed.
739    
740    
741 nigel 63 INFORMATION ABOUT A PATTERN
742 nigel 41
743 nigel 63 int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
744     int what, void *where);
745    
746 nigel 43 The pcre_fullinfo() function returns information about a
747     compiled pattern. It replaces the obsolete pcre_info() func-
748     tion, which is nevertheless retained for backwards compabil-
749     ity (and is documented below).
750 nigel 41
751 nigel 43 The first argument for pcre_fullinfo() is a pointer to the
752     compiled pattern. The second argument is the result of
753     pcre_study(), or NULL if the pattern was not studied. The
754     third argument specifies which piece of information is
755 nigel 63 required, and the fourth argument is a pointer to a variable
756     to receive the data. The yield of the function is zero for
757     success, or one of the following negative numbers:
758 nigel 43
759 nigel 41 PCRE_ERROR_NULL the argument code was NULL
760 nigel 43 the argument where was NULL
761 nigel 41 PCRE_ERROR_BADMAGIC the "magic number" was not found
762 nigel 43 PCRE_ERROR_BADOPTION the value of what was invalid
763 nigel 41
764 nigel 53 Here is a typical call of pcre_fullinfo(), to obtain the
765     length of the compiled pattern:
766    
767     int rc;
768     unsigned long int length;
769     rc = pcre_fullinfo(
770     re, /* result of pcre_compile() */
771     pe, /* result of pcre_study(), or NULL */
772     PCRE_INFO_SIZE, /* what is required */
773     &length); /* where to put the data */
774    
775 nigel 43 The possible values for the third argument are defined in
776     pcre.h, and are as follows:
777    
778 nigel 63 PCRE_INFO_BACKREFMAX
779 nigel 43
780 nigel 63 Return the number of the highest back reference in the pat-
781     tern. The fourth argument should point to an int variable.
782     Zero is returned if there are no back references.
783 nigel 41
784 nigel 43 PCRE_INFO_CAPTURECOUNT
785    
786     Return the number of capturing subpatterns in the pattern.
787     The fourth argument should point to an int variable.
788    
789 nigel 63 PCRE_INFO_FIRSTBYTE
790 nigel 43
791 nigel 63 Return information about the first byte of any matched
792     string, for a non-anchored pattern. (This option used to be
793     called PCRE_INFO_FIRSTCHAR; the old name is still recognized
794     for backwards compatibility.)
795 nigel 43
796 nigel 63 If there is a fixed first byte, e.g. from a pattern such as
797 nigel 47 (cat|cow|coyote), it is returned in the integer pointed to
798     by where. Otherwise, if either
799 nigel 41
800     (a) the pattern was compiled with the PCRE_MULTILINE option,
801     and every branch starts with "^", or
802    
803     (b) every branch of the pattern starts with ".*" and
804     PCRE_DOTALL is not set (if it were set, the pattern would be
805     anchored),
806 nigel 43
807 nigel 47 -1 is returned, indicating that the pattern matches only at
808 nigel 63 the start of a subject string or after any newline within
809     the string. Otherwise -2 is returned. For anchored patterns,
810     -2 is returned.
811 nigel 41
812 nigel 43 PCRE_INFO_FIRSTTABLE
813 nigel 41
814 nigel 43 If the pattern was studied, and this resulted in the con-
815 nigel 63 struction of a 256-bit table indicating a fixed set of bytes
816     for the first byte in any matching string, a pointer to the
817     table is returned. Otherwise NULL is returned. The fourth
818     argument should point to an unsigned char * variable.
819 nigel 41
820 nigel 43 PCRE_INFO_LASTLITERAL
821    
822     For a non-anchored pattern, return the value of the right-
823 nigel 63 most literal byte which must exist in any matched string,
824     other than at its start. The fourth argument should point to
825     an int variable. If there is no such byte, or if the pattern
826     is anchored, -1 is returned. For example, for the pattern
827     /a\d+z\d+/ the returned value is 'z'.
828 nigel 43
829 nigel 63 PCRE_INFO_NAMECOUNT
830     PCRE_INFO_NAMEENTRYSIZE
831     PCRE_INFO_NAMETABLE
832    
833     PCRE supports the use of named as well as numbered capturing
834     parentheses. The names are just an additional way of identi-
835     fying the parentheses, which still acquire a number. A
836     caller that wants to extract data from a named subpattern
837     must convert the name to a number in order to access the
838     correct pointers in the output vector (described with
839     pcre_exec() below). In order to do this, it must first use
840     these three values to obtain the name-to-number mapping
841     table for the pattern.
842    
843     The map consists of a number of fixed-size entries.
844     PCRE_INFO_NAMECOUNT gives the number of entries, and
845     PCRE_INFO_NAMEENTRYSIZE gives the size of each entry; both
846     of these return an int value. The entry size depends on the
847     length of the longest name. PCRE_INFO_NAMETABLE returns a
848     pointer to the first entry of the table (a pointer to char).
849     The first two bytes of each entry are the number of the cap-
850     turing parenthesis, most significant byte first. The rest of
851     the entry is the corresponding name, zero terminated. The
852     names are in alphabetical order. For example, consider the
853     following pattern (assume PCRE_EXTENDED is set, so white
854     space - including newlines - is ignored):
855    
856     (?P<date> (?P<year>(\d\d)?\d\d) -
857     (?P<month>\d\d) - (?P<day>\d\d) )
858    
859     There are four named subpatterns, so the table has four
860     entries, and each entry in the table is eight bytes long.
861     The table is as follows, with non-printing bytes shows in
862     hex, and undefined bytes shown as ??:
863    
864     00 01 d a t e 00 ??
865     00 05 d a y 00 ?? ??
866     00 04 m o n t h 00
867     00 02 y e a r 00 ??
868    
869     When writing code to extract data from named subpatterns,
870     remember that the length of each entry may be different for
871     each compiled pattern.
872    
873     PCRE_INFO_OPTIONS
874    
875     Return a copy of the options with which the pattern was com-
876     piled. The fourth argument should point to an unsigned long
877     int variable. These option bits are those specified in the
878     call to pcre_compile(), modified by any top-level option
879     settings within the pattern itself.
880    
881     A pattern is automatically anchored by PCRE if all of its
882     top-level alternatives begin with one of the following:
883    
884     ^ unless PCRE_MULTILINE is set
885     \A always
886     \G always
887     .* if PCRE_DOTALL is set and there are no back
888     references to the subpattern in which .* appears
889    
890     For such patterns, the PCRE_ANCHORED bit is set in the
891     options returned by pcre_fullinfo().
892    
893     PCRE_INFO_SIZE
894    
895     Return the size of the compiled pattern, that is, the value
896     that was passed as the argument to pcre_malloc() when PCRE
897     was getting memory in which to place the compiled data. The
898     fourth argument should point to a size_t variable.
899    
900     PCRE_INFO_STUDYSIZE
901    
902     Returns the size of the data block pointed to by the
903     study_data field in a pcre_extra block. That is, it is the
904     value that was passed to pcre_malloc() when PCRE was getting
905     memory into which to place the data created by pcre_study().
906     The fourth argument should point to a size_t variable.
907    
908    
909     OBSOLETE INFO FUNCTION
910    
911     int pcre_info(const pcre *code, int *optptr, *firstcharptr);
912    
913 nigel 43 The pcre_info() function is now obsolete because its inter-
914     face is too restrictive to return all the available data
915     about a compiled pattern. New programs should use
916     pcre_fullinfo() instead. The yield of pcre_info() is the
917     number of capturing subpatterns, or one of the following
918     negative numbers:
919    
920     PCRE_ERROR_NULL the argument code was NULL
921     PCRE_ERROR_BADMAGIC the "magic number" was not found
922    
923     If the optptr argument is not NULL, a copy of the options
924     with which the pattern was compiled is placed in the integer
925     it points to (see PCRE_INFO_OPTIONS above).
926    
927     If the pattern is not anchored and the firstcharptr argument
928     is not NULL, it is used to pass back information about the
929     first character of any matched string (see
930 nigel 63 PCRE_INFO_FIRSTBYTE above).
931 nigel 43
932    
933 nigel 41 MATCHING A PATTERN
934 nigel 53
935 nigel 63 int pcre_exec(const pcre *code, const pcre_extra *extra,
936     const char *subject, int length, int startoffset,
937     int options, int *ovector, int ovecsize);
938 nigel 53
939 nigel 63 The function pcre_exec() is called to match a subject string
940 nigel 41 against a pre-compiled pattern, which is passed in the code
941     argument. If the pattern has been studied, the result of the
942 nigel 63 study should be passed in the extra argument.
943 nigel 41
944 nigel 53 Here is an example of a simple call to pcre_exec():
945    
946     int rc;
947     int ovector[30];
948     rc = pcre_exec(
949     re, /* result of pcre_compile() */
950     NULL, /* we didn't study the pattern */
951     "some string", /* the subject string */
952     11, /* the length of the subject string */
953     0, /* start at offset 0 in the subject */
954     0, /* default options */
955     ovector, /* vector for substring information */
956     30); /* number of elements in the vector */
957    
958 nigel 63 If the extra argument is not NULL, it must point to a
959     pcre_extra data block. The pcre_study() function returns
960     such a block (when it doesn't return NULL), but you can also
961     create one for yourself, and pass additional information in
962     it. The fields in the block are as follows:
963    
964     unsigned long int flags;
965     void *study_data;
966     unsigned long int match_limit;
967     void *callout_data;
968    
969     The flags field is a bitmap that specifies which of the
970     other fields are set. The flag bits are:
971    
972     PCRE_EXTRA_STUDY_DATA
973     PCRE_EXTRA_MATCH_LIMIT
974     PCRE_EXTRA_CALLOUT_DATA
975    
976     Other flag bits should be set to zero. The study_data field
977     is set in the pcre_extra block that is returned by
978     pcre_study(), together with the appropriate flag bit. You
979     should not set this yourself, but you can add to the block
980     by setting the other fields.
981    
982     The match_limit field provides a means of preventing PCRE
983     from using up a vast amount of resources when running pat-
984     terns that are not going to match, but which have a very
985     large number of possibilities in their search trees. The
986     classic example is the use of nested unlimited repeats.
987     Internally, PCRE uses a function called match() which it
988     calls repeatedly (sometimes recursively). The limit is
989     imposed on the number of times this function is called dur-
990     ing a match, which has the effect of limiting the amount of
991     recursion and backtracking that can take place. For patterns
992     that are not anchored, the count starts from zero for each
993     position in the subject string.
994    
995     The default limit for the library can be set when PCRE is
996     built; the default default is 10 million, which handles all
997     but the most extreme cases. You can reduce the default by
998     suppling pcre_exec() with a pcre_extra block in which
999     match_limit is set to a smaller value, and
1000     PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the
1001     limit is exceeded, pcre_exec() returns
1002     PCRE_ERROR_MATCHLIMIT.
1003    
1004     The pcre_callout field is used in conjunction with the "cal-
1005     lout" feature, which is described in the pcrecallout docu-
1006     mentation.
1007    
1008 nigel 41 The PCRE_ANCHORED option can be passed in the options argu-
1009 nigel 63 ment, whose unused bits must be zero. This limits
1010     pcre_exec() to matching at the first matching position. How-
1011     ever, if a pattern was compiled with PCRE_ANCHORED, or
1012     turned out to be anchored by virtue of its contents, it can-
1013     not be made unachored at matching time.
1014 nigel 41
1015     There are also three further options that can be set only at
1016     matching time:
1017    
1018     PCRE_NOTBOL
1019    
1020     The first character of the string is not the beginning of a
1021     line, so the circumflex metacharacter should not match
1022     before it. Setting this without PCRE_MULTILINE (at compile
1023     time) causes circumflex never to match.
1024    
1025     PCRE_NOTEOL
1026    
1027     The end of the string is not the end of a line, so the dol-
1028     lar metacharacter should not match it nor (except in multi-
1029     line mode) a newline immediately before it. Setting this
1030     without PCRE_MULTILINE (at compile time) causes dollar never
1031     to match.
1032    
1033     PCRE_NOTEMPTY
1034    
1035     An empty string is not considered to be a valid match if
1036     this option is set. If there are alternatives in the pat-
1037     tern, they are tried. If all the alternatives match the
1038     empty string, the entire match fails. For example, if the
1039     pattern
1040    
1041     a?b?
1042    
1043     is applied to a string not beginning with "a" or "b", it
1044     matches the empty string at the start of the subject. With
1045     PCRE_NOTEMPTY set, this match is not valid, so PCRE searches
1046     further into the string for occurrences of "a" or "b".
1047    
1048     Perl has no direct equivalent of PCRE_NOTEMPTY, but it does
1049     make a special case of a pattern match of the empty string
1050     within its split() function, and when using the /g modifier.
1051     It is possible to emulate Perl's behaviour after matching a
1052     null string by first trying the match again at the same
1053     offset with PCRE_NOTEMPTY set, and then if that fails by
1054     advancing the starting offset (see below) and trying an
1055     ordinary match again.
1056    
1057 nigel 63 The subject string is passed to pcre_exec() as a pointer in
1058     subject, a length in length, and a starting offset in star-
1059     toffset. Unlike the pattern string, the subject may contain
1060     binary zero bytes. When the starting offset is zero, the
1061 nigel 53 search for a match starts at the beginning of the subject,
1062     and this is by far the most common case.
1063 nigel 41
1064 nigel 63 If the pattern was compiled with the PCRE_UTF8 option, the
1065     subject must be a sequence of bytes that is a valid UTF-8
1066     string. If an invalid UTF-8 string is passed, PCRE's
1067     behaviour is not defined.
1068    
1069 nigel 41 A non-zero starting offset is useful when searching for
1070     another match in the same subject by calling pcre_exec()
1071     again after a previous success. Setting startoffset differs
1072     from just passing over a shortened string and setting
1073     PCRE_NOTBOL in the case of a pattern that begins with any
1074     kind of lookbehind. For example, consider the pattern
1075    
1076     \Biss\B
1077    
1078     which finds occurrences of "iss" in the middle of words. (\B
1079     matches only if the current position in the subject is not a
1080     word boundary.) When applied to the string "Mississipi" the
1081     first call to pcre_exec() finds the first occurrence. If
1082     pcre_exec() is called again with just the remainder of the
1083     subject, namely "issipi", it does not match, because \B is
1084     always false at the start of the subject, which is deemed to
1085     be a word boundary. However, if pcre_exec() is passed the
1086     entire string again, but with startoffset set to 4, it finds
1087     the second occurrence of "iss" because it is able to look
1088     behind the starting point to discover that it is preceded by
1089     a letter.
1090    
1091     If a non-zero starting offset is passed when the pattern is
1092     anchored, one attempt to match at the given offset is tried.
1093     This can only succeed if the pattern does not require the
1094     match to be at the start of the subject.
1095    
1096     In general, a pattern matches a certain portion of the sub-
1097     ject, and in addition, further substrings from the subject
1098     may be picked out by parts of the pattern. Following the
1099     usage in Jeffrey Friedl's book, this is called "capturing"
1100     in what follows, and the phrase "capturing subpattern" is
1101     used for a fragment of a pattern that picks out a substring.
1102     PCRE supports several other kinds of parenthesized subpat-
1103     tern that do not cause substrings to be captured.
1104    
1105     Captured substrings are returned to the caller via a vector
1106     of integer offsets whose address is passed in ovector. The
1107     number of elements in the vector is passed in ovecsize. The
1108     first two-thirds of the vector is used to pass back captured
1109     substrings, each substring using a pair of integers. The
1110     remaining third of the vector is used as workspace by
1111     pcre_exec() while matching capturing subpatterns, and is not
1112     available for passing back information. The length passed in
1113     ovecsize should always be a multiple of three. If it is not,
1114     it is rounded down.
1115    
1116     When a match has been successful, information about captured
1117     substrings is returned in pairs of integers, starting at the
1118     beginning of ovector, and continuing up to two-thirds of its
1119     length at the most. The first element of a pair is set to
1120     the offset of the first character in a substring, and the
1121     second is set to the offset of the first character after the
1122     end of a substring. The first pair, ovector[0] and ovec-
1123     tor[1], identify the portion of the subject string matched
1124     by the entire pattern. The next pair is used for the first
1125     capturing subpattern, and so on. The value returned by
1126     pcre_exec() is the number of pairs that have been set. If
1127     there are no capturing subpatterns, the return value from a
1128     successful match is 1, indicating that just the first pair
1129     of offsets has been set.
1130     Some convenience functions are provided for extracting the
1131     captured substrings as separate strings. These are described
1132     in the following section.
1133    
1134     It is possible for an capturing subpattern number n+1 to
1135     match some part of the subject when subpattern n has not
1136     been used at all. For example, if the string "abc" is
1137     matched against the pattern (a|(z))(bc) subpatterns 1 and 3
1138     are matched, but 2 is not. When this happens, both offset
1139     values corresponding to the unused subpattern are set to -1.
1140    
1141     If a capturing subpattern is matched repeatedly, it is the
1142     last portion of the string that it matched that gets
1143     returned.
1144    
1145     If the vector is too small to hold all the captured sub-
1146     strings, it is used as far as possible (up to two-thirds of
1147     its length), and the function returns a value of zero. In
1148     particular, if the substring offsets are not of interest,
1149     pcre_exec() may be called with ovector passed as NULL and
1150     ovecsize as zero. However, if the pattern contains back
1151     references and the ovector isn't big enough to remember the
1152     related substrings, PCRE has to get additional memory for
1153     use during matching. Thus it is usually advisable to supply
1154     an ovector.
1155    
1156     Note that pcre_info() can be used to find out how many cap-
1157     turing subpatterns there are in a compiled pattern. The
1158     smallest size for ovector that will allow for n captured
1159 nigel 63 substrings, in addition to the offsets of the substring
1160     matched by the whole pattern, is (n+1)*3.
1161 nigel 41
1162     If pcre_exec() fails, it returns a negative number. The fol-
1163     lowing are defined in the header file:
1164    
1165     PCRE_ERROR_NOMATCH (-1)
1166    
1167     The subject string did not match the pattern.
1168    
1169     PCRE_ERROR_NULL (-2)
1170    
1171     Either code or subject was passed as NULL, or ovector was
1172     NULL and ovecsize was not zero.
1173    
1174     PCRE_ERROR_BADOPTION (-3)
1175    
1176     An unrecognized bit was set in the options argument.
1177    
1178     PCRE_ERROR_BADMAGIC (-4)
1179    
1180     PCRE stores a 4-byte "magic number" at the start of the com-
1181     piled code, to catch the case when it is passed a junk
1182     pointer. This is the error it gives when the magic number
1183     isn't present.
1184    
1185     PCRE_ERROR_UNKNOWN_NODE (-5)
1186    
1187     While running the pattern match, an unknown item was encoun-
1188     tered in the compiled pattern. This error could be caused by
1189     a bug in PCRE or by overwriting of the compiled pattern.
1190    
1191     PCRE_ERROR_NOMEMORY (-6)
1192    
1193     If a pattern contains back references, but the ovector that
1194     is passed to pcre_exec() is not big enough to remember the
1195     referenced substrings, PCRE gets a block of memory at the
1196     start of matching to use for this purpose. If the call via
1197     pcre_malloc() fails, this error is given. The memory is
1198     freed at the end of matching.
1199    
1200 nigel 63 PCRE_ERROR_NOSUBSTRING (-7)
1201 nigel 41
1202 nigel 63 This error is used by the pcre_copy_substring(),
1203     pcre_get_substring(), and pcre_get_substring_list() func-
1204     tions (see below). It is never returned by pcre_exec().
1205 nigel 41
1206 nigel 63 PCRE_ERROR_MATCHLIMIT (-8)
1207 nigel 53
1208 nigel 63 The recursion and backtracking limit, as specified by the
1209     match_limit field in a pcre_extra structure (or defaulted)
1210     was reached. See the description above.
1211    
1212     PCRE_ERROR_CALLOUT (-9)
1213    
1214     This error is never generated by pcre_exec() itself. It is
1215     provided for use by callout functions that want to yield a
1216     distinctive error code. See the pcrecallout documentation
1217     for details.
1218    
1219    
1220     EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
1221    
1222     int pcre_copy_substring(const char *subject, int *ovector,
1223     int stringcount, int stringnumber, char *buffer,
1224     int buffersize);
1225    
1226     int pcre_get_substring(const char *subject, int *ovector,
1227     int stringcount, int stringnumber,
1228     const char **stringptr);
1229    
1230     int pcre_get_substring_list(const char *subject,
1231     int *ovector, int stringcount, const char ***listptr);
1232    
1233    
1234 nigel 41 Captured substrings can be accessed directly by using the
1235     offsets returned by pcre_exec() in ovector. For convenience,
1236     the functions pcre_copy_substring(), pcre_get_substring(),
1237     and pcre_get_substring_list() are provided for extracting
1238     captured substrings as new, separate, zero-terminated
1239 nigel 63 strings. These functions identify substrings by number. The
1240     next section describes functions for extracting named sub-
1241 nigel 41 strings. A substring that contains a binary zero is
1242     correctly extracted and has a further zero added on the end,
1243 nigel 63 but the result is not, of course, a C string.
1244 nigel 41
1245 nigel 63 The first three arguments are the same for all three of
1246     these functions: subject is the subject string which has
1247     just been successfully matched, ovector is a pointer to the
1248     vector of integer offsets that was passed to pcre_exec(),
1249     and stringcount is the number of substrings that were cap-
1250     tured by the match, including the substring that matched the
1251 nigel 41 entire regular expression. This is the value returned by
1252     pcre_exec if it is greater than zero. If pcre_exec()
1253     returned zero, indicating that it ran out of space in ovec-
1254 nigel 47 tor, the value passed as stringcount should be the size of
1255     the vector divided by three.
1256 nigel 41
1257     The functions pcre_copy_substring() and pcre_get_substring()
1258     extract a single substring, whose number is given as string-
1259     number. A value of zero extracts the substring that matched
1260     the entire pattern, while higher values extract the captured
1261     substrings. For pcre_copy_substring(), the string is placed
1262     in buffer, whose length is given by buffersize, while for
1263 nigel 49 pcre_get_substring() a new block of memory is obtained via
1264 nigel 41 pcre_malloc, and its address is returned via stringptr. The
1265     yield of the function is the length of the string, not
1266     including the terminating zero, or one of
1267    
1268     PCRE_ERROR_NOMEMORY (-6)
1269    
1270     The buffer was too small for pcre_copy_substring(), or the
1271     attempt to get memory failed for pcre_get_substring().
1272    
1273     PCRE_ERROR_NOSUBSTRING (-7)
1274    
1275     There is no substring whose number is stringnumber.
1276    
1277     The pcre_get_substring_list() function extracts all avail-
1278     able substrings and builds a list of pointers to them. All
1279     this is done in a single block of memory which is obtained
1280     via pcre_malloc. The address of the memory block is returned
1281     via listptr, which is also the start of the list of string
1282     pointers. The end of the list is marked by a NULL pointer.
1283     The yield of the function is zero if all went well, or
1284    
1285     PCRE_ERROR_NOMEMORY (-6)
1286    
1287     if the attempt to get the memory block failed.
1288    
1289     When any of these functions encounter a substring that is
1290     unset, which can happen when capturing subpattern number n+1
1291     matches some part of the subject, but subpattern n has not
1292     been used at all, they return an empty string. This can be
1293     distinguished from a genuine zero-length substring by
1294     inspecting the appropriate offset in ovector, which is nega-
1295     tive for unset substrings.
1296    
1297 nigel 49 The two convenience functions pcre_free_substring() and
1298     pcre_free_substring_list() can be used to free the memory
1299     returned by a previous call of pcre_get_substring() or
1300     pcre_get_substring_list(), respectively. They do nothing
1301     more than call the function pointed to by pcre_free, which
1302     of course could be called directly from a C program. How-
1303     ever, PCRE is used in some situations where it is linked via
1304     a special interface to another programming language which
1305     cannot use pcre_free directly; it is for these cases that
1306     the functions are provided.
1307 nigel 41
1308    
1309 nigel 63 EXTRACTING CAPTURED SUBSTRINGS BY NAME
1310 nigel 41
1311 nigel 63 int pcre_copy_named_substring(const pcre *code,
1312     const char *subject, int *ovector,
1313     int stringcount, const char *stringname,
1314     char *buffer, int buffersize);
1315 nigel 41
1316 nigel 63 int pcre_get_stringnumber(const pcre *code,
1317     const char *name);
1318 nigel 41
1319 nigel 63 int pcre_get_named_substring(const pcre *code,
1320     const char *subject, int *ovector,
1321     int stringcount, const char *stringname,
1322     const char **stringptr);
1323 nigel 41
1324 nigel 63 To extract a substring by name, you first have to find asso-
1325     ciated number. This can be done by calling
1326     pcre_get_stringnumber(). The first argument is the compiled
1327     pattern, and the second is the name. For example, for this
1328     pattern
1329 nigel 41
1330 nigel 63 ab(?<xxx>\d+)...
1331    
1332     the number of the subpattern called "xxx" is 1. Given the
1333     number, you can then extract the substring directly, or use
1334     one of the functions described in the previous section. For
1335     convenience, there are also two functions that do the whole
1336     job.
1337    
1338     Most of the arguments of pcre_copy_named_substring() and
1339     pcre_get_named_substring() are the same as those for the
1340     functions that extract by number, and so are not re-
1341     described here. There are just two differences.
1342    
1343     First, instead of a substring number, a substring name is
1344     given. Second, there is an extra argument, given at the
1345     start, which is a pointer to the compiled pattern. This is
1346     needed in order to gain access to the name-to-number trans-
1347     lation table.
1348    
1349     These functions call pcre_get_stringnumber(), and if it
1350     succeeds, they then call pcre_copy_substring() or
1351     pcre_get_substring(), as appropriate.
1352    
1353     Last updated: 03 February 2003
1354     Copyright (c) 1997-2003 University of Cambridge.
1355     -----------------------------------------------------------------------------
1356    
1357     NAME
1358     PCRE - Perl-compatible regular expressions
1359    
1360    
1361     PCRE CALLOUTS
1362    
1363     int (*pcre_callout)(pcre_callout_block *);
1364    
1365     PCRE provides a feature called "callout", which is a means
1366     of temporarily passing control to the caller of PCRE in the
1367     middle of pattern matching. The caller of PCRE provides an
1368     external function by putting its entry point in the global
1369     variable pcre_callout. By default, this variable contains
1370     NULL, which disables all calling out.
1371    
1372     Within a regular expression, (?C) indicates the points at
1373     which the external function is to be called. Different cal-
1374     lout points can be identified by putting a number less than
1375     256 after the letter C. The default value is zero. For
1376     example, this pattern has two callout points:
1377    
1378     (?C1)9abc(?C2)def
1379    
1380     During matching, when PCRE reaches a callout point (and
1381     pcre_callout is set), the external function is called. Its
1382     only argument is a pointer to a pcre_callout block. This
1383     contains the following variables:
1384    
1385     int version;
1386     int callout_number;
1387     int *offset_vector;
1388     const char *subject;
1389     int subject_length;
1390     int start_match;
1391     int current_position;
1392     int capture_top;
1393     int capture_last;
1394     void *callout_data;
1395    
1396     The version field is an integer containing the version
1397     number of the block format. The current version is zero. The
1398     version number may change in future if additional fields are
1399     added, but the intention is never to remove any of the
1400     existing fields.
1401    
1402     The callout_number field contains the number of the callout,
1403     as compiled into the pattern (that is, the number after ?C).
1404    
1405     The offset_vector field is a pointer to the vector of
1406     offsets that was passed by the caller to pcre_exec(). The
1407     contents can be inspected in order to extract substrings
1408     that have been matched so far, in the same way as for
1409     extracting substrings after a match has completed.
1410     The subject and subject_length fields contain copies the
1411     values that were passed to pcre_exec().
1412    
1413     The start_match field contains the offset within the subject
1414     at which the current match attempt started. If the pattern
1415     is not anchored, the callout function may be called several
1416     times for different starting points.
1417    
1418     The current_position field contains the offset within the
1419     subject of the current match pointer.
1420    
1421     The capture_top field contains the number of the highest
1422     captured substring so far.
1423    
1424     The capture_last field contains the number of the most
1425     recently captured substring.
1426    
1427     The callout_data field contains a value that is passed to
1428     pcre_exec() by the caller specifically so that it can be
1429     passed back in callouts. It is passed in the pcre_callout
1430     field of the pcre_extra data structure. If no such data was
1431     passed, the value of callout_data in a pcre_callout block is
1432     NULL. There is a description of the pcre_extra structure in
1433     the pcreapi documentation.
1434    
1435    
1436    
1437     RETURN VALUES
1438    
1439     The callout function returns an integer. If the value is
1440     zero, matching proceeds as normal. If the value is greater
1441     than zero, matching fails at the current point, but back-
1442     tracking to test other possibilities goes ahead, just as if
1443     a lookahead assertion had failed. If the value is less than
1444     zero, the match is abandoned, and pcre_exec() returns the
1445     value.
1446    
1447     Negative values should normally be chosen from the set of
1448     PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH
1449     forces a standard "no match" failure. The error number
1450     PCRE_ERROR_CALLOUT is reserved for use by callout functions;
1451     it will never be used by PCRE itself.
1452    
1453     Last updated: 21 January 2003
1454     Copyright (c) 1997-2003 University of Cambridge.
1455     -----------------------------------------------------------------------------
1456    
1457     NAME
1458     PCRE - Perl-compatible regular expressions
1459    
1460    
1461 nigel 41 DIFFERENCES FROM PERL
1462    
1463 nigel 63 This document describes the differences in the ways that
1464     PCRE and Perl handle regular expressions. The differences
1465     described here are with respect to Perl 5.8.
1466 nigel 41
1467 nigel 63 1. PCRE does not allow repeat quantifiers on lookahead
1468 nigel 41 assertions. Perl permits them, but they do not mean what you
1469     might think. For example, (?!a){3} does not assert that the
1470     next three characters are not "a". It just asserts that the
1471     next character is not "a" three times.
1472    
1473 nigel 63 2. Capturing subpatterns that occur inside negative looka-
1474 nigel 41 head assertions are counted, but their entries in the
1475     offsets vector are never set. Perl sets its numerical vari-
1476     ables from any such patterns that are matched before the
1477     assertion fails to match something (thereby succeeding), but
1478     only if the negative lookahead assertion contains just one
1479     branch.
1480    
1481 nigel 63 3. Though binary zero characters are supported in the sub-
1482 nigel 41 ject string, they are not allowed in a pattern string
1483     because it is passed as a normal C string, terminated by
1484     zero. The escape sequence "\0" can be used in the pattern to
1485     represent a binary zero.
1486    
1487 nigel 63 4. The following Perl escape sequences are not supported:
1488     \l, \u, \L, \U, \P, \p, and \X. In fact these are imple-
1489     mented by Perl's general string-handling and are not part of
1490     its pattern matching engine. If any of these are encountered
1491     by PCRE, an error is generated.
1492 nigel 41
1493 nigel 63 5. PCRE does support the \Q...\E escape for quoting sub-
1494     strings. Characters in between are treated as literals. This
1495     is slightly different from Perl in that $ and @ are also
1496     handled as literals inside the quotes. In Perl, they cause
1497     variable interpolation (but of course PCRE does not have
1498     variables). Note the following examples:
1499 nigel 41
1500 nigel 63 Pattern PCRE matches Perl matches
1501 nigel 49
1502 nigel 63 \Qabc$xyz\E abc$xyz abc followed by the
1503     contents of $xyz
1504     \Qabc\$xyz\E abc\$xyz abc\$xyz
1505     \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
1506 nigel 41
1507 nigel 63 In PCRE, the \Q...\E mechanism is not recognized inside a
1508     character class.
1509 nigel 41
1510 nigel 63 8. Fairly obviously, PCRE does not support the (?{code}) and
1511     (?p{code}) constructions. However, there is some experimen-
1512     tal support for recursive patterns using the non-Perl items
1513     (?R), (?number) and (?P>name). Also, the PCRE "callout"
1514     feature allows an external function to be called during pat-
1515     tern matching.
1516 nigel 41
1517 nigel 63 9. There are some differences that are concerned with the
1518     settings of captured strings when part of a pattern is
1519     repeated. For example, matching "aba" against the pattern
1520     /^(a(b)?)+$/ in Perl leaves $2 unset, but in PCRE it is set
1521     to "b".
1522    
1523 nigel 41 10. PCRE provides some extensions to the Perl regular
1524     expression facilities:
1525    
1526     (a) Although lookbehind assertions must match fixed length
1527     strings, each alternative branch of a lookbehind assertion
1528 nigel 63 can match a different length of string. Perl requires them
1529     all to have the same length.
1530 nigel 41
1531     (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not
1532 nigel 63 set, the $ meta-character matches only at the very end of
1533 nigel 41 the string.
1534    
1535     (c) If PCRE_EXTRA is set, a backslash followed by a letter
1536     with no special meaning is faulted.
1537    
1538 nigel 43 (d) If PCRE_UNGREEDY is set, the greediness of the repeti-
1539     tion quantifiers is inverted, that is, by default they are
1540     not greedy, but if followed by a question mark they are.
1541 nigel 41
1542     (e) PCRE_ANCHORED can be used to force a pattern to be tried
1543 nigel 63 only at the first matching position in the subject string.
1544 nigel 41
1545 nigel 63 (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and
1546     PCRE_NO_AUTO_CAPTURE options for pcre_exec() have no Perl
1547     equivalents.
1548 nigel 41
1549 nigel 63 (g) The (?R), (?number), and (?P>name) constructs allows for
1550     recursive pattern matching (Perl can do this using the
1551     (?p{code}) construct, which PCRE cannot support.)
1552 nigel 41
1553 nigel 63 (h) PCRE supports named capturing substrings, using the
1554     Python syntax.
1555 nigel 41
1556 nigel 63 (i) PCRE supports the possessive quantifier "++" syntax,
1557     taken from Sun's Java package.
1558 nigel 43
1559 nigel 63 (j) The (R) condition, for testing recursion, is a PCRE
1560     extension.
1561    
1562     (k) The callout facility is PCRE-specific.
1563    
1564     Last updated: 03 February 2003
1565     Copyright (c) 1997-2003 University of Cambridge.
1566     -----------------------------------------------------------------------------
1567    
1568     NAME
1569     PCRE - Perl-compatible regular expressions
1570    
1571    
1572     PCRE REGULAR EXPRESSION DETAILS
1573    
1574 nigel 41 The syntax and semantics of the regular expressions sup-
1575     ported by PCRE are described below. Regular expressions are
1576     also described in the Perl documentation and in a number of
1577     other books, some of which have copious examples. Jeffrey
1578     Friedl's "Mastering Regular Expressions", published by
1579 nigel 63 O'Reilly, covers them in great detail. The description here
1580     is intended as reference documentation.
1581 nigel 49
1582     The basic operation of PCRE is on strings of bytes. However,
1583 nigel 63 there is also support for UTF-8 character strings. To use
1584     this support you must build PCRE to include UTF-8 support,
1585     and then call pcre_compile() with the PCRE_UTF8 option. How
1586     this affects the pattern matching is mentioned in several
1587     places below. There is also a summary of UTF-8 features in
1588     the section on UTF-8 support in the main pcre page.
1589 nigel 41
1590     A regular expression is a pattern that is matched against a
1591     subject string from left to right. Most characters stand for
1592     themselves in a pattern, and match the corresponding charac-
1593     ters in the subject. As a trivial example, the pattern
1594    
1595     The quick brown fox
1596    
1597     matches a portion of a subject string that is identical to
1598     itself. The power of regular expressions comes from the
1599     ability to include alternatives and repetitions in the pat-
1600     tern. These are encoded in the pattern by the use of meta-
1601     characters, which do not stand for themselves but instead
1602     are interpreted in some special way.
1603    
1604     There are two different sets of meta-characters: those that
1605     are recognized anywhere in the pattern except within square
1606     brackets, and those that are recognized in square brackets.
1607     Outside square brackets, the meta-characters are as follows:
1608    
1609     \ general escape character with several uses
1610 nigel 63 ^ assert start of string (or line, in multiline mode)
1611     $ assert end of string (or line, in multiline mode)
1612 nigel 41 . match any character except newline (by default)
1613     [ start character class definition
1614     | start of alternative branch
1615     ( start subpattern
1616     ) end subpattern
1617     ? extends the meaning of (
1618     also 0 or 1 quantifier
1619     also quantifier minimizer
1620     * 0 or more quantifier
1621     + 1 or more quantifier
1622 nigel 63 also "possessive quantifier"
1623 nigel 41 { start min/max quantifier
1624    
1625     Part of a pattern that is in square brackets is called a
1626     "character class". In a character class the only meta-
1627     characters are:
1628    
1629     \ general escape character
1630     ^ negate the class, but only if the first character
1631     - indicates character range
1632 nigel 63 [ POSIX character class (only if followed by POSIX
1633     syntax)
1634 nigel 41 ] terminates the character class
1635    
1636     The following sections describe the use of each of the
1637     meta-characters.
1638    
1639    
1640 nigel 63 BACKSLASH
1641 nigel 41
1642     The backslash character has several uses. Firstly, if it is
1643     followed by a non-alphameric character, it takes away any
1644     special meaning that character may have. This use of
1645     backslash as an escape character applies both inside and
1646     outside character classes.
1647    
1648 nigel 63 For example, if you want to match a * character, you write
1649     \* in the pattern. This escaping action applies whether or
1650     not the following character would otherwise be interpreted
1651     as a meta-character, so it is always safe to precede a non-
1652     alphameric with backslash to specify that it stands for
1653     itself. In particular, if you want to match a backslash, you
1654     write \\.
1655 nigel 41
1656     If a pattern is compiled with the PCRE_EXTENDED option, whi-
1657     tespace in the pattern (other than in a character class) and
1658 nigel 63 characters between a # outside a character class and the
1659 nigel 41 next newline character are ignored. An escaping backslash
1660 nigel 63 can be used to include a whitespace or # character as part
1661 nigel 41 of the pattern.
1662    
1663 nigel 63 If you want to remove the special meaning from a sequence of
1664     characters, you can do so by putting them between \Q and \E.
1665     This is different from Perl in that $ and @ are handled as
1666     literals in \Q...\E sequences in PCRE, whereas in Perl, $
1667     and @ cause variable interpolation. Note the following exam-
1668     ples:
1669    
1670     Pattern PCRE matches Perl matches
1671    
1672     \Qabc$xyz\E abc$xyz abc followed by the
1673    
1674     contents of $xyz
1675     \Qabc\$xyz\E abc\$xyz abc\$xyz
1676     \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
1677    
1678     The \Q...\E sequence is recognized both inside and outside
1679     character classes.
1680    
1681 nigel 41 A second use of backslash provides a way of encoding non-
1682     printing characters in patterns in a visible manner. There
1683     is no restriction on the appearance of non-printing charac-
1684     ters, apart from the binary zero that terminates a pattern,
1685     but when a pattern is being prepared by text editing, it is
1686     usually easier to use one of the following escape sequences
1687     than the binary character it represents:
1688    
1689 nigel 63 \a alarm, that is, the BEL character (hex 07)
1690     \cx "control-x", where x is any character
1691     \e escape (hex 1B)
1692     \f formfeed (hex 0C)
1693     \n newline (hex 0A)
1694     \r carriage return (hex 0D)
1695     \t tab (hex 09)
1696     \ddd character with octal code ddd, or backreference
1697     \xhh character with hex code hh
1698     \x{hhh..} character with hex code hhh... (UTF-8 mode only)
1699 nigel 41
1700 nigel 63 The precise effect of \cx is as follows: if x is a lower
1701 nigel 41 case letter, it is converted to upper case. Then bit 6 of
1702 nigel 63 the character (hex 40) is inverted. Thus \cz becomes hex
1703     1A, but \c{ becomes hex 3B, while \c; becomes hex 7B.
1704 nigel 41
1705 nigel 63 After \x, from zero to two hexadecimal digits are read
1706     (letters can be in upper or lower case). In UTF-8 mode, any
1707     number of hexadecimal digits may appear between \x{ and },
1708     but the value of the character code must be less than 2**31
1709     (that is, the maximum hexadecimal value is 7FFFFFFF). If
1710     characters other than hexadecimal digits appear between \x{
1711     and }, or if there is no terminating }, this form of escape
1712     is not recognized. Instead, the initial \x will be inter-
1713     preted as a basic hexadecimal escape, with no following
1714     digits, giving a byte whose value is zero.
1715 nigel 41
1716 nigel 63 Characters whose value is less than 256 can be defined by
1717     either of the two syntaxes for \x when PCRE is in UTF-8
1718     mode. There is no difference in the way they are handled.
1719     For example, \xdc is exactly the same as \x{dc}.
1720    
1721     After \0 up to two further octal digits are read. In both
1722 nigel 41 cases, if there are fewer than two digits, just those that
1723 nigel 63 are present are used. Thus the sequence \0\x\07 specifies
1724     two binary zeros followed by a BEL character (code value 7).
1725     Make sure you supply two digits after the initial zero if
1726     the character that follows is itself an octal digit.
1727 nigel 41
1728     The handling of a backslash followed by a digit other than 0
1729     is complicated. Outside a character class, PCRE reads it
1730     and any following digits as a decimal number. If the number
1731     is less than 10, or if there have been at least that many
1732     previous capturing left parentheses in the expression, the
1733     entire sequence is taken as a back reference. A description
1734     of how this works is given later, following the discussion
1735     of parenthesized subpatterns.
1736    
1737     Inside a character class, or if the decimal number is
1738     greater than 9 and there have not been that many capturing
1739     subpatterns, PCRE re-reads up to three octal digits follow-
1740     ing the backslash, and generates a single byte from the
1741     least significant 8 bits of the value. Any subsequent digits
1742     stand for themselves. For example:
1743    
1744     \040 is another way of writing a space
1745     \40 is the same, provided there are fewer than 40
1746     previous capturing subpatterns
1747     \7 is always a back reference
1748     \11 might be a back reference, or another way of
1749     writing a tab
1750     \011 is always a tab
1751     \0113 is a tab followed by the character "3"
1752 nigel 63 \113 might be a back reference, otherwise the
1753     character with octal code 113
1754     \377 might be a back reference, otherwise
1755     the byte consisting entirely of 1 bits
1756 nigel 41 \81 is either a back reference, or a binary zero
1757     followed by the two characters "8" and "1"
1758    
1759     Note that octal values of 100 or greater must not be intro-
1760     duced by a leading zero, because no more than three octal
1761     digits are ever read.
1762 nigel 43
1763 nigel 63 All the sequences that define a single byte value or a sin-
1764     gle UTF-8 character (in UTF-8 mode) can be used both inside
1765     and outside character classes. In addition, inside a charac-
1766     ter class, the sequence \b is interpreted as the backspace
1767     character (hex 08). Outside a character class it has a dif-
1768     ferent meaning (see below).
1769 nigel 41
1770     The third use of backslash is for specifying generic charac-
1771     ter types:
1772    
1773     \d any decimal digit
1774     \D any character that is not a decimal digit
1775     \s any whitespace character
1776     \S any character that is not a whitespace character
1777     \w any "word" character
1778 nigel 63 W any "non-word" character
1779 nigel 41
1780     Each pair of escape sequences partitions the complete set of
1781     characters into two disjoint sets. Any given character
1782     matches one, and only one, of each pair.
1783    
1784 nigel 63 In UTF-8 mode, characters with values greater than 255 never
1785     match \d, \s, or \w, and always match \D, \S, and \W.
1786    
1787     For compatibility with Perl, \s does not match the VT char-
1788     acter (code 11). This makes it different from the the POSIX
1789     "space" class. The \s characters are HT (9), LF (10), FF
1790     (12), CR (13), and space (32).
1791    
1792 nigel 41 A "word" character is any letter or digit or the underscore
1793     character, that is, any character which can be part of a
1794     Perl "word". The definition of letters and digits is con-
1795     trolled by PCRE's character tables, and may vary if locale-
1796 nigel 63 specific matching is taking place (see "Locale support" in
1797     the pcreapi page). For example, in the "fr" (French) locale,
1798     some character codes greater than 128 are used for accented
1799     letters, and these are matched by \w.
1800 nigel 41
1801     These character type sequences can appear both inside and
1802     outside character classes. They each match one character of
1803     the appropriate type. If the current matching point is at
1804     the end of the subject string, all of them fail, since there
1805     is no character to match.
1806    
1807     The fourth use of backslash is for certain simple asser-
1808     tions. An assertion specifies a condition that has to be met
1809     at a particular point in a match, without consuming any
1810     characters from the subject string. The use of subpatterns
1811     for more complicated assertions is described below. The
1812     backslashed assertions are
1813    
1814 nigel 63 \b matches at a word boundary
1815     \B matches when not at a word boundary
1816     \A matches at start of subject
1817     \Z matches at end of subject or before newline at end
1818     \z matches at end of subject
1819     \G matches at first matching position in subject
1820 nigel 41
1821     These assertions may not appear in character classes (but
1822 nigel 63 note that \b has a different meaning, namely the backspace
1823 nigel 41 character, inside a character class).
1824 nigel 43
1825 nigel 41 A word boundary is a position in the subject string where
1826     the current character and the previous character do not both
1827     match \w or \W (i.e. one matches \w and the other matches
1828     \W), or the start or end of the string if the first or last
1829     character matches \w, respectively.
1830     The \A, \Z, and \z assertions differ from the traditional
1831     circumflex and dollar (described below) in that they only
1832     ever match at the very start and end of the subject string,
1833 nigel 63 whatever options are set. Thus, they are independent of mul-
1834     tiline mode.
1835    
1836     They are not affected by the PCRE_NOTBOL or PCRE_NOTEOL
1837     options. If the startoffset argument of pcre_exec() is non-
1838     zero, indicating that matching is to start at a point other
1839     than the beginning of the subject, \A can never match. The
1840 nigel 41 difference between \Z and \z is that \Z matches before a
1841     newline that is the last character of the string as well as
1842     at the end of the string, whereas \z matches only at the
1843     end.
1844    
1845 nigel 63 The \G assertion is true only when the current matching
1846     position is at the start point of the match, as specified by
1847     the startoffset argument of pcre_exec(). It differs from \A
1848     when the value of startoffset is non-zero. By calling
1849     pcre_exec() multiple times with appropriate arguments, you
1850     can mimic Perl's /g option, and it is in this kind of imple-
1851     mentation where \G can be useful.
1852 nigel 41
1853 nigel 63 Note, however, that PCRE's interpretation of \G, as the
1854     start of the current match, is subtly different from Perl's,
1855     which defines it as the end of the previous match. In Perl,
1856     these can be different when the previously matched string
1857     was empty. Because PCRE does just one match at a time, it
1858     cannot reproduce this behaviour.
1859 nigel 41
1860 nigel 63 If all the alternatives of a pattern begin with \G, the
1861     expression is anchored to the starting match position, and
1862     the "anchored" flag is set in the compiled regular expres-
1863     sion.
1864    
1865    
1866 nigel 41 CIRCUMFLEX AND DOLLAR
1867 nigel 63
1868 nigel 41 Outside a character class, in the default matching mode, the
1869     circumflex character is an assertion which is true only if
1870     the current matching point is at the start of the subject
1871     string. If the startoffset argument of pcre_exec() is non-
1872 nigel 63 zero, circumflex can never match if the PCRE_MULTILINE
1873     option is unset. Inside a character class, circumflex has an
1874     entirely different meaning (see below).
1875 nigel 41
1876     Circumflex need not be the first character of the pattern if
1877     a number of alternatives are involved, but it should be the
1878     first thing in each alternative in which it appears if the
1879     pattern is ever to match that branch. If all possible alter-
1880     natives start with a circumflex, that is, if the pattern is
1881     constrained to match only at the start of the subject, it is
1882     said to be an "anchored" pattern. (There are also other con-
1883     structs that can cause a pattern to be anchored.)
1884    
1885     A dollar character is an assertion which is true only if the
1886     current matching point is at the end of the subject string,
1887     or immediately before a newline character that is the last
1888     character in the string (by default). Dollar need not be the
1889     last character of the pattern if a number of alternatives
1890     are involved, but it should be the last item in any branch
1891     in which it appears. Dollar has no special meaning in a
1892     character class.
1893    
1894     The meaning of dollar can be changed so that it matches only
1895     at the very end of the string, by setting the
1896 nigel 63 PCRE_DOLLAR_ENDONLY option at compile time. This does not
1897     affect the \Z assertion.
1898 nigel 41
1899     The meanings of the circumflex and dollar characters are
1900     changed if the PCRE_MULTILINE option is set. When this is
1901     the case, they match immediately after and immediately
1902 nigel 63 before an internal newline character, respectively, in addi-
1903     tion to matching at the start and end of the subject string.
1904     For example, the pattern /^abc$/ matches the subject string
1905 nigel 41 "def\nabc" in multiline mode, but not otherwise. Conse-
1906     quently, patterns that are anchored in single line mode
1907 nigel 63 because all branches start with ^ are not anchored in multi-
1908     line mode, and a match for circumflex is possible when the
1909 nigel 41 startoffset argument of pcre_exec() is non-zero. The
1910     PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is
1911     set.
1912    
1913     Note that the sequences \A, \Z, and \z can be used to match
1914     the start and end of the subject in both modes, and if all
1915 nigel 53 branches of a pattern start with \A it is always anchored,
1916 nigel 41 whether PCRE_MULTILINE is set or not.
1917    
1918    
1919 nigel 63 FULL STOP (PERIOD, DOT)
1920 nigel 41
1921     Outside a character class, a dot in the pattern matches any
1922     one character in the subject, including a non-printing char-
1923 nigel 63 acter, but not (by default) newline. In UTF-8 mode, a dot
1924     matches any UTF-8 character, which might be more than one
1925     byte long, except (by default) for newline. If the
1926     PCRE_DOTALL option is set, dots match newlines as well. The
1927     handling of dot is entirely independent of the handling of
1928     circumflex and dollar, the only relationship being that they
1929     both involve newline characters. Dot has no special meaning
1930     in a character class.
1931 nigel 41
1932    
1933    
1934 nigel 63 MATCHING A SINGLE BYTE
1935    
1936     Outside a character class, the escape sequence \C matches
1937     any one byte, both in and out of UTF-8 mode. Unlike a dot,
1938     it always matches a newline. The feature is provided in Perl
1939     in order to match individual bytes in UTF-8 mode. Because
1940     it breaks up UTF-8 characters into individual bytes, what
1941     remains in the string may be a malformed UTF-8 string. For
1942     this reason it is best avoided.
1943    
1944     PCRE does not allow \C to appear in lookbehind assertions
1945     (see below), because in UTF-8 mode it makes it impossible to
1946     calculate the length of the lookbehind.
1947    
1948    
1949 nigel 41 SQUARE BRACKETS
1950 nigel 63
1951 nigel 41 An opening square bracket introduces a character class, ter-
1952     minated by a closing square bracket. A closing square
1953     bracket on its own is not special. If a closing square
1954     bracket is required as a member of the class, it should be
1955     the first data character in the class (after an initial cir-
1956     cumflex, if present) or escaped with a backslash.
1957    
1958 nigel 63 A character class matches a single character in the subject.
1959     In UTF-8 mode, the character may occupy more than one byte.
1960     A matched character must be in the set of characters defined
1961     by the class, unless the first character in the class defin-
1962     ition is a circumflex, in which case the subject character
1963     must not be in the set defined by the class. If a circumflex
1964     is actually required as a member of the class, ensure it is
1965     not the first character, or escape it with a backslash.
1966 nigel 41
1967     For example, the character class [aeiou] matches any lower
1968     case vowel, while [^aeiou] matches any character that is not
1969     a lower case vowel. Note that a circumflex is just a con-
1970     venient notation for specifying the characters which are in
1971     the class by enumerating those that are not. It is not an
1972     assertion: it still consumes a character from the subject
1973     string, and fails if the current pointer is at the end of
1974     the string.
1975    
1976 nigel 63 In UTF-8 mode, characters with values greater than 255 can
1977     be included in a class as a literal string of bytes, or by
1978     using the \x{ escaping mechanism.
1979    
1980 nigel 41 When caseless matching is set, any letters in a class
1981     represent both their upper case and lower case versions, so
1982     for example, a caseless [aeiou] matches "A" as well as "a",
1983     and a caseless [^aeiou] does not match "A", whereas a case-
1984 nigel 63 ful version would. PCRE does not support the concept of case
1985     for characters with values greater than 255.
1986 nigel 41 The newline character is never treated in any special way in
1987     character classes, whatever the setting of the PCRE_DOTALL
1988     or PCRE_MULTILINE options is. A class such as [^a] will
1989     always match a newline.
1990    
1991     The minus (hyphen) character can be used to specify a range
1992     of characters in a character class. For example, [d-m]
1993     matches any letter between d and m, inclusive. If a minus
1994     character is required in a class, it must be escaped with a
1995     backslash or appear in a position where it cannot be inter-
1996     preted as indicating a range, typically as the first or last
1997     character in the class.
1998    
1999     It is not possible to have the literal character "]" as the
2000     end character of a range. A pattern such as [W-]46] is
2001     interpreted as a class of two characters ("W" and "-") fol-
2002     lowed by a literal string "46]", so it would match "W46]" or
2003     "-46]". However, if the "]" is escaped with a backslash it
2004     is interpreted as the end of range, so [W-\]46] is inter-
2005     preted as a single class containing a range followed by two
2006     separate characters. The octal or hexadecimal representation
2007     of "]" can also be used to end a range.
2008    
2009 nigel 63 Ranges operate in the collating sequence of character
2010     values. They can also be used for characters specified
2011     numerically, for example [\000-\037]. In UTF-8 mode, ranges
2012     can include characters whose values are greater than 255,
2013     for example [\x{100}-\x{2ff}].
2014 nigel 41
2015 nigel 63 If a range that includes letters is used when caseless
2016     matching is set, it matches the letters in either case. For
2017     example, [W-c] is equivalent to [][\^_`wxyzabc], matched
2018     caselessly, and if character tables for the "fr" locale are
2019     in use, [\xc8-\xcb] matches accented E characters in both
2020     cases.
2021    
2022 nigel 41 The character types \d, \D, \s, \S, \w, and \W may also
2023     appear in a character class, and add the characters that
2024     they match to the class. For example, [\dABCDEF] matches any
2025     hexadecimal digit. A circumflex can conveniently be used
2026     with the upper case character types to specify a more res-
2027     tricted set of characters than the matching lower case type.
2028     For example, the class [^\W_] matches any letter or digit,
2029     but not underscore.
2030    
2031     All non-alphameric characters other than \, -, ^ (at the
2032     start) and the terminating ] are non-special in character
2033     classes, but it does no harm if they are escaped.
2034    
2035    
2036 nigel 43 POSIX CHARACTER CLASSES
2037    
2038 nigel 63 Perl supports the POSIX notation for character classes,
2039     which uses names enclosed by [: and :] within the enclosing
2040     square brackets. PCRE also supports this notation. For exam-
2041     ple,
2042    
2043 nigel 43 [01[:alpha:]%]
2044    
2045     matches "0", "1", any alphabetic character, or "%". The sup-
2046     ported class names are
2047    
2048     alnum letters and digits
2049     alpha letters
2050     ascii character codes 0 - 127
2051 nigel 63 blank space or tab only
2052 nigel 43 cntrl control characters
2053     digit decimal digits (same as \d)
2054     graph printing characters, excluding space
2055     lower lower case letters
2056     print printing characters, including space
2057     punct printing characters, excluding letters and digits
2058 nigel 63 space white space (not quite the same as \s)
2059 nigel 43 upper upper case letters
2060     word "word" characters (same as \w)
2061     xdigit hexadecimal digits
2062    
2063 nigel 63 The "space" characters are HT (9), LF (10), VT (11), FF
2064     (12), CR (13), and space (32). Notice that this list
2065     includes the VT character (code 11). This makes "space" dif-
2066     ferent to \s, which does not include VT (for Perl compati-
2067     bility).
2068 nigel 43
2069 nigel 63 The name "word" is a Perl extension, and "blank" is a GNU
2070     extension from Perl 5.8. Another Perl extension is negation,
2071     which is indicated by a ^ character after the colon. For
2072     example,
2073    
2074 nigel 43 [12[:^digit:]]
2075    
2076     matches "1", "2", or any non-digit. PCRE (and Perl) also
2077 nigel 53 recognize the POSIX syntax [.ch.] and [=ch=] where "ch" is a
2078 nigel 43 "collating element", but these are not supported, and an
2079     error is given if they are encountered.
2080    
2081 nigel 63 In UTF-8 mode, characters with values greater than 255 do
2082     not match any of the POSIX character classes.
2083 nigel 43
2084    
2085 nigel 41 VERTICAL BAR
2086 nigel 63
2087 nigel 41 Vertical bar characters are used to separate alternative
2088     patterns. For example, the pattern
2089    
2090     gilbert|sullivan
2091    
2092     matches either "gilbert" or "sullivan". Any number of alter-
2093     natives may appear, and an empty alternative is permitted
2094     (matching the empty string). The matching process tries
2095     each alternative in turn, from left to right, and the first
2096     one that succeeds is used. If the alternatives are within a
2097     subpattern (defined below), "succeeds" means matching the
2098     rest of the main pattern as well as the alternative in the
2099     subpattern.
2100    
2101    
2102     INTERNAL OPTION SETTING
2103    
2104 nigel 63 The settings of the PCRE_CASELESS, PCRE_MULTILINE,
2105     PCRE_DOTALL, and PCRE_EXTENDED options can be changed from
2106     within the pattern by a sequence of Perl option letters
2107     enclosed between "(?" and ")". The option letters are
2108    
2109 nigel 41 i for PCRE_CASELESS
2110     m for PCRE_MULTILINE
2111     s for PCRE_DOTALL
2112     x for PCRE_EXTENDED
2113    
2114     For example, (?im) sets caseless, multiline matching. It is
2115     also possible to unset these options by preceding the letter
2116     with a hyphen, and a combined setting and unsetting such as
2117     (?im-sx), which sets PCRE_CASELESS and PCRE_MULTILINE while
2118     unsetting PCRE_DOTALL and PCRE_EXTENDED, is also permitted.
2119     If a letter appears both before and after the hyphen, the
2120     option is unset.
2121    
2122 nigel 63 When an option change occurs at top level (that is, not
2123     inside subpattern parentheses), the change applies to the
2124     remainder of the pattern that follows. If the change is
2125     placed right at the start of a pattern, PCRE extracts it
2126     into the global options (and it will therefore show up in
2127     data extracted by the pcre_fullinfo() function).
2128 nigel 41
2129 nigel 63 An option change within a subpattern affects only that part
2130     of the current pattern that follows it, so
2131 nigel 41
2132     (a(?i)b)c
2133    
2134     matches abc and aBc and no other strings (assuming
2135     PCRE_CASELESS is not used). By this means, options can be
2136     made to have different settings in different parts of the
2137     pattern. Any changes made in one alternative do carry on
2138     into subsequent branches within the same subpattern. For
2139     example,
2140    
2141     (a(?i)b|c)
2142    
2143     matches "ab", "aB", "c", and "C", even though when matching
2144     "C" the first branch is abandoned before the option setting.
2145     This is because the effects of option settings happen at
2146     compile time. There would be some very weird behaviour oth-
2147     erwise.
2148    
2149     The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can
2150     be changed in the same way as the Perl-compatible options by
2151     using the characters U and X respectively. The (?X) flag
2152     setting is special in that it must always occur earlier in
2153     the pattern than any of the additional features it turns on,
2154     even when it is at top level. It is best put at the start.
2155    
2156    
2157 nigel 63 SUBPATTERNS
2158 nigel 41
2159     Subpatterns are delimited by parentheses (round brackets),
2160     which can be nested. Marking part of a pattern as a subpat-
2161     tern does two things:
2162    
2163     1. It localizes a set of alternatives. For example, the pat-
2164     tern
2165    
2166     cat(aract|erpillar|)
2167    
2168     matches one of the words "cat", "cataract", or "caterpil-
2169     lar". Without the parentheses, it would match "cataract",
2170     "erpillar" or the empty string.
2171    
2172     2. It sets up the subpattern as a capturing subpattern (as
2173     defined above). When the whole pattern matches, that por-
2174     tion of the subject string that matched the subpattern is
2175     passed back to the caller via the ovector argument of
2176     pcre_exec(). Opening parentheses are counted from left to
2177     right (starting from 1) to obtain the numbers of the captur-
2178     ing subpatterns.
2179    
2180     For example, if the string "the red king" is matched against
2181     the pattern
2182    
2183     the ((red|white) (king|queen))
2184    
2185     the captured substrings are "red king", "red", and "king",
2186 nigel 53 and are numbered 1, 2, and 3, respectively.
2187 nigel 41
2188     The fact that plain parentheses fulfil two functions is not
2189     always helpful. There are often times when a grouping sub-
2190     pattern is required without a capturing requirement. If an
2191 nigel 63 opening parenthesis is followed by a question mark and a
2192     colon, the subpattern does not do any capturing, and is not
2193     counted when computing the number of any subsequent captur-
2194     ing subpatterns. For example, if the string "the white
2195     queen" is matched against the pattern
2196 nigel 41
2197     the ((?:red|white) (king|queen))
2198    
2199     the captured substrings are "white queen" and "queen", and
2200 nigel 63 are numbered 1 and 2. The maximum number of capturing sub-
2201     patterns is 65535, and the maximum depth of nesting of all
2202     subpatterns, both capturing and non-capturing, is 200.
2203 nigel 41
2204     As a convenient shorthand, if any option settings are
2205     required at the start of a non-capturing subpattern, the
2206     option letters may appear between the "?" and the ":". Thus
2207     the two patterns
2208    
2209     (?i:saturday|sunday)
2210     (?:(?i)saturday|sunday)
2211    
2212     match exactly the same set of strings. Because alternative
2213     branches are tried from left to right, and options are not
2214     reset until the end of the subpattern is reached, an option
2215     setting in one branch does affect subsequent branches, so
2216     the above patterns match "SUNDAY" as well as "Saturday".
2217    
2218    
2219 nigel 63 NAMED SUBPATTERNS
2220 nigel 41
2221 nigel 63 Identifying capturing parentheses by number is simple, but
2222     it can be very hard to keep track of the numbers in compli-
2223     cated regular expressions. Furthermore, if an expression is
2224     modified, the numbers may change. To help with the diffi-
2225     culty, PCRE supports the naming of subpatterns, something
2226     that Perl does not provide. The Python syntax (?P<name>...)
2227     is used. Names consist of alphanumeric characters and under-
2228     scores, and must be unique within a pattern.
2229    
2230     Named capturing parentheses are still allocated numbers as
2231     well as names. The PCRE API provides function calls for
2232     extracting the name-to-number translation table from a com-
2233     piled pattern. For further details see the pcreapi documen-
2234     tation.
2235    
2236    
2237 nigel 41 REPETITION
2238 nigel 63
2239 nigel 41 Repetition is specified by quantifiers, which can follow any
2240     of the following items:
2241    
2242 nigel 63 a literal data character
2243 nigel 41 the . metacharacter
2244 nigel 63 the \C escape sequence
2245     escapes such as \d that match single characters
2246 nigel 41 a character class
2247     a back reference (see next section)
2248 nigel 63 a parenthesized subpattern (unless it is an assertion)
2249 nigel 41
2250     The general repetition quantifier specifies a minimum and
2251     maximum number of permitted matches, by giving the two
2252     numbers in curly brackets (braces), separated by a comma.
2253     The numbers must be less than 65536, and the first must be
2254     less than or equal to the second. For example:
2255    
2256     z{2,4}
2257    
2258     matches "zz", "zzz", or "zzzz". A closing brace on its own
2259     is not a special character. If the second number is omitted,
2260     but the comma is present, there is no upper limit; if the
2261     second number and the comma are both omitted, the quantifier
2262     specifies an exact number of required matches. Thus
2263    
2264     [aeiou]{3,}
2265    
2266     matches at least 3 successive vowels, but may match many
2267     more, while
2268    
2269     \d{8}
2270    
2271     matches exactly 8 digits. An opening curly bracket that
2272     appears in a position where a quantifier is not allowed, or
2273     one that does not match the syntax of a quantifier, is taken
2274     as a literal character. For example, {,6} is not a quantif-
2275     ier, but a literal string of four characters.
2276 nigel 63
2277     In UTF-8 mode, quantifiers apply to UTF-8 characters rather
2278     than to individual bytes. Thus, for example, \x{100}{2}
2279     matches two UTF-8 characters, each of which is represented
2280     by a two-byte sequence.
2281    
2282 nigel 41 The quantifier {0} is permitted, causing the expression to
2283     behave as if the previous item and the quantifier were not
2284     present.
2285    
2286     For convenience (and historical compatibility) the three
2287     most common quantifiers have single-character abbreviations:
2288    
2289     * is equivalent to {0,}
2290     + is equivalent to {1,}
2291     ? is equivalent to {0,1}
2292    
2293     It is possible to construct infinite loops by following a
2294     subpattern that can match no characters with a quantifier
2295     that has no upper limit, for example:
2296    
2297     (a?)*
2298    
2299     Earlier versions of Perl and PCRE used to give an error at
2300     compile time for such patterns. However, because there are
2301     cases where this can be useful, such patterns are now
2302     accepted, but if any repetition of the subpattern does in
2303     fact match no characters, the loop is forcibly broken.
2304    
2305     By default, the quantifiers are "greedy", that is, they
2306     match as much as possible (up to the maximum number of per-
2307     mitted times), without causing the rest of the pattern to
2308     fail. The classic example of where this gives problems is in
2309     trying to match comments in C programs. These appear between
2310     the sequences /* and */ and within the sequence, individual
2311     * and / characters may appear. An attempt to match C com-
2312     ments by applying the pattern
2313    
2314     /\*.*\*/
2315    
2316     to the string
2317    
2318     /* first command */ not comment /* second comment */
2319    
2320 nigel 51 fails, because it matches the entire string owing to the
2321 nigel 41 greediness of the .* item.
2322    
2323 nigel 47 However, if a quantifier is followed by a question mark, it
2324     ceases to be greedy, and instead matches the minimum number
2325     of times possible, so the pattern
2326 nigel 41
2327     /\*.*?\*/
2328    
2329     does the right thing with the C comments. The meaning of the
2330     various quantifiers is not otherwise changed, just the pre-
2331     ferred number of matches. Do not confuse this use of ques-
2332     tion mark with its use as a quantifier in its own right.
2333     Because it has two uses, it can sometimes appear doubled, as
2334     in
2335    
2336     \d??\d
2337    
2338     which matches one digit by preference, but can match two if
2339     that is the only way the rest of the pattern matches.
2340    
2341     If the PCRE_UNGREEDY option is set (an option which is not
2342 nigel 47 available in Perl), the quantifiers are not greedy by
2343 nigel 41 default, but individual ones can be made greedy by following
2344     them with a question mark. In other words, it inverts the
2345     default behaviour.
2346    
2347     When a parenthesized subpattern is quantified with a minimum
2348     repeat count that is greater than 1 or with a limited max-
2349     imum, more store is required for the compiled pattern, in
2350     proportion to the size of the minimum or maximum.
2351     If a pattern starts with .* or .{0,} and the PCRE_DOTALL
2352     option (equivalent to Perl's /s) is set, thus allowing the .
2353 nigel 47 to match newlines, the pattern is implicitly anchored,
2354 nigel 41 because whatever follows will be tried against every charac-
2355     ter position in the subject string, so there is no point in
2356     retrying the overall match at any position after the first.
2357 nigel 63 PCRE normally treats such a pattern as though it were pre-
2358     ceded by \A.
2359 nigel 41
2360 nigel 63 In cases where it is known that the subject string contains
2361     no newlines, it is worth setting PCRE_DOTALL in order to
2362     obtain this optimization, or alternatively using ^ to indi-
2363     cate anchoring explicitly.
2364    
2365     However, there is one situation where the optimization can-
2366     not be used. When .* is inside capturing parentheses that
2367     are the subject of a backreference elsewhere in the pattern,
2368     a match at the start may fail, and a later one succeed. Con-
2369     sider, for example:
2370    
2371     (.*)abc\1
2372    
2373     If the subject is "xyz123abc123" the match point is the
2374     fourth character. For this reason, such a pattern is not
2375     implicitly anchored.
2376    
2377 nigel 41 When a capturing subpattern is repeated, the value captured
2378     is the substring that matched the final iteration. For exam-
2379     ple, after
2380    
2381     (tweedle[dume]{3}\s*)+
2382    
2383     has matched "tweedledum tweedledee" the value of the cap-
2384     tured substring is "tweedledee". However, if there are
2385     nested capturing subpatterns, the corresponding captured
2386     values may have been set in previous iterations. For exam-
2387     ple, after
2388    
2389     /(a|(b))+/
2390    
2391     matches "aba" the value of the second captured substring is
2392     "b".
2393    
2394    
2395 nigel 63 ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
2396 nigel 41
2397 nigel 63 With both maximizing and minimizing repetition, failure of
2398     what follows normally causes the repeated item to be re-
2399     evaluated to see if a different number of repeats allows the
2400     rest of the pattern to match. Sometimes it is useful to
2401     prevent this, either to change the nature of the match, or
2402     to cause it fail earlier than it otherwise might, when the
2403     author of the pattern knows there is no point in carrying
2404     on.
2405 nigel 53
2406 nigel 63 Consider, for example, the pattern \d+foo when applied to
2407     the subject line
2408 nigel 53
2409 nigel 63 123456bar
2410 nigel 53
2411 nigel 63 After matching all 6 digits and then failing to match "foo",
2412     the normal action of the matcher is to try again with only 5
2413     digits matching the \d+ item, and then with 4, and so on,
2414     before ultimately failing. "Atomic grouping" (a term taken
2415     from Jeffrey Friedl's book) provides the means for specify-
2416     ing that once a subpattern has matched, it is not to be re-
2417     evaluated in this way.
2418 nigel 53
2419 nigel 63 If we use atomic grouping for the previous example, the
2420     matcher would give up immediately on failing to match "foo"
2421     the first time. The notation is a kind of special
2422     parenthesis, starting with (?> as in this example:
2423 nigel 53
2424 nigel 63 (?>\d+)bar
2425 nigel 53
2426 nigel 63 This kind of parenthesis "locks up" the part of the pattern
2427     it contains once it has matched, and a failure further into
2428     the pattern is prevented from backtracking into it. Back-
2429     tracking past it to previous items, however, works as nor-
2430     mal.
2431 nigel 53
2432 nigel 63 An alternative description is that a subpattern of this type
2433     matches the string of characters that an identical stan-
2434     dalone pattern would match, if anchored at the current point
2435     in the subject string.
2436    
2437     Atomic grouping subpatterns are not capturing subpatterns.
2438     Simple cases such as the above example can be thought of as
2439     a maximizing repeat that must swallow everything it can. So,
2440     while both \d+ and \d+? are prepared to adjust the number of
2441     digits they match in order to make the rest of the pattern
2442     match, (?>\d+) can only match an entire sequence of digits.
2443    
2444     Atomic groups in general can of course contain arbitrarily
2445     complicated subpatterns, and can be nested. However, when
2446     the subpattern for an atomic group is just a single repeated
2447     item, as in the example above, a simpler notation, called a
2448     "possessive quantifier" can be used. This consists of an
2449     additional + character following a quantifier. Using this
2450     notation, the previous example can be rewritten as
2451    
2452     \d++bar
2453    
2454     Possessive quantifiers are always greedy; the setting of the
2455     PCRE_UNGREEDY option is ignored. They are a convenient nota-
2456     tion for the simpler forms of atomic group. However, there
2457     is no difference in the meaning or processing of a posses-
2458     sive quantifier and the equivalent atomic group.
2459    
2460     The possessive quantifier syntax is an extension to the Perl
2461     syntax. It originates in Sun's Java package.
2462    
2463     When a pattern contains an unlimited repeat inside a subpat-
2464     tern that can itself be repeated an unlimited number of
2465     times, the use of an atomic group is the only way to avoid
2466     some failing matches taking a very long time indeed. The
2467     pattern
2468    
2469     (\D+|<\d+>)*[!?]
2470    
2471     matches an unlimited number of substrings that either con-
2472     sist of non-digits, or digits enclosed in <>, followed by
2473     either ! or ?. When it matches, it runs quickly. However, if
2474     it is applied to
2475    
2476     aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
2477    
2478     it takes a long time before reporting failure. This is
2479     because the string can be divided between the two repeats in
2480     a large number of ways, and all have to be tried. (The exam-
2481     ple used [!?] rather than a single character at the end,
2482     because both PCRE and Perl have an optimization that allows
2483     for fast failure when a single character is used. They
2484     remember the last single character that is required for a
2485     match, and fail early if it is not present in the string.)
2486     If the pattern is changed to
2487    
2488     ((?>\D+)|<\d+>)*[!?]
2489    
2490     sequences of non-digits cannot be broken, and failure hap-
2491     pens quickly.
2492    
2493    
2494     BACK REFERENCES
2495    
2496     Outside a character class, a backslash followed by a digit
2497     greater than 0 (and possibly further digits) is a back
2498     reference to a capturing subpattern earlier (that is, to its
2499 nigel 41 left) in the pattern, provided there have been that many
2500     previous capturing left parentheses.
2501    
2502     However, if the decimal number following the backslash is
2503     less than 10, it is always taken as a back reference, and
2504     causes an error only if there are not that many capturing
2505     left parentheses in the entire pattern. In other words, the
2506     parentheses that are referenced need not be to the left of
2507     the reference for numbers less than 10. See the section
2508     entitled "Backslash" above for further details of the han-
2509     dling of digits following a backslash.
2510    
2511     A back reference matches whatever actually matched the cap-
2512     turing subpattern in the current subject string, rather than
2513 nigel 63 anything matching the subpattern itself (see "Subpatterns as
2514     subroutines" below for a way of doing that). So the pattern
2515 nigel 41
2516     (sens|respons)e and \1ibility
2517    
2518     matches "sense and sensibility" and "response and responsi-
2519     bility", but not "sense and responsibility". If caseful
2520 nigel 47 matching is in force at the time of the back reference, the
2521     case of letters is relevant. For example,
2522 nigel 41
2523     ((?i)rah)\s+\1
2524    
2525     matches "rah rah" and "RAH RAH", but not "RAH rah", even
2526     though the original capturing subpattern is matched case-
2527     lessly.
2528    
2529 nigel 63 Back references to named subpatterns use the Python syntax
2530     (?P=name). We could rewrite the above example as follows:
2531    
2532     (?<p1>(?i)rah)\s+(?P=p1)
2533    
2534 nigel 41 There may be more than one back reference to the same sub-
2535     pattern. If a subpattern has not actually been used in a
2536 nigel 47 particular match, any back references to it always fail. For
2537     example, the pattern
2538 nigel 41
2539     (a|(bc))\2
2540    
2541     always fails if it starts to match "a" rather than "bc".
2542 nigel 63 Because there may be many capturing parentheses in a pat-
2543     tern, all digits following the backslash are taken as part
2544     of a potential back reference number. If the pattern contin-
2545     ues with a digit character, some delimiter must be used to
2546     terminate the back reference. If the PCRE_EXTENDED option is
2547     set, this can be whitespace. Otherwise an empty comment can
2548     be used.
2549 nigel 41
2550     A back reference that occurs inside the parentheses to which
2551     it refers fails when the subpattern is first used, so, for
2552     example, (a\1) never matches. However, such references can
2553 nigel 49 be useful inside repeated subpatterns. For example, the pat-
2554     tern
2555 nigel 41
2556     (a|b\1)+
2557    
2558 nigel 49 matches any number of "a"s and also "aba", "ababbaa" etc. At
2559 nigel 41 each iteration of the subpattern, the back reference matches
2560 nigel 53 the character string corresponding to the previous itera-
2561     tion. In order for this to work, the pattern must be such
2562     that the first iteration does not need to match the back
2563     reference. This can be done using alternation, as in the
2564     example above, or by a quantifier with a minimum of zero.
2565 nigel 41
2566    
2567 nigel 63 ASSERTIONS
2568 nigel 41
2569     An assertion is a test on the characters following or
2570     preceding the current matching point that does not actually
2571     consume any characters. The simple assertions coded as \b,
2572 nigel 63 \B, \A, \G, \Z, \z, ^ and $ are described above. More com-
2573     plicated assertions are coded as subpatterns. There are two
2574 nigel 41 kinds: those that look ahead of the current position in the
2575     subject string, and those that look behind it.
2576 nigel 43
2577 nigel 41 An assertion subpattern is matched in the normal way, except
2578     that it does not cause the current matching position to be
2579     changed. Lookahead assertions start with (?= for positive
2580     assertions and (?! for negative assertions. For example,
2581    
2582     \w+(?=;)
2583    
2584     matches a word followed by a semicolon, but does not include
2585     the semicolon in the match, and
2586    
2587     foo(?!bar)
2588    
2589     matches any occurrence of "foo" that is not followed by
2590     "bar". Note that the apparently similar pattern
2591    
2592     (?!foo)bar
2593    
2594     does not find an occurrence of "bar" that is preceded by
2595     something other than "foo"; it finds any occurrence of "bar"
2596     whatsoever, because the assertion (?!foo) is always true
2597     when the next three characters are "bar". A lookbehind
2598     assertion is needed to achieve this effect.
2599    
2600 nigel 63 If you want to force a matching failure at some point in a
2601     pattern, the most convenient way to do it is with (?!)
2602     because an empty string always matches, so an assertion that
2603     requires there not to be an empty string must always fail.
2604    
2605 nigel 41 Lookbehind assertions start with (?<= for positive asser-
2606     tions and (?<! for negative assertions. For example,
2607    
2608     (?<!foo)bar
2609    
2610     does find an occurrence of "bar" that is not preceded by
2611     "foo". The contents of a lookbehind assertion are restricted
2612     such that all the strings it matches must have a fixed
2613     length. However, if there are several alternatives, they do
2614     not all have to have the same fixed length. Thus
2615    
2616     (?<=bullock|donkey)
2617    
2618     is permitted, but
2619    
2620     (?<!dogs?|cats?)
2621    
2622     causes an error at compile time. Branches that match dif-
2623     ferent length strings are permitted only at the top level of
2624     a lookbehind assertion. This is an extension compared with
2625 nigel 63 Perl (at least for 5.8), which requires all branches to
2626     match the same length of string. An assertion such as
2627 nigel 41
2628     (?<=ab(c|de))
2629    
2630     is not permitted, because its single top-level branch can
2631     match two different lengths, but it is acceptable if rewrit-
2632     ten to use two top-level branches:
2633    
2634     (?<=abc|abde)
2635    
2636     The implementation of lookbehind assertions is, for each
2637     alternative, to temporarily move the current position back
2638     by the fixed width and then try to match. If there are
2639     insufficient characters before the current position, the
2640 nigel 63 match is deemed to fail.
2641 nigel 41
2642 nigel 63 PCRE does not allow the \C escape (which matches a single
2643     byte in UTF-8 mode) to appear in lookbehind assertions,
2644     because it makes it impossible to calculate the length of
2645     the lookbehind.
2646    
2647     Atomic groups can be used in conjunction with lookbehind
2648     assertions to specify efficient matching at the end of the
2649     subject string. Consider a simple pattern such as
2650    
2651     abcd$
2652    
2653     when applied to a long string that does not match. Because
2654     matching proceeds from left to right, PCRE will look for
2655     each "a" in the subject and then see if what follows matches
2656     the rest of the pattern. If the pattern is specified as
2657    
2658     ^.*abcd$
2659    
2660     the initial .* matches the entire string at first, but when
2661     this fails (because there is no following "a"), it back-
2662     tracks to match all but the last character, then all but the
2663     last two characters, and so on. Once again the search for
2664     "a" covers the entire string, from right to left, so we are
2665     no better off. However, if the pattern is written as
2666    
2667     ^(?>.*)(?<=abcd)
2668    
2669     or, equivalently,
2670    
2671     ^.*+(?<=abcd)
2672    
2673     there can be no backtracking for the .* item; it can match
2674     only the entire string. The subsequent lookbehind assertion
2675     does a single test on the last four characters. If it fails,
2676     the match fails immediately. For long strings, this approach
2677     makes a significant difference to the processing time.
2678    
2679 nigel 41 Several assertions (of any sort) may occur in succession.
2680     For example,
2681    
2682     (?<=\d{3})(?<!999)foo
2683    
2684     matches "foo" preceded by three digits that are not "999".
2685     Notice that each of the assertions is applied independently
2686     at the same point in the subject string. First there is a
2687 nigel 47 check that the previous three characters are all digits, and
2688 nigel 41 then there is a check that the same three characters are not
2689     "999". This pattern does not match "foo" preceded by six
2690     characters, the first of which are digits and the last three
2691     of which are not "999". For example, it doesn't match
2692     "123abcfoo". A pattern to do that is
2693    
2694     (?<=\d{3}...)(?<!999)foo
2695    
2696     This time the first assertion looks at the preceding six
2697     characters, checking that the first three are digits, and
2698     then the second assertion checks that the preceding three
2699     characters are not "999".
2700    
2701     Assertions can be nested in any combination. For example,
2702    
2703     (?<=(?<!foo)bar)baz
2704    
2705     matches an occurrence of "baz" that is preceded by "bar"
2706     which in turn is not preceded by "foo", while
2707    
2708     (?<=\d{3}(?!999)...)foo
2709    
2710     is another pattern which matches "foo" preceded by three
2711     digits and any three characters that are not "999".
2712    
2713     Assertion subpatterns are not capturing subpatterns, and may
2714     not be repeated, because it makes no sense to assert the
2715     same thing several times. If any kind of assertion contains
2716     capturing subpatterns within it, these are counted for the
2717     purposes of numbering the capturing subpatterns in the whole
2718     pattern. However, substring capturing is carried out only
2719     for positive assertions, because it does not make sense for
2720     negative assertions.
2721    
2722    
2723 nigel 63 CONDITIONAL SUBPATTERNS
2724 nigel 41
2725     It is possible to cause the matching process to obey a sub-
2726     pattern conditionally or to choose between two alternative
2727     subpatterns, depending on the result of an assertion, or
2728     whether a previous capturing subpattern matched or not. The
2729     two possible forms of conditional subpattern are
2730    
2731     (?(condition)yes-pattern)
2732     (?(condition)yes-pattern|no-pattern)
2733    
2734     If the condition is satisfied, the yes-pattern is used; oth-
2735     erwise the no-pattern (if present) is used. If there are
2736     more than two alternatives in the subpattern, a compile-time
2737     error occurs.
2738    
2739 nigel 63 There are three kinds of condition. If the text between the
2740 nigel 47 parentheses consists of a sequence of digits, the condition
2741     is satisfied if the capturing subpattern of that number has
2742 nigel 51 previously matched. The number must be greater than zero.
2743     Consider the following pattern, which contains non-
2744     significant white space to make it more readable (assume the
2745     PCRE_EXTENDED option) and to divide it into three parts for
2746     ease of discussion:
2747 nigel 41
2748     ( \( )? [^()]+ (?(1) \) )
2749    
2750     The first part matches an optional opening parenthesis, and
2751     if that character is present, sets it as the first captured
2752     substring. The second part matches one or more characters
2753     that are not parentheses. The third part is a conditional
2754     subpattern that tests whether the first set of parentheses
2755     matched or not. If they did, that is, if subject started
2756     with an opening parenthesis, the condition is true, and so
2757     the yes-pattern is executed and a closing parenthesis is
2758     required. Otherwise, since no-pattern is not present, the
2759     subpattern matches nothing. In other words, this pattern
2760     matches a sequence of non-parentheses, optionally enclosed
2761     in parentheses.
2762    
2763 nigel 63 If the condition is the string (R), it is satisfied if a
2764     recursive call to the pattern or subpattern has been made.
2765     At "top level", the condition is false. This is a PCRE
2766     extension. Recursive patterns are described in the next
2767     section.
2768 nigel 41
2769 nigel 63 If the condition is not a sequence of digits or (R), it must
2770     be an assertion. This may be a positive or negative looka-
2771     head or lookbehind assertion. Consider this pattern, again
2772     containing non-significant white space, and with the two
2773     alternatives on the second line:
2774    
2775 nigel 41 (?(?=[^a-z]*[a-z])
2776     \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
2777    
2778     The condition is a positive lookahead assertion that matches
2779     an optional sequence of non-letters followed by a letter. In
2780     other words, it tests for the presence of at least one
2781     letter in the subject. If a letter is found, the subject is
2782     matched against the first alternative; otherwise it is
2783     matched against the second. This pattern matches strings in
2784     one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
2785     letters and dd are digits.
2786    
2787    
2788 nigel 63 COMMENTS
2789 nigel 41
2790     The sequence (?# marks the start of a comment which contin-
2791     ues up to the next closing parenthesis. Nested parentheses
2792     are not permitted. The characters that make up a comment
2793     play no part in the pattern matching at all.
2794    
2795     If the PCRE_EXTENDED option is set, an unescaped # character
2796     outside a character class introduces a comment that contin-
2797     ues up to the next newline character in the pattern.
2798    
2799    
2800 nigel 63 RECURSIVE PATTERNS
2801 nigel 41
2802 nigel 43 Consider the problem of matching a string in parentheses,
2803     allowing for unlimited nested parentheses. Without the use
2804     of recursion, the best that can be done is to use a pattern
2805     that matches up to some fixed depth of nesting. It is not
2806 nigel 63 possible to handle an arbitrary nesting depth. Perl has pro-
2807     vided an experimental facility that allows regular expres-
2808     sions to recurse (amongst other things). It does this by
2809     interpolating Perl code in the expression at run time, and
2810     the code can refer to the expression itself. A Perl pattern
2811     to solve the parentheses problem can be created like this:
2812 nigel 43
2813     $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
2814    
2815     The (?p{...}) item interpolates Perl code at run time, and
2816     in this case refers recursively to the pattern in which it
2817     appears. Obviously, PCRE cannot support the interpolation of
2818 nigel 63 Perl code. Instead, it supports some special syntax for
2819     recursion of the entire pattern, and also for individual
2820     subpattern recursion.
2821 nigel 43
2822 nigel 63 The special item that consists of (? followed by a number
2823     greater than zero and a closing parenthesis is a recursive
2824     call of the subpattern of the given number, provided that it
2825     occurs inside that subpattern. (If not, it is a "subroutine"
2826     call, which is described in the next section.) The special
2827     item (?R) is a recursive call of the entire regular expres-
2828     sion.
2829    
2830     For example, this PCRE pattern solves the nested parentheses
2831     problem (assume the PCRE_EXTENDED option is set so that
2832     white space is ignored):
2833    
2834 nigel 43 \( ( (?>[^()]+) | (?R) )* \)
2835    
2836     First it matches an opening parenthesis. Then it matches any
2837     number of substrings which can either be a sequence of non-
2838     parentheses, or a recursive match of the pattern itself
2839 nigel 63 (that is a correctly parenthesized substring). Finally
2840     there is a closing parenthesis.
2841 nigel 43
2842 nigel 63 If this were part of a larger pattern, you would not want to
2843     recurse the entire pattern, so instead you could use this:
2844    
2845     ( \( ( (?>[^()]+) | (?1) )* \) )
2846    
2847     We have put the pattern into parentheses, and caused the
2848     recursion to refer to them instead of the whole pattern. In
2849     a larger pattern, keeping track of parenthesis numbers can
2850     be tricky. It may be more convenient to use named
2851     parentheses instead. For this, PCRE uses (?P>name), which is
2852     an extension to the Python syntax that PCRE uses for named
2853     parentheses (Perl does not provide named parentheses). We
2854     could rewrite the above example as follows:
2855    
2856     (?<pn> \( ( (?>[^()]+) | (?P>pn) )* \) )
2857    
2858 nigel 43 This particular example pattern contains nested unlimited
2859 nigel 63 repeats, and so the use of atomic grouping for matching
2860     strings of non-parentheses is important when applying the
2861     pattern to strings that do not match. For example, when this
2862     pattern is applied to
2863 nigel 43
2864     (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
2865    
2866 nigel 63 it yields "no match" quickly. However, if atomic grouping is
2867     not used, the match runs for a very long time indeed because
2868     there are so many different ways the + and * repeats can
2869     carve up the subject, and all have to be tested before
2870     failure can be reported.
2871     At the end of a match, the values set for any capturing sub-
2872     patterns are those from the outermost level of the recursion
2873     at which the subpattern value is set. If you want to obtain
2874     intermediate values, a callout function can be used (see
2875     below and the pcrecallout documentation). If the pattern
2876     above is matched against
2877 nigel 43
2878     (ab(cd)ef)
2879    
2880     the value for the capturing parentheses is "ef", which is
2881     the last value taken on at the top level. If additional
2882     parentheses are added, giving
2883    
2884     \( ( ( (?>[^()]+) | (?R) )* ) \)
2885     ^ ^
2886 nigel 63 ^ ^
2887 nigel 43
2888 nigel 63 the string they capture is "ab(cd)ef", the contents of the
2889     top level parentheses. If there are more than 15 capturing
2890     parentheses in a pattern, PCRE has to obtain extra memory to
2891     store data during a recursion, which it does by using
2892     pcre_malloc, freeing it via pcre_free afterwards. If no
2893     memory can be obtained, the match fails with the
2894     PCRE_ERROR_NOMEMORY error.
2895 nigel 43
2896 nigel 63 Do not confuse the (?R) item with the condition (R), which
2897     tests for recursion. Consider this pattern, which matches
2898     text in angle brackets, allowing for arbitrary nesting. Only
2899     digits are allowed in nested brackets (that is, when recurs-
2900     ing), whereas any characters are permitted at the outer
2901     level.
2902 nigel 43
2903 nigel 63 < (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
2904 nigel 41
2905 nigel 63 In this pattern, (?(R) is the start of a conditional subpat-
2906     tern, with two different alternatives for the recursive and
2907     non-recursive cases. The (?R) item is the actual recursive
2908     call.
2909 nigel 41
2910    
2911 nigel 63 SUBPATTERNS AS SUBROUTINES
2912    
2913     If the syntax for a recursive subpattern reference (either
2914     by number or by name) is used outside the parentheses to
2915     which it refers, it operates like a subroutine in a program-
2916     ming language. An earlier example pointed out that the pat-
2917     tern
2918    
2919     (sens|respons)e and \1ibility
2920    
2921     matches "sense and sensibility" and "response and responsi-
2922     bility", but not "sense and responsibility". If instead the
2923     pattern
2924    
2925     (sens|respons)e and (?1)ibility
2926    
2927     is used, it does match "sense and responsibility" as well as
2928     the other two strings. Such references must, however, follow
2929     the subpattern to which they refer.
2930    
2931    
2932     CALLOUTS
2933    
2934     Perl has a feature whereby using the sequence (?{...})
2935     causes arbitrary Perl code to be obeyed in the middle of
2936     matching a regular expression. This makes it possible,
2937     amongst other things, to extract different substrings that
2938     match the same pair of parentheses when there is a repeti-
2939     tion.
2940    
2941     PCRE provides a similar feature, but of course it cannot
2942     obey arbitrary Perl code. The feature is called "callout".
2943     The caller of PCRE provides an external function by putting
2944     its entry point in the global variable pcre_callout. By
2945     default, this variable contains NULL, which disables all
2946     calling out.
2947    
2948     Within a regular expression, (?C) indicates the points at
2949     which the external function is to be called. If you want to
2950     identify different callout points, you can put a number less
2951     than 256 after the letter C. The default value is zero. For
2952     example, this pattern has two callout points:
2953    
2954     (?C1)9abc(?C2)def
2955    
2956     During matching, when PCRE reaches a callout point (and
2957     pcre_callout is set), the external function is called. It is
2958     provided with the number of the callout, and, optionally,
2959     one item of data originally supplied by the caller of
2960     pcre_exec(). The callout function may cause matching to
2961     backtrack, or to fail altogether. A complete description of
2962     the interface to the callout function is given in the pcre-
2963     callout documentation.
2964    
2965     Last updated: 03 February 2003
2966     Copyright (c) 1997-2003 University of Cambridge.
2967     -----------------------------------------------------------------------------
2968    
2969     NAME
2970     PCRE - Perl-compatible regular expressions
2971    
2972    
2973     PCRE PERFORMANCE
2974    
2975     Certain items that may appear in regular expression patterns
2976     are more efficient than others. It is more efficient to use
2977     a character class like [aeiou] than a set of alternatives
2978     such as (a|e|i|o|u). In general, the simplest construction
2979     that provides the required behaviour is usually the most
2980     efficient. Jeffrey Friedl's book contains a lot of discus-
2981     sion about optimizing regular expressions for efficient per-
2982     formance.
2983    
2984     When a pattern begins with .* not in parentheses, or in
2985     parentheses that are not the subject of a backreference, and
2986     the PCRE_DOTALL option is set, the pattern is implicitly
2987     anchored by PCRE, since it can match only at the start of a
2988     subject string. However, if PCRE_DOTALL is not set, PCRE
2989     cannot make this optimization, because the . metacharacter
2990     does not then match a newline, and if the subject string
2991     contains newlines, the pattern may match from the character
2992     immediately following one of them instead of from the very
2993     start. For example, the pattern
2994    
2995     .*second
2996    
2997 nigel 41 matches the subject "first\nand second" (where \n stands for
2998 nigel 63 a newline character), with the match starting at the seventh
2999     character. In order to do this, PCRE has to retry the match
3000 nigel 41 starting after every newline in the subject.
3001    
3002     If you are using such a pattern with subject strings that do
3003     not contain newlines, the best performance is obtained by
3004     setting PCRE_DOTALL, or starting the pattern with ^.* to
3005     indicate explicit anchoring. That saves PCRE from having to
3006     scan along the subject looking for a newline to restart at.
3007    
3008     Beware of patterns that contain nested indefinite repeats.
3009     These can take a long time to run when applied to a string
3010     that does not match. Consider the pattern fragment
3011    
3012     (a+)*
3013    
3014     This can match "aaaa" in 33 different ways, and this number
3015     increases very rapidly as the string gets longer. (The *
3016     repeat can match 0, 1, 2, 3, or 4 times, and for each of
3017     those cases other than 0, the + repeats can match different
3018     numbers of times.) When the remainder of the pattern is such
3019 nigel 51 that the entire match is going to fail, PCRE has in princi-
3020     ple to try every possible variation, and this can take an
3021     extremely long time.
3022 nigel 41 An optimization catches some of the more simple cases such
3023     as
3024    
3025     (a+)*b
3026    
3027     where a literal character follows. Before embarking on the
3028     standard matching procedure, PCRE checks that there is a "b"
3029     later in the subject string, and if there is not, it fails
3030     the match immediately. However, when there is no following
3031     literal this optimization cannot be used. You can see the
3032     difference by comparing the behaviour of
3033    
3034     (a+)*\d
3035    
3036     with the pattern above. The former gives a failure almost
3037     instantly when applied to a whole line of "a" characters,
3038     whereas the latter takes an appreciable time with strings
3039     longer than about 20 characters.
3040    
3041 nigel 63 Last updated: 03 February 2003
3042     Copyright (c) 1997-2003 University of Cambridge.
3043     -----------------------------------------------------------------------------
3044 nigel 41
3045 nigel 63 NAME
3046     PCRE - Perl-compatible regular expressions.
3047 nigel 41
3048 nigel 49
3049 nigel 63 SYNOPSIS OF POSIX API
3050     #include <pcreposix.h>
3051 nigel 49
3052 nigel 63 int regcomp(regex_t *preg, const char *pattern,
3053     int cflags);
3054 nigel 49
3055 nigel 63 int regexec(regex_t *preg, const char *string,
3056     size_t nmatch, regmatch_t pmatch[], int eflags);
3057 nigel 49
3058 nigel 63 size_t regerror(int errcode, const regex_t *preg,
3059     char *errbuf, size_t errbuf_size);
3060 nigel 49
3061 nigel 63 void regfree(regex_t *preg);
3062 nigel 49
3063    
3064 nigel 63 DESCRIPTION
3065 nigel 49
3066 nigel 63 This set of functions provides a POSIX-style API to the PCRE
3067     regular expression package. See the pcreapi documentation
3068     for a description of the native API, which contains addi-
3069     tional functionality.
3070 nigel 49
3071 nigel 63 The functions described here are just wrapper functions that
3072     ultimately call the PCRE native API. Their prototypes are
3073     defined in the pcreposix.h header file, and on Unix systems
3074     the library itself is called pcreposix.a, so can be accessed
3075     by adding -lpcreposix to the command for linking an applica-
3076     tion which uses them. Because the POSIX functions call the
3077     native ones, it is also necessary to add -lpcre.
3078 nigel 49
3079 nigel 63 I have implemented only those option bits that can be rea-
3080     sonably mapped to PCRE native options. In addition, the
3081     options REG_EXTENDED and REG_NOSUB are defined with the
3082     value zero. They have no effect, but since programs that are
3083     written to the POSIX interface often use them, this makes it
3084     easier to slot in PCRE as a replacement library. Other POSIX
3085     options are not even defined.
3086 nigel 49
3087 nigel 63 When PCRE is called via these functions, it is only the API
3088     that is POSIX-like in style. The syntax and semantics of the
3089     regular expressions themselves are still those of Perl, sub-
3090     ject to the setting of various PCRE options, as described
3091     below.
3092 nigel 49
3093 nigel 63 The header for these functions is supplied as pcreposix.h to
3094     avoid any potential clash with other POSIX libraries. It
3095     can, of course, be renamed or aliased as regex.h, which is
3096     the "correct" name. It provides two structure types, regex_t
3097     for compiled internal forms, and regmatch_t for returning
3098     captured substrings. It also defines some constants whose
3099     names start with "REG_"; these are used for setting options
3100     and identifying error codes.
3101 nigel 49
3102    
3103 nigel 63 COMPILING A PATTERN
3104 nigel 49
3105 nigel 63 The function regcomp() is called to compile a pattern into
3106     an internal form. The pattern is a C string terminated by a
3107     binary zero, and is passed in the argument pattern. The preg
3108     argument is a pointer to a regex_t structure which is used
3109     as a base for storing information about the compiled expres-
3110     sion.
3111 nigel 49
3112 nigel 63 The argument cflags is either zero, or contains one or more
3113     of the bits defined by the following macros:
3114 nigel 53
3115 nigel 63 REG_ICASE
3116 nigel 49
3117 nigel 63 The PCRE_CASELESS option is set when the expression is
3118     passed for compilation to the native function.
3119 nigel 49
3120 nigel 63 REG_NEWLINE
3121 nigel 49
3122 nigel 63 The PCRE_MULTILINE option is set when the expression is
3123     passed for compilation to the native function. Note that
3124     this does not mimic the defined POSIX behaviour for
3125     REG_NEWLINE (see the following section).
3126 nigel 49
3127 nigel 63 In the absence of these flags, no options are passed to the
3128     native function. This means the the regex is compiled with
3129     PCRE default semantics. In particular, the way it handles
3130     newline characters in the subject string is the Perl way,
3131     not the POSIX way. Note that setting PCRE_MULTILINE has only
3132     some of the effects specified for REG_NEWLINE. It does not
3133     affect the way newlines are matched by . (they aren't) or by
3134     a negative class such as [^a] (they are).
3135 nigel 53
3136 nigel 63 The yield of regcomp() is zero on success, and non-zero oth-
3137     erwise. The preg structure is filled in on success, and one
3138     member of the structure is public: re_nsub contains the
3139     number of capturing subpatterns in the regular expression.
3140     Various error codes are defined in the header file.
3141 nigel 53
3142    
3143 nigel 63 MATCHING NEWLINE CHARACTERS
3144 nigel 53
3145 nigel 63 This area is not simple, because POSIX and Perl take dif-
3146     ferent views of things. It is not possible to get PCRE to
3147     obey POSIX semantics, but then PCRE was never intended to be
3148     a POSIX engine. The following table lists the different pos-
3149     sibilities for matching newline characters in PCRE:
3150 nigel 53
3151 nigel 63 Default Change with
3152 nigel 53
3153 nigel 63 . matches newline no PCRE_DOTALL
3154     newline matches [^a] yes not changeable
3155     $ matches \n at end yes PCRE_DOLLARENDONLY
3156     $ matches \n in middle no PCRE_MULTILINE
3157     ^ matches \n in middle no PCRE_MULTILINE
3158 nigel 53
3159 nigel 63 This is the equivalent table for POSIX:
3160 nigel 53
3161 nigel 63 Default Change with
3162 nigel 53
3163 nigel 63 . matches newline yes REG_NEWLINE
3164     newline matches [^a] yes REG_NEWLINE
3165     $ matches \n at end no REG_NEWLINE
3166     $ matches \n in middle no REG_NEWLINE
3167     ^ matches \n in middle no REG_NEWLINE
3168 nigel 53
3169 nigel 63 PCRE's behaviour is the same as Perl's, except that there is
3170     no equivalent for PCRE_DOLLARENDONLY in Perl. In both PCRE
3171     and Perl, there is no way to stop newline from matching
3172     [^a].
3173 nigel 53
3174 nigel 63 The default POSIX newline handling can be obtained by set-
3175     ting PCRE_DOTALL and PCRE_DOLLARENDONLY, but there is no way
3176     to make PCRE behave exactly as for the REG_NEWLINE action.
3177 nigel 53
3178    
3179 nigel 63 MATCHING A PATTERN
3180 nigel 53
3181 nigel 63 The function regexec() is called to match a pre-compiled
3182     pattern preg against a given string, which is terminated by
3183     a zero byte, subject to the options in eflags. These can be:
3184 nigel 53
3185 nigel 63 REG_NOTBOL
3186 nigel 53
3187 nigel 63 The PCRE_NOTBOL option is set when calling the underlying
3188     PCRE matching function.
3189 nigel 53
3190 nigel 63 REG_NOTEOL
3191 nigel 53
3192 nigel 63 The PCRE_NOTEOL option is set when calling the underlying
3193     PCRE matching function.
3194 nigel 53
3195 nigel 63 The portion of the string that was matched, and also any
3196     captured substrings, are returned via the pmatch argument,
3197     which points to an array of nmatch structures of type
3198     regmatch_t, containing the members rm_so and rm_eo. These
3199     contain the offset to the first character of each substring
3200     and the offset to the first character after the end of each
3201     substring, respectively. The 0th element of the vector
3202     relates to the entire portion of string that was matched;
3203     subsequent elements relate to the capturing subpatterns of
3204     the regular expression. Unused entries in the array have
3205     both structure members set to -1.
3206 nigel 53
3207 nigel 63 A successful match yields a zero return; various error codes
3208     are defined in the header file, of which REG_NOMATCH is the
3209     "expected" failure code.
3210 nigel 53
3211    
3212 nigel 63 ERROR MESSAGES
3213 nigel 53
3214 nigel 63 The regerror() function maps a non-zero errorcode from
3215     either regcomp() or regexec() to a printable message. If
3216     preg is not NULL, the error should have arisen from the use
3217     of that structure. A message terminated by a binary zero is
3218     placed in errbuf. The length of the message, including the
3219     zero, is limited to errbuf_size. The yield of the function
3220     is the size of buffer needed to hold the whole message.
3221 nigel 53
3222    
3223 nigel 63 STORAGE
3224 nigel 53
3225 nigel 63 Compiling a regular expression causes memory to be allocated
3226     and associated with the preg structure. The function reg-
3227     free() frees all such memory, after which preg may no longer
3228     be used as a compiled expression.
3229 nigel 53
3230    
3231 nigel 63 AUTHOR
3232 nigel 53
3233 nigel 63 Philip Hazel <ph10@cam.ac.uk>
3234     University Computing Service,
3235     Cambridge CB2 3QG, England.
3236 nigel 53
3237 nigel 63 Last updated: 03 February 2003
3238     Copyright (c) 1997-2003 University of Cambridge.
3239     -----------------------------------------------------------------------------
3240 nigel 53
3241 nigel 63 NAME
3242     PCRE - Perl-compatible regular expressions
3243 nigel 53
3244    
3245 nigel 63 PCRE SAMPLE PROGRAM
3246 nigel 41
3247 nigel 63 A simple, complete demonstration program, to get you started
3248     with using PCRE, is supplied in the file pcredemo.c in the
3249     PCRE distribution.
3250    
3251     The program compiles the regular expression that is its
3252     first argument, and matches it against the subject string in
3253     its second argument. No PCRE options are set, and default
3254     character tables are used. If matching succeeds, the program
3255     outputs the portion of the subject that matched, together
3256     with the contents of any captured substrings.
3257    
3258     If the -g option is given on the command line, the program
3259     then goes on to check for further matches of the same regu-
3260     lar expression in the same subject string. The logic is a
3261     little bit tricky because of the possibility of matching an
3262     empty string. Comments in the code explain what is going on.
3263    
3264     On a Unix system that has PCRE installed in /usr/local, you
3265     can compile the demonstration program using a command like
3266     this:
3267    
3268     gcc -o pcredemo pcredemo.c -I/usr/local/include \
3269     -L/usr/local/lib -lpcre
3270    
3271     Then you can run simple tests like this:
3272    
3273     ./pcredemo 'cat|dog' 'the cat sat on the mat'
3274     ./pcredemo -g 'cat|dog' 'the dog sat on the cat'
3275    
3276     Note that there is a much more comprehensive test program,
3277     called pcretest, which supports many more facilities for
3278     testing regular expressions and the PCRE library. The
3279     pcredemo program is provided as a simple coding example.
3280    
3281     On some operating systems (e.g. Solaris) you may get an
3282     error like this when you try to run pcredemo:
3283    
3284     ld.so.1: a.out: fatal: libpcre.so.0: open failed: No such
3285     file or directory
3286    
3287     This is caused by the way shared library support works on
3288     those systems. You need to add
3289    
3290     -R/usr/local/lib
3291    
3292     to the compile command to get round this problem.
3293    
3294     Last updated: 28 January 2003
3295     Copyright (c) 1997-2003 University of Cambridge.
3296     -----------------------------------------------------------------------------
3297    

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12