/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Contents of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log


Revision 69 - (hide annotations) (download)
Sat Feb 24 21:40:18 2007 UTC (6 years, 3 months ago) by nigel
File MIME type: text/plain
File size: 142596 byte(s)
Load pcre-4.3 into code/trunk.

1 nigel 63 This file contains a concatenation of the PCRE man pages, converted to plain
2     text format for ease of searching with a text editor, or for use on systems
3     that do not have a man page processor. The small individual files that give
4     synopses of each function in the library have not been included. There are
5     separate text files for the pcregrep and pcretest commands.
6     -----------------------------------------------------------------------------
7    
8 nigel 41 NAME
9 nigel 63 PCRE - Perl-compatible regular expressions
10 nigel 41
11    
12 nigel 63 DESCRIPTION
13 nigel 41
14 nigel 63 The PCRE library is a set of functions that implement regu-
15     lar expression pattern matching using the same syntax and
16     semantics as Perl, with just a few differences. The current
17     implementation of PCRE (release 4.x) corresponds approxi-
18     mately with Perl 5.8, including support for UTF-8 encoded
19     strings. However, this support has to be explicitly
20     enabled; it is not the default.
21    
22     PCRE is written in C and released as a C library. However, a
23     number of people have written wrappers and interfaces of
24     various kinds. A C++ class is included in these contribu-
25     tions, which can be found in the Contrib directory at the
26     primary FTP site, which is:
27    
28     ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
29    
30     Details of exactly which Perl regular expression features
31     are and are not supported by PCRE are given in separate
32     documents. See the pcrepattern and pcrecompat pages.
33    
34     Some features of PCRE can be included, excluded, or changed
35     when the library is built. The pcre_config() function makes
36     it possible for a client to discover which features are
37     available. Documentation about building PCRE for various
38     operating systems can be found in the README file in the
39     source distribution.
40    
41    
42     USER DOCUMENTATION
43    
44     The user documentation for PCRE has been split up into a
45     number of different sections. In the "man" format, each of
46     these is a separate "man page". In the HTML format, each is
47     a separate page, linked from the index page. In the plain
48     text format, all the sections are concatenated, for ease of
49     searching. The sections are as follows:
50    
51     pcre this document
52     pcreapi details of PCRE's native API
53     pcrebuild options for building PCRE
54     pcrecallout details of the callout feature
55     pcrecompat discussion of Perl compatibility
56     pcregrep description of the pcregrep command
57     pcrepattern syntax and semantics of supported
58     regular expressions
59     pcreperform discussion of performance issues
60     pcreposix the POSIX-compatible API
61     pcresample discussion of the sample program
62     pcretest the pcretest testing command
63    
64     In addition, in the "man" and HTML formats, there is a short
65     page for each library function, listing its arguments and
66     results.
67    
68    
69     LIMITATIONS
70    
71     There are some size limitations in PCRE but it is hoped that
72     they will never in practice be relevant.
73    
74     The maximum length of a compiled pattern is 65539 (sic)
75     bytes if PCRE is compiled with the default internal linkage
76     size of 2. If you want to process regular expressions that
77     are truly enormous, you can compile PCRE with an internal
78     linkage size of 3 or 4 (see the README file in the source
79     distribution and the pcrebuild documentation for details).
80     If these cases the limit is substantially larger. However,
81     the speed of execution will be slower.
82    
83     All values in repeating quantifiers must be less than 65536.
84     The maximum number of capturing subpatterns is 65535.
85    
86     There is no limit to the number of non-capturing subpat-
87     terns, but the maximum depth of nesting of all kinds of
88     parenthesized subpattern, including capturing subpatterns,
89     assertions, and other types of subpattern, is 200.
90    
91     The maximum length of a subject string is the largest posi-
92     tive number that an integer variable can hold. However, PCRE
93     uses recursion to handle subpatterns and indefinite repeti-
94     tion. This means that the available stack space may limit
95     the size of a subject string that can be processed by cer-
96     tain patterns.
97    
98    
99     UTF-8 SUPPORT
100    
101     Starting at release 3.3, PCRE has had some support for char-
102     acter strings encoded in the UTF-8 format. For release 4.0
103     this has been greatly extended to cover most common require-
104     ments.
105    
106     In order process UTF-8 strings, you must build PCRE to
107     include UTF-8 support in the code, and, in addition, you
108     must call pcre_compile() with the PCRE_UTF8 option flag.
109     When you do this, both the pattern and any subject strings
110     that are matched against it are treated as UTF-8 strings
111     instead of just strings of bytes.
112    
113     If you compile PCRE with UTF-8 support, but do not use it at
114     run time, the library will be a bit bigger, but the addi-
115     tional run time overhead is limited to testing the PCRE_UTF8
116     flag in several places, so should not be very large.
117    
118     The following comments apply when PCRE is running in UTF-8
119     mode:
120    
121     1. PCRE assumes that the strings it is given contain valid
122     UTF-8 codes. It does not diagnose invalid UTF-8 strings. If
123     you pass invalid UTF-8 strings to PCRE, the results are
124     undefined.
125    
126     2. In a pattern, the escape sequence \x{...}, where the con-
127     tents of the braces is a string of hexadecimal digits, is
128     interpreted as a UTF-8 character whose code number is the
129     given hexadecimal number, for example: \x{1234}. If a non-
130     hexadecimal digit appears between the braces, the item is
131     not recognized. This escape sequence can be used either as
132     a literal, or within a character class.
133    
134     3. The original hexadecimal escape sequence, \xhh, matches a
135     two-byte UTF-8 character if the value is greater than 127.
136    
137     4. Repeat quantifiers apply to complete UTF-8 characters,
138     not to individual bytes, for example: \x{100}{3}.
139    
140     5. The dot metacharacter matches one UTF-8 character instead
141     of a single byte.
142    
143     6. The escape sequence \C can be used to match a single byte
144     in UTF-8 mode, but its use can lead to some strange effects.
145    
146     7. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W
147     correctly test characters of any code value, but the charac-
148     ters that PCRE recognizes as digits, spaces, or word charac-
149     ters remain the same set as before, all with values less
150     than 256.
151    
152     8. Case-insensitive matching applies only to characters
153     whose values are less than 256. PCRE does not support the
154     notion of "case" for higher-valued characters.
155    
156     9. PCRE does not support the use of Unicode tables and pro-
157     perties or the Perl escapes \p, \P, and \X.
158    
159    
160     AUTHOR
161    
162     Philip Hazel <ph10@cam.ac.uk>
163     University Computing Service,
164     Cambridge CB2 3QG, England.
165     Phone: +44 1223 334714
166    
167     Last updated: 04 February 2003
168     Copyright (c) 1997-2003 University of Cambridge.
169     -----------------------------------------------------------------------------
170    
171     NAME
172     PCRE - Perl-compatible regular expressions
173    
174    
175     PCRE BUILD-TIME OPTIONS
176    
177     This document describes the optional features of PCRE that
178     can be selected when the library is compiled. They are all
179     selected, or deselected, by providing options to the config-
180     ure script which is run before the make command. The com-
181     plete list of options for configure (which includes the
182     standard ones such as the selection of the installation
183     directory) can be obtained by running
184    
185     ./configure --help
186    
187     The following sections describe certain options whose names
188     begin with --enable or --disable. These settings specify
189     changes to the defaults for the configure command. Because
190     of the way that configure works, --enable and --disable
191     always come in pairs, so the complementary option always
192     exists as well, but as it specifies the default, it is not
193     described.
194    
195    
196     UTF-8 SUPPORT
197    
198     To build PCRE with support for UTF-8 character strings, add
199    
200     --enable-utf8
201    
202     to the configure command. Of itself, this does not make PCRE
203     treat strings as UTF-8. As well as compiling PCRE with this
204     option, you also have have to set the PCRE_UTF8 option when
205     you call the pcre_compile() function.
206    
207    
208     CODE VALUE OF NEWLINE
209    
210     By default, PCRE treats character 10 (linefeed) as the new-
211     line character. This is the normal newline character on
212     Unix-like systems. You can compile PCRE to use character 13
213     (carriage return) instead by adding
214    
215     --enable-newline-is-cr
216    
217     to the configure command. For completeness there is also a
218     --enable-newline-is-lf option, which explicitly specifies
219     linefeed as the newline character.
220    
221    
222     BUILDING SHARED AND STATIC LIBRARIES
223    
224     The PCRE building process uses libtool to build both shared
225     and static Unix libraries by default. You can suppress one
226     of these by adding one of
227    
228     --disable-shared
229     --disable-static
230    
231     to the configure command, as required.
232    
233    
234     POSIX MALLOC USAGE
235    
236     When PCRE is called through the POSIX interface (see the
237     pcreposix documentation), additional working storage is
238     required for holding the pointers to capturing substrings
239     because PCRE requires three integers per substring, whereas
240     the POSIX interface provides only two. If the number of
241     expected substrings is small, the wrapper function uses
242     space on the stack, because this is faster than using mal-
243     loc() for each call. The default threshold above which the
244     stack is no longer used is 10; it can be changed by adding a
245     setting such as
246    
247     --with-posix-malloc-threshold=20
248    
249     to the configure command.
250    
251    
252     LIMITING PCRE RESOURCE USAGE
253    
254     Internally, PCRE has a function called match() which it
255     calls repeatedly (possibly recursively) when performing a
256     matching operation. By limiting the number of times this
257     function may be called, a limit can be placed on the
258     resources used by a single call to pcre_exec(). The limit
259     can be changed at run time, as described in the pcreapi
260     documentation. The default is 10 million, but this can be
261     changed by adding a setting such as
262    
263     --with-match-limit=500000
264    
265     to the configure command.
266    
267    
268     HANDLING VERY LARGE PATTERNS
269    
270     Within a compiled pattern, offset values are used to point
271     from one part to another (for example, from an opening
272     parenthesis to an alternation metacharacter). By default
273     two-byte values are used for these offsets, leading to a
274     maximum size for a compiled pattern of around 64K. This is
275     sufficient to handle all but the most gigantic patterns.
276     Nevertheless, some people do want to process enormous pat-
277     terns, so it is possible to compile PCRE to use three-byte
278     or four-byte offsets by adding a setting such as
279    
280     --with-link-size=3
281    
282     to the configure command. The value given must be 2, 3, or
283     4. Using longer offsets slows down the operation of PCRE
284     because it has to load additional bytes when handling them.
285    
286     If you build PCRE with an increased link size, test 2 (and
287     test 5 if you are using UTF-8) will fail. Part of the output
288     of these tests is a representation of the compiled pattern,
289     and this changes with the link size.
290    
291     Last updated: 21 January 2003
292     Copyright (c) 1997-2003 University of Cambridge.
293     -----------------------------------------------------------------------------
294    
295     NAME
296     PCRE - Perl-compatible regular expressions
297    
298    
299     SYNOPSIS OF PCRE API
300    
301 nigel 41 #include <pcre.h>
302    
303     pcre *pcre_compile(const char *pattern, int options,
304     const char **errptr, int *erroffset,
305     const unsigned char *tableptr);
306    
307     pcre_extra *pcre_study(const pcre *code, int options,
308     const char **errptr);
309    
310     int pcre_exec(const pcre *code, const pcre_extra *extra,
311     const char *subject, int length, int startoffset,
312     int options, int *ovector, int ovecsize);
313    
314 nigel 63 int pcre_copy_named_substring(const pcre *code,
315     const char *subject, int *ovector,
316     int stringcount, const char *stringname,
317     char *buffer, int buffersize);
318    
319 nigel 41 int pcre_copy_substring(const char *subject, int *ovector,
320     int stringcount, int stringnumber, char *buffer,
321     int buffersize);
322    
323 nigel 63 int pcre_get_named_substring(const pcre *code,
324     const char *subject, int *ovector,
325     int stringcount, const char *stringname,
326     const char **stringptr);
327    
328     int pcre_get_stringnumber(const pcre *code,
329     const char *name);
330    
331 nigel 41 int pcre_get_substring(const char *subject, int *ovector,
332     int stringcount, int stringnumber,
333     const char **stringptr);
334    
335     int pcre_get_substring_list(const char *subject,
336     int *ovector, int stringcount, const char ***listptr);
337    
338 nigel 49 void pcre_free_substring(const char *stringptr);
339    
340     void pcre_free_substring_list(const char **stringptr);
341    
342 nigel 41 const unsigned char *pcre_maketables(void);
343    
344 nigel 43 int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
345     int what, void *where);
346    
347 nigel 63
348 nigel 41 int pcre_info(const pcre *code, int *optptr, *firstcharptr);
349    
350 nigel 63 int pcre_config(int what, void *where);
351    
352 nigel 41 char *pcre_version(void);
353    
354     void *(*pcre_malloc)(size_t);
355    
356     void (*pcre_free)(void *);
357    
358 nigel 63 int (*pcre_callout)(pcre_callout_block *);
359 nigel 41
360    
361 nigel 63 PCRE API
362 nigel 41
363     PCRE has its own native API, which is described in this
364     document. There is also a set of wrapper functions that
365 nigel 43 correspond to the POSIX regular expression API. These are
366     described in the pcreposix documentation.
367    
368 nigel 41 The native API function prototypes are defined in the header
369     file pcre.h, and on Unix systems the library itself is
370     called libpcre.a, so can be accessed by adding -lpcre to the
371 nigel 43 command for linking an application which calls it. The
372     header file defines the macros PCRE_MAJOR and PCRE_MINOR to
373     contain the major and minor release numbers for the library.
374     Applications can use these to include support for different
375     releases.
376 nigel 41
377     The functions pcre_compile(), pcre_study(), and pcre_exec()
378 nigel 53 are used for compiling and matching regular expressions. A
379     sample program that demonstrates the simplest way of using
380 nigel 63 them is given in the file pcredemo.c. The pcresample docu-
381     mentation describes how to run it.
382 nigel 49
383 nigel 63 There are convenience functions for extracting captured sub-
384     strings from a matched subject string. They are:
385    
386     pcre_copy_substring()
387     pcre_copy_named_substring()
388     pcre_get_substring()
389     pcre_get_named_substring()
390     pcre_get_substring_list()
391    
392     pcre_free_substring() and pcre_free_substring_list() are
393     also provided, to free the memory used for extracted
394 nigel 49 strings.
395 nigel 41
396 nigel 49 The function pcre_maketables() is used (optionally) to build
397     a set of character tables in the current locale for passing
398     to pcre_compile().
399    
400 nigel 43 The function pcre_fullinfo() is used to find out information
401     about a compiled pattern; pcre_info() is an obsolete version
402     which returns only some of the available information, but is
403     retained for backwards compatibility. The function
404     pcre_version() returns a pointer to a string containing the
405     version of PCRE and its date of release.
406 nigel 41
407     The global variables pcre_malloc and pcre_free initially
408     contain the entry points of the standard malloc() and free()
409     functions respectively. PCRE calls the memory management
410     functions via these variables, so a calling program can
411     replace them if it wishes to intercept the calls. This
412     should be done before calling any PCRE functions.
413    
414 nigel 63 The global variable pcre_callout initially contains NULL. It
415     can be set by the caller to a "callout" function, which PCRE
416     will then call at specified points during a matching opera-
417     tion. Details are given in the pcrecallout documentation.
418 nigel 41
419    
420 nigel 63 MULTITHREADING
421    
422 nigel 53 The PCRE functions can be used in multi-threading applica-
423     tions, with the proviso that the memory management functions
424 nigel 63 pointed to by pcre_malloc and pcre_free, and the callout
425     function pointed to by pcre_callout, are shared by all
426 nigel 53 threads.
427 nigel 41
428     The compiled form of a regular expression is not altered
429     during matching, so the same compiled pattern can safely be
430     used by several threads at once.
431    
432    
433 nigel 63 CHECKING BUILD-TIME OPTIONS
434 nigel 41
435 nigel 63 int pcre_config(int what, void *where);
436    
437     The function pcre_config() makes it possible for a PCRE
438     client to discover which optional features have been com-
439     piled into the PCRE library. The pcrebuild documentation has
440     more details about these optional features.
441    
442     The first argument for pcre_config() is an integer, specify-
443     ing which information is required; the second argument is a
444     pointer to a variable into which the information is placed.
445     The following information is available:
446    
447     PCRE_CONFIG_UTF8
448    
449     The output is an integer that is set to one if UTF-8 support
450     is available; otherwise it is set to zero.
451    
452     PCRE_CONFIG_NEWLINE
453    
454     The output is an integer that is set to the value of the
455     code that is used for the newline character. It is either
456     linefeed (10) or carriage return (13), and should normally
457     be the standard character for your operating system.
458    
459     PCRE_CONFIG_LINK_SIZE
460    
461     The output is an integer that contains the number of bytes
462     used for internal linkage in compiled regular expressions.
463     The value is 2, 3, or 4. Larger values allow larger regular
464     expressions to be compiled, at the expense of slower match-
465     ing. The default value of 2 is sufficient for all but the
466     most massive patterns, since it allows the compiled pattern
467     to be up to 64K in size.
468    
469     PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
470    
471     The output is an integer that contains the threshold above
472     which the POSIX interface uses malloc() for output vectors.
473     Further details are given in the pcreposix documentation.
474    
475     PCRE_CONFIG_MATCH_LIMIT
476    
477     The output is an integer that gives the default limit for
478     the number of internal matching function calls in a
479     pcre_exec() execution. Further details are given with
480     pcre_exec() below.
481    
482    
483 nigel 41 COMPILING A PATTERN
484 nigel 63
485     pcre *pcre_compile(const char *pattern, int options,
486     const char **errptr, int *erroffset,
487     const unsigned char *tableptr);
488    
489 nigel 41 The function pcre_compile() is called to compile a pattern
490     into an internal form. The pattern is a C string terminated
491     by a binary zero, and is passed in the argument pattern. A
492     pointer to a single block of memory that is obtained via
493     pcre_malloc is returned. This contains the compiled code and
494 nigel 53 related data. The pcre type is defined for the returned
495     block; this is a typedef for a structure whose contents are
496     not externally defined. It is up to the caller to free the
497     memory when it is no longer required.
498 nigel 41
499 nigel 53 Although the compiled code of a PCRE regex is relocatable,
500     that is, it does not depend on memory location, the complete
501     pcre data block is not fully relocatable, because it con-
502     tains a copy of the tableptr argument, which is an address
503     (see below).
504 nigel 41 The options argument contains independent bits that affect
505     the compilation. It should be zero if no options are
506     required. Some of the options, in particular, those that are
507     compatible with Perl, can also be set and unset from within
508     the pattern (see the detailed description of regular expres-
509 nigel 63 sions in the pcrepattern documentation). For these options,
510     the contents of the options argument specifies their initial
511     settings at the start of compilation and execution. The
512     PCRE_ANCHORED option can be set at the time of matching as
513     well as at compile time.
514 nigel 41
515     If errptr is NULL, pcre_compile() returns NULL immediately.
516     Otherwise, if compilation of a pattern fails, pcre_compile()
517     returns NULL, and sets the variable pointed to by errptr to
518     point to a textual error message. The offset from the start
519     of the pattern to the character where the error was
520     discovered is placed in the variable pointed to by
521     erroffset, which must not be NULL. If it is, an immediate
522     error is given.
523    
524     If the final argument, tableptr, is NULL, PCRE uses a
525     default set of character tables which are built when it is
526     compiled, using the default C locale. Otherwise, tableptr
527     must be the result of a call to pcre_maketables(). See the
528     section on locale support below.
529    
530 nigel 53 This code fragment shows a typical straightforward call to
531     pcre_compile():
532    
533     pcre *re;
534     const char *error;
535     int erroffset;
536     re = pcre_compile(
537     "^A.*Z", /* the pattern */
538     0, /* default options */
539     &error, /* for error message */
540     &erroffset, /* for error offset */
541     NULL); /* use default character tables */
542    
543 nigel 63 The following option bits are defined:
544 nigel 41
545     PCRE_ANCHORED
546    
547     If this bit is set, the pattern is forced to be "anchored",
548 nigel 63 that is, it is constrained to match only at the first match-
549     ing point in the string which is being searched (the "sub-
550     ject string"). This effect can also be achieved by appropri-
551     ate constructs in the pattern itself, which is the only way
552     to do it in Perl.
553 nigel 41
554     PCRE_CASELESS
555    
556     If this bit is set, letters in the pattern match both upper
557     and lower case letters. It is equivalent to Perl's /i
558 nigel 63 option, and it can be changed within a pattern by a (?i)
559     option setting.
560 nigel 41
561     PCRE_DOLLAR_ENDONLY
562    
563     If this bit is set, a dollar metacharacter in the pattern
564     matches only at the end of the subject string. Without this
565     option, a dollar also matches immediately before the final
566     character if it is a newline (but not before any other new-
567     lines). The PCRE_DOLLAR_ENDONLY option is ignored if
568     PCRE_MULTILINE is set. There is no equivalent to this option
569 nigel 63 in Perl, and no way to set it within a pattern.
570 nigel 41
571     PCRE_DOTALL
572    
573     If this bit is set, a dot metacharater in the pattern
574     matches all characters, including newlines. Without it, new-
575     lines are excluded. This option is equivalent to Perl's /s
576 nigel 63 option, and it can be changed within a pattern by a (?s)
577     option setting. A negative class such as [^a] always matches
578     a newline character, independent of the setting of this
579     option.
580 nigel 41
581     PCRE_EXTENDED
582    
583     If this bit is set, whitespace data characters in the pat-
584     tern are totally ignored except when escaped or inside a
585 nigel 63 character class. Whitespace does not include the VT charac-
586     ter (code 11). In addition, characters between an unescaped
587     # outside a character class and the next newline character,
588 nigel 41 inclusive, are also ignored. This is equivalent to Perl's /x
589 nigel 63 option, and it can be changed within a pattern by a (?x)
590     option setting.
591    
592     This option makes it possible to include comments inside
593     complicated patterns. Note, however, that this applies only
594     to data characters. Whitespace characters may never appear
595 nigel 41 within special character sequences in a pattern, for example
596 nigel 63 within the sequence (?( which introduces a conditional sub-
597 nigel 41 pattern.
598    
599     PCRE_EXTRA
600    
601 nigel 43 This option was invented in order to turn on additional
602     functionality of PCRE that is incompatible with Perl, but it
603     is currently of very little use. When set, any backslash in
604     a pattern that is followed by a letter that has no special
605     meaning causes an error, thus reserving these combinations
606     for future expansion. By default, as in Perl, a backslash
607     followed by a letter with no special meaning is treated as a
608     literal. There are at present no other features controlled
609     by this option. It can also be set by a (?X) option setting
610     within a pattern.
611 nigel 41
612     PCRE_MULTILINE
613    
614     By default, PCRE treats the subject string as consisting of
615     a single "line" of characters (even if it actually contains
616     several newlines). The "start of line" metacharacter (^)
617     matches only at the start of the string, while the "end of
618     line" metacharacter ($) matches only at the end of the
619     string, or before a terminating newline (unless
620     PCRE_DOLLAR_ENDONLY is set). This is the same as Perl.
621    
622     When PCRE_MULTILINE it is set, the "start of line" and "end
623 nigel 43 of line" constructs match immediately following or immedi-
624     ately before any newline in the subject string, respec-
625     tively, as well as at the very start and end. This is
626 nigel 63 equivalent to Perl's /m option, and it can be changed within
627     a pattern by a (?m) option setting. If there are no "\n"
628     characters in a subject string, or no occurrences of ^ or $
629     in a pattern, setting PCRE_MULTILINE has no effect.
630 nigel 41
631 nigel 63 PCRE_NO_AUTO_CAPTURE
632    
633     If this option is set, it disables the use of numbered cap-
634     turing parentheses in the pattern. Any opening parenthesis
635     that is not followed by ? behaves as if it were followed by
636     ?: but named parentheses can still be used for capturing
637     (and they acquire numbers in the usual way). There is no
638     equivalent of this option in Perl.
639    
640 nigel 41 PCRE_UNGREEDY
641    
642     This option inverts the "greediness" of the quantifiers so
643     that they are not greedy by default, but become greedy if
644     followed by "?". It is not compatible with Perl. It can also
645     be set by a (?U) option setting within the pattern.
646    
647 nigel 49 PCRE_UTF8
648 nigel 41
649 nigel 49 This option causes PCRE to regard both the pattern and the
650 nigel 63 subject as strings of UTF-8 characters instead of single-
651     byte character strings. However, it is available only if
652     PCRE has been built to include UTF-8 support. If not, the
653     use of this option provokes an error. Details of how this
654     option changes the behaviour of PCRE are given in the sec-
655     tion on UTF-8 support in the main pcre page.
656 nigel 41
657 nigel 49
658 nigel 63 STUDYING A PATTERN
659 nigel 49
660 nigel 63 pcre_extra *pcre_study(const pcre *code, int options,
661     const char **errptr);
662    
663 nigel 41 When a pattern is going to be used several times, it is
664     worth spending more time analyzing it in order to speed up
665     the time taken for matching. The function pcre_study() takes
666 nigel 63 a pointer to a compiled pattern as its first argument. If
667     studing the pattern produces additional information that
668     will help speed up matching, pcre_study() returns a pointer
669     to a pcre_extra block, in which the study_data field points
670     to the results of the study.
671 nigel 41
672 nigel 63 The returned value from a pcre_study() can be passed
673     directly to pcre_exec(). However, the pcre_extra block also
674     contains other fields that can be set by the caller before
675     the block is passed; these are described below. If studying
676     the pattern does not produce any additional information,
677     pcre_study() returns NULL. In that circumstance, if the cal-
678     ling program wants to pass some of the other fields to
679     pcre_exec(), it must set up its own pcre_extra block.
680    
681 nigel 41 The second argument contains option bits. At present, no
682     options are defined for pcre_study(), and this argument
683     should always be zero.
684    
685 nigel 63 The third argument for pcre_study() is a pointer for an
686     error message. If studying succeeds (even if no data is
687     returned), the variable it points to is set to NULL. Other-
688     wise it points to a textual error message. You should there-
689     fore test the error pointer for NULL after calling
690     pcre_study(), to be sure that it has run successfully.
691 nigel 41
692 nigel 53 This is a typical call to pcre_study():
693    
694     pcre_extra *pe;
695     pe = pcre_study(
696     re, /* result of pcre_compile() */
697     0, /* no options exist */
698     &error); /* set to NULL or points to a message */
699    
700 nigel 41 At present, studying a pattern is useful only for non-
701     anchored patterns that do not have a single fixed starting
702     character. A bitmap of possible starting characters is
703     created.
704    
705    
706 nigel 63 LOCALE SUPPORT
707 nigel 41
708     PCRE handles caseless matching, and determines whether char-
709     acters are letters, digits, or whatever, by reference to a
710 nigel 63 set of tables. When running in UTF-8 mode, this applies only
711     to characters with codes less than 256. The library contains
712     a default set of tables that is created in the default C
713     locale when PCRE is compiled. This is used when the final
714     argument of pcre_compile() is NULL, and is sufficient for
715     many applications.
716 nigel 41
717     An alternative set of tables can, however, be supplied. Such
718     tables are built by calling the pcre_maketables() function,
719     which has no arguments, in the relevant locale. The result
720     can then be passed to pcre_compile() as often as necessary.
721     For example, to build and use tables that are appropriate
722     for the French locale (where accented characters with codes
723     greater than 128 are treated as letters), the following code
724     could be used:
725    
726     setlocale(LC_CTYPE, "fr");
727     tables = pcre_maketables();
728     re = pcre_compile(..., tables);
729    
730     The tables are built in memory that is obtained via
731     pcre_malloc. The pointer that is passed to pcre_compile is
732     saved with the compiled pattern, and the same tables are
733 nigel 63 used via this pointer by pcre_study() and pcre_exec(). Thus,
734 nigel 41 for any single pattern, compilation, studying and matching
735     all happen in the same locale, but different patterns can be
736     compiled in different locales. It is the caller's responsi-
737     bility to ensure that the memory containing the tables
738     remains available for as long as it is needed.
739    
740    
741 nigel 63 INFORMATION ABOUT A PATTERN
742 nigel 41
743 nigel 63 int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
744     int what, void *where);
745    
746 nigel 43 The pcre_fullinfo() function returns information about a
747     compiled pattern. It replaces the obsolete pcre_info() func-
748     tion, which is nevertheless retained for backwards compabil-
749     ity (and is documented below).
750 nigel 41
751 nigel 43 The first argument for pcre_fullinfo() is a pointer to the
752     compiled pattern. The second argument is the result of
753     pcre_study(), or NULL if the pattern was not studied. The
754     third argument specifies which piece of information is
755 nigel 63 required, and the fourth argument is a pointer to a variable
756     to receive the data. The yield of the function is zero for
757     success, or one of the following negative numbers:
758 nigel 43
759 nigel 41 PCRE_ERROR_NULL the argument code was NULL
760 nigel 43 the argument where was NULL
761 nigel 41 PCRE_ERROR_BADMAGIC the "magic number" was not found
762 nigel 43 PCRE_ERROR_BADOPTION the value of what was invalid
763 nigel 41
764 nigel 53 Here is a typical call of pcre_fullinfo(), to obtain the
765     length of the compiled pattern:
766    
767     int rc;
768     unsigned long int length;
769     rc = pcre_fullinfo(
770     re, /* result of pcre_compile() */
771     pe, /* result of pcre_study(), or NULL */
772     PCRE_INFO_SIZE, /* what is required */
773     &length); /* where to put the data */
774    
775 nigel 43 The possible values for the third argument are defined in
776     pcre.h, and are as follows:
777    
778 nigel 63 PCRE_INFO_BACKREFMAX
779 nigel 43
780 nigel 63 Return the number of the highest back reference in the pat-
781     tern. The fourth argument should point to an int variable.
782     Zero is returned if there are no back references.
783 nigel 41
784 nigel 43 PCRE_INFO_CAPTURECOUNT
785    
786     Return the number of capturing subpatterns in the pattern.
787     The fourth argument should point to an int variable.
788    
789 nigel 63 PCRE_INFO_FIRSTBYTE
790 nigel 43
791 nigel 63 Return information about the first byte of any matched
792     string, for a non-anchored pattern. (This option used to be
793     called PCRE_INFO_FIRSTCHAR; the old name is still recognized
794     for backwards compatibility.)
795 nigel 43
796 nigel 63 If there is a fixed first byte, e.g. from a pattern such as
797 nigel 47 (cat|cow|coyote), it is returned in the integer pointed to
798     by where. Otherwise, if either
799 nigel 41
800     (a) the pattern was compiled with the PCRE_MULTILINE option,
801     and every branch starts with "^", or
802    
803     (b) every branch of the pattern starts with ".*" and
804     PCRE_DOTALL is not set (if it were set, the pattern would be
805     anchored),
806 nigel 43
807 nigel 47 -1 is returned, indicating that the pattern matches only at
808 nigel 63 the start of a subject string or after any newline within
809     the string. Otherwise -2 is returned. For anchored patterns,
810     -2 is returned.
811 nigel 41
812 nigel 43 PCRE_INFO_FIRSTTABLE
813 nigel 41
814 nigel 43 If the pattern was studied, and this resulted in the con-
815 nigel 63 struction of a 256-bit table indicating a fixed set of bytes
816     for the first byte in any matching string, a pointer to the
817     table is returned. Otherwise NULL is returned. The fourth
818     argument should point to an unsigned char * variable.
819 nigel 41
820 nigel 43 PCRE_INFO_LASTLITERAL
821    
822 nigel 65 Return the value of the rightmost literal byte that must
823     exist in any matched string, other than at its start, if
824     such a byte has been recorded. The fourth argument should
825     point to an int variable. If there is no such byte, -1 is
826     returned. For anchored patterns, a last literal byte is
827     recorded only if it follows something of variable length.
828     For example, for the pattern /^a\d+z\d+/ the returned value
829     is "z", but for /^a\dz\d/ the returned value is -1.
830 nigel 43
831 nigel 63 PCRE_INFO_NAMECOUNT
832     PCRE_INFO_NAMEENTRYSIZE
833     PCRE_INFO_NAMETABLE
834    
835     PCRE supports the use of named as well as numbered capturing
836     parentheses. The names are just an additional way of identi-
837     fying the parentheses, which still acquire a number. A
838     caller that wants to extract data from a named subpattern
839     must convert the name to a number in order to access the
840     correct pointers in the output vector (described with
841     pcre_exec() below). In order to do this, it must first use
842     these three values to obtain the name-to-number mapping
843     table for the pattern.
844    
845     The map consists of a number of fixed-size entries.
846     PCRE_INFO_NAMECOUNT gives the number of entries, and
847     PCRE_INFO_NAMEENTRYSIZE gives the size of each entry; both
848     of these return an int value. The entry size depends on the
849     length of the longest name. PCRE_INFO_NAMETABLE returns a
850     pointer to the first entry of the table (a pointer to char).
851     The first two bytes of each entry are the number of the cap-
852     turing parenthesis, most significant byte first. The rest of
853     the entry is the corresponding name, zero terminated. The
854     names are in alphabetical order. For example, consider the
855     following pattern (assume PCRE_EXTENDED is set, so white
856     space - including newlines - is ignored):
857    
858     (?P<date> (?P<year>(\d\d)?\d\d) -
859     (?P<month>\d\d) - (?P<day>\d\d) )
860    
861     There are four named subpatterns, so the table has four
862     entries, and each entry in the table is eight bytes long.
863     The table is as follows, with non-printing bytes shows in
864     hex, and undefined bytes shown as ??:
865    
866     00 01 d a t e 00 ??
867     00 05 d a y 00 ?? ??
868     00 04 m o n t h 00
869     00 02 y e a r 00 ??
870    
871     When writing code to extract data from named subpatterns,
872     remember that the length of each entry may be different for
873     each compiled pattern.
874    
875     PCRE_INFO_OPTIONS
876    
877     Return a copy of the options with which the pattern was com-
878     piled. The fourth argument should point to an unsigned long
879     int variable. These option bits are those specified in the
880     call to pcre_compile(), modified by any top-level option
881     settings within the pattern itself.
882    
883     A pattern is automatically anchored by PCRE if all of its
884     top-level alternatives begin with one of the following:
885    
886     ^ unless PCRE_MULTILINE is set
887     \A always
888     \G always
889     .* if PCRE_DOTALL is set and there are no back
890     references to the subpattern in which .* appears
891    
892     For such patterns, the PCRE_ANCHORED bit is set in the
893     options returned by pcre_fullinfo().
894    
895     PCRE_INFO_SIZE
896    
897     Return the size of the compiled pattern, that is, the value
898     that was passed as the argument to pcre_malloc() when PCRE
899     was getting memory in which to place the compiled data. The
900     fourth argument should point to a size_t variable.
901    
902     PCRE_INFO_STUDYSIZE
903    
904     Returns the size of the data block pointed to by the
905     study_data field in a pcre_extra block. That is, it is the
906     value that was passed to pcre_malloc() when PCRE was getting
907     memory into which to place the data created by pcre_study().
908     The fourth argument should point to a size_t variable.
909    
910    
911     OBSOLETE INFO FUNCTION
912    
913     int pcre_info(const pcre *code, int *optptr, *firstcharptr);
914    
915 nigel 43 The pcre_info() function is now obsolete because its inter-
916     face is too restrictive to return all the available data
917     about a compiled pattern. New programs should use
918     pcre_fullinfo() instead. The yield of pcre_info() is the
919     number of capturing subpatterns, or one of the following
920     negative numbers:
921    
922     PCRE_ERROR_NULL the argument code was NULL
923     PCRE_ERROR_BADMAGIC the "magic number" was not found
924    
925     If the optptr argument is not NULL, a copy of the options
926     with which the pattern was compiled is placed in the integer
927     it points to (see PCRE_INFO_OPTIONS above).
928    
929     If the pattern is not anchored and the firstcharptr argument
930     is not NULL, it is used to pass back information about the
931     first character of any matched string (see
932 nigel 63 PCRE_INFO_FIRSTBYTE above).
933 nigel 43
934    
935 nigel 41 MATCHING A PATTERN
936 nigel 53
937 nigel 63 int pcre_exec(const pcre *code, const pcre_extra *extra,
938     const char *subject, int length, int startoffset,
939     int options, int *ovector, int ovecsize);
940 nigel 53
941 nigel 63 The function pcre_exec() is called to match a subject string
942 nigel 41 against a pre-compiled pattern, which is passed in the code
943     argument. If the pattern has been studied, the result of the
944 nigel 63 study should be passed in the extra argument.
945 nigel 41
946 nigel 53 Here is an example of a simple call to pcre_exec():
947    
948     int rc;
949     int ovector[30];
950     rc = pcre_exec(
951     re, /* result of pcre_compile() */
952     NULL, /* we didn't study the pattern */
953     "some string", /* the subject string */
954     11, /* the length of the subject string */
955     0, /* start at offset 0 in the subject */
956     0, /* default options */
957     ovector, /* vector for substring information */
958     30); /* number of elements in the vector */
959    
960 nigel 63 If the extra argument is not NULL, it must point to a
961     pcre_extra data block. The pcre_study() function returns
962     such a block (when it doesn't return NULL), but you can also
963     create one for yourself, and pass additional information in
964     it. The fields in the block are as follows:
965    
966     unsigned long int flags;
967     void *study_data;
968     unsigned long int match_limit;
969     void *callout_data;
970    
971     The flags field is a bitmap that specifies which of the
972     other fields are set. The flag bits are:
973    
974     PCRE_EXTRA_STUDY_DATA
975     PCRE_EXTRA_MATCH_LIMIT
976     PCRE_EXTRA_CALLOUT_DATA
977    
978     Other flag bits should be set to zero. The study_data field
979     is set in the pcre_extra block that is returned by
980     pcre_study(), together with the appropriate flag bit. You
981     should not set this yourself, but you can add to the block
982     by setting the other fields.
983    
984     The match_limit field provides a means of preventing PCRE
985     from using up a vast amount of resources when running pat-
986     terns that are not going to match, but which have a very
987     large number of possibilities in their search trees. The
988     classic example is the use of nested unlimited repeats.
989     Internally, PCRE uses a function called match() which it
990     calls repeatedly (sometimes recursively). The limit is
991     imposed on the number of times this function is called dur-
992     ing a match, which has the effect of limiting the amount of
993     recursion and backtracking that can take place. For patterns
994     that are not anchored, the count starts from zero for each
995     position in the subject string.
996    
997     The default limit for the library can be set when PCRE is
998     built; the default default is 10 million, which handles all
999     but the most extreme cases. You can reduce the default by
1000     suppling pcre_exec() with a pcre_extra block in which
1001     match_limit is set to a smaller value, and
1002     PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the
1003     limit is exceeded, pcre_exec() returns
1004     PCRE_ERROR_MATCHLIMIT.
1005    
1006     The pcre_callout field is used in conjunction with the "cal-
1007     lout" feature, which is described in the pcrecallout docu-
1008     mentation.
1009    
1010 nigel 41 The PCRE_ANCHORED option can be passed in the options argu-
1011 nigel 63 ment, whose unused bits must be zero. This limits
1012     pcre_exec() to matching at the first matching position. How-
1013     ever, if a pattern was compiled with PCRE_ANCHORED, or
1014     turned out to be anchored by virtue of its contents, it can-
1015     not be made unachored at matching time.
1016 nigel 41
1017     There are also three further options that can be set only at
1018     matching time:
1019    
1020     PCRE_NOTBOL
1021    
1022     The first character of the string is not the beginning of a
1023     line, so the circumflex metacharacter should not match
1024     before it. Setting this without PCRE_MULTILINE (at compile
1025     time) causes circumflex never to match.
1026    
1027     PCRE_NOTEOL
1028    
1029     The end of the string is not the end of a line, so the dol-
1030     lar metacharacter should not match it nor (except in multi-
1031     line mode) a newline immediately before it. Setting this
1032     without PCRE_MULTILINE (at compile time) causes dollar never
1033     to match.
1034    
1035     PCRE_NOTEMPTY
1036    
1037     An empty string is not considered to be a valid match if
1038     this option is set. If there are alternatives in the pat-
1039     tern, they are tried. If all the alternatives match the
1040     empty string, the entire match fails. For example, if the
1041     pattern
1042    
1043     a?b?
1044    
1045     is applied to a string not beginning with "a" or "b", it
1046     matches the empty string at the start of the subject. With
1047     PCRE_NOTEMPTY set, this match is not valid, so PCRE searches
1048     further into the string for occurrences of "a" or "b".
1049    
1050     Perl has no direct equivalent of PCRE_NOTEMPTY, but it does
1051     make a special case of a pattern match of the empty string
1052     within its split() function, and when using the /g modifier.
1053     It is possible to emulate Perl's behaviour after matching a
1054     null string by first trying the match again at the same
1055     offset with PCRE_NOTEMPTY set, and then if that fails by
1056     advancing the starting offset (see below) and trying an
1057     ordinary match again.
1058    
1059 nigel 63 The subject string is passed to pcre_exec() as a pointer in
1060     subject, a length in length, and a starting offset in star-
1061     toffset. Unlike the pattern string, the subject may contain
1062     binary zero bytes. When the starting offset is zero, the
1063 nigel 53 search for a match starts at the beginning of the subject,
1064     and this is by far the most common case.
1065 nigel 41
1066 nigel 63 If the pattern was compiled with the PCRE_UTF8 option, the
1067     subject must be a sequence of bytes that is a valid UTF-8
1068     string. If an invalid UTF-8 string is passed, PCRE's
1069     behaviour is not defined.
1070    
1071 nigel 41 A non-zero starting offset is useful when searching for
1072     another match in the same subject by calling pcre_exec()
1073     again after a previous success. Setting startoffset differs
1074     from just passing over a shortened string and setting
1075     PCRE_NOTBOL in the case of a pattern that begins with any
1076     kind of lookbehind. For example, consider the pattern
1077    
1078     \Biss\B
1079    
1080     which finds occurrences of "iss" in the middle of words. (\B
1081     matches only if the current position in the subject is not a
1082     word boundary.) When applied to the string "Mississipi" the
1083     first call to pcre_exec() finds the first occurrence. If
1084     pcre_exec() is called again with just the remainder of the
1085     subject, namely "issipi", it does not match, because \B is
1086     always false at the start of the subject, which is deemed to
1087     be a word boundary. However, if pcre_exec() is passed the
1088     entire string again, but with startoffset set to 4, it finds
1089     the second occurrence of "iss" because it is able to look
1090     behind the starting point to discover that it is preceded by
1091     a letter.
1092    
1093     If a non-zero starting offset is passed when the pattern is
1094     anchored, one attempt to match at the given offset is tried.
1095     This can only succeed if the pattern does not require the
1096     match to be at the start of the subject.
1097    
1098     In general, a pattern matches a certain portion of the sub-
1099     ject, and in addition, further substrings from the subject
1100     may be picked out by parts of the pattern. Following the
1101     usage in Jeffrey Friedl's book, this is called "capturing"
1102     in what follows, and the phrase "capturing subpattern" is
1103     used for a fragment of a pattern that picks out a substring.
1104     PCRE supports several other kinds of parenthesized subpat-
1105     tern that do not cause substrings to be captured.
1106    
1107     Captured substrings are returned to the caller via a vector
1108     of integer offsets whose address is passed in ovector. The
1109     number of elements in the vector is passed in ovecsize. The
1110     first two-thirds of the vector is used to pass back captured
1111     substrings, each substring using a pair of integers. The
1112     remaining third of the vector is used as workspace by
1113     pcre_exec() while matching capturing subpatterns, and is not
1114     available for passing back information. The length passed in
1115     ovecsize should always be a multiple of three. If it is not,
1116     it is rounded down.
1117    
1118     When a match has been successful, information about captured
1119     substrings is returned in pairs of integers, starting at the
1120     beginning of ovector, and continuing up to two-thirds of its
1121     length at the most. The first element of a pair is set to
1122     the offset of the first character in a substring, and the
1123     second is set to the offset of the first character after the
1124     end of a substring. The first pair, ovector[0] and ovec-
1125     tor[1], identify the portion of the subject string matched
1126     by the entire pattern. The next pair is used for the first
1127     capturing subpattern, and so on. The value returned by
1128     pcre_exec() is the number of pairs that have been set. If
1129     there are no capturing subpatterns, the return value from a
1130     successful match is 1, indicating that just the first pair
1131     of offsets has been set.
1132 nigel 65
1133 nigel 41 Some convenience functions are provided for extracting the
1134     captured substrings as separate strings. These are described
1135     in the following section.
1136    
1137     It is possible for an capturing subpattern number n+1 to
1138     match some part of the subject when subpattern n has not
1139     been used at all. For example, if the string "abc" is
1140     matched against the pattern (a|(z))(bc) subpatterns 1 and 3
1141     are matched, but 2 is not. When this happens, both offset
1142     values corresponding to the unused subpattern are set to -1.
1143    
1144     If a capturing subpattern is matched repeatedly, it is the
1145     last portion of the string that it matched that gets
1146     returned.
1147    
1148     If the vector is too small to hold all the captured sub-
1149     strings, it is used as far as possible (up to two-thirds of
1150     its length), and the function returns a value of zero. In
1151     particular, if the substring offsets are not of interest,
1152     pcre_exec() may be called with ovector passed as NULL and
1153     ovecsize as zero. However, if the pattern contains back
1154     references and the ovector isn't big enough to remember the
1155     related substrings, PCRE has to get additional memory for
1156     use during matching. Thus it is usually advisable to supply
1157     an ovector.
1158    
1159     Note that pcre_info() can be used to find out how many cap-
1160     turing subpatterns there are in a compiled pattern. The
1161     smallest size for ovector that will allow for n captured
1162 nigel 63 substrings, in addition to the offsets of the substring
1163     matched by the whole pattern, is (n+1)*3.
1164 nigel 41
1165     If pcre_exec() fails, it returns a negative number. The fol-
1166     lowing are defined in the header file:
1167    
1168     PCRE_ERROR_NOMATCH (-1)
1169    
1170     The subject string did not match the pattern.
1171    
1172     PCRE_ERROR_NULL (-2)
1173    
1174     Either code or subject was passed as NULL, or ovector was
1175     NULL and ovecsize was not zero.
1176    
1177     PCRE_ERROR_BADOPTION (-3)
1178    
1179     An unrecognized bit was set in the options argument.
1180    
1181     PCRE_ERROR_BADMAGIC (-4)
1182    
1183     PCRE stores a 4-byte "magic number" at the start of the com-
1184     piled code, to catch the case when it is passed a junk
1185     pointer. This is the error it gives when the magic number
1186     isn't present.
1187    
1188     PCRE_ERROR_UNKNOWN_NODE (-5)
1189    
1190     While running the pattern match, an unknown item was encoun-
1191     tered in the compiled pattern. This error could be caused by
1192     a bug in PCRE or by overwriting of the compiled pattern.
1193    
1194     PCRE_ERROR_NOMEMORY (-6)
1195    
1196     If a pattern contains back references, but the ovector that
1197     is passed to pcre_exec() is not big enough to remember the
1198     referenced substrings, PCRE gets a block of memory at the
1199     start of matching to use for this purpose. If the call via
1200     pcre_malloc() fails, this error is given. The memory is
1201     freed at the end of matching.
1202    
1203 nigel 63 PCRE_ERROR_NOSUBSTRING (-7)
1204 nigel 41
1205 nigel 63 This error is used by the pcre_copy_substring(),
1206     pcre_get_substring(), and pcre_get_substring_list() func-
1207     tions (see below). It is never returned by pcre_exec().
1208 nigel 41
1209 nigel 63 PCRE_ERROR_MATCHLIMIT (-8)
1210 nigel 53
1211 nigel 63 The recursion and backtracking limit, as specified by the
1212     match_limit field in a pcre_extra structure (or defaulted)
1213     was reached. See the description above.
1214    
1215     PCRE_ERROR_CALLOUT (-9)
1216    
1217     This error is never generated by pcre_exec() itself. It is
1218     provided for use by callout functions that want to yield a
1219     distinctive error code. See the pcrecallout documentation
1220     for details.
1221    
1222    
1223     EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
1224    
1225     int pcre_copy_substring(const char *subject, int *ovector,
1226     int stringcount, int stringnumber, char *buffer,
1227     int buffersize);
1228    
1229     int pcre_get_substring(const char *subject, int *ovector,
1230     int stringcount, int stringnumber,
1231     const char **stringptr);
1232    
1233     int pcre_get_substring_list(const char *subject,
1234     int *ovector, int stringcount, const char ***listptr);
1235    
1236 nigel 41 Captured substrings can be accessed directly by using the
1237     offsets returned by pcre_exec() in ovector. For convenience,
1238     the functions pcre_copy_substring(), pcre_get_substring(),
1239     and pcre_get_substring_list() are provided for extracting
1240     captured substrings as new, separate, zero-terminated
1241 nigel 63 strings. These functions identify substrings by number. The
1242     next section describes functions for extracting named sub-
1243 nigel 41 strings. A substring that contains a binary zero is
1244     correctly extracted and has a further zero added on the end,
1245 nigel 63 but the result is not, of course, a C string.
1246 nigel 41
1247 nigel 63 The first three arguments are the same for all three of
1248     these functions: subject is the subject string which has
1249     just been successfully matched, ovector is a pointer to the
1250     vector of integer offsets that was passed to pcre_exec(),
1251     and stringcount is the number of substrings that were cap-
1252     tured by the match, including the substring that matched the
1253 nigel 41 entire regular expression. This is the value returned by
1254     pcre_exec if it is greater than zero. If pcre_exec()
1255     returned zero, indicating that it ran out of space in ovec-
1256 nigel 47 tor, the value passed as stringcount should be the size of
1257     the vector divided by three.
1258 nigel 41
1259     The functions pcre_copy_substring() and pcre_get_substring()
1260     extract a single substring, whose number is given as string-
1261     number. A value of zero extracts the substring that matched
1262     the entire pattern, while higher values extract the captured
1263     substrings. For pcre_copy_substring(), the string is placed
1264     in buffer, whose length is given by buffersize, while for
1265 nigel 49 pcre_get_substring() a new block of memory is obtained via
1266 nigel 41 pcre_malloc, and its address is returned via stringptr. The
1267     yield of the function is the length of the string, not
1268     including the terminating zero, or one of
1269    
1270     PCRE_ERROR_NOMEMORY (-6)
1271    
1272     The buffer was too small for pcre_copy_substring(), or the
1273     attempt to get memory failed for pcre_get_substring().
1274    
1275     PCRE_ERROR_NOSUBSTRING (-7)
1276    
1277     There is no substring whose number is stringnumber.
1278    
1279     The pcre_get_substring_list() function extracts all avail-
1280     able substrings and builds a list of pointers to them. All
1281     this is done in a single block of memory which is obtained
1282     via pcre_malloc. The address of the memory block is returned
1283     via listptr, which is also the start of the list of string
1284     pointers. The end of the list is marked by a NULL pointer.
1285     The yield of the function is zero if all went well, or
1286    
1287     PCRE_ERROR_NOMEMORY (-6)
1288    
1289     if the attempt to get the memory block failed.
1290    
1291     When any of these functions encounter a substring that is
1292     unset, which can happen when capturing subpattern number n+1
1293     matches some part of the subject, but subpattern n has not
1294     been used at all, they return an empty string. This can be
1295     distinguished from a genuine zero-length substring by
1296     inspecting the appropriate offset in ovector, which is nega-
1297     tive for unset substrings.
1298    
1299 nigel 49 The two convenience functions pcre_free_substring() and
1300     pcre_free_substring_list() can be used to free the memory
1301     returned by a previous call of pcre_get_substring() or
1302     pcre_get_substring_list(), respectively. They do nothing
1303     more than call the function pointed to by pcre_free, which
1304     of course could be called directly from a C program. How-
1305     ever, PCRE is used in some situations where it is linked via
1306     a special interface to another programming language which
1307     cannot use pcre_free directly; it is for these cases that
1308     the functions are provided.
1309 nigel 41
1310    
1311 nigel 63 EXTRACTING CAPTURED SUBSTRINGS BY NAME
1312 nigel 41
1313 nigel 63 int pcre_copy_named_substring(const pcre *code,
1314     const char *subject, int *ovector,
1315     int stringcount, const char *stringname,
1316     char *buffer, int buffersize);
1317 nigel 41
1318 nigel 63 int pcre_get_stringnumber(const pcre *code,
1319     const char *name);
1320 nigel 41
1321 nigel 63 int pcre_get_named_substring(const pcre *code,
1322     const char *subject, int *ovector,
1323     int stringcount, const char *stringname,
1324     const char **stringptr);
1325 nigel 41
1326 nigel 63 To extract a substring by name, you first have to find asso-
1327     ciated number. This can be done by calling
1328     pcre_get_stringnumber(). The first argument is the compiled
1329     pattern, and the second is the name. For example, for this
1330     pattern
1331 nigel 41
1332 nigel 63 ab(?<xxx>\d+)...
1333    
1334     the number of the subpattern called "xxx" is 1. Given the
1335     number, you can then extract the substring directly, or use
1336     one of the functions described in the previous section. For
1337     convenience, there are also two functions that do the whole
1338     job.
1339    
1340     Most of the arguments of pcre_copy_named_substring() and
1341     pcre_get_named_substring() are the same as those for the
1342     functions that extract by number, and so are not re-
1343     described here. There are just two differences.
1344    
1345     First, instead of a substring number, a substring name is
1346     given. Second, there is an extra argument, given at the
1347     start, which is a pointer to the compiled pattern. This is
1348     needed in order to gain access to the name-to-number trans-
1349     lation table.
1350    
1351     These functions call pcre_get_stringnumber(), and if it
1352     succeeds, they then call pcre_copy_substring() or
1353     pcre_get_substring(), as appropriate.
1354    
1355     Last updated: 03 February 2003
1356     Copyright (c) 1997-2003 University of Cambridge.
1357     -----------------------------------------------------------------------------
1358    
1359     NAME
1360     PCRE - Perl-compatible regular expressions
1361    
1362    
1363     PCRE CALLOUTS
1364    
1365     int (*pcre_callout)(pcre_callout_block *);
1366    
1367     PCRE provides a feature called "callout", which is a means
1368     of temporarily passing control to the caller of PCRE in the
1369     middle of pattern matching. The caller of PCRE provides an
1370     external function by putting its entry point in the global
1371     variable pcre_callout. By default, this variable contains
1372     NULL, which disables all calling out.
1373    
1374     Within a regular expression, (?C) indicates the points at
1375     which the external function is to be called. Different cal-
1376     lout points can be identified by putting a number less than
1377     256 after the letter C. The default value is zero. For
1378     example, this pattern has two callout points:
1379    
1380     (?C1)9abc(?C2)def
1381    
1382     During matching, when PCRE reaches a callout point (and
1383     pcre_callout is set), the external function is called. Its
1384     only argument is a pointer to a pcre_callout block. This
1385     contains the following variables:
1386    
1387     int version;
1388     int callout_number;
1389     int *offset_vector;
1390     const char *subject;
1391     int subject_length;
1392     int start_match;
1393     int current_position;
1394     int capture_top;
1395     int capture_last;
1396     void *callout_data;
1397    
1398     The version field is an integer containing the version
1399     number of the block format. The current version is zero. The
1400     version number may change in future if additional fields are
1401     added, but the intention is never to remove any of the
1402     existing fields.
1403    
1404     The callout_number field contains the number of the callout,
1405     as compiled into the pattern (that is, the number after ?C).
1406    
1407     The offset_vector field is a pointer to the vector of
1408     offsets that was passed by the caller to pcre_exec(). The
1409     contents can be inspected in order to extract substrings
1410     that have been matched so far, in the same way as for
1411     extracting substrings after a match has completed.
1412     The subject and subject_length fields contain copies the
1413     values that were passed to pcre_exec().
1414    
1415     The start_match field contains the offset within the subject
1416     at which the current match attempt started. If the pattern
1417     is not anchored, the callout function may be called several
1418     times for different starting points.
1419    
1420     The current_position field contains the offset within the
1421     subject of the current match pointer.
1422    
1423     The capture_top field contains the number of the highest
1424     captured substring so far.
1425    
1426     The capture_last field contains the number of the most
1427     recently captured substring.
1428    
1429     The callout_data field contains a value that is passed to
1430     pcre_exec() by the caller specifically so that it can be
1431     passed back in callouts. It is passed in the pcre_callout
1432     field of the pcre_extra data structure. If no such data was
1433     passed, the value of callout_data in a pcre_callout block is
1434     NULL. There is a description of the pcre_extra structure in
1435     the pcreapi documentation.
1436    
1437    
1438    
1439     RETURN VALUES
1440    
1441     The callout function returns an integer. If the value is
1442     zero, matching proceeds as normal. If the value is greater
1443     than zero, matching fails at the current point, but back-
1444     tracking to test other possibilities goes ahead, just as if
1445     a lookahead assertion had failed. If the value is less than
1446     zero, the match is abandoned, and pcre_exec() returns the
1447     value.
1448    
1449     Negative values should normally be chosen from the set of
1450     PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH
1451     forces a standard "no match" failure. The error number
1452     PCRE_ERROR_CALLOUT is reserved for use by callout functions;
1453     it will never be used by PCRE itself.
1454    
1455     Last updated: 21 January 2003
1456     Copyright (c) 1997-2003 University of Cambridge.
1457     -----------------------------------------------------------------------------
1458    
1459     NAME
1460     PCRE - Perl-compatible regular expressions
1461    
1462    
1463 nigel 41 DIFFERENCES FROM PERL
1464    
1465 nigel 63 This document describes the differences in the ways that
1466     PCRE and Perl handle regular expressions. The differences
1467     described here are with respect to Perl 5.8.
1468 nigel 41
1469 nigel 63 1. PCRE does not allow repeat quantifiers on lookahead
1470 nigel 41 assertions. Perl permits them, but they do not mean what you
1471     might think. For example, (?!a){3} does not assert that the
1472     next three characters are not "a". It just asserts that the
1473     next character is not "a" three times.
1474    
1475 nigel 63 2. Capturing subpatterns that occur inside negative looka-
1476 nigel 41 head assertions are counted, but their entries in the
1477     offsets vector are never set. Perl sets its numerical vari-
1478     ables from any such patterns that are matched before the
1479     assertion fails to match something (thereby succeeding), but
1480     only if the negative lookahead assertion contains just one
1481     branch.
1482    
1483 nigel 63 3. Though binary zero characters are supported in the sub-
1484 nigel 41 ject string, they are not allowed in a pattern string
1485     because it is passed as a normal C string, terminated by
1486     zero. The escape sequence "\0" can be used in the pattern to
1487     represent a binary zero.
1488    
1489 nigel 63 4. The following Perl escape sequences are not supported:
1490     \l, \u, \L, \U, \P, \p, and \X. In fact these are imple-
1491     mented by Perl's general string-handling and are not part of
1492     its pattern matching engine. If any of these are encountered
1493     by PCRE, an error is generated.
1494 nigel 41
1495 nigel 63 5. PCRE does support the \Q...\E escape for quoting sub-
1496     strings. Characters in between are treated as literals. This
1497     is slightly different from Perl in that $ and @ are also
1498     handled as literals inside the quotes. In Perl, they cause
1499     variable interpolation (but of course PCRE does not have
1500     variables). Note the following examples:
1501 nigel 41
1502 nigel 63 Pattern PCRE matches Perl matches
1503 nigel 49
1504 nigel 63 \Qabc$xyz\E abc$xyz abc followed by the
1505     contents of $xyz
1506     \Qabc\$xyz\E abc\$xyz abc\$xyz
1507     \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
1508 nigel 41
1509 nigel 63 In PCRE, the \Q...\E mechanism is not recognized inside a
1510     character class.
1511 nigel 41
1512 nigel 63 8. Fairly obviously, PCRE does not support the (?{code}) and
1513     (?p{code}) constructions. However, there is some experimen-
1514     tal support for recursive patterns using the non-Perl items
1515     (?R), (?number) and (?P>name). Also, the PCRE "callout"
1516     feature allows an external function to be called during pat-
1517     tern matching.
1518 nigel 41
1519 nigel 63 9. There are some differences that are concerned with the
1520     settings of captured strings when part of a pattern is
1521     repeated. For example, matching "aba" against the pattern
1522     /^(a(b)?)+$/ in Perl leaves $2 unset, but in PCRE it is set
1523     to "b".
1524    
1525 nigel 41 10. PCRE provides some extensions to the Perl regular
1526     expression facilities:
1527    
1528     (a) Although lookbehind assertions must match fixed length
1529     strings, each alternative branch of a lookbehind assertion
1530 nigel 63 can match a different length of string. Perl requires them
1531     all to have the same length.
1532 nigel 41
1533     (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not
1534 nigel 63 set, the $ meta-character matches only at the very end of
1535 nigel 41 the string.
1536    
1537     (c) If PCRE_EXTRA is set, a backslash followed by a letter
1538     with no special meaning is faulted.
1539    
1540 nigel 43 (d) If PCRE_UNGREEDY is set, the greediness of the repeti-
1541     tion quantifiers is inverted, that is, by default they are
1542     not greedy, but if followed by a question mark they are.
1543 nigel 41
1544     (e) PCRE_ANCHORED can be used to force a pattern to be tried
1545 nigel 63 only at the first matching position in the subject string.
1546 nigel 41
1547 nigel 63 (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and
1548     PCRE_NO_AUTO_CAPTURE options for pcre_exec() have no Perl
1549     equivalents.
1550 nigel 41
1551 nigel 63 (g) The (?R), (?number), and (?P>name) constructs allows for
1552     recursive pattern matching (Perl can do this using the
1553     (?p{code}) construct, which PCRE cannot support.)
1554 nigel 41
1555 nigel 63 (h) PCRE supports named capturing substrings, using the
1556     Python syntax.
1557 nigel 41
1558 nigel 63 (i) PCRE supports the possessive quantifier "++" syntax,
1559     taken from Sun's Java package.
1560 nigel 43
1561 nigel 63 (j) The (R) condition, for testing recursion, is a PCRE
1562     extension.
1563    
1564     (k) The callout facility is PCRE-specific.
1565    
1566     Last updated: 03 February 2003
1567     Copyright (c) 1997-2003 University of Cambridge.
1568     -----------------------------------------------------------------------------
1569    
1570     NAME
1571     PCRE - Perl-compatible regular expressions
1572    
1573    
1574     PCRE REGULAR EXPRESSION DETAILS
1575    
1576 nigel 41 The syntax and semantics of the regular expressions sup-
1577     ported by PCRE are described below. Regular expressions are
1578     also described in the Perl documentation and in a number of
1579     other books, some of which have copious examples. Jeffrey
1580     Friedl's "Mastering Regular Expressions", published by
1581 nigel 63 O'Reilly, covers them in great detail. The description here
1582     is intended as reference documentation.
1583 nigel 49
1584     The basic operation of PCRE is on strings of bytes. However,
1585 nigel 63 there is also support for UTF-8 character strings. To use
1586     this support you must build PCRE to include UTF-8 support,
1587     and then call pcre_compile() with the PCRE_UTF8 option. How
1588     this affects the pattern matching is mentioned in several
1589     places below. There is also a summary of UTF-8 features in
1590     the section on UTF-8 support in the main pcre page.
1591 nigel 41
1592     A regular expression is a pattern that is matched against a
1593     subject string from left to right. Most characters stand for
1594     themselves in a pattern, and match the corresponding charac-
1595     ters in the subject. As a trivial example, the pattern
1596    
1597     The quick brown fox
1598    
1599     matches a portion of a subject string that is identical to
1600     itself. The power of regular expressions comes from the
1601     ability to include alternatives and repetitions in the pat-
1602     tern. These are encoded in the pattern by the use of meta-
1603     characters, which do not stand for themselves but instead
1604     are interpreted in some special way.
1605    
1606     There are two different sets of meta-characters: those that
1607     are recognized anywhere in the pattern except within square
1608     brackets, and those that are recognized in square brackets.
1609     Outside square brackets, the meta-characters are as follows:
1610    
1611     \ general escape character with several uses
1612 nigel 63 ^ assert start of string (or line, in multiline mode)
1613     $ assert end of string (or line, in multiline mode)
1614 nigel 41 . match any character except newline (by default)
1615     [ start character class definition
1616     | start of alternative branch
1617     ( start subpattern
1618     ) end subpattern
1619     ? extends the meaning of (
1620     also 0 or 1 quantifier
1621     also quantifier minimizer
1622     * 0 or more quantifier
1623     + 1 or more quantifier
1624 nigel 63 also "possessive quantifier"
1625 nigel 41 { start min/max quantifier
1626    
1627     Part of a pattern that is in square brackets is called a
1628     "character class". In a character class the only meta-
1629     characters are:
1630    
1631     \ general escape character
1632     ^ negate the class, but only if the first character
1633     - indicates character range
1634 nigel 63 [ POSIX character class (only if followed by POSIX
1635     syntax)
1636 nigel 41 ] terminates the character class
1637    
1638     The following sections describe the use of each of the
1639     meta-characters.
1640    
1641    
1642 nigel 63 BACKSLASH
1643 nigel 41
1644     The backslash character has several uses. Firstly, if it is
1645     followed by a non-alphameric character, it takes away any
1646     special meaning that character may have. This use of
1647     backslash as an escape character applies both inside and
1648     outside character classes.
1649    
1650 nigel 63 For example, if you want to match a * character, you write
1651     \* in the pattern. This escaping action applies whether or
1652     not the following character would otherwise be interpreted
1653     as a meta-character, so it is always safe to precede a non-
1654     alphameric with backslash to specify that it stands for
1655     itself. In particular, if you want to match a backslash, you
1656     write \\.
1657 nigel 41
1658     If a pattern is compiled with the PCRE_EXTENDED option, whi-
1659     tespace in the pattern (other than in a character class) and
1660 nigel 63 characters between a # outside a character class and the
1661 nigel 41 next newline character are ignored. An escaping backslash
1662 nigel 63 can be used to include a whitespace or # character as part
1663 nigel 41 of the pattern.
1664    
1665 nigel 63 If you want to remove the special meaning from a sequence of
1666     characters, you can do so by putting them between \Q and \E.
1667     This is different from Perl in that $ and @ are handled as
1668     literals in \Q...\E sequences in PCRE, whereas in Perl, $
1669     and @ cause variable interpolation. Note the following exam-
1670     ples:
1671    
1672     Pattern PCRE matches Perl matches
1673    
1674     \Qabc$xyz\E abc$xyz abc followed by the
1675    
1676     contents of $xyz
1677     \Qabc\$xyz\E abc\$xyz abc\$xyz
1678     \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
1679    
1680     The \Q...\E sequence is recognized both inside and outside
1681     character classes.
1682    
1683 nigel 41 A second use of backslash provides a way of encoding non-
1684     printing characters in patterns in a visible manner. There
1685     is no restriction on the appearance of non-printing charac-
1686     ters, apart from the binary zero that terminates a pattern,
1687     but when a pattern is being prepared by text editing, it is
1688     usually easier to use one of the following escape sequences
1689     than the binary character it represents:
1690    
1691 nigel 63 \a alarm, that is, the BEL character (hex 07)
1692     \cx "control-x", where x is any character
1693     \e escape (hex 1B)
1694     \f formfeed (hex 0C)
1695     \n newline (hex 0A)
1696     \r carriage return (hex 0D)
1697     \t tab (hex 09)
1698     \ddd character with octal code ddd, or backreference
1699     \xhh character with hex code hh
1700     \x{hhh..} character with hex code hhh... (UTF-8 mode only)
1701 nigel 41
1702 nigel 63 The precise effect of \cx is as follows: if x is a lower
1703 nigel 41 case letter, it is converted to upper case. Then bit 6 of
1704 nigel 63 the character (hex 40) is inverted. Thus \cz becomes hex
1705     1A, but \c{ becomes hex 3B, while \c; becomes hex 7B.
1706 nigel 41
1707 nigel 63 After \x, from zero to two hexadecimal digits are read
1708     (letters can be in upper or lower case). In UTF-8 mode, any
1709     number of hexadecimal digits may appear between \x{ and },
1710     but the value of the character code must be less than 2**31
1711     (that is, the maximum hexadecimal value is 7FFFFFFF). If
1712     characters other than hexadecimal digits appear between \x{
1713     and }, or if there is no terminating }, this form of escape
1714     is not recognized. Instead, the initial \x will be inter-
1715     preted as a basic hexadecimal escape, with no following
1716     digits, giving a byte whose value is zero.
1717 nigel 41
1718 nigel 63 Characters whose value is less than 256 can be defined by
1719     either of the two syntaxes for \x when PCRE is in UTF-8
1720     mode. There is no difference in the way they are handled.
1721     For example, \xdc is exactly the same as \x{dc}.
1722    
1723     After \0 up to two further octal digits are read. In both
1724 nigel 41 cases, if there are fewer than two digits, just those that
1725 nigel 63 are present are used. Thus the sequence \0\x\07 specifies
1726     two binary zeros followed by a BEL character (code value 7).
1727     Make sure you supply two digits after the initial zero if
1728     the character that follows is itself an octal digit.
1729 nigel 41
1730     The handling of a backslash followed by a digit other than 0
1731     is complicated. Outside a character class, PCRE reads it
1732     and any following digits as a decimal number. If the number
1733     is less than 10, or if there have been at least that many
1734     previous capturing left parentheses in the expression, the
1735     entire sequence is taken as a back reference. A description
1736     of how this works is given later, following the discussion
1737     of parenthesized subpatterns.
1738    
1739     Inside a character class, or if the decimal number is
1740     greater than 9 and there have not been that many capturing
1741     subpatterns, PCRE re-reads up to three octal digits follow-
1742     ing the backslash, and generates a single byte from the
1743     least significant 8 bits of the value. Any subsequent digits
1744     stand for themselves. For example:
1745    
1746     \040 is another way of writing a space
1747     \40 is the same, provided there are fewer than 40
1748     previous capturing subpatterns
1749     \7 is always a back reference
1750     \11 might be a back reference, or another way of
1751     writing a tab
1752     \011 is always a tab
1753     \0113 is a tab followed by the character "3"
1754 nigel 63 \113 might be a back reference, otherwise the
1755     character with octal code 113
1756     \377 might be a back reference, otherwise
1757     the byte consisting entirely of 1 bits
1758 nigel 41 \81 is either a back reference, or a binary zero
1759     followed by the two characters "8" and "1"
1760    
1761     Note that octal values of 100 or greater must not be intro-
1762     duced by a leading zero, because no more than three octal
1763     digits are ever read.
1764 nigel 43
1765 nigel 63 All the sequences that define a single byte value or a sin-
1766     gle UTF-8 character (in UTF-8 mode) can be used both inside
1767     and outside character classes. In addition, inside a charac-
1768     ter class, the sequence \b is interpreted as the backspace
1769     character (hex 08). Outside a character class it has a dif-
1770     ferent meaning (see below).
1771 nigel 41
1772     The third use of backslash is for specifying generic charac-
1773     ter types:
1774    
1775     \d any decimal digit
1776     \D any character that is not a decimal digit
1777     \s any whitespace character
1778     \S any character that is not a whitespace character
1779     \w any "word" character
1780 nigel 63 W any "non-word" character
1781 nigel 41
1782     Each pair of escape sequences partitions the complete set of
1783     characters into two disjoint sets. Any given character
1784     matches one, and only one, of each pair.
1785    
1786 nigel 63 In UTF-8 mode, characters with values greater than 255 never
1787     match \d, \s, or \w, and always match \D, \S, and \W.
1788    
1789     For compatibility with Perl, \s does not match the VT char-
1790     acter (code 11). This makes it different from the the POSIX
1791     "space" class. The \s characters are HT (9), LF (10), FF
1792     (12), CR (13), and space (32).
1793    
1794 nigel 41 A "word" character is any letter or digit or the underscore
1795     character, that is, any character which can be part of a
1796     Perl "word". The definition of letters and digits is con-
1797     trolled by PCRE's character tables, and may vary if locale-
1798 nigel 63 specific matching is taking place (see "Locale support" in
1799     the pcreapi page). For example, in the "fr" (French) locale,
1800     some character codes greater than 128 are used for accented
1801     letters, and these are matched by \w.
1802 nigel 41
1803     These character type sequences can appear both inside and
1804     outside character classes. They each match one character of
1805     the appropriate type. If the current matching point is at
1806     the end of the subject string, all of them fail, since there
1807     is no character to match.
1808    
1809     The fourth use of backslash is for certain simple asser-
1810     tions. An assertion specifies a condition that has to be met
1811     at a particular point in a match, without consuming any
1812     characters from the subject string. The use of subpatterns
1813     for more complicated assertions is described below. The
1814     backslashed assertions are
1815    
1816 nigel 63 \b matches at a word boundary
1817     \B matches when not at a word boundary
1818     \A matches at start of subject
1819     \Z matches at end of subject or before newline at end
1820     \z matches at end of subject
1821     \G matches at first matching position in subject
1822 nigel 41
1823     These assertions may not appear in character classes (but
1824 nigel 63 note that \b has a different meaning, namely the backspace
1825 nigel 41 character, inside a character class).
1826 nigel 43
1827 nigel 41 A word boundary is a position in the subject string where
1828     the current character and the previous character do not both
1829     match \w or \W (i.e. one matches \w and the other matches
1830     \W), or the start or end of the string if the first or last
1831     character matches \w, respectively.
1832     The \A, \Z, and \z assertions differ from the traditional
1833     circumflex and dollar (described below) in that they only
1834     ever match at the very start and end of the subject string,
1835 nigel 63 whatever options are set. Thus, they are independent of mul-
1836     tiline mode.
1837    
1838     They are not affected by the PCRE_NOTBOL or PCRE_NOTEOL
1839     options. If the startoffset argument of pcre_exec() is non-
1840     zero, indicating that matching is to start at a point other
1841     than the beginning of the subject, \A can never match. The
1842 nigel 41 difference between \Z and \z is that \Z matches before a
1843     newline that is the last character of the string as well as
1844     at the end of the string, whereas \z matches only at the
1845     end.
1846    
1847 nigel 63 The \G assertion is true only when the current matching
1848     position is at the start point of the match, as specified by
1849     the startoffset argument of pcre_exec(). It differs from \A
1850     when the value of startoffset is non-zero. By calling
1851     pcre_exec() multiple times with appropriate arguments, you
1852     can mimic Perl's /g option, and it is in this kind of imple-
1853     mentation where \G can be useful.
1854 nigel 41
1855 nigel 63 Note, however, that PCRE's interpretation of \G, as the
1856     start of the current match, is subtly different from Perl's,
1857     which defines it as the end of the previous match. In Perl,
1858     these can be different when the previously matched string
1859     was empty. Because PCRE does just one match at a time, it
1860     cannot reproduce this behaviour.
1861 nigel 41
1862 nigel 63 If all the alternatives of a pattern begin with \G, the
1863     expression is anchored to the starting match position, and
1864     the "anchored" flag is set in the compiled regular expres-
1865     sion.
1866    
1867    
1868 nigel 41 CIRCUMFLEX AND DOLLAR
1869 nigel 63
1870 nigel 41 Outside a character class, in the default matching mode, the
1871     circumflex character is an assertion which is true only if
1872     the current matching point is at the start of the subject
1873     string. If the startoffset argument of pcre_exec() is non-
1874 nigel 63 zero, circumflex can never match if the PCRE_MULTILINE
1875     option is unset. Inside a character class, circumflex has an
1876     entirely different meaning (see below).
1877 nigel 41
1878     Circumflex need not be the first character of the pattern if
1879     a number of alternatives are involved, but it should be the
1880     first thing in each alternative in which it appears if the
1881     pattern is ever to match that branch. If all possible alter-
1882     natives start with a circumflex, that is, if the pattern is
1883     constrained to match only at the start of the subject, it is
1884     said to be an "anchored" pattern. (There are also other con-
1885     structs that can cause a pattern to be anchored.)
1886    
1887     A dollar character is an assertion which is true only if the
1888     current matching point is at the end of the subject string,
1889     or immediately before a newline character that is the last
1890     character in the string (by default). Dollar need not be the
1891     last character of the pattern if a number of alternatives
1892     are involved, but it should be the last item in any branch
1893     in which it appears. Dollar has no special meaning in a
1894     character class.
1895    
1896     The meaning of dollar can be changed so that it matches only
1897     at the very end of the string, by setting the
1898 nigel 63 PCRE_DOLLAR_ENDONLY option at compile time. This does not
1899     affect the \Z assertion.
1900 nigel 41
1901     The meanings of the circumflex and dollar characters are
1902     changed if the PCRE_MULTILINE option is set. When this is
1903     the case, they match immediately after and immediately
1904 nigel 63 before an internal newline character, respectively, in addi-
1905     tion to matching at the start and end of the subject string.
1906     For example, the pattern /^abc$/ matches the subject string
1907 nigel 41 "def\nabc" in multiline mode, but not otherwise. Conse-
1908     quently, patterns that are anchored in single line mode
1909 nigel 63 because all branches start with ^ are not anchored in multi-
1910     line mode, and a match for circumflex is possible when the
1911 nigel 41 startoffset argument of pcre_exec() is non-zero. The
1912     PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is
1913     set.
1914    
1915     Note that the sequences \A, \Z, and \z can be used to match
1916     the start and end of the subject in both modes, and if all
1917 nigel 53 branches of a pattern start with \A it is always anchored,
1918 nigel 41 whether PCRE_MULTILINE is set or not.
1919    
1920    
1921 nigel 63 FULL STOP (PERIOD, DOT)
1922 nigel 41
1923     Outside a character class, a dot in the pattern matches any
1924     one character in the subject, including a non-printing char-
1925 nigel 63 acter, but not (by default) newline. In UTF-8 mode, a dot
1926     matches any UTF-8 character, which might be more than one
1927     byte long, except (by default) for newline. If the
1928     PCRE_DOTALL option is set, dots match newlines as well. The
1929     handling of dot is entirely independent of the handling of
1930     circumflex and dollar, the only relationship being that they
1931     both involve newline characters. Dot has no special meaning
1932     in a character class.
1933 nigel 41
1934    
1935    
1936 nigel 63 MATCHING A SINGLE BYTE
1937    
1938     Outside a character class, the escape sequence \C matches
1939     any one byte, both in and out of UTF-8 mode. Unlike a dot,
1940     it always matches a newline. The feature is provided in Perl
1941     in order to match individual bytes in UTF-8 mode. Because
1942     it breaks up UTF-8 characters into individual bytes, what
1943     remains in the string may be a malformed UTF-8 string. For
1944     this reason it is best avoided.
1945    
1946     PCRE does not allow \C to appear in lookbehind assertions
1947     (see below), because in UTF-8 mode it makes it impossible to
1948     calculate the length of the lookbehind.
1949    
1950    
1951 nigel 41 SQUARE BRACKETS
1952 nigel 63
1953 nigel 41 An opening square bracket introduces a character class, ter-
1954     minated by a closing square bracket. A closing square
1955     bracket on its own is not special. If a closing square
1956     bracket is required as a member of the class, it should be
1957     the first data character in the class (after an initial cir-
1958     cumflex, if present) or escaped with a backslash.
1959    
1960 nigel 63 A character class matches a single character in the subject.
1961     In UTF-8 mode, the character may occupy more than one byte.
1962     A matched character must be in the set of characters defined
1963     by the class, unless the first character in the class defin-
1964     ition is a circumflex, in which case the subject character
1965     must not be in the set defined by the class. If a circumflex
1966     is actually required as a member of the class, ensure it is
1967     not the first character, or escape it with a backslash.
1968 nigel 41
1969     For example, the character class [aeiou] matches any lower
1970     case vowel, while [^aeiou] matches any character that is not
1971     a lower case vowel. Note that a circumflex is just a con-
1972     venient notation for specifying the characters which are in
1973     the class by enumerating those that are not. It is not an
1974     assertion: it still consumes a character from the subject
1975     string, and fails if the current pointer is at the end of
1976     the string.
1977    
1978 nigel 63 In UTF-8 mode, characters with values greater than 255 can
1979     be included in a class as a literal string of bytes, or by
1980     using the \x{ escaping mechanism.
1981    
1982 nigel 41 When caseless matching is set, any letters in a class
1983     represent both their upper case and lower case versions, so
1984     for example, a caseless [aeiou] matches "A" as well as "a",
1985     and a caseless [^aeiou] does not match "A", whereas a case-
1986 nigel 63 ful version would. PCRE does not support the concept of case
1987     for characters with values greater than 255.
1988 nigel 41 The newline character is never treated in any special way in
1989     character classes, whatever the setting of the PCRE_DOTALL
1990     or PCRE_MULTILINE options is. A class such as [^a] will
1991     always match a newline.
1992    
1993     The minus (hyphen) character can be used to specify a range
1994     of characters in a character class. For example, [d-m]
1995     matches any letter between d and m, inclusive. If a minus
1996     character is required in a class, it must be escaped with a
1997     backslash or appear in a position where it cannot be inter-
1998     preted as indicating a range, typically as the first or last
1999     character in the class.
2000    
2001     It is not possible to have the literal character "]" as the
2002     end character of a range. A pattern such as [W-]46] is
2003     interpreted as a class of two characters ("W" and "-") fol-
2004     lowed by a literal string "46]", so it would match "W46]" or
2005     "-46]". However, if the "]" is escaped with a backslash it
2006     is interpreted as the end of range, so [W-\]46] is inter-
2007     preted as a single class containing a range followed by two
2008     separate characters. The octal or hexadecimal representation
2009     of "]" can also be used to end a range.
2010    
2011 nigel 63 Ranges operate in the collating sequence of character
2012     values. They can also be used for characters specified
2013     numerically, for example [\000-\037]. In UTF-8 mode, ranges
2014     can include characters whose values are greater than 255,
2015     for example [\x{100}-\x{2ff}].
2016 nigel 41
2017 nigel 63 If a range that includes letters is used when caseless
2018     matching is set, it matches the letters in either case. For
2019     example, [W-c] is equivalent to [][\^_`wxyzabc], matched
2020     caselessly, and if character tables for the "fr" locale are
2021     in use, [\xc8-\xcb] matches accented E characters in both
2022     cases.
2023    
2024 nigel 41 The character types \d, \D, \s, \S, \w, and \W may also
2025     appear in a character class, and add the characters that
2026     they match to the class. For example, [\dABCDEF] matches any
2027     hexadecimal digit. A circumflex can conveniently be used
2028     with the upper case character types to specify a more res-
2029     tricted set of characters than the matching lower case type.
2030     For example, the class [^\W_] matches any letter or digit,
2031     but not underscore.
2032    
2033     All non-alphameric characters other than \, -, ^ (at the
2034     start) and the terminating ] are non-special in character
2035     classes, but it does no harm if they are escaped.
2036    
2037    
2038 nigel 43 POSIX CHARACTER CLASSES
2039    
2040 nigel 63 Perl supports the POSIX notation for character classes,
2041     which uses names enclosed by [: and :] within the enclosing
2042     square brackets. PCRE also supports this notation. For exam-
2043     ple,
2044    
2045 nigel 43 [01[:alpha:]%]
2046    
2047     matches "0", "1", any alphabetic character, or "%". The sup-
2048     ported class names are
2049    
2050     alnum letters and digits
2051     alpha letters
2052     ascii character codes 0 - 127
2053 nigel 63 blank space or tab only
2054 nigel 43 cntrl control characters
2055     digit decimal digits (same as \d)
2056     graph printing characters, excluding space
2057     lower lower case letters
2058     print printing characters, including space
2059     punct printing characters, excluding letters and digits
2060 nigel 63 space white space (not quite the same as \s)
2061 nigel 43 upper upper case letters
2062     word "word" characters (same as \w)
2063     xdigit hexadecimal digits
2064    
2065 nigel 63 The "space" characters are HT (9), LF (10), VT (11), FF
2066     (12), CR (13), and space (32). Notice that this list
2067     includes the VT character (code 11). This makes "space" dif-
2068     ferent to \s, which does not include VT (for Perl compati-
2069     bility).
2070 nigel 43
2071 nigel 63 The name "word" is a Perl extension, and "blank" is a GNU
2072     extension from Perl 5.8. Another Perl extension is negation,
2073     which is indicated by a ^ character after the colon. For
2074     example,
2075    
2076 nigel 43 [12[:^digit:]]
2077    
2078     matches "1", "2", or any non-digit. PCRE (and Perl) also
2079 nigel 53 recognize the POSIX syntax [.ch.] and [=ch=] where "ch" is a
2080 nigel 43 "collating element", but these are not supported, and an
2081     error is given if they are encountered.
2082    
2083 nigel 63 In UTF-8 mode, characters with values greater than 255 do
2084     not match any of the POSIX character classes.
2085 nigel 43
2086    
2087 nigel 41 VERTICAL BAR
2088 nigel 63
2089 nigel 41 Vertical bar characters are used to separate alternative
2090     patterns. For example, the pattern
2091    
2092     gilbert|sullivan
2093    
2094     matches either "gilbert" or "sullivan". Any number of alter-
2095     natives may appear, and an empty alternative is permitted
2096     (matching the empty string). The matching process tries
2097     each alternative in turn, from left to right, and the first
2098     one that succeeds is used. If the alternatives are within a
2099     subpattern (defined below), "succeeds" means matching the
2100     rest of the main pattern as well as the alternative in the
2101     subpattern.
2102    
2103    
2104     INTERNAL OPTION SETTING
2105    
2106 nigel 63 The settings of the PCRE_CASELESS, PCRE_MULTILINE,
2107     PCRE_DOTALL, and PCRE_EXTENDED options can be changed from
2108     within the pattern by a sequence of Perl option letters
2109     enclosed between "(?" and ")". The option letters are
2110    
2111 nigel 41 i for PCRE_CASELESS
2112     m for PCRE_MULTILINE
2113     s for PCRE_DOTALL
2114     x for PCRE_EXTENDED
2115    
2116     For example, (?im) sets caseless, multiline matching. It is
2117     also possible to unset these options by preceding the letter
2118     with a hyphen, and a combined setting and unsetting such as
2119     (?im-sx), which sets PCRE_CASELESS and PCRE_MULTILINE while
2120     unsetting PCRE_DOTALL and PCRE_EXTENDED, is also permitted.
2121     If a letter appears both before and after the hyphen, the
2122     option is unset.
2123    
2124 nigel 63 When an option change occurs at top level (that is, not
2125     inside subpattern parentheses), the change applies to the
2126     remainder of the pattern that follows. If the change is
2127     placed right at the start of a pattern, PCRE extracts it
2128     into the global options (and it will therefore show up in
2129     data extracted by the pcre_fullinfo() function).
2130 nigel 41
2131 nigel 63 An option change within a subpattern affects only that part
2132     of the current pattern that follows it, so
2133 nigel 41
2134     (a(?i)b)c
2135    
2136     matches abc and aBc and no other strings (assuming
2137     PCRE_CASELESS is not used). By this means, options can be
2138     made to have different settings in different parts of the
2139     pattern. Any changes made in one alternative do carry on
2140     into subsequent branches within the same subpattern. For
2141     example,
2142    
2143     (a(?i)b|c)
2144    
2145     matches "ab", "aB", "c", and "C", even though when matching
2146     "C" the first branch is abandoned before the option setting.
2147     This is because the effects of option settings happen at
2148     compile time. There would be some very weird behaviour oth-
2149     erwise.
2150    
2151     The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can
2152     be changed in the same way as the Perl-compatible options by
2153     using the characters U and X respectively. The (?X) flag
2154     setting is special in that it must always occur earlier in
2155     the pattern than any of the additional features it turns on,
2156     even when it is at top level. It is best put at the start.
2157    
2158    
2159 nigel 63 SUBPATTERNS
2160 nigel 41
2161     Subpatterns are delimited by parentheses (round brackets),
2162     which can be nested. Marking part of a pattern as a subpat-
2163     tern does two things:
2164    
2165     1. It localizes a set of alternatives. For example, the pat-
2166     tern
2167    
2168     cat(aract|erpillar|)
2169    
2170     matches one of the words "cat", "cataract", or "caterpil-
2171     lar". Without the parentheses, it would match "cataract",
2172     "erpillar" or the empty string.
2173    
2174     2. It sets up the subpattern as a capturing subpattern (as
2175     defined above). When the whole pattern matches, that por-
2176     tion of the subject string that matched the subpattern is
2177     passed back to the caller via the ovector argument of
2178     pcre_exec(). Opening parentheses are counted from left to
2179     right (starting from 1) to obtain the numbers of the captur-
2180     ing subpatterns.
2181    
2182     For example, if the string "the red king" is matched against
2183     the pattern
2184    
2185     the ((red|white) (king|queen))
2186    
2187     the captured substrings are "red king", "red", and "king",
2188 nigel 53 and are numbered 1, 2, and 3, respectively.
2189 nigel 41
2190     The fact that plain parentheses fulfil two functions is not
2191     always helpful. There are often times when a grouping sub-
2192     pattern is required without a capturing requirement. If an
2193 nigel 63 opening parenthesis is followed by a question mark and a
2194     colon, the subpattern does not do any capturing, and is not
2195     counted when computing the number of any subsequent captur-
2196     ing subpatterns. For example, if the string "the white
2197     queen" is matched against the pattern
2198 nigel 41
2199     the ((?:red|white) (king|queen))
2200    
2201     the captured substrings are "white queen" and "queen", and
2202 nigel 63 are numbered 1 and 2. The maximum number of capturing sub-
2203     patterns is 65535, and the maximum depth of nesting of all
2204     subpatterns, both capturing and non-capturing, is 200.
2205 nigel 41
2206     As a convenient shorthand, if any option settings are
2207     required at the start of a non-capturing subpattern, the
2208     option letters may appear between the "?" and the ":". Thus
2209     the two patterns
2210    
2211     (?i:saturday|sunday)
2212     (?:(?i)saturday|sunday)
2213    
2214     match exactly the same set of strings. Because alternative
2215     branches are tried from left to right, and options are not
2216     reset until the end of the subpattern is reached, an option
2217     setting in one branch does affect subsequent branches, so
2218     the above patterns match "SUNDAY" as well as "Saturday".
2219    
2220    
2221 nigel 63 NAMED SUBPATTERNS
2222 nigel 41
2223 nigel 63 Identifying capturing parentheses by number is simple, but
2224     it can be very hard to keep track of the numbers in compli-
2225     cated regular expressions. Furthermore, if an expression is
2226     modified, the numbers may change. To help with the diffi-
2227     culty, PCRE supports the naming of subpatterns, something
2228     that Perl does not provide. The Python syntax (?P<name>...)
2229     is used. Names consist of alphanumeric characters and under-
2230     scores, and must be unique within a pattern.
2231    
2232     Named capturing parentheses are still allocated numbers as
2233     well as names. The PCRE API provides function calls for
2234     extracting the name-to-number translation table from a com-
2235     piled pattern. For further details see the pcreapi documen-
2236     tation.
2237    
2238    
2239 nigel 41 REPETITION
2240 nigel 63
2241 nigel 41 Repetition is specified by quantifiers, which can follow any
2242     of the following items:
2243    
2244 nigel 63 a literal data character
2245 nigel 41 the . metacharacter
2246 nigel 63 the \C escape sequence
2247     escapes such as \d that match single characters
2248 nigel 41 a character class
2249     a back reference (see next section)
2250 nigel 63 a parenthesized subpattern (unless it is an assertion)
2251 nigel 41
2252     The general repetition quantifier specifies a minimum and
2253     maximum number of permitted matches, by giving the two
2254     numbers in curly brackets (braces), separated by a comma.
2255     The numbers must be less than 65536, and the first must be
2256     less than or equal to the second. For example:
2257    
2258     z{2,4}
2259    
2260     matches "zz", "zzz", or "zzzz". A closing brace on its own
2261     is not a special character. If the second number is omitted,
2262     but the comma is present, there is no upper limit; if the
2263     second number and the comma are both omitted, the quantifier
2264     specifies an exact number of required matches. Thus
2265    
2266     [aeiou]{3,}
2267    
2268     matches at least 3 successive vowels, but may match many
2269     more, while
2270    
2271     \d{8}
2272    
2273     matches exactly 8 digits. An opening curly bracket that
2274     appears in a position where a quantifier is not allowed, or
2275     one that does not match the syntax of a quantifier, is taken
2276     as a literal character. For example, {,6} is not a quantif-
2277     ier, but a literal string of four characters.
2278 nigel 63
2279     In UTF-8 mode, quantifiers apply to UTF-8 characters rather
2280     than to individual bytes. Thus, for example, \x{100}{2}
2281     matches two UTF-8 characters, each of which is represented
2282     by a two-byte sequence.
2283    
2284 nigel 41 The quantifier {0} is permitted, causing the expression to
2285     behave as if the previous item and the quantifier were not
2286     present.
2287    
2288     For convenience (and historical compatibility) the three
2289     most common quantifiers have single-character abbreviations:
2290    
2291     * is equivalent to {0,}
2292     + is equivalent to {1,}
2293     ? is equivalent to {0,1}
2294    
2295     It is possible to construct infinite loops by following a
2296     subpattern that can match no characters with a quantifier
2297     that has no upper limit, for example:
2298    
2299     (a?)*
2300    
2301     Earlier versions of Perl and PCRE used to give an error at
2302     compile time for such patterns. However, because there are
2303     cases where this can be useful, such patterns are now
2304     accepted, but if any repetition of the subpattern does in
2305     fact match no characters, the loop is forcibly broken.
2306    
2307     By default, the quantifiers are "greedy", that is, they
2308     match as much as possible (up to the maximum number of per-
2309     mitted times), without causing the rest of the pattern to
2310     fail. The classic example of where this gives problems is in
2311     trying to match comments in C programs. These appear between
2312     the sequences /* and */ and within the sequence, individual
2313     * and / characters may appear. An attempt to match C com-
2314     ments by applying the pattern
2315    
2316     /\*.*\*/
2317    
2318     to the string
2319    
2320     /* first command */ not comment /* second comment */
2321    
2322 nigel 51 fails, because it matches the entire string owing to the
2323 nigel 41 greediness of the .* item.
2324    
2325 nigel 47 However, if a quantifier is followed by a question mark, it
2326     ceases to be greedy, and instead matches the minimum number
2327     of times possible, so the pattern
2328 nigel 41
2329     /\*.*?\*/
2330    
2331     does the right thing with the C comments. The meaning of the
2332     various quantifiers is not otherwise changed, just the pre-
2333     ferred number of matches. Do not confuse this use of ques-
2334     tion mark with its use as a quantifier in its own right.
2335     Because it has two uses, it can sometimes appear doubled, as
2336     in
2337    
2338     \d??\d
2339    
2340     which matches one digit by preference, but can match two if
2341     that is the only way the rest of the pattern matches.
2342    
2343     If the PCRE_UNGREEDY option is set (an option which is not
2344 nigel 47 available in Perl), the quantifiers are not greedy by
2345 nigel 41 default, but individual ones can be made greedy by following
2346     them with a question mark. In other words, it inverts the
2347     default behaviour.
2348    
2349     When a parenthesized subpattern is quantified with a minimum
2350     repeat count that is greater than 1 or with a limited max-
2351     imum, more store is required for the compiled pattern, in
2352     proportion to the size of the minimum or maximum.
2353     If a pattern starts with .* or .{0,} and the PCRE_DOTALL
2354     option (equivalent to Perl's /s) is set, thus allowing the .
2355 nigel 47 to match newlines, the pattern is implicitly anchored,
2356 nigel 41 because whatever follows will be tried against every charac-
2357     ter position in the subject string, so there is no point in
2358     retrying the overall match at any position after the first.
2359 nigel 63 PCRE normally treats such a pattern as though it were pre-
2360     ceded by \A.
2361 nigel 41
2362 nigel 63 In cases where it is known that the subject string contains
2363     no newlines, it is worth setting PCRE_DOTALL in order to
2364     obtain this optimization, or alternatively using ^ to indi-
2365     cate anchoring explicitly.
2366    
2367     However, there is one situation where the optimization can-
2368     not be used. When .* is inside capturing parentheses that
2369     are the subject of a backreference elsewhere in the pattern,
2370     a match at the start may fail, and a later one succeed. Con-
2371     sider, for example:
2372    
2373     (.*)abc\1
2374    
2375     If the subject is "xyz123abc123" the match point is the
2376     fourth character. For this reason, such a pattern is not
2377     implicitly anchored.
2378    
2379 nigel 41 When a capturing subpattern is repeated, the value captured
2380     is the substring that matched the final iteration. For exam-
2381     ple, after
2382    
2383     (tweedle[dume]{3}\s*)+
2384    
2385     has matched "tweedledum tweedledee" the value of the cap-
2386     tured substring is "tweedledee". However, if there are
2387     nested capturing subpatterns, the corresponding captured
2388     values may have been set in previous iterations. For exam-
2389     ple, after
2390    
2391     /(a|(b))+/
2392    
2393     matches "aba" the value of the second captured substring is
2394     "b".
2395    
2396    
2397 nigel 63 ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
2398 nigel 41
2399 nigel 63 With both maximizing and minimizing repetition, failure of
2400     what follows normally causes the repeated item to be re-
2401     evaluated to see if a different number of repeats allows the
2402     rest of the pattern to match. Sometimes it is useful to
2403     prevent this, either to change the nature of the match, or
2404     to cause it fail earlier than it otherwise might, when the
2405     author of the pattern knows there is no point in carrying
2406     on.
2407 nigel 53
2408 nigel 63 Consider, for example, the pattern \d+foo when applied to
2409     the subject line
2410 nigel 53
2411 nigel 63 123456bar
2412 nigel 53
2413 nigel 63 After matching all 6 digits and then failing to match "foo",
2414     the normal action of the matcher is to try again with only 5
2415     digits matching the \d+ item, and then with 4, and so on,
2416     before ultimately failing. "Atomic grouping" (a term taken
2417     from Jeffrey Friedl's book) provides the means for specify-
2418     ing that once a subpattern has matched, it is not to be re-
2419     evaluated in this way.
2420 nigel 53
2421 nigel 63 If we use atomic grouping for the previous example, the
2422     matcher would give up immediately on failing to match "foo"
2423     the first time. The notation is a kind of special
2424     parenthesis, starting with (?> as in this example:
2425 nigel 53
2426 nigel 63 (?>\d+)bar
2427 nigel 53
2428 nigel 63 This kind of parenthesis "locks up" the part of the pattern
2429     it contains once it has matched, and a failure further into
2430     the pattern is prevented from backtracking into it. Back-
2431     tracking past it to previous items, however, works as nor-
2432     mal.
2433 nigel 53
2434 nigel 63 An alternative description is that a subpattern of this type
2435     matches the string of characters that an identical stan-
2436     dalone pattern would match, if anchored at the current point
2437     in the subject string.
2438    
2439     Atomic grouping subpatterns are not capturing subpatterns.
2440     Simple cases such as the above example can be thought of as
2441     a maximizing repeat that must swallow everything it can. So,
2442     while both \d+ and \d+? are prepared to adjust the number of
2443     digits they match in order to make the rest of the pattern
2444     match, (?>\d+) can only match an entire sequence of digits.
2445    
2446     Atomic groups in general can of course contain arbitrarily
2447     complicated subpatterns, and can be nested. However, when
2448     the subpattern for an atomic group is just a single repeated
2449     item, as in the example above, a simpler notation, called a
2450     "possessive quantifier" can be used. This consists of an
2451     additional + character following a quantifier. Using this
2452     notation, the previous example can be rewritten as
2453    
2454     \d++bar
2455    
2456     Possessive quantifiers are always greedy; the setting of the
2457     PCRE_UNGREEDY option is ignored. They are a convenient nota-
2458     tion for the simpler forms of atomic group. However, there
2459     is no difference in the meaning or processing of a posses-
2460     sive quantifier and the equivalent atomic group.
2461    
2462     The possessive quantifier syntax is an extension to the Perl
2463     syntax. It originates in Sun's Java package.
2464    
2465     When a pattern contains an unlimited repeat inside a subpat-
2466     tern that can itself be repeated an unlimited number of
2467     times, the use of an atomic group is the only way to avoid
2468     some failing matches taking a very long time indeed. The
2469     pattern
2470    
2471     (\D+|<\d+>)*[!?]
2472    
2473     matches an unlimited number of substrings that either con-
2474     sist of non-digits, or digits enclosed in <>, followed by
2475     either ! or ?. When it matches, it runs quickly. However, if
2476     it is applied to
2477    
2478     aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
2479    
2480     it takes a long time before reporting failure. This is
2481     because the string can be divided between the two repeats in
2482     a large number of ways, and all have to be tried. (The exam-
2483     ple used [!?] rather than a single character at the end,
2484     because both PCRE and Perl have an optimization that allows
2485     for fast failure when a single character is used. They
2486     remember the last single character that is required for a
2487     match, and fail early if it is not present in the string.)
2488     If the pattern is changed to
2489    
2490     ((?>\D+)|<\d+>)*[!?]
2491    
2492     sequences of non-digits cannot be broken, and failure hap-
2493     pens quickly.
2494    
2495    
2496     BACK REFERENCES
2497    
2498     Outside a character class, a backslash followed by a digit
2499     greater than 0 (and possibly further digits) is a back
2500     reference to a capturing subpattern earlier (that is, to its
2501 nigel 41 left) in the pattern, provided there have been that many
2502     previous capturing left parentheses.
2503    
2504     However, if the decimal number following the backslash is
2505     less than 10, it is always taken as a back reference, and
2506     causes an error only if there are not that many capturing
2507     left parentheses in the entire pattern. In other words, the
2508     parentheses that are referenced need not be to the left of
2509     the reference for numbers less than 10. See the section
2510     entitled "Backslash" above for further details of the han-
2511     dling of digits following a backslash.
2512    
2513     A back reference matches whatever actually matched the cap-
2514     turing subpattern in the current subject string, rather than
2515 nigel 63 anything matching the subpattern itself (see "Subpatterns as
2516     subroutines" below for a way of doing that). So the pattern
2517 nigel 41
2518     (sens|respons)e and \1ibility
2519    
2520     matches "sense and sensibility" and "response and responsi-
2521     bility", but not "sense and responsibility". If caseful
2522 nigel 47 matching is in force at the time of the back reference, the
2523     case of letters is relevant. For example,
2524 nigel 41
2525     ((?i)rah)\s+\1
2526    
2527     matches "rah rah" and "RAH RAH", but not "RAH rah", even
2528     though the original capturing subpattern is matched case-
2529     lessly.
2530    
2531 nigel 63 Back references to named subpatterns use the Python syntax
2532     (?P=name). We could rewrite the above example as follows:
2533    
2534     (?<p1>(?i)rah)\s+(?P=p1)
2535    
2536 nigel 41 There may be more than one back reference to the same sub-
2537     pattern. If a subpattern has not actually been used in a
2538 nigel 47 particular match, any back references to it always fail. For
2539     example, the pattern
2540 nigel 41
2541     (a|(bc))\2
2542    
2543     always fails if it starts to match "a" rather than "bc".
2544 nigel 63 Because there may be many capturing parentheses in a pat-
2545     tern, all digits following the backslash are taken as part
2546     of a potential back reference number. If the pattern contin-
2547     ues with a digit character, some delimiter must be used to
2548     terminate the back reference. If the PCRE_EXTENDED option is
2549     set, this can be whitespace. Otherwise an empty comment can
2550     be used.
2551 nigel 41
2552     A back reference that occurs inside the parentheses to which
2553     it refers fails when the subpattern is first used, so, for
2554     example, (a\1) never matches. However, such references can
2555 nigel 49 be useful inside repeated subpatterns. For example, the pat-
2556     tern
2557 nigel 41
2558     (a|b\1)+
2559    
2560 nigel 49 matches any number of "a"s and also "aba", "ababbaa" etc. At
2561 nigel 41 each iteration of the subpattern, the back reference matches
2562 nigel 53 the character string corresponding to the previous itera-
2563     tion. In order for this to work, the pattern must be such
2564     that the first iteration does not need to match the back
2565     reference. This can be done using alternation, as in the
2566     example above, or by a quantifier with a minimum of zero.
2567 nigel 41
2568    
2569 nigel 63 ASSERTIONS
2570 nigel 41
2571     An assertion is a test on the characters following or
2572     preceding the current matching point that does not actually
2573     consume any characters. The simple assertions coded as \b,
2574 nigel 63 \B, \A, \G, \Z, \z, ^ and $ are described above. More com-
2575     plicated assertions are coded as subpatterns. There are two
2576 nigel 41 kinds: those that look ahead of the current position in the
2577     subject string, and those that look behind it.
2578 nigel 43
2579 nigel 41 An assertion subpattern is matched in the normal way, except
2580     that it does not cause the current matching position to be
2581     changed. Lookahead assertions start with (?= for positive
2582     assertions and (?! for negative assertions. For example,
2583    
2584     \w+(?=;)
2585    
2586     matches a word followed by a semicolon, but does not include
2587     the semicolon in the match, and
2588    
2589     foo(?!bar)
2590    
2591     matches any occurrence of "foo" that is not followed by
2592     "bar". Note that the apparently similar pattern
2593    
2594     (?!foo)bar
2595    
2596     does not find an occurrence of "bar" that is preceded by
2597     something other than "foo"; it finds any occurrence of "bar"
2598     whatsoever, because the assertion (?!foo) is always true
2599     when the next three characters are "bar". A lookbehind
2600     assertion is needed to achieve this effect.
2601    
2602 nigel 63 If you want to force a matching failure at some point in a
2603     pattern, the most convenient way to do it is with (?!)
2604     because an empty string always matches, so an assertion that
2605     requires there not to be an empty string must always fail.
2606    
2607 nigel 41 Lookbehind assertions start with (?<= for positive asser-
2608     tions and (?<! for negative assertions. For example,
2609    
2610     (?<!foo)bar
2611    
2612     does find an occurrence of "bar" that is not preceded by
2613     "foo". The contents of a lookbehind assertion are restricted
2614     such that all the strings it matches must have a fixed
2615     length. However, if there are several alternatives, they do
2616     not all have to have the same fixed length. Thus
2617    
2618     (?<=bullock|donkey)
2619    
2620     is permitted, but
2621    
2622     (?<!dogs?|cats?)
2623    
2624     causes an error at compile time. Branches that match dif-
2625     ferent length strings are permitted only at the top level of
2626     a lookbehind assertion. This is an extension compared with
2627 nigel 63 Perl (at least for 5.8), which requires all branches to
2628     match the same length of string. An assertion such as
2629 nigel 41
2630     (?<=ab(c|de))
2631    
2632     is not permitted, because its single top-level branch can
2633     match two different lengths, but it is acceptable if rewrit-
2634     ten to use two top-level branches:
2635    
2636     (?<=abc|abde)
2637    
2638     The implementation of lookbehind assertions is, for each
2639     alternative, to temporarily move the current position back
2640     by the fixed width and then try to match. If there are
2641     insufficient characters before the current position, the
2642 nigel 63 match is deemed to fail.
2643 nigel 41
2644 nigel 63 PCRE does not allow the \C escape (which matches a single
2645     byte in UTF-8 mode) to appear in lookbehind assertions,
2646     because it makes it impossible to calculate the length of
2647     the lookbehind.
2648    
2649     Atomic groups can be used in conjunction with lookbehind
2650     assertions to specify efficient matching at the end of the
2651     subject string. Consider a simple pattern such as
2652    
2653     abcd$
2654    
2655     when applied to a long string that does not match. Because
2656     matching proceeds from left to right, PCRE will look for
2657     each "a" in the subject and then see if what follows matches
2658     the rest of the pattern. If the pattern is specified as
2659    
2660     ^.*abcd$
2661    
2662     the initial .* matches the entire string at first, but when
2663     this fails (because there is no following "a"), it back-
2664     tracks to match all but the last character, then all but the
2665     last two characters, and so on. Once again the search for
2666     "a" covers the entire string, from right to left, so we are
2667     no better off. However, if the pattern is written as
2668    
2669     ^(?>.*)(?<=abcd)
2670    
2671     or, equivalently,
2672    
2673     ^.*+(?<=abcd)
2674    
2675     there can be no backtracking for the .* item; it can match
2676     only the entire string. The subsequent lookbehind assertion
2677     does a single test on the last four characters. If it fails,
2678     the match fails immediately. For long strings, this approach
2679     makes a significant difference to the processing time.
2680    
2681 nigel 41 Several assertions (of any sort) may occur in succession.
2682     For example,
2683    
2684     (?<=\d{3})(?<!999)foo
2685    
2686     matches "foo" preceded by three digits that are not "999".
2687     Notice that each of the assertions is applied independently
2688     at the same point in the subject string. First there is a
2689 nigel 47 check that the previous three characters are all digits, and
2690 nigel 41 then there is a check that the same three characters are not
2691     "999". This pattern does not match "foo" preceded by six
2692     characters, the first of which are digits and the last three
2693     of which are not "999". For example, it doesn't match
2694     "123abcfoo". A pattern to do that is
2695    
2696     (?<=\d{3}...)(?<!999)foo
2697    
2698     This time the first assertion looks at the preceding six
2699     characters, checking that the first three are digits, and
2700     then the second assertion checks that the preceding three
2701     characters are not "999".
2702    
2703     Assertions can be nested in any combination. For example,
2704    
2705     (?<=(?<!foo)bar)baz
2706    
2707     matches an occurrence of "baz" that is preceded by "bar"
2708     which in turn is not preceded by "foo", while
2709    
2710     (?<=\d{3}(?!999)...)foo
2711    
2712     is another pattern which matches "foo" preceded by three
2713     digits and any three characters that are not "999".
2714    
2715     Assertion subpatterns are not capturing subpatterns, and may
2716     not be repeated, because it makes no sense to assert the
2717     same thing several times. If any kind of assertion contains
2718     capturing subpatterns within it, these are counted for the
2719     purposes of numbering the capturing subpatterns in the whole
2720     pattern. However, substring capturing is carried out only
2721     for positive assertions, because it does not make sense for
2722     negative assertions.
2723    
2724    
2725 nigel 63 CONDITIONAL SUBPATTERNS
2726 nigel 41
2727     It is possible to cause the matching process to obey a sub-
2728     pattern conditionally or to choose between two alternative
2729     subpatterns, depending on the result of an assertion, or
2730     whether a previous capturing subpattern matched or not. The
2731     two possible forms of conditional subpattern are
2732    
2733     (?(condition)yes-pattern)
2734     (?(condition)yes-pattern|no-pattern)
2735    
2736     If the condition is satisfied, the yes-pattern is used; oth-
2737     erwise the no-pattern (if present) is used. If there are
2738     more than two alternatives in the subpattern, a compile-time
2739     error occurs.
2740    
2741 nigel 63 There are three kinds of condition. If the text between the
2742 nigel 47 parentheses consists of a sequence of digits, the condition
2743     is satisfied if the capturing subpattern of that number has
2744 nigel 51 previously matched. The number must be greater than zero.
2745     Consider the following pattern, which contains non-
2746     significant white space to make it more readable (assume the
2747     PCRE_EXTENDED option) and to divide it into three parts for
2748     ease of discussion:
2749 nigel 41
2750     ( \( )? [^()]+ (?(1) \) )
2751    
2752     The first part matches an optional opening parenthesis, and
2753     if that character is present, sets it as the first captured
2754     substring. The second part matches one or more characters
2755     that are not parentheses. The third part is a conditional
2756     subpattern that tests whether the first set of parentheses
2757     matched or not. If they did, that is, if subject started
2758     with an opening parenthesis, the condition is true, and so
2759     the yes-pattern is executed and a closing parenthesis is
2760     required. Otherwise, since no-pattern is not present, the
2761     subpattern matches nothing. In other words, this pattern
2762     matches a sequence of non-parentheses, optionally enclosed
2763     in parentheses.
2764    
2765 nigel 63 If the condition is the string (R), it is satisfied if a
2766     recursive call to the pattern or subpattern has been made.
2767     At "top level", the condition is false. This is a PCRE
2768     extension. Recursive patterns are described in the next
2769     section.
2770 nigel 41
2771 nigel 63 If the condition is not a sequence of digits or (R), it must
2772     be an assertion. This may be a positive or negative looka-
2773     head or lookbehind assertion. Consider this pattern, again
2774     containing non-significant white space, and with the two
2775     alternatives on the second line:
2776    
2777 nigel 41 (?(?=[^a-z]*[a-z])
2778     \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
2779    
2780     The condition is a positive lookahead assertion that matches
2781     an optional sequence of non-letters followed by a letter. In
2782     other words, it tests for the presence of at least one
2783     letter in the subject. If a letter is found, the subject is
2784     matched against the first alternative; otherwise it is
2785     matched against the second. This pattern matches strings in
2786     one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
2787     letters and dd are digits.
2788    
2789    
2790 nigel 63 COMMENTS
2791 nigel 41
2792     The sequence (?# marks the start of a comment which contin-
2793     ues up to the next closing parenthesis. Nested parentheses
2794     are not permitted. The characters that make up a comment
2795     play no part in the pattern matching at all.
2796    
2797     If the PCRE_EXTENDED option is set, an unescaped # character
2798     outside a character class introduces a comment that contin-
2799     ues up to the next newline character in the pattern.
2800    
2801    
2802 nigel 63 RECURSIVE PATTERNS
2803 nigel 41
2804 nigel 43 Consider the problem of matching a string in parentheses,
2805     allowing for unlimited nested parentheses. Without the use
2806     of recursion, the best that can be done is to use a pattern
2807     that matches up to some fixed depth of nesting. It is not
2808 nigel 63 possible to handle an arbitrary nesting depth. Perl has pro-
2809     vided an experimental facility that allows regular expres-
2810     sions to recurse (amongst other things). It does this by
2811     interpolating Perl code in the expression at run time, and
2812     the code can refer to the expression itself. A Perl pattern
2813     to solve the parentheses problem can be created like this:
2814 nigel 43
2815     $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
2816    
2817     The (?p{...}) item interpolates Perl code at run time, and
2818     in this case refers recursively to the pattern in which it
2819     appears. Obviously, PCRE cannot support the interpolation of
2820 nigel 63 Perl code. Instead, it supports some special syntax for
2821     recursion of the entire pattern, and also for individual
2822     subpattern recursion.
2823 nigel 43
2824 nigel 63 The special item that consists of (? followed by a number
2825     greater than zero and a closing parenthesis is a recursive
2826     call of the subpattern of the given number, provided that it
2827     occurs inside that subpattern. (If not, it is a "subroutine"
2828     call, which is described in the next section.) The special
2829     item (?R) is a recursive call of the entire regular expres-
2830     sion.
2831    
2832     For example, this PCRE pattern solves the nested parentheses
2833     problem (assume the PCRE_EXTENDED option is set so that
2834     white space is ignored):
2835    
2836 nigel 43 \( ( (?>[^()]+) | (?R) )* \)
2837    
2838     First it matches an opening parenthesis. Then it matches any
2839     number of substrings which can either be a sequence of non-
2840     parentheses, or a recursive match of the pattern itself
2841 nigel 63 (that is a correctly parenthesized substring). Finally
2842     there is a closing parenthesis.
2843 nigel 43
2844 nigel 63 If this were part of a larger pattern, you would not want to
2845     recurse the entire pattern, so instead you could use this:
2846    
2847     ( \( ( (?>[^()]+) | (?1) )* \) )
2848    
2849     We have put the pattern into parentheses, and caused the
2850     recursion to refer to them instead of the whole pattern. In
2851     a larger pattern, keeping track of parenthesis numbers can
2852     be tricky. It may be more convenient to use named
2853     parentheses instead. For this, PCRE uses (?P>name), which is
2854     an extension to the Python syntax that PCRE uses for named
2855     parentheses (Perl does not provide named parentheses). We
2856     could rewrite the above example as follows:
2857    
2858     (?<pn> \( ( (?>[^()]+) | (?P>pn) )* \) )
2859    
2860 nigel 43 This particular example pattern contains nested unlimited
2861 nigel 63 repeats, and so the use of atomic grouping for matching
2862     strings of non-parentheses is important when applying the
2863     pattern to strings that do not match. For example, when this
2864     pattern is applied to
2865 nigel 43
2866     (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
2867    
2868 nigel 63 it yields "no match" quickly. However, if atomic grouping is
2869     not used, the match runs for a very long time indeed because
2870     there are so many different ways the + and * repeats can
2871     carve up the subject, and all have to be tested before
2872     failure can be reported.
2873     At the end of a match, the values set for any capturing sub-
2874     patterns are those from the outermost level of the recursion
2875     at which the subpattern value is set. If you want to obtain
2876     intermediate values, a callout function can be used (see
2877     below and the pcrecallout documentation). If the pattern
2878     above is matched against
2879 nigel 43
2880     (ab(cd)ef)
2881    
2882     the value for the capturing parentheses is "ef", which is
2883     the last value taken on at the top level. If additional
2884     parentheses are added, giving
2885    
2886     \( ( ( (?>[^()]+) | (?R) )* ) \)
2887     ^ ^
2888 nigel 63 ^ ^
2889 nigel 43
2890 nigel 63 the string they capture is "ab(cd)ef", the contents of the
2891     top level parentheses. If there are more than 15 capturing
2892     parentheses in a pattern, PCRE has to obtain extra memory to
2893     store data during a recursion, which it does by using
2894     pcre_malloc, freeing it via pcre_free afterwards. If no
2895     memory can be obtained, the match fails with the
2896     PCRE_ERROR_NOMEMORY error.
2897 nigel 43
2898 nigel 63 Do not confuse the (?R) item with the condition (R), which
2899     tests for recursion. Consider this pattern, which matches
2900     text in angle brackets, allowing for arbitrary nesting. Only
2901     digits are allowed in nested brackets (that is, when recurs-
2902     ing), whereas any characters are permitted at the outer
2903     level.
2904 nigel 43
2905 nigel 63 < (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
2906 nigel 41
2907 nigel 63 In this pattern, (?(R) is the start of a conditional subpat-
2908     tern, with two different alternatives for the recursive and
2909     non-recursive cases. The (?R) item is the actual recursive
2910     call.
2911 nigel 41
2912    
2913 nigel 63 SUBPATTERNS AS SUBROUTINES
2914    
2915     If the syntax for a recursive subpattern reference (either
2916     by number or by name) is used outside the parentheses to
2917     which it refers, it operates like a subroutine in a program-
2918     ming language. An earlier example pointed out that the pat-
2919     tern
2920    
2921     (sens|respons)e and \1ibility
2922    
2923     matches "sense and sensibility" and "response and responsi-
2924     bility", but not "sense and responsibility". If instead the
2925     pattern
2926    
2927     (sens|respons)e and (?1)ibility
2928    
2929     is used, it does match "sense and responsibility" as well as
2930     the other two strings. Such references must, however, follow
2931     the subpattern to which they refer.
2932    
2933    
2934     CALLOUTS
2935    
2936     Perl has a feature whereby using the sequence (?{...})
2937     causes arbitrary Perl code to be obeyed in the middle of
2938     matching a regular expression. This makes it possible,
2939     amongst other things, to extract different substrings that
2940     match the same pair of parentheses when there is a repeti-
2941     tion.
2942    
2943     PCRE provides a similar feature, but of course it cannot
2944     obey arbitrary Perl code. The feature is called "callout".
2945     The caller of PCRE provides an external function by putting
2946     its entry point in the global variable pcre_callout. By
2947     default, this variable contains NULL, which disables all
2948     calling out.
2949    
2950     Within a regular expression, (?C) indicates the points at
2951     which the external function is to be called. If you want to
2952     identify different callout points, you can put a number less
2953     than 256 after the letter C. The default value is zero. For
2954     example, this pattern has two callout points:
2955    
2956     (?C1)9abc(?C2)def
2957    
2958     During matching, when PCRE reaches a callout point (and
2959     pcre_callout is set), the external function is called. It is
2960     provided with the number of the callout, and, optionally,
2961     one item of data originally supplied by the caller of
2962     pcre_exec(). The callout function may cause matching to
2963     backtrack, or to fail altogether. A complete description of
2964     the interface to the callout function is given in the pcre-
2965     callout documentation.
2966    
2967     Last updated: 03 February 2003
2968     Copyright (c) 1997-2003 University of Cambridge.
2969     -----------------------------------------------------------------------------
2970    
2971     NAME
2972     PCRE - Perl-compatible regular expressions
2973    
2974    
2975     PCRE PERFORMANCE
2976    
2977     Certain items that may appear in regular expression patterns
2978     are more efficient than others. It is more efficient to use
2979     a character class like [aeiou] than a set of alternatives
2980     such as (a|e|i|o|u). In general, the simplest construction
2981     that provides the required behaviour is usually the most
2982     efficient. Jeffrey Friedl's book contains a lot of discus-
2983     sion about optimizing regular expressions for efficient per-
2984     formance.
2985    
2986     When a pattern begins with .* not in parentheses, or in
2987     parentheses that are not the subject of a backreference, and
2988     the PCRE_DOTALL option is set, the pattern is implicitly
2989     anchored by PCRE, since it can match only at the start of a
2990     subject string. However, if PCRE_DOTALL is not set, PCRE
2991     cannot make this optimization, because the . metacharacter
2992     does not then match a newline, and if the subject string
2993     contains newlines, the pattern may match from the character
2994     immediately following one of them instead of from the very
2995     start. For example, the pattern
2996    
2997     .*second
2998    
2999 nigel 41 matches the subject "first\nand second" (where \n stands for
3000 nigel 63 a newline character), with the match starting at the seventh
3001     character. In order to do this, PCRE has to retry the match
3002 nigel 41 starting after every newline in the subject.
3003    
3004     If you are using such a pattern with subject strings that do
3005     not contain newlines, the best performance is obtained by
3006     setting PCRE_DOTALL, or starting the pattern with ^.* to
3007     indicate explicit anchoring. That saves PCRE from having to
3008     scan along the subject looking for a newline to restart at.
3009    
3010     Beware of patterns that contain nested indefinite repeats.
3011     These can take a long time to run when applied to a string
3012     that does not match. Consider the pattern fragment
3013    
3014     (a+)*
3015    
3016     This can match "aaaa" in 33 different ways, and this number
3017     increases very rapidly as the string gets longer. (The *
3018     repeat can match 0, 1, 2, 3, or 4 times, and for each of
3019     those cases other than 0, the + repeats can match different
3020     numbers of times.) When the remainder of the pattern is such
3021 nigel 51 that the entire match is going to fail, PCRE has in princi-
3022     ple to try every possible variation, and this can take an
3023     extremely long time.
3024 nigel 41 An optimization catches some of the more simple cases such
3025     as
3026    
3027     (a+)*b
3028    
3029     where a literal character follows. Before embarking on the
3030     standard matching procedure, PCRE checks that there is a "b"
3031     later in the subject string, and if there is not, it fails
3032     the match immediately. However, when there is no following
3033     literal this optimization cannot be used. You can see the
3034     difference by comparing the behaviour of
3035    
3036     (a+)*\d
3037    
3038     with the pattern above. The former gives a failure almost
3039     instantly when applied to a whole line of "a" characters,
3040     whereas the latter takes an appreciable time with strings
3041     longer than about 20 characters.
3042    
3043 nigel 63 Last updated: 03 February 2003
3044     Copyright (c) 1997-2003 University of Cambridge.
3045     -----------------------------------------------------------------------------
3046 nigel 41
3047 nigel 63 NAME
3048     PCRE - Perl-compatible regular expressions.
3049 nigel 41
3050 nigel 49
3051 nigel 63 SYNOPSIS OF POSIX API
3052     #include <pcreposix.h>
3053 nigel 49
3054 nigel 63 int regcomp(regex_t *preg, const char *pattern,
3055     int cflags);
3056 nigel 49
3057 nigel 63 int regexec(regex_t *preg, const char *string,
3058     size_t nmatch, regmatch_t pmatch[], int eflags);
3059 nigel 49
3060 nigel 63 size_t regerror(int errcode, const regex_t *preg,
3061     char *errbuf, size_t errbuf_size);
3062 nigel 49
3063 nigel 63 void regfree(regex_t *preg);
3064 nigel 49
3065    
3066 nigel 63 DESCRIPTION
3067 nigel 49
3068 nigel 63 This set of functions provides a POSIX-style API to the PCRE
3069     regular expression package. See the pcreapi documentation
3070     for a description of the native API, which contains addi-
3071     tional functionality.
3072 nigel 49
3073 nigel 63 The functions described here are just wrapper functions that
3074     ultimately call the PCRE native API. Their prototypes are
3075     defined in the pcreposix.h header file, and on Unix systems
3076     the library itself is called pcreposix.a, so can be accessed
3077     by adding -lpcreposix to the command for linking an applica-
3078     tion which uses them. Because the POSIX functions call the
3079     native ones, it is also necessary to add -lpcre.
3080 nigel 49
3081 nigel 63 I have implemented only those option bits that can be rea-
3082     sonably mapped to PCRE native options. In addition, the
3083     options REG_EXTENDED and REG_NOSUB are defined with the
3084     value zero. They have no effect, but since programs that are
3085     written to the POSIX interface often use them, this makes it
3086     easier to slot in PCRE as a replacement library. Other POSIX
3087     options are not even defined.
3088 nigel 49
3089 nigel 63 When PCRE is called via these functions, it is only the API
3090     that is POSIX-like in style. The syntax and semantics of the
3091     regular expressions themselves are still those of Perl, sub-
3092     ject to the setting of various PCRE options, as described
3093 nigel 69 below. "POSIX-like in style" means that the API approximates
3094     to the POSIX definition; it is not fully POSIX-compatible,
3095     and in multi-byte encoding domains it is probably even less
3096     compatible.
3097 nigel 49
3098 nigel 63 The header for these functions is supplied as pcreposix.h to
3099     avoid any potential clash with other POSIX libraries. It
3100     can, of course, be renamed or aliased as regex.h, which is
3101     the "correct" name. It provides two structure types, regex_t
3102     for compiled internal forms, and regmatch_t for returning
3103     captured substrings. It also defines some constants whose
3104     names start with "REG_"; these are used for setting options
3105     and identifying error codes.
3106 nigel 49
3107    
3108 nigel 63 COMPILING A PATTERN
3109 nigel 49
3110 nigel 63 The function regcomp() is called to compile a pattern into
3111     an internal form. The pattern is a C string terminated by a
3112     binary zero, and is passed in the argument pattern. The preg
3113     argument is a pointer to a regex_t structure which is used
3114     as a base for storing information about the compiled expres-
3115     sion.
3116 nigel 49
3117 nigel 63 The argument cflags is either zero, or contains one or more
3118     of the bits defined by the following macros:
3119 nigel 53
3120 nigel 63 REG_ICASE
3121 nigel 49
3122 nigel 63 The PCRE_CASELESS option is set when the expression is
3123     passed for compilation to the native function.
3124 nigel 49
3125 nigel 63 REG_NEWLINE
3126 nigel 49
3127 nigel 63 The PCRE_MULTILINE option is set when the expression is
3128     passed for compilation to the native function. Note that
3129     this does not mimic the defined POSIX behaviour for
3130     REG_NEWLINE (see the following section).
3131 nigel 49
3132 nigel 63 In the absence of these flags, no options are passed to the
3133     native function. This means the the regex is compiled with
3134     PCRE default semantics. In particular, the way it handles
3135     newline characters in the subject string is the Perl way,
3136     not the POSIX way. Note that setting PCRE_MULTILINE has only
3137     some of the effects specified for REG_NEWLINE. It does not
3138     affect the way newlines are matched by . (they aren't) or by
3139     a negative class such as [^a] (they are).
3140 nigel 53
3141 nigel 63 The yield of regcomp() is zero on success, and non-zero oth-
3142     erwise. The preg structure is filled in on success, and one
3143     member of the structure is public: re_nsub contains the
3144     number of capturing subpatterns in the regular expression.
3145     Various error codes are defined in the header file.
3146 nigel 53
3147    
3148 nigel 63 MATCHING NEWLINE CHARACTERS
3149 nigel 53
3150 nigel 63 This area is not simple, because POSIX and Perl take dif-
3151     ferent views of things. It is not possible to get PCRE to
3152     obey POSIX semantics, but then PCRE was never intended to be
3153     a POSIX engine. The following table lists the different pos-
3154     sibilities for matching newline characters in PCRE:
3155 nigel 53
3156 nigel 63 Default Change with
3157 nigel 53
3158 nigel 63 . matches newline no PCRE_DOTALL
3159     newline matches [^a] yes not changeable
3160     $ matches \n at end yes PCRE_DOLLARENDONLY
3161     $ matches \n in middle no PCRE_MULTILINE
3162     ^ matches \n in middle no PCRE_MULTILINE
3163 nigel 53
3164 nigel 63 This is the equivalent table for POSIX:
3165 nigel 53
3166 nigel 63 Default Change with
3167 nigel 53
3168 nigel 63 . matches newline yes REG_NEWLINE
3169     newline matches [^a] yes REG_NEWLINE
3170     $ matches \n at end no REG_NEWLINE
3171     $ matches \n in middle no REG_NEWLINE
3172     ^ matches \n in middle no REG_NEWLINE
3173 nigel 53
3174 nigel 63 PCRE's behaviour is the same as Perl's, except that there is
3175     no equivalent for PCRE_DOLLARENDONLY in Perl. In both PCRE
3176     and Perl, there is no way to stop newline from matching
3177     [^a].
3178 nigel 53
3179 nigel 63 The default POSIX newline handling can be obtained by set-
3180     ting PCRE_DOTALL and PCRE_DOLLARENDONLY, but there is no way
3181     to make PCRE behave exactly as for the REG_NEWLINE action.
3182 nigel 53
3183    
3184 nigel 63 MATCHING A PATTERN
3185 nigel 53
3186 nigel 63 The function regexec() is called to match a pre-compiled
3187     pattern preg against a given string, which is terminated by
3188     a zero byte, subject to the options in eflags. These can be:
3189 nigel 53
3190 nigel 63 REG_NOTBOL
3191 nigel 53
3192 nigel 63 The PCRE_NOTBOL option is set when calling the underlying
3193     PCRE matching function.
3194 nigel 53
3195 nigel 63 REG_NOTEOL
3196 nigel 53
3197 nigel 63 The PCRE_NOTEOL option is set when calling the underlying
3198     PCRE matching function.
3199 nigel 53
3200 nigel 63 The portion of the string that was matched, and also any
3201     captured substrings, are returned via the pmatch argument,
3202     which points to an array of nmatch structures of type
3203     regmatch_t, containing the members rm_so and rm_eo. These
3204     contain the offset to the first character of each substring
3205     and the offset to the first character after the end of each
3206     substring, respectively. The 0th element of the vector
3207     relates to the entire portion of string that was matched;
3208     subsequent elements relate to the capturing subpatterns of
3209     the regular expression. Unused entries in the array have
3210     both structure members set to -1.
3211 nigel 53
3212 nigel 63 A successful match yields a zero return; various error codes
3213     are defined in the header file, of which REG_NOMATCH is the
3214     "expected" failure code.
3215 nigel 53
3216    
3217 nigel 63 ERROR MESSAGES
3218 nigel 53
3219 nigel 63 The regerror() function maps a non-zero errorcode from
3220     either regcomp() or regexec() to a printable message. If
3221     preg is not NULL, the error should have arisen from the use
3222     of that structure. A message terminated by a binary zero is
3223     placed in errbuf. The length of the message, including the
3224     zero, is limited to errbuf_size. The yield of the function
3225     is the size of buffer needed to hold the whole message.
3226 nigel 53
3227    
3228 nigel 63 STORAGE
3229 nigel 53
3230 nigel 63 Compiling a regular expression causes memory to be allocated
3231     and associated with the preg structure. The function reg-
3232     free() frees all such memory, after which preg may no longer
3233     be used as a compiled expression.
3234 nigel 53
3235    
3236 nigel 63 AUTHOR
3237 nigel 53
3238 nigel 63 Philip Hazel <ph10@cam.ac.uk>
3239     University Computing Service,
3240     Cambridge CB2 3QG, England.
3241 nigel 53
3242 nigel 63 Last updated: 03 February 2003
3243     Copyright (c) 1997-2003 University of Cambridge.
3244     -----------------------------------------------------------------------------
3245 nigel 53
3246 nigel 63 NAME
3247     PCRE - Perl-compatible regular expressions
3248 nigel 53
3249    
3250 nigel 63 PCRE SAMPLE PROGRAM
3251 nigel 41
3252 nigel 63 A simple, complete demonstration program, to get you started
3253     with using PCRE, is supplied in the file pcredemo.c in the
3254     PCRE distribution.
3255    
3256     The program compiles the regular expression that is its
3257     first argument, and matches it against the subject string in
3258     its second argument. No PCRE options are set, and default
3259     character tables are used. If matching succeeds, the program
3260     outputs the portion of the subject that matched, together
3261     with the contents of any captured substrings.
3262    
3263     If the -g option is given on the command line, the program
3264     then goes on to check for further matches of the same regu-
3265     lar expression in the same subject string. The logic is a
3266     little bit tricky because of the possibility of matching an
3267     empty string. Comments in the code explain what is going on.
3268    
3269     On a Unix system that has PCRE installed in /usr/local, you
3270     can compile the demonstration program using a command like
3271     this:
3272    
3273     gcc -o pcredemo pcredemo.c -I/usr/local/include \
3274     -L/usr/local/lib -lpcre
3275    
3276     Then you can run simple tests like this:
3277    
3278     ./pcredemo 'cat|dog' 'the cat sat on the mat'
3279     ./pcredemo -g 'cat|dog' 'the dog sat on the cat'
3280    
3281     Note that there is a much more comprehensive test program,
3282     called pcretest, which supports many more facilities for
3283     testing regular expressions and the PCRE library. The
3284     pcredemo program is provided as a simple coding example.
3285    
3286     On some operating systems (e.g. Solaris) you may get an
3287     error like this when you try to run pcredemo:
3288    
3289     ld.so.1: a.out: fatal: libpcre.so.0: open failed: No such
3290     file or directory
3291    
3292     This is caused by the way shared library support works on
3293     those systems. You need to add
3294    
3295     -R/usr/local/lib
3296    
3297     to the compile command to get round this problem.
3298    
3299     Last updated: 28 January 2003
3300     Copyright (c) 1997-2003 University of Cambridge.
3301     -----------------------------------------------------------------------------
3302    

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12