/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Contents of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log


Revision 71 - (hide annotations) (download)
Sat Feb 24 21:40:24 2007 UTC (6 years, 2 months ago) by nigel
File MIME type: text/plain
File size: 144666 byte(s)
Load pcre-4.4 into code/trunk.

1 nigel 63 This file contains a concatenation of the PCRE man pages, converted to plain
2     text format for ease of searching with a text editor, or for use on systems
3     that do not have a man page processor. The small individual files that give
4     synopses of each function in the library have not been included. There are
5     separate text files for the pcregrep and pcretest commands.
6     -----------------------------------------------------------------------------
7    
8 nigel 41 NAME
9 nigel 63 PCRE - Perl-compatible regular expressions
10 nigel 41
11    
12 nigel 63 DESCRIPTION
13 nigel 41
14 nigel 63 The PCRE library is a set of functions that implement regu-
15     lar expression pattern matching using the same syntax and
16     semantics as Perl, with just a few differences. The current
17     implementation of PCRE (release 4.x) corresponds approxi-
18     mately with Perl 5.8, including support for UTF-8 encoded
19     strings. However, this support has to be explicitly
20     enabled; it is not the default.
21    
22     PCRE is written in C and released as a C library. However, a
23     number of people have written wrappers and interfaces of
24     various kinds. A C++ class is included in these contribu-
25     tions, which can be found in the Contrib directory at the
26     primary FTP site, which is:
27    
28     ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
29    
30     Details of exactly which Perl regular expression features
31     are and are not supported by PCRE are given in separate
32     documents. See the pcrepattern and pcrecompat pages.
33    
34     Some features of PCRE can be included, excluded, or changed
35     when the library is built. The pcre_config() function makes
36     it possible for a client to discover which features are
37     available. Documentation about building PCRE for various
38     operating systems can be found in the README file in the
39     source distribution.
40    
41    
42     USER DOCUMENTATION
43    
44     The user documentation for PCRE has been split up into a
45     number of different sections. In the "man" format, each of
46     these is a separate "man page". In the HTML format, each is
47     a separate page, linked from the index page. In the plain
48     text format, all the sections are concatenated, for ease of
49     searching. The sections are as follows:
50    
51     pcre this document
52     pcreapi details of PCRE's native API
53     pcrebuild options for building PCRE
54     pcrecallout details of the callout feature
55     pcrecompat discussion of Perl compatibility
56     pcregrep description of the pcregrep command
57     pcrepattern syntax and semantics of supported
58     regular expressions
59     pcreperform discussion of performance issues
60     pcreposix the POSIX-compatible API
61     pcresample discussion of the sample program
62     pcretest the pcretest testing command
63    
64     In addition, in the "man" and HTML formats, there is a short
65     page for each library function, listing its arguments and
66     results.
67    
68    
69     LIMITATIONS
70    
71     There are some size limitations in PCRE but it is hoped that
72     they will never in practice be relevant.
73    
74     The maximum length of a compiled pattern is 65539 (sic)
75     bytes if PCRE is compiled with the default internal linkage
76     size of 2. If you want to process regular expressions that
77     are truly enormous, you can compile PCRE with an internal
78     linkage size of 3 or 4 (see the README file in the source
79     distribution and the pcrebuild documentation for details).
80     If these cases the limit is substantially larger. However,
81     the speed of execution will be slower.
82    
83     All values in repeating quantifiers must be less than 65536.
84     The maximum number of capturing subpatterns is 65535.
85    
86     There is no limit to the number of non-capturing subpat-
87     terns, but the maximum depth of nesting of all kinds of
88     parenthesized subpattern, including capturing subpatterns,
89     assertions, and other types of subpattern, is 200.
90    
91     The maximum length of a subject string is the largest posi-
92     tive number that an integer variable can hold. However, PCRE
93     uses recursion to handle subpatterns and indefinite repeti-
94     tion. This means that the available stack space may limit
95     the size of a subject string that can be processed by cer-
96     tain patterns.
97    
98    
99     UTF-8 SUPPORT
100    
101     Starting at release 3.3, PCRE has had some support for char-
102     acter strings encoded in the UTF-8 format. For release 4.0
103     this has been greatly extended to cover most common require-
104     ments.
105    
106     In order process UTF-8 strings, you must build PCRE to
107     include UTF-8 support in the code, and, in addition, you
108     must call pcre_compile() with the PCRE_UTF8 option flag.
109     When you do this, both the pattern and any subject strings
110     that are matched against it are treated as UTF-8 strings
111     instead of just strings of bytes.
112    
113     If you compile PCRE with UTF-8 support, but do not use it at
114     run time, the library will be a bit bigger, but the addi-
115     tional run time overhead is limited to testing the PCRE_UTF8
116     flag in several places, so should not be very large.
117    
118     The following comments apply when PCRE is running in UTF-8
119     mode:
120    
121 nigel 71 1. When you set the PCRE_UTF8 flag, the strings passed as
122     patterns and subjects are checked for validity on entry to
123     the relevant functions. If an invalid UTF-8 string is
124     passed, an error return is given. In some situations, you
125     may already know that your strings are valid, and therefore
126     want to skip these checks in order to improve performance.
127     If you set the PCRE_NO_UTF8_CHECK flag at compile time or at
128     run time, PCRE assumes that the pattern or subject it is
129     given (respectively) contains only valid UTF-8 codes. In
130     this case, it does not diagnose an invalid UTF-8 string. If
131     you pass an invalid UTF-8 string to PCRE when
132     PCRE_NO_UTF8_CHECK is set, the results are undefined. Your
133     program may crash.
134 nigel 63
135     2. In a pattern, the escape sequence \x{...}, where the con-
136     tents of the braces is a string of hexadecimal digits, is
137     interpreted as a UTF-8 character whose code number is the
138     given hexadecimal number, for example: \x{1234}. If a non-
139     hexadecimal digit appears between the braces, the item is
140     not recognized. This escape sequence can be used either as
141     a literal, or within a character class.
142    
143     3. The original hexadecimal escape sequence, \xhh, matches a
144     two-byte UTF-8 character if the value is greater than 127.
145    
146     4. Repeat quantifiers apply to complete UTF-8 characters,
147     not to individual bytes, for example: \x{100}{3}.
148    
149     5. The dot metacharacter matches one UTF-8 character instead
150     of a single byte.
151    
152     6. The escape sequence \C can be used to match a single byte
153     in UTF-8 mode, but its use can lead to some strange effects.
154    
155     7. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W
156     correctly test characters of any code value, but the charac-
157     ters that PCRE recognizes as digits, spaces, or word charac-
158     ters remain the same set as before, all with values less
159     than 256.
160    
161     8. Case-insensitive matching applies only to characters
162     whose values are less than 256. PCRE does not support the
163     notion of "case" for higher-valued characters.
164    
165     9. PCRE does not support the use of Unicode tables and pro-
166     perties or the Perl escapes \p, \P, and \X.
167    
168    
169     AUTHOR
170    
171     Philip Hazel <ph10@cam.ac.uk>
172     University Computing Service,
173     Cambridge CB2 3QG, England.
174     Phone: +44 1223 334714
175    
176 nigel 71 Last updated: 20 August 2003
177 nigel 63 Copyright (c) 1997-2003 University of Cambridge.
178     -----------------------------------------------------------------------------
179    
180     NAME
181     PCRE - Perl-compatible regular expressions
182    
183    
184     PCRE BUILD-TIME OPTIONS
185    
186     This document describes the optional features of PCRE that
187     can be selected when the library is compiled. They are all
188     selected, or deselected, by providing options to the config-
189     ure script which is run before the make command. The com-
190     plete list of options for configure (which includes the
191     standard ones such as the selection of the installation
192     directory) can be obtained by running
193    
194     ./configure --help
195    
196     The following sections describe certain options whose names
197     begin with --enable or --disable. These settings specify
198     changes to the defaults for the configure command. Because
199     of the way that configure works, --enable and --disable
200     always come in pairs, so the complementary option always
201     exists as well, but as it specifies the default, it is not
202     described.
203    
204    
205     UTF-8 SUPPORT
206    
207     To build PCRE with support for UTF-8 character strings, add
208    
209     --enable-utf8
210    
211     to the configure command. Of itself, this does not make PCRE
212     treat strings as UTF-8. As well as compiling PCRE with this
213     option, you also have have to set the PCRE_UTF8 option when
214     you call the pcre_compile() function.
215    
216    
217     CODE VALUE OF NEWLINE
218    
219     By default, PCRE treats character 10 (linefeed) as the new-
220     line character. This is the normal newline character on
221     Unix-like systems. You can compile PCRE to use character 13
222     (carriage return) instead by adding
223    
224     --enable-newline-is-cr
225    
226     to the configure command. For completeness there is also a
227     --enable-newline-is-lf option, which explicitly specifies
228     linefeed as the newline character.
229    
230    
231     BUILDING SHARED AND STATIC LIBRARIES
232    
233     The PCRE building process uses libtool to build both shared
234     and static Unix libraries by default. You can suppress one
235     of these by adding one of
236    
237     --disable-shared
238     --disable-static
239    
240     to the configure command, as required.
241    
242    
243     POSIX MALLOC USAGE
244    
245     When PCRE is called through the POSIX interface (see the
246     pcreposix documentation), additional working storage is
247     required for holding the pointers to capturing substrings
248     because PCRE requires three integers per substring, whereas
249     the POSIX interface provides only two. If the number of
250     expected substrings is small, the wrapper function uses
251     space on the stack, because this is faster than using mal-
252     loc() for each call. The default threshold above which the
253     stack is no longer used is 10; it can be changed by adding a
254     setting such as
255    
256     --with-posix-malloc-threshold=20
257    
258     to the configure command.
259    
260    
261     LIMITING PCRE RESOURCE USAGE
262    
263     Internally, PCRE has a function called match() which it
264     calls repeatedly (possibly recursively) when performing a
265     matching operation. By limiting the number of times this
266     function may be called, a limit can be placed on the
267     resources used by a single call to pcre_exec(). The limit
268     can be changed at run time, as described in the pcreapi
269     documentation. The default is 10 million, but this can be
270     changed by adding a setting such as
271    
272     --with-match-limit=500000
273    
274     to the configure command.
275    
276    
277     HANDLING VERY LARGE PATTERNS
278    
279     Within a compiled pattern, offset values are used to point
280     from one part to another (for example, from an opening
281     parenthesis to an alternation metacharacter). By default
282     two-byte values are used for these offsets, leading to a
283     maximum size for a compiled pattern of around 64K. This is
284     sufficient to handle all but the most gigantic patterns.
285     Nevertheless, some people do want to process enormous pat-
286     terns, so it is possible to compile PCRE to use three-byte
287     or four-byte offsets by adding a setting such as
288    
289     --with-link-size=3
290    
291     to the configure command. The value given must be 2, 3, or
292     4. Using longer offsets slows down the operation of PCRE
293     because it has to load additional bytes when handling them.
294    
295     If you build PCRE with an increased link size, test 2 (and
296     test 5 if you are using UTF-8) will fail. Part of the output
297     of these tests is a representation of the compiled pattern,
298     and this changes with the link size.
299    
300     Last updated: 21 January 2003
301     Copyright (c) 1997-2003 University of Cambridge.
302     -----------------------------------------------------------------------------
303    
304     NAME
305     PCRE - Perl-compatible regular expressions
306    
307    
308     SYNOPSIS OF PCRE API
309    
310 nigel 41 #include <pcre.h>
311    
312     pcre *pcre_compile(const char *pattern, int options,
313     const char **errptr, int *erroffset,
314     const unsigned char *tableptr);
315    
316     pcre_extra *pcre_study(const pcre *code, int options,
317     const char **errptr);
318    
319     int pcre_exec(const pcre *code, const pcre_extra *extra,
320     const char *subject, int length, int startoffset,
321     int options, int *ovector, int ovecsize);
322    
323 nigel 63 int pcre_copy_named_substring(const pcre *code,
324     const char *subject, int *ovector,
325     int stringcount, const char *stringname,
326     char *buffer, int buffersize);
327    
328 nigel 41 int pcre_copy_substring(const char *subject, int *ovector,
329     int stringcount, int stringnumber, char *buffer,
330     int buffersize);
331    
332 nigel 63 int pcre_get_named_substring(const pcre *code,
333     const char *subject, int *ovector,
334     int stringcount, const char *stringname,
335     const char **stringptr);
336    
337     int pcre_get_stringnumber(const pcre *code,
338     const char *name);
339    
340 nigel 41 int pcre_get_substring(const char *subject, int *ovector,
341     int stringcount, int stringnumber,
342     const char **stringptr);
343    
344     int pcre_get_substring_list(const char *subject,
345     int *ovector, int stringcount, const char ***listptr);
346    
347 nigel 49 void pcre_free_substring(const char *stringptr);
348    
349     void pcre_free_substring_list(const char **stringptr);
350    
351 nigel 41 const unsigned char *pcre_maketables(void);
352    
353 nigel 43 int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
354     int what, void *where);
355    
356 nigel 63
357 nigel 41 int pcre_info(const pcre *code, int *optptr, *firstcharptr);
358    
359 nigel 63 int pcre_config(int what, void *where);
360    
361 nigel 41 char *pcre_version(void);
362    
363     void *(*pcre_malloc)(size_t);
364    
365     void (*pcre_free)(void *);
366    
367 nigel 63 int (*pcre_callout)(pcre_callout_block *);
368 nigel 41
369    
370 nigel 63 PCRE API
371 nigel 41
372     PCRE has its own native API, which is described in this
373     document. There is also a set of wrapper functions that
374 nigel 43 correspond to the POSIX regular expression API. These are
375     described in the pcreposix documentation.
376    
377 nigel 41 The native API function prototypes are defined in the header
378     file pcre.h, and on Unix systems the library itself is
379     called libpcre.a, so can be accessed by adding -lpcre to the
380 nigel 43 command for linking an application which calls it. The
381     header file defines the macros PCRE_MAJOR and PCRE_MINOR to
382     contain the major and minor release numbers for the library.
383     Applications can use these to include support for different
384     releases.
385 nigel 41
386     The functions pcre_compile(), pcre_study(), and pcre_exec()
387 nigel 53 are used for compiling and matching regular expressions. A
388     sample program that demonstrates the simplest way of using
389 nigel 63 them is given in the file pcredemo.c. The pcresample docu-
390     mentation describes how to run it.
391 nigel 49
392 nigel 63 There are convenience functions for extracting captured sub-
393     strings from a matched subject string. They are:
394    
395     pcre_copy_substring()
396     pcre_copy_named_substring()
397     pcre_get_substring()
398     pcre_get_named_substring()
399     pcre_get_substring_list()
400    
401     pcre_free_substring() and pcre_free_substring_list() are
402     also provided, to free the memory used for extracted
403 nigel 49 strings.
404 nigel 41
405 nigel 49 The function pcre_maketables() is used (optionally) to build
406     a set of character tables in the current locale for passing
407     to pcre_compile().
408    
409 nigel 43 The function pcre_fullinfo() is used to find out information
410     about a compiled pattern; pcre_info() is an obsolete version
411     which returns only some of the available information, but is
412     retained for backwards compatibility. The function
413     pcre_version() returns a pointer to a string containing the
414     version of PCRE and its date of release.
415 nigel 41
416     The global variables pcre_malloc and pcre_free initially
417     contain the entry points of the standard malloc() and free()
418     functions respectively. PCRE calls the memory management
419     functions via these variables, so a calling program can
420     replace them if it wishes to intercept the calls. This
421     should be done before calling any PCRE functions.
422    
423 nigel 63 The global variable pcre_callout initially contains NULL. It
424     can be set by the caller to a "callout" function, which PCRE
425     will then call at specified points during a matching opera-
426     tion. Details are given in the pcrecallout documentation.
427 nigel 41
428    
429 nigel 63 MULTITHREADING
430    
431 nigel 53 The PCRE functions can be used in multi-threading applica-
432     tions, with the proviso that the memory management functions
433 nigel 63 pointed to by pcre_malloc and pcre_free, and the callout
434     function pointed to by pcre_callout, are shared by all
435 nigel 53 threads.
436 nigel 41
437     The compiled form of a regular expression is not altered
438     during matching, so the same compiled pattern can safely be
439     used by several threads at once.
440    
441    
442 nigel 63 CHECKING BUILD-TIME OPTIONS
443 nigel 41
444 nigel 63 int pcre_config(int what, void *where);
445    
446     The function pcre_config() makes it possible for a PCRE
447     client to discover which optional features have been com-
448     piled into the PCRE library. The pcrebuild documentation has
449     more details about these optional features.
450    
451     The first argument for pcre_config() is an integer, specify-
452     ing which information is required; the second argument is a
453     pointer to a variable into which the information is placed.
454     The following information is available:
455    
456     PCRE_CONFIG_UTF8
457    
458     The output is an integer that is set to one if UTF-8 support
459     is available; otherwise it is set to zero.
460    
461     PCRE_CONFIG_NEWLINE
462    
463     The output is an integer that is set to the value of the
464     code that is used for the newline character. It is either
465     linefeed (10) or carriage return (13), and should normally
466     be the standard character for your operating system.
467    
468     PCRE_CONFIG_LINK_SIZE
469    
470     The output is an integer that contains the number of bytes
471     used for internal linkage in compiled regular expressions.
472     The value is 2, 3, or 4. Larger values allow larger regular
473     expressions to be compiled, at the expense of slower match-
474     ing. The default value of 2 is sufficient for all but the
475     most massive patterns, since it allows the compiled pattern
476     to be up to 64K in size.
477    
478     PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
479    
480     The output is an integer that contains the threshold above
481     which the POSIX interface uses malloc() for output vectors.
482     Further details are given in the pcreposix documentation.
483    
484     PCRE_CONFIG_MATCH_LIMIT
485    
486     The output is an integer that gives the default limit for
487     the number of internal matching function calls in a
488     pcre_exec() execution. Further details are given with
489     pcre_exec() below.
490    
491    
492 nigel 41 COMPILING A PATTERN
493 nigel 63
494     pcre *pcre_compile(const char *pattern, int options,
495     const char **errptr, int *erroffset,
496     const unsigned char *tableptr);
497    
498 nigel 41 The function pcre_compile() is called to compile a pattern
499     into an internal form. The pattern is a C string terminated
500     by a binary zero, and is passed in the argument pattern. A
501     pointer to a single block of memory that is obtained via
502     pcre_malloc is returned. This contains the compiled code and
503 nigel 53 related data. The pcre type is defined for the returned
504     block; this is a typedef for a structure whose contents are
505     not externally defined. It is up to the caller to free the
506     memory when it is no longer required.
507 nigel 41
508 nigel 53 Although the compiled code of a PCRE regex is relocatable,
509     that is, it does not depend on memory location, the complete
510     pcre data block is not fully relocatable, because it con-
511     tains a copy of the tableptr argument, which is an address
512     (see below).
513 nigel 41 The options argument contains independent bits that affect
514     the compilation. It should be zero if no options are
515     required. Some of the options, in particular, those that are
516     compatible with Perl, can also be set and unset from within
517     the pattern (see the detailed description of regular expres-
518 nigel 63 sions in the pcrepattern documentation). For these options,
519     the contents of the options argument specifies their initial
520     settings at the start of compilation and execution. The
521     PCRE_ANCHORED option can be set at the time of matching as
522     well as at compile time.
523 nigel 41
524     If errptr is NULL, pcre_compile() returns NULL immediately.
525     Otherwise, if compilation of a pattern fails, pcre_compile()
526     returns NULL, and sets the variable pointed to by errptr to
527     point to a textual error message. The offset from the start
528     of the pattern to the character where the error was
529     discovered is placed in the variable pointed to by
530     erroffset, which must not be NULL. If it is, an immediate
531     error is given.
532    
533     If the final argument, tableptr, is NULL, PCRE uses a
534     default set of character tables which are built when it is
535     compiled, using the default C locale. Otherwise, tableptr
536     must be the result of a call to pcre_maketables(). See the
537     section on locale support below.
538    
539 nigel 53 This code fragment shows a typical straightforward call to
540     pcre_compile():
541    
542     pcre *re;
543     const char *error;
544     int erroffset;
545     re = pcre_compile(
546     "^A.*Z", /* the pattern */
547     0, /* default options */
548     &error, /* for error message */
549     &erroffset, /* for error offset */
550     NULL); /* use default character tables */
551    
552 nigel 63 The following option bits are defined:
553 nigel 41
554     PCRE_ANCHORED
555    
556     If this bit is set, the pattern is forced to be "anchored",
557 nigel 63 that is, it is constrained to match only at the first match-
558     ing point in the string which is being searched (the "sub-
559     ject string"). This effect can also be achieved by appropri-
560     ate constructs in the pattern itself, which is the only way
561     to do it in Perl.
562 nigel 41
563     PCRE_CASELESS
564    
565     If this bit is set, letters in the pattern match both upper
566     and lower case letters. It is equivalent to Perl's /i
567 nigel 63 option, and it can be changed within a pattern by a (?i)
568     option setting.
569 nigel 41
570     PCRE_DOLLAR_ENDONLY
571    
572     If this bit is set, a dollar metacharacter in the pattern
573     matches only at the end of the subject string. Without this
574     option, a dollar also matches immediately before the final
575     character if it is a newline (but not before any other new-
576     lines). The PCRE_DOLLAR_ENDONLY option is ignored if
577     PCRE_MULTILINE is set. There is no equivalent to this option
578 nigel 63 in Perl, and no way to set it within a pattern.
579 nigel 41
580     PCRE_DOTALL
581    
582     If this bit is set, a dot metacharater in the pattern
583     matches all characters, including newlines. Without it, new-
584     lines are excluded. This option is equivalent to Perl's /s
585 nigel 63 option, and it can be changed within a pattern by a (?s)
586     option setting. A negative class such as [^a] always matches
587     a newline character, independent of the setting of this
588     option.
589 nigel 41
590     PCRE_EXTENDED
591    
592     If this bit is set, whitespace data characters in the pat-
593     tern are totally ignored except when escaped or inside a
594 nigel 63 character class. Whitespace does not include the VT charac-
595     ter (code 11). In addition, characters between an unescaped
596     # outside a character class and the next newline character,
597 nigel 41 inclusive, are also ignored. This is equivalent to Perl's /x
598 nigel 63 option, and it can be changed within a pattern by a (?x)
599     option setting.
600    
601     This option makes it possible to include comments inside
602     complicated patterns. Note, however, that this applies only
603     to data characters. Whitespace characters may never appear
604 nigel 41 within special character sequences in a pattern, for example
605 nigel 63 within the sequence (?( which introduces a conditional sub-
606 nigel 41 pattern.
607    
608     PCRE_EXTRA
609    
610 nigel 43 This option was invented in order to turn on additional
611     functionality of PCRE that is incompatible with Perl, but it
612     is currently of very little use. When set, any backslash in
613     a pattern that is followed by a letter that has no special
614     meaning causes an error, thus reserving these combinations
615     for future expansion. By default, as in Perl, a backslash
616     followed by a letter with no special meaning is treated as a
617     literal. There are at present no other features controlled
618     by this option. It can also be set by a (?X) option setting
619     within a pattern.
620 nigel 41
621     PCRE_MULTILINE
622    
623     By default, PCRE treats the subject string as consisting of
624     a single "line" of characters (even if it actually contains
625     several newlines). The "start of line" metacharacter (^)
626     matches only at the start of the string, while the "end of
627     line" metacharacter ($) matches only at the end of the
628     string, or before a terminating newline (unless
629     PCRE_DOLLAR_ENDONLY is set). This is the same as Perl.
630    
631     When PCRE_MULTILINE it is set, the "start of line" and "end
632 nigel 43 of line" constructs match immediately following or immedi-
633     ately before any newline in the subject string, respec-
634     tively, as well as at the very start and end. This is
635 nigel 63 equivalent to Perl's /m option, and it can be changed within
636     a pattern by a (?m) option setting. If there are no "\n"
637     characters in a subject string, or no occurrences of ^ or $
638     in a pattern, setting PCRE_MULTILINE has no effect.
639 nigel 41
640 nigel 63 PCRE_NO_AUTO_CAPTURE
641    
642     If this option is set, it disables the use of numbered cap-
643     turing parentheses in the pattern. Any opening parenthesis
644     that is not followed by ? behaves as if it were followed by
645     ?: but named parentheses can still be used for capturing
646     (and they acquire numbers in the usual way). There is no
647     equivalent of this option in Perl.
648    
649 nigel 41 PCRE_UNGREEDY
650    
651     This option inverts the "greediness" of the quantifiers so
652     that they are not greedy by default, but become greedy if
653     followed by "?". It is not compatible with Perl. It can also
654     be set by a (?U) option setting within the pattern.
655    
656 nigel 49 PCRE_UTF8
657 nigel 41
658 nigel 49 This option causes PCRE to regard both the pattern and the
659 nigel 63 subject as strings of UTF-8 characters instead of single-
660     byte character strings. However, it is available only if
661     PCRE has been built to include UTF-8 support. If not, the
662     use of this option provokes an error. Details of how this
663     option changes the behaviour of PCRE are given in the sec-
664     tion on UTF-8 support in the main pcre page.
665 nigel 41
666 nigel 71 PCRE_NO_UTF8_CHECK
667 nigel 49
668 nigel 71 When PCRE_UTF8 is set, the validity of the pattern as a
669     UTF-8 string is automatically checked. If an invalid UTF-8
670     sequence of bytes is found, pcre_compile() returns an error.
671     If you already know that your pattern is valid, and you want
672     to skip this check for performance reasons, you can set the
673     PCRE_NO_UTF8_CHECK option. When it is set, the effect of
674     passing an invalid UTF-8 string as a pattern is undefined.
675     It may cause your program to crash. Note that there is a
676     similar option for suppressing the checking of subject
677     strings passed to pcre_exec().
678    
679    
680    
681 nigel 63 STUDYING A PATTERN
682 nigel 49
683 nigel 63 pcre_extra *pcre_study(const pcre *code, int options,
684     const char **errptr);
685    
686 nigel 41 When a pattern is going to be used several times, it is
687     worth spending more time analyzing it in order to speed up
688     the time taken for matching. The function pcre_study() takes
689 nigel 63 a pointer to a compiled pattern as its first argument. If
690     studing the pattern produces additional information that
691     will help speed up matching, pcre_study() returns a pointer
692     to a pcre_extra block, in which the study_data field points
693     to the results of the study.
694 nigel 41
695 nigel 63 The returned value from a pcre_study() can be passed
696     directly to pcre_exec(). However, the pcre_extra block also
697     contains other fields that can be set by the caller before
698     the block is passed; these are described below. If studying
699     the pattern does not produce any additional information,
700     pcre_study() returns NULL. In that circumstance, if the cal-
701     ling program wants to pass some of the other fields to
702     pcre_exec(), it must set up its own pcre_extra block.
703    
704 nigel 41 The second argument contains option bits. At present, no
705     options are defined for pcre_study(), and this argument
706     should always be zero.
707    
708 nigel 63 The third argument for pcre_study() is a pointer for an
709     error message. If studying succeeds (even if no data is
710     returned), the variable it points to is set to NULL. Other-
711     wise it points to a textual error message. You should there-
712     fore test the error pointer for NULL after calling
713     pcre_study(), to be sure that it has run successfully.
714 nigel 41
715 nigel 53 This is a typical call to pcre_study():
716    
717     pcre_extra *pe;
718     pe = pcre_study(
719     re, /* result of pcre_compile() */
720     0, /* no options exist */
721     &error); /* set to NULL or points to a message */
722    
723 nigel 41 At present, studying a pattern is useful only for non-
724     anchored patterns that do not have a single fixed starting
725     character. A bitmap of possible starting characters is
726     created.
727    
728    
729 nigel 63 LOCALE SUPPORT
730 nigel 41
731     PCRE handles caseless matching, and determines whether char-
732     acters are letters, digits, or whatever, by reference to a
733 nigel 63 set of tables. When running in UTF-8 mode, this applies only
734     to characters with codes less than 256. The library contains
735     a default set of tables that is created in the default C
736     locale when PCRE is compiled. This is used when the final
737     argument of pcre_compile() is NULL, and is sufficient for
738     many applications.
739 nigel 41
740     An alternative set of tables can, however, be supplied. Such
741     tables are built by calling the pcre_maketables() function,
742     which has no arguments, in the relevant locale. The result
743     can then be passed to pcre_compile() as often as necessary.
744     For example, to build and use tables that are appropriate
745     for the French locale (where accented characters with codes
746     greater than 128 are treated as letters), the following code
747     could be used:
748    
749     setlocale(LC_CTYPE, "fr");
750     tables = pcre_maketables();
751     re = pcre_compile(..., tables);
752    
753     The tables are built in memory that is obtained via
754     pcre_malloc. The pointer that is passed to pcre_compile is
755     saved with the compiled pattern, and the same tables are
756 nigel 63 used via this pointer by pcre_study() and pcre_exec(). Thus,
757 nigel 41 for any single pattern, compilation, studying and matching
758     all happen in the same locale, but different patterns can be
759     compiled in different locales. It is the caller's responsi-
760     bility to ensure that the memory containing the tables
761     remains available for as long as it is needed.
762    
763    
764 nigel 63 INFORMATION ABOUT A PATTERN
765 nigel 41
766 nigel 63 int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
767     int what, void *where);
768    
769 nigel 43 The pcre_fullinfo() function returns information about a
770     compiled pattern. It replaces the obsolete pcre_info() func-
771     tion, which is nevertheless retained for backwards compabil-
772     ity (and is documented below).
773     The first argument for pcre_fullinfo() is a pointer to the
774     compiled pattern. The second argument is the result of
775     pcre_study(), or NULL if the pattern was not studied. The
776     third argument specifies which piece of information is
777 nigel 63 required, and the fourth argument is a pointer to a variable
778     to receive the data. The yield of the function is zero for
779     success, or one of the following negative numbers:
780 nigel 43
781 nigel 41 PCRE_ERROR_NULL the argument code was NULL
782 nigel 43 the argument where was NULL
783 nigel 41 PCRE_ERROR_BADMAGIC the "magic number" was not found
784 nigel 43 PCRE_ERROR_BADOPTION the value of what was invalid
785 nigel 41
786 nigel 53 Here is a typical call of pcre_fullinfo(), to obtain the
787     length of the compiled pattern:
788    
789     int rc;
790     unsigned long int length;
791     rc = pcre_fullinfo(
792     re, /* result of pcre_compile() */
793     pe, /* result of pcre_study(), or NULL */
794     PCRE_INFO_SIZE, /* what is required */
795     &length); /* where to put the data */
796    
797 nigel 43 The possible values for the third argument are defined in
798     pcre.h, and are as follows:
799    
800 nigel 63 PCRE_INFO_BACKREFMAX
801 nigel 43
802 nigel 63 Return the number of the highest back reference in the pat-
803     tern. The fourth argument should point to an int variable.
804     Zero is returned if there are no back references.
805 nigel 41
806 nigel 43 PCRE_INFO_CAPTURECOUNT
807    
808     Return the number of capturing subpatterns in the pattern.
809     The fourth argument should point to an int variable.
810    
811 nigel 63 PCRE_INFO_FIRSTBYTE
812 nigel 43
813 nigel 63 Return information about the first byte of any matched
814     string, for a non-anchored pattern. (This option used to be
815     called PCRE_INFO_FIRSTCHAR; the old name is still recognized
816     for backwards compatibility.)
817 nigel 43
818 nigel 63 If there is a fixed first byte, e.g. from a pattern such as
819 nigel 47 (cat|cow|coyote), it is returned in the integer pointed to
820     by where. Otherwise, if either
821 nigel 41
822     (a) the pattern was compiled with the PCRE_MULTILINE option,
823     and every branch starts with "^", or
824    
825     (b) every branch of the pattern starts with ".*" and
826     PCRE_DOTALL is not set (if it were set, the pattern would be
827     anchored),
828 nigel 43
829 nigel 47 -1 is returned, indicating that the pattern matches only at
830 nigel 63 the start of a subject string or after any newline within
831     the string. Otherwise -2 is returned. For anchored patterns,
832     -2 is returned.
833 nigel 41
834 nigel 43 PCRE_INFO_FIRSTTABLE
835 nigel 41
836 nigel 43 If the pattern was studied, and this resulted in the con-
837 nigel 63 struction of a 256-bit table indicating a fixed set of bytes
838     for the first byte in any matching string, a pointer to the
839     table is returned. Otherwise NULL is returned. The fourth
840     argument should point to an unsigned char * variable.
841 nigel 41
842 nigel 43 PCRE_INFO_LASTLITERAL
843    
844 nigel 65 Return the value of the rightmost literal byte that must
845     exist in any matched string, other than at its start, if
846     such a byte has been recorded. The fourth argument should
847     point to an int variable. If there is no such byte, -1 is
848     returned. For anchored patterns, a last literal byte is
849     recorded only if it follows something of variable length.
850     For example, for the pattern /^a\d+z\d+/ the returned value
851     is "z", but for /^a\dz\d/ the returned value is -1.
852 nigel 43
853 nigel 63 PCRE_INFO_NAMECOUNT
854     PCRE_INFO_NAMEENTRYSIZE
855     PCRE_INFO_NAMETABLE
856    
857     PCRE supports the use of named as well as numbered capturing
858     parentheses. The names are just an additional way of identi-
859     fying the parentheses, which still acquire a number. A
860     caller that wants to extract data from a named subpattern
861     must convert the name to a number in order to access the
862     correct pointers in the output vector (described with
863     pcre_exec() below). In order to do this, it must first use
864     these three values to obtain the name-to-number mapping
865     table for the pattern.
866    
867     The map consists of a number of fixed-size entries.
868     PCRE_INFO_NAMECOUNT gives the number of entries, and
869     PCRE_INFO_NAMEENTRYSIZE gives the size of each entry; both
870     of these return an int value. The entry size depends on the
871     length of the longest name. PCRE_INFO_NAMETABLE returns a
872     pointer to the first entry of the table (a pointer to char).
873     The first two bytes of each entry are the number of the cap-
874     turing parenthesis, most significant byte first. The rest of
875     the entry is the corresponding name, zero terminated. The
876     names are in alphabetical order. For example, consider the
877     following pattern (assume PCRE_EXTENDED is set, so white
878     space - including newlines - is ignored):
879    
880     (?P<date> (?P<year>(\d\d)?\d\d) -
881     (?P<month>\d\d) - (?P<day>\d\d) )
882    
883     There are four named subpatterns, so the table has four
884     entries, and each entry in the table is eight bytes long.
885     The table is as follows, with non-printing bytes shows in
886     hex, and undefined bytes shown as ??:
887    
888     00 01 d a t e 00 ??
889     00 05 d a y 00 ?? ??
890     00 04 m o n t h 00
891     00 02 y e a r 00 ??
892    
893     When writing code to extract data from named subpatterns,
894     remember that the length of each entry may be different for
895     each compiled pattern.
896    
897     PCRE_INFO_OPTIONS
898    
899     Return a copy of the options with which the pattern was com-
900     piled. The fourth argument should point to an unsigned long
901     int variable. These option bits are those specified in the
902     call to pcre_compile(), modified by any top-level option
903     settings within the pattern itself.
904    
905     A pattern is automatically anchored by PCRE if all of its
906     top-level alternatives begin with one of the following:
907    
908     ^ unless PCRE_MULTILINE is set
909     \A always
910     \G always
911     .* if PCRE_DOTALL is set and there are no back
912     references to the subpattern in which .* appears
913    
914     For such patterns, the PCRE_ANCHORED bit is set in the
915     options returned by pcre_fullinfo().
916    
917     PCRE_INFO_SIZE
918    
919     Return the size of the compiled pattern, that is, the value
920     that was passed as the argument to pcre_malloc() when PCRE
921     was getting memory in which to place the compiled data. The
922     fourth argument should point to a size_t variable.
923    
924     PCRE_INFO_STUDYSIZE
925    
926     Returns the size of the data block pointed to by the
927     study_data field in a pcre_extra block. That is, it is the
928     value that was passed to pcre_malloc() when PCRE was getting
929     memory into which to place the data created by pcre_study().
930     The fourth argument should point to a size_t variable.
931    
932    
933     OBSOLETE INFO FUNCTION
934    
935     int pcre_info(const pcre *code, int *optptr, *firstcharptr);
936    
937 nigel 43 The pcre_info() function is now obsolete because its inter-
938     face is too restrictive to return all the available data
939     about a compiled pattern. New programs should use
940     pcre_fullinfo() instead. The yield of pcre_info() is the
941     number of capturing subpatterns, or one of the following
942     negative numbers:
943    
944     PCRE_ERROR_NULL the argument code was NULL
945     PCRE_ERROR_BADMAGIC the "magic number" was not found
946    
947     If the optptr argument is not NULL, a copy of the options
948     with which the pattern was compiled is placed in the integer
949     it points to (see PCRE_INFO_OPTIONS above).
950    
951     If the pattern is not anchored and the firstcharptr argument
952     is not NULL, it is used to pass back information about the
953     first character of any matched string (see
954 nigel 63 PCRE_INFO_FIRSTBYTE above).
955 nigel 43
956    
957 nigel 41 MATCHING A PATTERN
958 nigel 53
959 nigel 63 int pcre_exec(const pcre *code, const pcre_extra *extra,
960     const char *subject, int length, int startoffset,
961     int options, int *ovector, int ovecsize);
962 nigel 53
963 nigel 63 The function pcre_exec() is called to match a subject string
964 nigel 41 against a pre-compiled pattern, which is passed in the code
965     argument. If the pattern has been studied, the result of the
966 nigel 63 study should be passed in the extra argument.
967 nigel 41
968 nigel 53 Here is an example of a simple call to pcre_exec():
969    
970     int rc;
971     int ovector[30];
972     rc = pcre_exec(
973     re, /* result of pcre_compile() */
974     NULL, /* we didn't study the pattern */
975     "some string", /* the subject string */
976     11, /* the length of the subject string */
977     0, /* start at offset 0 in the subject */
978     0, /* default options */
979     ovector, /* vector for substring information */
980     30); /* number of elements in the vector */
981    
982 nigel 63 If the extra argument is not NULL, it must point to a
983     pcre_extra data block. The pcre_study() function returns
984     such a block (when it doesn't return NULL), but you can also
985     create one for yourself, and pass additional information in
986     it. The fields in the block are as follows:
987    
988     unsigned long int flags;
989     void *study_data;
990     unsigned long int match_limit;
991     void *callout_data;
992    
993     The flags field is a bitmap that specifies which of the
994     other fields are set. The flag bits are:
995    
996     PCRE_EXTRA_STUDY_DATA
997     PCRE_EXTRA_MATCH_LIMIT
998     PCRE_EXTRA_CALLOUT_DATA
999    
1000     Other flag bits should be set to zero. The study_data field
1001     is set in the pcre_extra block that is returned by
1002     pcre_study(), together with the appropriate flag bit. You
1003     should not set this yourself, but you can add to the block
1004     by setting the other fields.
1005    
1006     The match_limit field provides a means of preventing PCRE
1007     from using up a vast amount of resources when running pat-
1008     terns that are not going to match, but which have a very
1009     large number of possibilities in their search trees. The
1010     classic example is the use of nested unlimited repeats.
1011     Internally, PCRE uses a function called match() which it
1012     calls repeatedly (sometimes recursively). The limit is
1013     imposed on the number of times this function is called dur-
1014     ing a match, which has the effect of limiting the amount of
1015     recursion and backtracking that can take place. For patterns
1016     that are not anchored, the count starts from zero for each
1017     position in the subject string.
1018    
1019     The default limit for the library can be set when PCRE is
1020     built; the default default is 10 million, which handles all
1021     but the most extreme cases. You can reduce the default by
1022     suppling pcre_exec() with a pcre_extra block in which
1023     match_limit is set to a smaller value, and
1024     PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the
1025     limit is exceeded, pcre_exec() returns
1026     PCRE_ERROR_MATCHLIMIT.
1027    
1028     The pcre_callout field is used in conjunction with the "cal-
1029     lout" feature, which is described in the pcrecallout docu-
1030     mentation.
1031    
1032 nigel 41 The PCRE_ANCHORED option can be passed in the options argu-
1033 nigel 63 ment, whose unused bits must be zero. This limits
1034     pcre_exec() to matching at the first matching position. How-
1035     ever, if a pattern was compiled with PCRE_ANCHORED, or
1036     turned out to be anchored by virtue of its contents, it can-
1037     not be made unachored at matching time.
1038 nigel 41
1039 nigel 71 When PCRE_UTF8 was set at compile time, the validity of the
1040     subject as a UTF-8 string is automatically checked. If an
1041     invalid UTF-8 sequence of bytes is found, pcre_exec()
1042     returns the error PCRE_ERROR_BADUTF8. If you already know
1043     that your subject is valid, and you want to skip this check
1044     for performance reasons, you can set the PCRE_NO_UTF8_CHECK
1045     option when calling pcre_exec(). When this option is set,
1046     the effect of passing an invalid UTF-8 string as a subject
1047     is undefined. It may cause your program to crash.
1048    
1049 nigel 41 There are also three further options that can be set only at
1050     matching time:
1051    
1052     PCRE_NOTBOL
1053    
1054     The first character of the string is not the beginning of a
1055     line, so the circumflex metacharacter should not match
1056     before it. Setting this without PCRE_MULTILINE (at compile
1057     time) causes circumflex never to match.
1058    
1059     PCRE_NOTEOL
1060    
1061     The end of the string is not the end of a line, so the dol-
1062     lar metacharacter should not match it nor (except in multi-
1063     line mode) a newline immediately before it. Setting this
1064     without PCRE_MULTILINE (at compile time) causes dollar never
1065     to match.
1066    
1067     PCRE_NOTEMPTY
1068    
1069     An empty string is not considered to be a valid match if
1070     this option is set. If there are alternatives in the pat-
1071     tern, they are tried. If all the alternatives match the
1072     empty string, the entire match fails. For example, if the
1073     pattern
1074    
1075     a?b?
1076    
1077     is applied to a string not beginning with "a" or "b", it
1078     matches the empty string at the start of the subject. With
1079     PCRE_NOTEMPTY set, this match is not valid, so PCRE searches
1080     further into the string for occurrences of "a" or "b".
1081    
1082     Perl has no direct equivalent of PCRE_NOTEMPTY, but it does
1083     make a special case of a pattern match of the empty string
1084     within its split() function, and when using the /g modifier.
1085     It is possible to emulate Perl's behaviour after matching a
1086     null string by first trying the match again at the same
1087     offset with PCRE_NOTEMPTY set, and then if that fails by
1088     advancing the starting offset (see below) and trying an
1089     ordinary match again.
1090    
1091 nigel 63 The subject string is passed to pcre_exec() as a pointer in
1092     subject, a length in length, and a starting offset in star-
1093     toffset. Unlike the pattern string, the subject may contain
1094     binary zero bytes. When the starting offset is zero, the
1095 nigel 53 search for a match starts at the beginning of the subject,
1096     and this is by far the most common case.
1097 nigel 41
1098 nigel 63 If the pattern was compiled with the PCRE_UTF8 option, the
1099     subject must be a sequence of bytes that is a valid UTF-8
1100     string. If an invalid UTF-8 string is passed, PCRE's
1101     behaviour is not defined.
1102    
1103 nigel 41 A non-zero starting offset is useful when searching for
1104     another match in the same subject by calling pcre_exec()
1105     again after a previous success. Setting startoffset differs
1106     from just passing over a shortened string and setting
1107     PCRE_NOTBOL in the case of a pattern that begins with any
1108     kind of lookbehind. For example, consider the pattern
1109    
1110     \Biss\B
1111    
1112     which finds occurrences of "iss" in the middle of words. (\B
1113     matches only if the current position in the subject is not a
1114     word boundary.) When applied to the string "Mississipi" the
1115     first call to pcre_exec() finds the first occurrence. If
1116     pcre_exec() is called again with just the remainder of the
1117     subject, namely "issipi", it does not match, because \B is
1118     always false at the start of the subject, which is deemed to
1119     be a word boundary. However, if pcre_exec() is passed the
1120     entire string again, but with startoffset set to 4, it finds
1121     the second occurrence of "iss" because it is able to look
1122     behind the starting point to discover that it is preceded by
1123     a letter.
1124    
1125     If a non-zero starting offset is passed when the pattern is
1126     anchored, one attempt to match at the given offset is tried.
1127     This can only succeed if the pattern does not require the
1128     match to be at the start of the subject.
1129    
1130     In general, a pattern matches a certain portion of the sub-
1131     ject, and in addition, further substrings from the subject
1132     may be picked out by parts of the pattern. Following the
1133     usage in Jeffrey Friedl's book, this is called "capturing"
1134     in what follows, and the phrase "capturing subpattern" is
1135     used for a fragment of a pattern that picks out a substring.
1136     PCRE supports several other kinds of parenthesized subpat-
1137     tern that do not cause substrings to be captured.
1138     Captured substrings are returned to the caller via a vector
1139     of integer offsets whose address is passed in ovector. The
1140     number of elements in the vector is passed in ovecsize. The
1141     first two-thirds of the vector is used to pass back captured
1142     substrings, each substring using a pair of integers. The
1143     remaining third of the vector is used as workspace by
1144     pcre_exec() while matching capturing subpatterns, and is not
1145     available for passing back information. The length passed in
1146     ovecsize should always be a multiple of three. If it is not,
1147     it is rounded down.
1148    
1149     When a match has been successful, information about captured
1150     substrings is returned in pairs of integers, starting at the
1151     beginning of ovector, and continuing up to two-thirds of its
1152     length at the most. The first element of a pair is set to
1153     the offset of the first character in a substring, and the
1154     second is set to the offset of the first character after the
1155     end of a substring. The first pair, ovector[0] and ovec-
1156     tor[1], identify the portion of the subject string matched
1157     by the entire pattern. The next pair is used for the first
1158     capturing subpattern, and so on. The value returned by
1159     pcre_exec() is the number of pairs that have been set. If
1160     there are no capturing subpatterns, the return value from a
1161     successful match is 1, indicating that just the first pair
1162     of offsets has been set.
1163 nigel 65
1164 nigel 41 Some convenience functions are provided for extracting the
1165     captured substrings as separate strings. These are described
1166     in the following section.
1167    
1168     It is possible for an capturing subpattern number n+1 to
1169     match some part of the subject when subpattern n has not
1170     been used at all. For example, if the string "abc" is
1171     matched against the pattern (a|(z))(bc) subpatterns 1 and 3
1172     are matched, but 2 is not. When this happens, both offset
1173     values corresponding to the unused subpattern are set to -1.
1174    
1175     If a capturing subpattern is matched repeatedly, it is the
1176     last portion of the string that it matched that gets
1177     returned.
1178    
1179     If the vector is too small to hold all the captured sub-
1180     strings, it is used as far as possible (up to two-thirds of
1181     its length), and the function returns a value of zero. In
1182     particular, if the substring offsets are not of interest,
1183     pcre_exec() may be called with ovector passed as NULL and
1184     ovecsize as zero. However, if the pattern contains back
1185     references and the ovector isn't big enough to remember the
1186     related substrings, PCRE has to get additional memory for
1187     use during matching. Thus it is usually advisable to supply
1188     an ovector.
1189    
1190     Note that pcre_info() can be used to find out how many cap-
1191     turing subpatterns there are in a compiled pattern. The
1192     smallest size for ovector that will allow for n captured
1193 nigel 63 substrings, in addition to the offsets of the substring
1194     matched by the whole pattern, is (n+1)*3.
1195 nigel 41
1196     If pcre_exec() fails, it returns a negative number. The fol-
1197     lowing are defined in the header file:
1198    
1199     PCRE_ERROR_NOMATCH (-1)
1200    
1201     The subject string did not match the pattern.
1202    
1203     PCRE_ERROR_NULL (-2)
1204    
1205     Either code or subject was passed as NULL, or ovector was
1206     NULL and ovecsize was not zero.
1207    
1208     PCRE_ERROR_BADOPTION (-3)
1209    
1210     An unrecognized bit was set in the options argument.
1211    
1212     PCRE_ERROR_BADMAGIC (-4)
1213    
1214     PCRE stores a 4-byte "magic number" at the start of the com-
1215     piled code, to catch the case when it is passed a junk
1216     pointer. This is the error it gives when the magic number
1217     isn't present.
1218    
1219     PCRE_ERROR_UNKNOWN_NODE (-5)
1220    
1221     While running the pattern match, an unknown item was encoun-
1222     tered in the compiled pattern. This error could be caused by
1223     a bug in PCRE or by overwriting of the compiled pattern.
1224    
1225     PCRE_ERROR_NOMEMORY (-6)
1226    
1227     If a pattern contains back references, but the ovector that
1228     is passed to pcre_exec() is not big enough to remember the
1229     referenced substrings, PCRE gets a block of memory at the
1230     start of matching to use for this purpose. If the call via
1231     pcre_malloc() fails, this error is given. The memory is
1232     freed at the end of matching.
1233    
1234 nigel 63 PCRE_ERROR_NOSUBSTRING (-7)
1235 nigel 41
1236 nigel 63 This error is used by the pcre_copy_substring(),
1237     pcre_get_substring(), and pcre_get_substring_list() func-
1238     tions (see below). It is never returned by pcre_exec().
1239 nigel 41
1240 nigel 63 PCRE_ERROR_MATCHLIMIT (-8)
1241 nigel 53
1242 nigel 63 The recursion and backtracking limit, as specified by the
1243     match_limit field in a pcre_extra structure (or defaulted)
1244     was reached. See the description above.
1245    
1246     PCRE_ERROR_CALLOUT (-9)
1247    
1248     This error is never generated by pcre_exec() itself. It is
1249     provided for use by callout functions that want to yield a
1250     distinctive error code. See the pcrecallout documentation
1251     for details.
1252    
1253 nigel 71 PCRE_ERROR_BADUTF8 (-10)
1254 nigel 63
1255 nigel 71 A string that contains an invalid UTF-8 byte sequence was
1256     passed as a subject.
1257    
1258    
1259 nigel 63 EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
1260    
1261     int pcre_copy_substring(const char *subject, int *ovector,
1262     int stringcount, int stringnumber, char *buffer,
1263     int buffersize);
1264    
1265     int pcre_get_substring(const char *subject, int *ovector,
1266     int stringcount, int stringnumber,
1267     const char **stringptr);
1268    
1269     int pcre_get_substring_list(const char *subject,
1270     int *ovector, int stringcount, const char ***listptr);
1271    
1272 nigel 41 Captured substrings can be accessed directly by using the
1273     offsets returned by pcre_exec() in ovector. For convenience,
1274     the functions pcre_copy_substring(), pcre_get_substring(),
1275     and pcre_get_substring_list() are provided for extracting
1276     captured substrings as new, separate, zero-terminated
1277 nigel 63 strings. These functions identify substrings by number. The
1278     next section describes functions for extracting named sub-
1279 nigel 41 strings. A substring that contains a binary zero is
1280     correctly extracted and has a further zero added on the end,
1281 nigel 63 but the result is not, of course, a C string.
1282 nigel 41
1283 nigel 63 The first three arguments are the same for all three of
1284     these functions: subject is the subject string which has
1285     just been successfully matched, ovector is a pointer to the
1286     vector of integer offsets that was passed to pcre_exec(),
1287     and stringcount is the number of substrings that were cap-
1288     tured by the match, including the substring that matched the
1289 nigel 41 entire regular expression. This is the value returned by
1290     pcre_exec if it is greater than zero. If pcre_exec()
1291     returned zero, indicating that it ran out of space in ovec-
1292 nigel 47 tor, the value passed as stringcount should be the size of
1293     the vector divided by three.
1294 nigel 41 The functions pcre_copy_substring() and pcre_get_substring()
1295     extract a single substring, whose number is given as string-
1296     number. A value of zero extracts the substring that matched
1297     the entire pattern, while higher values extract the captured
1298     substrings. For pcre_copy_substring(), the string is placed
1299     in buffer, whose length is given by buffersize, while for
1300 nigel 49 pcre_get_substring() a new block of memory is obtained via
1301 nigel 41 pcre_malloc, and its address is returned via stringptr. The
1302     yield of the function is the length of the string, not
1303     including the terminating zero, or one of
1304    
1305     PCRE_ERROR_NOMEMORY (-6)
1306    
1307     The buffer was too small for pcre_copy_substring(), or the
1308     attempt to get memory failed for pcre_get_substring().
1309    
1310     PCRE_ERROR_NOSUBSTRING (-7)
1311    
1312     There is no substring whose number is stringnumber.
1313    
1314     The pcre_get_substring_list() function extracts all avail-
1315     able substrings and builds a list of pointers to them. All
1316     this is done in a single block of memory which is obtained
1317     via pcre_malloc. The address of the memory block is returned
1318     via listptr, which is also the start of the list of string
1319     pointers. The end of the list is marked by a NULL pointer.
1320     The yield of the function is zero if all went well, or
1321    
1322     PCRE_ERROR_NOMEMORY (-6)
1323    
1324     if the attempt to get the memory block failed.
1325    
1326     When any of these functions encounter a substring that is
1327     unset, which can happen when capturing subpattern number n+1
1328     matches some part of the subject, but subpattern n has not
1329     been used at all, they return an empty string. This can be
1330     distinguished from a genuine zero-length substring by
1331     inspecting the appropriate offset in ovector, which is nega-
1332     tive for unset substrings.
1333    
1334 nigel 49 The two convenience functions pcre_free_substring() and
1335     pcre_free_substring_list() can be used to free the memory
1336     returned by a previous call of pcre_get_substring() or
1337     pcre_get_substring_list(), respectively. They do nothing
1338     more than call the function pointed to by pcre_free, which
1339     of course could be called directly from a C program. How-
1340     ever, PCRE is used in some situations where it is linked via
1341     a special interface to another programming language which
1342     cannot use pcre_free directly; it is for these cases that
1343     the functions are provided.
1344 nigel 41
1345    
1346 nigel 63 EXTRACTING CAPTURED SUBSTRINGS BY NAME
1347 nigel 41
1348 nigel 63 int pcre_copy_named_substring(const pcre *code,
1349     const char *subject, int *ovector,
1350     int stringcount, const char *stringname,
1351     char *buffer, int buffersize);
1352 nigel 41
1353 nigel 63 int pcre_get_stringnumber(const pcre *code,
1354     const char *name);
1355 nigel 41
1356 nigel 63 int pcre_get_named_substring(const pcre *code,
1357     const char *subject, int *ovector,
1358     int stringcount, const char *stringname,
1359     const char **stringptr);
1360 nigel 41
1361 nigel 63 To extract a substring by name, you first have to find asso-
1362     ciated number. This can be done by calling
1363     pcre_get_stringnumber(). The first argument is the compiled
1364     pattern, and the second is the name. For example, for this
1365     pattern
1366 nigel 41
1367 nigel 63 ab(?<xxx>\d+)...
1368    
1369     the number of the subpattern called "xxx" is 1. Given the
1370     number, you can then extract the substring directly, or use
1371     one of the functions described in the previous section. For
1372     convenience, there are also two functions that do the whole
1373     job.
1374    
1375     Most of the arguments of pcre_copy_named_substring() and
1376     pcre_get_named_substring() are the same as those for the
1377     functions that extract by number, and so are not re-
1378     described here. There are just two differences.
1379    
1380     First, instead of a substring number, a substring name is
1381     given. Second, there is an extra argument, given at the
1382     start, which is a pointer to the compiled pattern. This is
1383     needed in order to gain access to the name-to-number trans-
1384     lation table.
1385    
1386     These functions call pcre_get_stringnumber(), and if it
1387     succeeds, they then call pcre_copy_substring() or
1388     pcre_get_substring(), as appropriate.
1389    
1390 nigel 71 Last updated: 20 August 2003
1391 nigel 63 Copyright (c) 1997-2003 University of Cambridge.
1392     -----------------------------------------------------------------------------
1393    
1394     NAME
1395     PCRE - Perl-compatible regular expressions
1396    
1397    
1398     PCRE CALLOUTS
1399    
1400     int (*pcre_callout)(pcre_callout_block *);
1401    
1402     PCRE provides a feature called "callout", which is a means
1403     of temporarily passing control to the caller of PCRE in the
1404     middle of pattern matching. The caller of PCRE provides an
1405     external function by putting its entry point in the global
1406     variable pcre_callout. By default, this variable contains
1407     NULL, which disables all calling out.
1408    
1409     Within a regular expression, (?C) indicates the points at
1410     which the external function is to be called. Different cal-
1411     lout points can be identified by putting a number less than
1412     256 after the letter C. The default value is zero. For
1413     example, this pattern has two callout points:
1414    
1415     (?C1)9abc(?C2)def
1416    
1417     During matching, when PCRE reaches a callout point (and
1418     pcre_callout is set), the external function is called. Its
1419     only argument is a pointer to a pcre_callout block. This
1420     contains the following variables:
1421    
1422     int version;
1423     int callout_number;
1424     int *offset_vector;
1425     const char *subject;
1426     int subject_length;
1427     int start_match;
1428     int current_position;
1429     int capture_top;
1430     int capture_last;
1431     void *callout_data;
1432    
1433     The version field is an integer containing the version
1434     number of the block format. The current version is zero. The
1435     version number may change in future if additional fields are
1436     added, but the intention is never to remove any of the
1437     existing fields.
1438    
1439     The callout_number field contains the number of the callout,
1440     as compiled into the pattern (that is, the number after ?C).
1441    
1442     The offset_vector field is a pointer to the vector of
1443     offsets that was passed by the caller to pcre_exec(). The
1444     contents can be inspected in order to extract substrings
1445     that have been matched so far, in the same way as for
1446     extracting substrings after a match has completed.
1447     The subject and subject_length fields contain copies the
1448     values that were passed to pcre_exec().
1449    
1450     The start_match field contains the offset within the subject
1451     at which the current match attempt started. If the pattern
1452     is not anchored, the callout function may be called several
1453     times for different starting points.
1454    
1455     The current_position field contains the offset within the
1456     subject of the current match pointer.
1457    
1458 nigel 71 The capture_top field contains one more than the number of
1459     the highest numbered captured substring so far. If no sub-
1460     strings have been captured, the value of capture_top is one.
1461 nigel 63
1462     The capture_last field contains the number of the most
1463     recently captured substring.
1464    
1465     The callout_data field contains a value that is passed to
1466     pcre_exec() by the caller specifically so that it can be
1467     passed back in callouts. It is passed in the pcre_callout
1468     field of the pcre_extra data structure. If no such data was
1469     passed, the value of callout_data in a pcre_callout block is
1470     NULL. There is a description of the pcre_extra structure in
1471     the pcreapi documentation.
1472    
1473    
1474    
1475     RETURN VALUES
1476    
1477     The callout function returns an integer. If the value is
1478     zero, matching proceeds as normal. If the value is greater
1479     than zero, matching fails at the current point, but back-
1480     tracking to test other possibilities goes ahead, just as if
1481     a lookahead assertion had failed. If the value is less than
1482     zero, the match is abandoned, and pcre_exec() returns the
1483     value.
1484    
1485     Negative values should normally be chosen from the set of
1486     PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH
1487     forces a standard "no match" failure. The error number
1488     PCRE_ERROR_CALLOUT is reserved for use by callout functions;
1489     it will never be used by PCRE itself.
1490    
1491     Last updated: 21 January 2003
1492     Copyright (c) 1997-2003 University of Cambridge.
1493     -----------------------------------------------------------------------------
1494    
1495     NAME
1496     PCRE - Perl-compatible regular expressions
1497    
1498    
1499 nigel 41 DIFFERENCES FROM PERL
1500    
1501 nigel 63 This document describes the differences in the ways that
1502     PCRE and Perl handle regular expressions. The differences
1503     described here are with respect to Perl 5.8.
1504 nigel 41
1505 nigel 63 1. PCRE does not allow repeat quantifiers on lookahead
1506 nigel 41 assertions. Perl permits them, but they do not mean what you
1507     might think. For example, (?!a){3} does not assert that the
1508     next three characters are not "a". It just asserts that the
1509     next character is not "a" three times.
1510    
1511 nigel 63 2. Capturing subpatterns that occur inside negative looka-
1512 nigel 41 head assertions are counted, but their entries in the
1513     offsets vector are never set. Perl sets its numerical vari-
1514     ables from any such patterns that are matched before the
1515     assertion fails to match something (thereby succeeding), but
1516     only if the negative lookahead assertion contains just one
1517     branch.
1518    
1519 nigel 63 3. Though binary zero characters are supported in the sub-
1520 nigel 41 ject string, they are not allowed in a pattern string
1521     because it is passed as a normal C string, terminated by
1522     zero. The escape sequence "\0" can be used in the pattern to
1523     represent a binary zero.
1524    
1525 nigel 63 4. The following Perl escape sequences are not supported:
1526     \l, \u, \L, \U, \P, \p, and \X. In fact these are imple-
1527     mented by Perl's general string-handling and are not part of
1528     its pattern matching engine. If any of these are encountered
1529     by PCRE, an error is generated.
1530 nigel 41
1531 nigel 63 5. PCRE does support the \Q...\E escape for quoting sub-
1532     strings. Characters in between are treated as literals. This
1533     is slightly different from Perl in that $ and @ are also
1534     handled as literals inside the quotes. In Perl, they cause
1535     variable interpolation (but of course PCRE does not have
1536     variables). Note the following examples:
1537 nigel 41
1538 nigel 63 Pattern PCRE matches Perl matches
1539 nigel 49
1540 nigel 63 \Qabc$xyz\E abc$xyz abc followed by the
1541     contents of $xyz
1542     \Qabc\$xyz\E abc\$xyz abc\$xyz
1543     \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
1544 nigel 41
1545 nigel 63 In PCRE, the \Q...\E mechanism is not recognized inside a
1546     character class.
1547 nigel 41
1548 nigel 63 8. Fairly obviously, PCRE does not support the (?{code}) and
1549     (?p{code}) constructions. However, there is some experimen-
1550     tal support for recursive patterns using the non-Perl items
1551     (?R), (?number) and (?P>name). Also, the PCRE "callout"
1552     feature allows an external function to be called during pat-
1553     tern matching.
1554 nigel 41
1555 nigel 63 9. There are some differences that are concerned with the
1556     settings of captured strings when part of a pattern is
1557     repeated. For example, matching "aba" against the pattern
1558     /^(a(b)?)+$/ in Perl leaves $2 unset, but in PCRE it is set
1559     to "b".
1560    
1561 nigel 41 10. PCRE provides some extensions to the Perl regular
1562     expression facilities:
1563    
1564     (a) Although lookbehind assertions must match fixed length
1565     strings, each alternative branch of a lookbehind assertion
1566 nigel 63 can match a different length of string. Perl requires them
1567     all to have the same length.
1568 nigel 41
1569     (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not
1570 nigel 63 set, the $ meta-character matches only at the very end of
1571 nigel 41 the string.
1572    
1573     (c) If PCRE_EXTRA is set, a backslash followed by a letter
1574     with no special meaning is faulted.
1575    
1576 nigel 43 (d) If PCRE_UNGREEDY is set, the greediness of the repeti-
1577     tion quantifiers is inverted, that is, by default they are
1578     not greedy, but if followed by a question mark they are.
1579 nigel 41
1580     (e) PCRE_ANCHORED can be used to force a pattern to be tried
1581 nigel 63 only at the first matching position in the subject string.
1582 nigel 41
1583 nigel 63 (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and
1584     PCRE_NO_AUTO_CAPTURE options for pcre_exec() have no Perl
1585     equivalents.
1586 nigel 41
1587 nigel 63 (g) The (?R), (?number), and (?P>name) constructs allows for
1588     recursive pattern matching (Perl can do this using the
1589     (?p{code}) construct, which PCRE cannot support.)
1590 nigel 41
1591 nigel 63 (h) PCRE supports named capturing substrings, using the
1592     Python syntax.
1593 nigel 41
1594 nigel 63 (i) PCRE supports the possessive quantifier "++" syntax,
1595     taken from Sun's Java package.
1596 nigel 43
1597 nigel 63 (j) The (R) condition, for testing recursion, is a PCRE
1598     extension.
1599    
1600     (k) The callout facility is PCRE-specific.
1601    
1602     Last updated: 03 February 2003
1603     Copyright (c) 1997-2003 University of Cambridge.
1604     -----------------------------------------------------------------------------
1605    
1606     NAME
1607     PCRE - Perl-compatible regular expressions
1608    
1609    
1610     PCRE REGULAR EXPRESSION DETAILS
1611    
1612 nigel 41 The syntax and semantics of the regular expressions sup-
1613     ported by PCRE are described below. Regular expressions are
1614     also described in the Perl documentation and in a number of
1615     other books, some of which have copious examples. Jeffrey
1616     Friedl's "Mastering Regular Expressions", published by
1617 nigel 63 O'Reilly, covers them in great detail. The description here
1618     is intended as reference documentation.
1619 nigel 49
1620     The basic operation of PCRE is on strings of bytes. However,
1621 nigel 63 there is also support for UTF-8 character strings. To use
1622     this support you must build PCRE to include UTF-8 support,
1623     and then call pcre_compile() with the PCRE_UTF8 option. How
1624     this affects the pattern matching is mentioned in several
1625     places below. There is also a summary of UTF-8 features in
1626     the section on UTF-8 support in the main pcre page.
1627 nigel 41
1628     A regular expression is a pattern that is matched against a
1629     subject string from left to right. Most characters stand for
1630     themselves in a pattern, and match the corresponding charac-
1631     ters in the subject. As a trivial example, the pattern
1632    
1633     The quick brown fox
1634    
1635     matches a portion of a subject string that is identical to
1636     itself. The power of regular expressions comes from the
1637     ability to include alternatives and repetitions in the pat-
1638     tern. These are encoded in the pattern by the use of meta-
1639     characters, which do not stand for themselves but instead
1640     are interpreted in some special way.
1641    
1642     There are two different sets of meta-characters: those that
1643     are recognized anywhere in the pattern except within square
1644     brackets, and those that are recognized in square brackets.
1645     Outside square brackets, the meta-characters are as follows:
1646    
1647     \ general escape character with several uses
1648 nigel 63 ^ assert start of string (or line, in multiline mode)
1649     $ assert end of string (or line, in multiline mode)
1650 nigel 41 . match any character except newline (by default)
1651     [ start character class definition
1652     | start of alternative branch
1653     ( start subpattern
1654     ) end subpattern
1655     ? extends the meaning of (
1656     also 0 or 1 quantifier
1657     also quantifier minimizer
1658     * 0 or more quantifier
1659     + 1 or more quantifier
1660 nigel 63 also "possessive quantifier"
1661 nigel 41 { start min/max quantifier
1662    
1663     Part of a pattern that is in square brackets is called a
1664     "character class". In a character class the only meta-
1665     characters are:
1666    
1667     \ general escape character
1668     ^ negate the class, but only if the first character
1669     - indicates character range
1670 nigel 63 [ POSIX character class (only if followed by POSIX
1671     syntax)
1672 nigel 41 ] terminates the character class
1673    
1674     The following sections describe the use of each of the
1675     meta-characters.
1676    
1677    
1678 nigel 63 BACKSLASH
1679 nigel 41
1680     The backslash character has several uses. Firstly, if it is
1681     followed by a non-alphameric character, it takes away any
1682     special meaning that character may have. This use of
1683     backslash as an escape character applies both inside and
1684     outside character classes.
1685    
1686 nigel 63 For example, if you want to match a * character, you write
1687     \* in the pattern. This escaping action applies whether or
1688     not the following character would otherwise be interpreted
1689     as a meta-character, so it is always safe to precede a non-
1690     alphameric with backslash to specify that it stands for
1691     itself. In particular, if you want to match a backslash, you
1692     write \\.
1693 nigel 41
1694     If a pattern is compiled with the PCRE_EXTENDED option, whi-
1695     tespace in the pattern (other than in a character class) and
1696 nigel 63 characters between a # outside a character class and the
1697 nigel 41 next newline character are ignored. An escaping backslash
1698 nigel 63 can be used to include a whitespace or # character as part
1699 nigel 41 of the pattern.
1700    
1701 nigel 63 If you want to remove the special meaning from a sequence of
1702     characters, you can do so by putting them between \Q and \E.
1703     This is different from Perl in that $ and @ are handled as
1704     literals in \Q...\E sequences in PCRE, whereas in Perl, $
1705     and @ cause variable interpolation. Note the following exam-
1706     ples:
1707    
1708     Pattern PCRE matches Perl matches
1709    
1710     \Qabc$xyz\E abc$xyz abc followed by the
1711    
1712     contents of $xyz
1713     \Qabc\$xyz\E abc\$xyz abc\$xyz
1714     \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
1715    
1716     The \Q...\E sequence is recognized both inside and outside
1717     character classes.
1718    
1719 nigel 41 A second use of backslash provides a way of encoding non-
1720     printing characters in patterns in a visible manner. There
1721     is no restriction on the appearance of non-printing charac-
1722     ters, apart from the binary zero that terminates a pattern,
1723     but when a pattern is being prepared by text editing, it is
1724     usually easier to use one of the following escape sequences
1725     than the binary character it represents:
1726    
1727 nigel 63 \a alarm, that is, the BEL character (hex 07)
1728     \cx "control-x", where x is any character
1729     \e escape (hex 1B)
1730     \f formfeed (hex 0C)
1731     \n newline (hex 0A)
1732     \r carriage return (hex 0D)
1733     \t tab (hex 09)
1734     \ddd character with octal code ddd, or backreference
1735     \xhh character with hex code hh
1736     \x{hhh..} character with hex code hhh... (UTF-8 mode only)
1737 nigel 41
1738 nigel 63 The precise effect of \cx is as follows: if x is a lower
1739 nigel 41 case letter, it is converted to upper case. Then bit 6 of
1740 nigel 63 the character (hex 40) is inverted. Thus \cz becomes hex
1741     1A, but \c{ becomes hex 3B, while \c; becomes hex 7B.
1742 nigel 41
1743 nigel 63 After \x, from zero to two hexadecimal digits are read
1744     (letters can be in upper or lower case). In UTF-8 mode, any
1745     number of hexadecimal digits may appear between \x{ and },
1746     but the value of the character code must be less than 2**31
1747     (that is, the maximum hexadecimal value is 7FFFFFFF). If
1748     characters other than hexadecimal digits appear between \x{
1749     and }, or if there is no terminating }, this form of escape
1750     is not recognized. Instead, the initial \x will be inter-
1751     preted as a basic hexadecimal escape, with no following
1752     digits, giving a byte whose value is zero.
1753 nigel 41
1754 nigel 63 Characters whose value is less than 256 can be defined by
1755     either of the two syntaxes for \x when PCRE is in UTF-8
1756     mode. There is no difference in the way they are handled.
1757     For example, \xdc is exactly the same as \x{dc}.
1758    
1759     After \0 up to two further octal digits are read. In both
1760 nigel 41 cases, if there are fewer than two digits, just those that
1761 nigel 63 are present are used. Thus the sequence \0\x\07 specifies
1762     two binary zeros followed by a BEL character (code value 7).
1763     Make sure you supply two digits after the initial zero if
1764     the character that follows is itself an octal digit.
1765 nigel 41
1766     The handling of a backslash followed by a digit other than 0
1767     is complicated. Outside a character class, PCRE reads it
1768     and any following digits as a decimal number. If the number
1769     is less than 10, or if there have been at least that many
1770     previous capturing left parentheses in the expression, the
1771     entire sequence is taken as a back reference. A description
1772     of how this works is given later, following the discussion
1773     of parenthesized subpatterns.
1774    
1775     Inside a character class, or if the decimal number is
1776     greater than 9 and there have not been that many capturing
1777     subpatterns, PCRE re-reads up to three octal digits follow-
1778     ing the backslash, and generates a single byte from the
1779     least significant 8 bits of the value. Any subsequent digits
1780     stand for themselves. For example:
1781    
1782     \040 is another way of writing a space
1783     \40 is the same, provided there are fewer than 40
1784     previous capturing subpatterns
1785     \7 is always a back reference
1786     \11 might be a back reference, or another way of
1787     writing a tab
1788     \011 is always a tab
1789     \0113 is a tab followed by the character "3"
1790 nigel 63 \113 might be a back reference, otherwise the
1791     character with octal code 113
1792     \377 might be a back reference, otherwise
1793     the byte consisting entirely of 1 bits
1794 nigel 41 \81 is either a back reference, or a binary zero
1795     followed by the two characters "8" and "1"
1796    
1797     Note that octal values of 100 or greater must not be intro-
1798     duced by a leading zero, because no more than three octal
1799     digits are ever read.
1800 nigel 43
1801 nigel 63 All the sequences that define a single byte value or a sin-
1802     gle UTF-8 character (in UTF-8 mode) can be used both inside
1803     and outside character classes. In addition, inside a charac-
1804     ter class, the sequence \b is interpreted as the backspace
1805     character (hex 08). Outside a character class it has a dif-
1806     ferent meaning (see below).
1807 nigel 41
1808     The third use of backslash is for specifying generic charac-
1809     ter types:
1810    
1811     \d any decimal digit
1812     \D any character that is not a decimal digit
1813     \s any whitespace character
1814     \S any character that is not a whitespace character
1815     \w any "word" character
1816 nigel 63 W any "non-word" character
1817 nigel 41
1818     Each pair of escape sequences partitions the complete set of
1819     characters into two disjoint sets. Any given character
1820     matches one, and only one, of each pair.
1821    
1822 nigel 63 In UTF-8 mode, characters with values greater than 255 never
1823     match \d, \s, or \w, and always match \D, \S, and \W.
1824    
1825     For compatibility with Perl, \s does not match the VT char-
1826     acter (code 11). This makes it different from the the POSIX
1827     "space" class. The \s characters are HT (9), LF (10), FF
1828     (12), CR (13), and space (32).
1829    
1830 nigel 41 A "word" character is any letter or digit or the underscore
1831     character, that is, any character which can be part of a
1832     Perl "word". The definition of letters and digits is con-
1833     trolled by PCRE's character tables, and may vary if locale-
1834 nigel 63 specific matching is taking place (see "Locale support" in
1835     the pcreapi page). For example, in the "fr" (French) locale,
1836     some character codes greater than 128 are used for accented
1837     letters, and these are matched by \w.
1838 nigel 41
1839     These character type sequences can appear both inside and
1840     outside character classes. They each match one character of
1841     the appropriate type. If the current matching point is at
1842     the end of the subject string, all of them fail, since there
1843     is no character to match.
1844    
1845     The fourth use of backslash is for certain simple asser-
1846     tions. An assertion specifies a condition that has to be met
1847     at a particular point in a match, without consuming any
1848     characters from the subject string. The use of subpatterns
1849     for more complicated assertions is described below. The
1850     backslashed assertions are
1851    
1852 nigel 63 \b matches at a word boundary
1853     \B matches when not at a word boundary
1854     \A matches at start of subject
1855     \Z matches at end of subject or before newline at end
1856     \z matches at end of subject
1857     \G matches at first matching position in subject
1858 nigel 41
1859     These assertions may not appear in character classes (but
1860 nigel 63 note that \b has a different meaning, namely the backspace
1861 nigel 41 character, inside a character class).
1862 nigel 43
1863 nigel 41 A word boundary is a position in the subject string where
1864     the current character and the previous character do not both
1865     match \w or \W (i.e. one matches \w and the other matches
1866     \W), or the start or end of the string if the first or last
1867     character matches \w, respectively.
1868     The \A, \Z, and \z assertions differ from the traditional
1869     circumflex and dollar (described below) in that they only
1870     ever match at the very start and end of the subject string,
1871 nigel 63 whatever options are set. Thus, they are independent of mul-
1872     tiline mode.
1873    
1874     They are not affected by the PCRE_NOTBOL or PCRE_NOTEOL
1875     options. If the startoffset argument of pcre_exec() is non-
1876     zero, indicating that matching is to start at a point other
1877     than the beginning of the subject, \A can never match. The
1878 nigel 41 difference between \Z and \z is that \Z matches before a
1879     newline that is the last character of the string as well as
1880     at the end of the string, whereas \z matches only at the
1881     end.
1882    
1883 nigel 63 The \G assertion is true only when the current matching
1884     position is at the start point of the match, as specified by
1885     the startoffset argument of pcre_exec(). It differs from \A
1886     when the value of startoffset is non-zero. By calling
1887     pcre_exec() multiple times with appropriate arguments, you
1888     can mimic Perl's /g option, and it is in this kind of imple-
1889     mentation where \G can be useful.
1890 nigel 41
1891 nigel 63 Note, however, that PCRE's interpretation of \G, as the
1892     start of the current match, is subtly different from Perl's,
1893     which defines it as the end of the previous match. In Perl,
1894     these can be different when the previously matched string
1895     was empty. Because PCRE does just one match at a time, it
1896     cannot reproduce this behaviour.
1897 nigel 41
1898 nigel 63 If all the alternatives of a pattern begin with \G, the
1899     expression is anchored to the starting match position, and
1900     the "anchored" flag is set in the compiled regular expres-
1901     sion.
1902    
1903    
1904 nigel 41 CIRCUMFLEX AND DOLLAR
1905 nigel 63
1906 nigel 41 Outside a character class, in the default matching mode, the
1907     circumflex character is an assertion which is true only if
1908     the current matching point is at the start of the subject
1909     string. If the startoffset argument of pcre_exec() is non-
1910 nigel 63 zero, circumflex can never match if the PCRE_MULTILINE
1911     option is unset. Inside a character class, circumflex has an
1912     entirely different meaning (see below).
1913 nigel 41
1914     Circumflex need not be the first character of the pattern if
1915     a number of alternatives are involved, but it should be the
1916     first thing in each alternative in which it appears if the
1917     pattern is ever to match that branch. If all possible alter-
1918     natives start with a circumflex, that is, if the pattern is
1919     constrained to match only at the start of the subject, it is
1920     said to be an "anchored" pattern. (There are also other con-
1921     structs that can cause a pattern to be anchored.)
1922    
1923     A dollar character is an assertion which is true only if the
1924     current matching point is at the end of the subject string,
1925     or immediately before a newline character that is the last
1926     character in the string (by default). Dollar need not be the
1927     last character of the pattern if a number of alternatives
1928     are involved, but it should be the last item in any branch
1929     in which it appears. Dollar has no special meaning in a
1930     character class.
1931    
1932     The meaning of dollar can be changed so that it matches only
1933     at the very end of the string, by setting the
1934 nigel 63 PCRE_DOLLAR_ENDONLY option at compile time. This does not
1935     affect the \Z assertion.
1936 nigel 41
1937     The meanings of the circumflex and dollar characters are
1938     changed if the PCRE_MULTILINE option is set. When this is
1939     the case, they match immediately after and immediately
1940 nigel 63 before an internal newline character, respectively, in addi-
1941     tion to matching at the start and end of the subject string.
1942     For example, the pattern /^abc$/ matches the subject string
1943 nigel 41 "def\nabc" in multiline mode, but not otherwise. Conse-
1944     quently, patterns that are anchored in single line mode
1945 nigel 63 because all branches start with ^ are not anchored in multi-
1946     line mode, and a match for circumflex is possible when the
1947 nigel 41 startoffset argument of pcre_exec() is non-zero. The
1948     PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is
1949     set.
1950    
1951     Note that the sequences \A, \Z, and \z can be used to match
1952     the start and end of the subject in both modes, and if all
1953 nigel 53 branches of a pattern start with \A it is always anchored,
1954 nigel 41 whether PCRE_MULTILINE is set or not.
1955    
1956    
1957 nigel 63 FULL STOP (PERIOD, DOT)
1958 nigel 41
1959     Outside a character class, a dot in the pattern matches any
1960     one character in the subject, including a non-printing char-
1961 nigel 63 acter, but not (by default) newline. In UTF-8 mode, a dot
1962     matches any UTF-8 character, which might be more than one
1963     byte long, except (by default) for newline. If the
1964     PCRE_DOTALL option is set, dots match newlines as well. The
1965     handling of dot is entirely independent of the handling of
1966     circumflex and dollar, the only relationship being that they
1967     both involve newline characters. Dot has no special meaning
1968     in a character class.
1969 nigel 41
1970    
1971    
1972 nigel 63 MATCHING A SINGLE BYTE
1973    
1974     Outside a character class, the escape sequence \C matches
1975     any one byte, both in and out of UTF-8 mode. Unlike a dot,
1976     it always matches a newline. The feature is provided in Perl
1977     in order to match individual bytes in UTF-8 mode. Because
1978     it breaks up UTF-8 characters into individual bytes, what
1979     remains in the string may be a malformed UTF-8 string. For
1980     this reason it is best avoided.
1981    
1982     PCRE does not allow \C to appear in lookbehind assertions
1983     (see below), because in UTF-8 mode it makes it impossible to
1984     calculate the length of the lookbehind.
1985    
1986    
1987 nigel 41 SQUARE BRACKETS
1988 nigel 63
1989 nigel 41 An opening square bracket introduces a character class, ter-
1990     minated by a closing square bracket. A closing square
1991     bracket on its own is not special. If a closing square
1992     bracket is required as a member of the class, it should be
1993     the first data character in the class (after an initial cir-
1994     cumflex, if present) or escaped with a backslash.
1995    
1996 nigel 63 A character class matches a single character in the subject.
1997     In UTF-8 mode, the character may occupy more than one byte.
1998     A matched character must be in the set of characters defined
1999     by the class, unless the first character in the class defin-
2000     ition is a circumflex, in which case the subject character
2001     must not be in the set defined by the class. If a circumflex
2002     is actually required as a member of the class, ensure it is
2003     not the first character, or escape it with a backslash.
2004 nigel 41
2005     For example, the character class [aeiou] matches any lower
2006     case vowel, while [^aeiou] matches any character that is not
2007     a lower case vowel. Note that a circumflex is just a con-
2008     venient notation for specifying the characters which are in
2009     the class by enumerating those that are not. It is not an
2010     assertion: it still consumes a character from the subject
2011     string, and fails if the current pointer is at the end of
2012     the string.
2013    
2014 nigel 63 In UTF-8 mode, characters with values greater than 255 can
2015     be included in a class as a literal string of bytes, or by
2016     using the \x{ escaping mechanism.
2017    
2018 nigel 41 When caseless matching is set, any letters in a class
2019     represent both their upper case and lower case versions, so
2020     for example, a caseless [aeiou] matches "A" as well as "a",
2021     and a caseless [^aeiou] does not match "A", whereas a case-
2022 nigel 63 ful version would. PCRE does not support the concept of case
2023     for characters with values greater than 255.
2024 nigel 41 The newline character is never treated in any special way in
2025     character classes, whatever the setting of the PCRE_DOTALL
2026     or PCRE_MULTILINE options is. A class such as [^a] will
2027     always match a newline.
2028    
2029     The minus (hyphen) character can be used to specify a range
2030     of characters in a character class. For example, [d-m]
2031     matches any letter between d and m, inclusive. If a minus
2032     character is required in a class, it must be escaped with a
2033     backslash or appear in a position where it cannot be inter-
2034     preted as indicating a range, typically as the first or last
2035     character in the class.
2036    
2037     It is not possible to have the literal character "]" as the
2038     end character of a range. A pattern such as [W-]46] is
2039     interpreted as a class of two characters ("W" and "-") fol-
2040     lowed by a literal string "46]", so it would match "W46]" or
2041     "-46]". However, if the "]" is escaped with a backslash it
2042     is interpreted as the end of range, so [W-\]46] is inter-
2043     preted as a single class containing a range followed by two
2044     separate characters. The octal or hexadecimal representation
2045     of "]" can also be used to end a range.
2046    
2047 nigel 63 Ranges operate in the collating sequence of character
2048     values. They can also be used for characters specified
2049     numerically, for example [\000-\037]. In UTF-8 mode, ranges
2050     can include characters whose values are greater than 255,
2051     for example [\x{100}-\x{2ff}].
2052 nigel 41
2053 nigel 63 If a range that includes letters is used when caseless
2054     matching is set, it matches the letters in either case. For
2055     example, [W-c] is equivalent to [][\^_`wxyzabc], matched
2056     caselessly, and if character tables for the "fr" locale are
2057     in use, [\xc8-\xcb] matches accented E characters in both
2058     cases.
2059    
2060 nigel 41 The character types \d, \D, \s, \S, \w, and \W may also
2061     appear in a character class, and add the characters that
2062     they match to the class. For example, [\dABCDEF] matches any
2063     hexadecimal digit. A circumflex can conveniently be used
2064     with the upper case character types to specify a more res-
2065     tricted set of characters than the matching lower case type.
2066     For example, the class [^\W_] matches any letter or digit,
2067     but not underscore.
2068    
2069     All non-alphameric characters other than \, -, ^ (at the
2070     start) and the terminating ] are non-special in character
2071     classes, but it does no harm if they are escaped.
2072    
2073    
2074 nigel 43 POSIX CHARACTER CLASSES
2075    
2076 nigel 63 Perl supports the POSIX notation for character classes,
2077     which uses names enclosed by [: and :] within the enclosing
2078     square brackets. PCRE also supports this notation. For exam-
2079     ple,
2080    
2081 nigel 43 [01[:alpha:]%]
2082    
2083     matches "0", "1", any alphabetic character, or "%". The sup-
2084     ported class names are
2085    
2086     alnum letters and digits
2087     alpha letters
2088     ascii character codes 0 - 127
2089 nigel 63 blank space or tab only
2090 nigel 43 cntrl control characters
2091     digit decimal digits (same as \d)
2092     graph printing characters, excluding space
2093     lower lower case letters
2094     print printing characters, including space
2095     punct printing characters, excluding letters and digits
2096 nigel 63 space white space (not quite the same as \s)
2097 nigel 43 upper upper case letters
2098     word "word" characters (same as \w)
2099     xdigit hexadecimal digits
2100    
2101 nigel 63 The "space" characters are HT (9), LF (10), VT (11), FF
2102     (12), CR (13), and space (32). Notice that this list
2103     includes the VT character (code 11). This makes "space" dif-
2104     ferent to \s, which does not include VT (for Perl compati-
2105     bility).
2106 nigel 43
2107 nigel 63 The name "word" is a Perl extension, and "blank" is a GNU
2108     extension from Perl 5.8. Another Perl extension is negation,
2109     which is indicated by a ^ character after the colon. For
2110     example,
2111    
2112 nigel 43 [12[:^digit:]]
2113    
2114     matches "1", "2", or any non-digit. PCRE (and Perl) also
2115 nigel 53 recognize the POSIX syntax [.ch.] and [=ch=] where "ch" is a
2116 nigel 43 "collating element", but these are not supported, and an
2117     error is given if they are encountered.
2118    
2119 nigel 63 In UTF-8 mode, characters with values greater than 255 do
2120     not match any of the POSIX character classes.
2121 nigel 43
2122    
2123 nigel 41 VERTICAL BAR
2124 nigel 63
2125 nigel 41 Vertical bar characters are used to separate alternative
2126     patterns. For example, the pattern
2127    
2128     gilbert|sullivan
2129    
2130     matches either "gilbert" or "sullivan". Any number of alter-
2131     natives may appear, and an empty alternative is permitted
2132     (matching the empty string). The matching process tries
2133     each alternative in turn, from left to right, and the first
2134     one that succeeds is used. If the alternatives are within a
2135     subpattern (defined below), "succeeds" means matching the
2136     rest of the main pattern as well as the alternative in the
2137     subpattern.
2138    
2139    
2140     INTERNAL OPTION SETTING
2141    
2142 nigel 63 The settings of the PCRE_CASELESS, PCRE_MULTILINE,
2143     PCRE_DOTALL, and PCRE_EXTENDED options can be changed from
2144     within the pattern by a sequence of Perl option letters
2145     enclosed between "(?" and ")". The option letters are
2146    
2147 nigel 41 i for PCRE_CASELESS
2148     m for PCRE_MULTILINE
2149     s for PCRE_DOTALL
2150     x for PCRE_EXTENDED
2151    
2152     For example, (?im) sets caseless, multiline matching. It is
2153     also possible to unset these options by preceding the letter
2154     with a hyphen, and a combined setting and unsetting such as
2155     (?im-sx), which sets PCRE_CASELESS and PCRE_MULTILINE while
2156     unsetting PCRE_DOTALL and PCRE_EXTENDED, is also permitted.
2157     If a letter appears both before and after the hyphen, the
2158     option is unset.
2159    
2160 nigel 63 When an option change occurs at top level (that is, not
2161     inside subpattern parentheses), the change applies to the
2162     remainder of the pattern that follows. If the change is
2163     placed right at the start of a pattern, PCRE extracts it
2164     into the global options (and it will therefore show up in
2165     data extracted by the pcre_fullinfo() function).
2166 nigel 41
2167 nigel 63 An option change within a subpattern affects only that part
2168     of the current pattern that follows it, so
2169 nigel 41
2170     (a(?i)b)c
2171    
2172     matches abc and aBc and no other strings (assuming
2173     PCRE_CASELESS is not used). By this means, options can be
2174     made to have different settings in different parts of the
2175     pattern. Any changes made in one alternative do carry on
2176     into subsequent branches within the same subpattern. For
2177     example,
2178    
2179     (a(?i)b|c)
2180    
2181     matches "ab", "aB", "c", and "C", even though when matching
2182     "C" the first branch is abandoned before the option setting.
2183     This is because the effects of option settings happen at
2184     compile time. There would be some very weird behaviour oth-
2185     erwise.
2186    
2187     The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can
2188     be changed in the same way as the Perl-compatible options by
2189     using the characters U and X respectively. The (?X) flag
2190     setting is special in that it must always occur earlier in
2191     the pattern than any of the additional features it turns on,
2192     even when it is at top level. It is best put at the start.
2193    
2194    
2195 nigel 63 SUBPATTERNS
2196 nigel 41
2197     Subpatterns are delimited by parentheses (round brackets),
2198     which can be nested. Marking part of a pattern as a subpat-
2199     tern does two things:
2200    
2201     1. It localizes a set of alternatives. For example, the pat-
2202     tern
2203    
2204     cat(aract|erpillar|)
2205    
2206     matches one of the words "cat", "cataract", or "caterpil-
2207     lar". Without the parentheses, it would match "cataract",
2208     "erpillar" or the empty string.
2209    
2210     2. It sets up the subpattern as a capturing subpattern (as
2211     defined above). When the whole pattern matches, that por-
2212     tion of the subject string that matched the subpattern is
2213     passed back to the caller via the ovector argument of
2214     pcre_exec(). Opening parentheses are counted from left to
2215     right (starting from 1) to obtain the numbers of the captur-
2216     ing subpatterns.
2217    
2218     For example, if the string "the red king" is matched against
2219     the pattern
2220    
2221     the ((red|white) (king|queen))
2222    
2223     the captured substrings are "red king", "red", and "king",
2224 nigel 53 and are numbered 1, 2, and 3, respectively.
2225 nigel 41
2226     The fact that plain parentheses fulfil two functions is not
2227     always helpful. There are often times when a grouping sub-
2228     pattern is required without a capturing requirement. If an
2229 nigel 63 opening parenthesis is followed by a question mark and a
2230     colon, the subpattern does not do any capturing, and is not
2231     counted when computing the number of any subsequent captur-
2232     ing subpatterns. For example, if the string "the white
2233     queen" is matched against the pattern
2234 nigel 41
2235     the ((?:red|white) (king|queen))
2236    
2237     the captured substrings are "white queen" and "queen", and
2238 nigel 63 are numbered 1 and 2. The maximum number of capturing sub-
2239     patterns is 65535, and the maximum depth of nesting of all
2240     subpatterns, both capturing and non-capturing, is 200.
2241 nigel 41
2242     As a convenient shorthand, if any option settings are
2243     required at the start of a non-capturing subpattern, the
2244     option letters may appear between the "?" and the ":". Thus
2245     the two patterns
2246    
2247     (?i:saturday|sunday)
2248     (?:(?i)saturday|sunday)
2249    
2250     match exactly the same set of strings. Because alternative
2251     branches are tried from left to right, and options are not
2252     reset until the end of the subpattern is reached, an option
2253     setting in one branch does affect subsequent branches, so
2254     the above patterns match "SUNDAY" as well as "Saturday".
2255    
2256    
2257 nigel 63 NAMED SUBPATTERNS
2258 nigel 41
2259 nigel 63 Identifying capturing parentheses by number is simple, but
2260     it can be very hard to keep track of the numbers in compli-
2261     cated regular expressions. Furthermore, if an expression is
2262     modified, the numbers may change. To help with the diffi-
2263     culty, PCRE supports the naming of subpatterns, something
2264     that Perl does not provide. The Python syntax (?P<name>...)
2265     is used. Names consist of alphanumeric characters and under-
2266     scores, and must be unique within a pattern.
2267    
2268     Named capturing parentheses are still allocated numbers as
2269     well as names. The PCRE API provides function calls for
2270     extracting the name-to-number translation table from a com-
2271     piled pattern. For further details see the pcreapi documen-
2272     tation.
2273    
2274    
2275 nigel 41 REPETITION
2276 nigel 63
2277 nigel 41 Repetition is specified by quantifiers, which can follow any
2278     of the following items:
2279    
2280 nigel 63 a literal data character
2281 nigel 41 the . metacharacter
2282 nigel 63 the \C escape sequence
2283     escapes such as \d that match single characters
2284 nigel 41 a character class
2285     a back reference (see next section)
2286 nigel 63 a parenthesized subpattern (unless it is an assertion)
2287 nigel 41
2288     The general repetition quantifier specifies a minimum and
2289     maximum number of permitted matches, by giving the two
2290     numbers in curly brackets (braces), separated by a comma.
2291     The numbers must be less than 65536, and the first must be
2292     less than or equal to the second. For example:
2293    
2294     z{2,4}
2295    
2296     matches "zz", "zzz", or "zzzz". A closing brace on its own
2297     is not a special character. If the second number is omitted,
2298     but the comma is present, there is no upper limit; if the
2299     second number and the comma are both omitted, the quantifier
2300     specifies an exact number of required matches. Thus
2301    
2302     [aeiou]{3,}
2303    
2304     matches at least 3 successive vowels, but may match many
2305     more, while
2306    
2307     \d{8}
2308    
2309     matches exactly 8 digits. An opening curly bracket that
2310     appears in a position where a quantifier is not allowed, or
2311     one that does not match the syntax of a quantifier, is taken
2312     as a literal character. For example, {,6} is not a quantif-
2313     ier, but a literal string of four characters.
2314 nigel 63
2315     In UTF-8 mode, quantifiers apply to UTF-8 characters rather
2316     than to individual bytes. Thus, for example, \x{100}{2}
2317     matches two UTF-8 characters, each of which is represented
2318     by a two-byte sequence.
2319    
2320 nigel 41 The quantifier {0} is permitted, causing the expression to
2321     behave as if the previous item and the quantifier were not
2322     present.
2323    
2324     For convenience (and historical compatibility) the three
2325     most common quantifiers have single-character abbreviations:
2326    
2327     * is equivalent to {0,}
2328     + is equivalent to {1,}
2329     ? is equivalent to {0,1}
2330    
2331     It is possible to construct infinite loops by following a
2332     subpattern that can match no characters with a quantifier
2333     that has no upper limit, for example:
2334    
2335     (a?)*
2336    
2337     Earlier versions of Perl and PCRE used to give an error at
2338     compile time for such patterns. However, because there are
2339     cases where this can be useful, such patterns are now
2340     accepted, but if any repetition of the subpattern does in
2341     fact match no characters, the loop is forcibly broken.
2342    
2343     By default, the quantifiers are "greedy", that is, they
2344     match as much as possible (up to the maximum number of per-
2345     mitted times), without causing the rest of the pattern to
2346     fail. The classic example of where this gives problems is in
2347     trying to match comments in C programs. These appear between
2348     the sequences /* and */ and within the sequence, individual
2349     * and / characters may appear. An attempt to match C com-
2350     ments by applying the pattern
2351    
2352     /\*.*\*/
2353    
2354     to the string
2355    
2356     /* first command */ not comment /* second comment */
2357    
2358 nigel 51 fails, because it matches the entire string owing to the
2359 nigel 41 greediness of the .* item.
2360    
2361 nigel 47 However, if a quantifier is followed by a question mark, it
2362     ceases to be greedy, and instead matches the minimum number
2363     of times possible, so the pattern
2364 nigel 41
2365     /\*.*?\*/
2366    
2367     does the right thing with the C comments. The meaning of the
2368     various quantifiers is not otherwise changed, just the pre-
2369     ferred number of matches. Do not confuse this use of ques-
2370     tion mark with its use as a quantifier in its own right.
2371     Because it has two uses, it can sometimes appear doubled, as
2372     in
2373    
2374     \d??\d
2375    
2376     which matches one digit by preference, but can match two if
2377     that is the only way the rest of the pattern matches.
2378    
2379     If the PCRE_UNGREEDY option is set (an option which is not
2380 nigel 47 available in Perl), the quantifiers are not greedy by
2381 nigel 41 default, but individual ones can be made greedy by following
2382     them with a question mark. In other words, it inverts the
2383     default behaviour.
2384    
2385     When a parenthesized subpattern is quantified with a minimum
2386     repeat count that is greater than 1 or with a limited max-
2387     imum, more store is required for the compiled pattern, in
2388     proportion to the size of the minimum or maximum.
2389     If a pattern starts with .* or .{0,} and the PCRE_DOTALL
2390     option (equivalent to Perl's /s) is set, thus allowing the .
2391 nigel 47 to match newlines, the pattern is implicitly anchored,
2392 nigel 41 because whatever follows will be tried against every charac-
2393     ter position in the subject string, so there is no point in
2394     retrying the overall match at any position after the first.
2395 nigel 63 PCRE normally treats such a pattern as though it were pre-
2396     ceded by \A.
2397 nigel 41
2398 nigel 63 In cases where it is known that the subject string contains
2399     no newlines, it is worth setting PCRE_DOTALL in order to
2400     obtain this optimization, or alternatively using ^ to indi-
2401     cate anchoring explicitly.
2402    
2403     However, there is one situation where the optimization can-
2404     not be used. When .* is inside capturing parentheses that
2405     are the subject of a backreference elsewhere in the pattern,
2406     a match at the start may fail, and a later one succeed. Con-
2407     sider, for example:
2408    
2409     (.*)abc\1
2410    
2411     If the subject is "xyz123abc123" the match point is the
2412     fourth character. For this reason, such a pattern is not
2413     implicitly anchored.
2414    
2415 nigel 41 When a capturing subpattern is repeated, the value captured
2416     is the substring that matched the final iteration. For exam-
2417     ple, after
2418    
2419     (tweedle[dume]{3}\s*)+
2420    
2421     has matched "tweedledum tweedledee" the value of the cap-
2422     tured substring is "tweedledee". However, if there are
2423     nested capturing subpatterns, the corresponding captured
2424     values may have been set in previous iterations. For exam-
2425     ple, after
2426    
2427     /(a|(b))+/
2428    
2429     matches "aba" the value of the second captured substring is
2430     "b".
2431    
2432    
2433 nigel 63 ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
2434 nigel 41
2435 nigel 63 With both maximizing and minimizing repetition, failure of
2436     what follows normally causes the repeated item to be re-
2437     evaluated to see if a different number of repeats allows the
2438     rest of the pattern to match. Sometimes it is useful to
2439     prevent this, either to change the nature of the match, or
2440     to cause it fail earlier than it otherwise might, when the
2441     author of the pattern knows there is no point in carrying
2442     on.
2443 nigel 53
2444 nigel 63 Consider, for example, the pattern \d+foo when applied to
2445     the subject line
2446 nigel 53
2447 nigel 63 123456bar
2448 nigel 53
2449 nigel 63 After matching all 6 digits and then failing to match "foo",
2450     the normal action of the matcher is to try again with only 5
2451     digits matching the \d+ item, and then with 4, and so on,
2452     before ultimately failing. "Atomic grouping" (a term taken
2453     from Jeffrey Friedl's book) provides the means for specify-
2454     ing that once a subpattern has matched, it is not to be re-
2455     evaluated in this way.
2456 nigel 53
2457 nigel 63 If we use atomic grouping for the previous example, the
2458     matcher would give up immediately on failing to match "foo"
2459     the first time. The notation is a kind of special
2460     parenthesis, starting with (?> as in this example:
2461 nigel 53
2462 nigel 63 (?>\d+)bar
2463 nigel 53
2464 nigel 63 This kind of parenthesis "locks up" the part of the pattern
2465     it contains once it has matched, and a failure further into
2466     the pattern is prevented from backtracking into it. Back-
2467     tracking past it to previous items, however, works as nor-
2468     mal.
2469 nigel 53
2470 nigel 63 An alternative description is that a subpattern of this type
2471     matches the string of characters that an identical stan-
2472     dalone pattern would match, if anchored at the current point
2473     in the subject string.
2474    
2475     Atomic grouping subpatterns are not capturing subpatterns.
2476     Simple cases such as the above example can be thought of as
2477     a maximizing repeat that must swallow everything it can. So,
2478     while both \d+ and \d+? are prepared to adjust the number of
2479     digits they match in order to make the rest of the pattern
2480     match, (?>\d+) can only match an entire sequence of digits.
2481    
2482     Atomic groups in general can of course contain arbitrarily
2483     complicated subpatterns, and can be nested. However, when
2484     the subpattern for an atomic group is just a single repeated
2485     item, as in the example above, a simpler notation, called a
2486     "possessive quantifier" can be used. This consists of an
2487     additional + character following a quantifier. Using this
2488     notation, the previous example can be rewritten as
2489    
2490     \d++bar
2491    
2492     Possessive quantifiers are always greedy; the setting of the
2493     PCRE_UNGREEDY option is ignored. They are a convenient nota-
2494     tion for the simpler forms of atomic group. However, there
2495     is no difference in the meaning or processing of a posses-
2496     sive quantifier and the equivalent atomic group.
2497    
2498     The possessive quantifier syntax is an extension to the Perl
2499     syntax. It originates in Sun's Java package.
2500    
2501     When a pattern contains an unlimited repeat inside a subpat-
2502     tern that can itself be repeated an unlimited number of
2503     times, the use of an atomic group is the only way to avoid
2504     some failing matches taking a very long time indeed. The
2505     pattern
2506    
2507     (\D+|<\d+>)*[!?]
2508    
2509     matches an unlimited number of substrings that either con-
2510     sist of non-digits, or digits enclosed in <>, followed by
2511     either ! or ?. When it matches, it runs quickly. However, if
2512     it is applied to
2513    
2514     aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
2515    
2516     it takes a long time before reporting failure. This is
2517     because the string can be divided between the two repeats in
2518     a large number of ways, and all have to be tried. (The exam-
2519     ple used [!?] rather than a single character at the end,
2520     because both PCRE and Perl have an optimization that allows
2521     for fast failure when a single character is used. They
2522     remember the last single character that is required for a
2523     match, and fail early if it is not present in the string.)
2524     If the pattern is changed to
2525    
2526     ((?>\D+)|<\d+>)*[!?]
2527    
2528     sequences of non-digits cannot be broken, and failure hap-
2529     pens quickly.
2530    
2531    
2532     BACK REFERENCES
2533    
2534     Outside a character class, a backslash followed by a digit
2535     greater than 0 (and possibly further digits) is a back
2536     reference to a capturing subpattern earlier (that is, to its
2537 nigel 41 left) in the pattern, provided there have been that many
2538     previous capturing left parentheses.
2539    
2540     However, if the decimal number following the backslash is
2541     less than 10, it is always taken as a back reference, and
2542     causes an error only if there are not that many capturing
2543     left parentheses in the entire pattern. In other words, the
2544     parentheses that are referenced need not be to the left of
2545     the reference for numbers less than 10. See the section
2546     entitled "Backslash" above for further details of the han-
2547     dling of digits following a backslash.
2548    
2549     A back reference matches whatever actually matched the cap-
2550     turing subpattern in the current subject string, rather than
2551 nigel 63 anything matching the subpattern itself (see "Subpatterns as
2552     subroutines" below for a way of doing that). So the pattern
2553 nigel 41
2554     (sens|respons)e and \1ibility
2555    
2556     matches "sense and sensibility" and "response and responsi-
2557     bility", but not "sense and responsibility". If caseful
2558 nigel 47 matching is in force at the time of the back reference, the
2559     case of letters is relevant. For example,
2560 nigel 41
2561     ((?i)rah)\s+\1
2562    
2563     matches "rah rah" and "RAH RAH", but not "RAH rah", even
2564     though the original capturing subpattern is matched case-
2565     lessly.
2566    
2567 nigel 63 Back references to named subpatterns use the Python syntax
2568     (?P=name). We could rewrite the above example as follows:
2569    
2570     (?<p1>(?i)rah)\s+(?P=p1)
2571    
2572 nigel 41 There may be more than one back reference to the same sub-
2573     pattern. If a subpattern has not actually been used in a
2574 nigel 47 particular match, any back references to it always fail. For
2575     example, the pattern
2576 nigel 41
2577     (a|(bc))\2
2578    
2579     always fails if it starts to match "a" rather than "bc".
2580 nigel 63 Because there may be many capturing parentheses in a pat-
2581     tern, all digits following the backslash are taken as part
2582     of a potential back reference number. If the pattern contin-
2583     ues with a digit character, some delimiter must be used to
2584     terminate the back reference. If the PCRE_EXTENDED option is
2585     set, this can be whitespace. Otherwise an empty comment can
2586     be used.
2587 nigel 41
2588     A back reference that occurs inside the parentheses to which
2589     it refers fails when the subpattern is first used, so, for
2590     example, (a\1) never matches. However, such references can
2591 nigel 49 be useful inside repeated subpatterns. For example, the pat-
2592     tern
2593 nigel 41
2594     (a|b\1)+
2595    
2596 nigel 49 matches any number of "a"s and also "aba", "ababbaa" etc. At
2597 nigel 41 each iteration of the subpattern, the back reference matches
2598 nigel 53 the character string corresponding to the previous itera-
2599     tion. In order for this to work, the pattern must be such
2600     that the first iteration does not need to match the back
2601     reference. This can be done using alternation, as in the
2602     example above, or by a quantifier with a minimum of zero.
2603 nigel 41
2604    
2605 nigel 63 ASSERTIONS
2606 nigel 41
2607     An assertion is a test on the characters following or
2608     preceding the current matching point that does not actually
2609     consume any characters. The simple assertions coded as \b,
2610 nigel 63 \B, \A, \G, \Z, \z, ^ and $ are described above. More com-
2611     plicated assertions are coded as subpatterns. There are two
2612 nigel 41 kinds: those that look ahead of the current position in the
2613     subject string, and those that look behind it.
2614 nigel 43
2615 nigel 41 An assertion subpattern is matched in the normal way, except
2616     that it does not cause the current matching position to be
2617     changed. Lookahead assertions start with (?= for positive
2618     assertions and (?! for negative assertions. For example,
2619    
2620     \w+(?=;)
2621    
2622     matches a word followed by a semicolon, but does not include
2623     the semicolon in the match, and
2624    
2625     foo(?!bar)
2626    
2627     matches any occurrence of "foo" that is not followed by
2628     "bar". Note that the apparently similar pattern
2629    
2630     (?!foo)bar
2631    
2632     does not find an occurrence of "bar" that is preceded by
2633     something other than "foo"; it finds any occurrence of "bar"
2634     whatsoever, because the assertion (?!foo) is always true
2635     when the next three characters are "bar". A lookbehind
2636     assertion is needed to achieve this effect.
2637    
2638 nigel 63 If you want to force a matching failure at some point in a
2639     pattern, the most convenient way to do it is with (?!)
2640     because an empty string always matches, so an assertion that
2641     requires there not to be an empty string must always fail.
2642    
2643 nigel 41 Lookbehind assertions start with (?<= for positive asser-
2644     tions and (?<! for negative assertions. For example,
2645    
2646     (?<!foo)bar
2647    
2648     does find an occurrence of "bar" that is not preceded by
2649     "foo". The contents of a lookbehind assertion are restricted
2650     such that all the strings it matches must have a fixed
2651     length. However, if there are several alternatives, they do
2652     not all have to have the same fixed length. Thus
2653    
2654     (?<=bullock|donkey)
2655    
2656     is permitted, but
2657    
2658     (?<!dogs?|cats?)
2659    
2660     causes an error at compile time. Branches that match dif-
2661     ferent length strings are permitted only at the top level of
2662     a lookbehind assertion. This is an extension compared with
2663 nigel 63 Perl (at least for 5.8), which requires all branches to
2664     match the same length of string. An assertion such as
2665 nigel 41
2666     (?<=ab(c|de))
2667    
2668     is not permitted, because its single top-level branch can
2669     match two different lengths, but it is acceptable if rewrit-
2670     ten to use two top-level branches:
2671    
2672     (?<=abc|abde)
2673    
2674     The implementation of lookbehind assertions is, for each
2675     alternative, to temporarily move the current position back
2676     by the fixed width and then try to match. If there are
2677     insufficient characters before the current position, the
2678 nigel 63 match is deemed to fail.
2679 nigel 41
2680 nigel 63 PCRE does not allow the \C escape (which matches a single
2681     byte in UTF-8 mode) to appear in lookbehind assertions,
2682     because it makes it impossible to calculate the length of
2683     the lookbehind.
2684    
2685     Atomic groups can be used in conjunction with lookbehind
2686     assertions to specify efficient matching at the end of the
2687     subject string. Consider a simple pattern such as
2688    
2689     abcd$
2690    
2691     when applied to a long string that does not match. Because
2692     matching proceeds from left to right, PCRE will look for
2693     each "a" in the subject and then see if what follows matches
2694     the rest of the pattern. If the pattern is specified as
2695    
2696     ^.*abcd$
2697    
2698     the initial .* matches the entire string at first, but when
2699     this fails (because there is no following "a"), it back-
2700     tracks to match all but the last character, then all but the
2701     last two characters, and so on. Once again the search for
2702     "a" covers the entire string, from right to left, so we are
2703     no better off. However, if the pattern is written as
2704    
2705     ^(?>.*)(?<=abcd)
2706    
2707     or, equivalently,
2708    
2709     ^.*+(?<=abcd)
2710    
2711     there can be no backtracking for the .* item; it can match
2712     only the entire string. The subsequent lookbehind assertion
2713     does a single test on the last four characters. If it fails,
2714     the match fails immediately. For long strings, this approach
2715     makes a significant difference to the processing time.
2716    
2717 nigel 41 Several assertions (of any sort) may occur in succession.
2718     For example,
2719    
2720     (?<=\d{3})(?<!999)foo
2721    
2722     matches "foo" preceded by three digits that are not "999".
2723     Notice that each of the assertions is applied independently
2724     at the same point in the subject string. First there is a
2725 nigel 47 check that the previous three characters are all digits, and
2726 nigel 41 then there is a check that the same three characters are not
2727     "999". This pattern does not match "foo" preceded by six
2728     characters, the first of which are digits and the last three
2729     of which are not "999". For example, it doesn't match
2730     "123abcfoo". A pattern to do that is
2731    
2732     (?<=\d{3}...)(?<!999)foo
2733    
2734     This time the first assertion looks at the preceding six
2735     characters, checking that the first three are digits, and
2736     then the second assertion checks that the preceding three
2737     characters are not "999".
2738    
2739     Assertions can be nested in any combination. For example,
2740    
2741     (?<=(?<!foo)bar)baz
2742    
2743     matches an occurrence of "baz" that is preceded by "bar"
2744     which in turn is not preceded by "foo", while
2745    
2746     (?<=\d{3}(?!999)...)foo
2747    
2748     is another pattern which matches "foo" preceded by three
2749     digits and any three characters that are not "999".
2750    
2751     Assertion subpatterns are not capturing subpatterns, and may
2752     not be repeated, because it makes no sense to assert the
2753     same thing several times. If any kind of assertion contains
2754     capturing subpatterns within it, these are counted for the
2755     purposes of numbering the capturing subpatterns in the whole
2756     pattern. However, substring capturing is carried out only
2757     for positive assertions, because it does not make sense for
2758     negative assertions.
2759    
2760    
2761 nigel 63 CONDITIONAL SUBPATTERNS
2762 nigel 41
2763     It is possible to cause the matching process to obey a sub-
2764     pattern conditionally or to choose between two alternative
2765     subpatterns, depending on the result of an assertion, or
2766     whether a previous capturing subpattern matched or not. The
2767     two possible forms of conditional subpattern are
2768    
2769     (?(condition)yes-pattern)
2770     (?(condition)yes-pattern|no-pattern)
2771    
2772     If the condition is satisfied, the yes-pattern is used; oth-
2773     erwise the no-pattern (if present) is used. If there are
2774     more than two alternatives in the subpattern, a compile-time
2775     error occurs.
2776    
2777 nigel 63 There are three kinds of condition. If the text between the
2778 nigel 47 parentheses consists of a sequence of digits, the condition
2779     is satisfied if the capturing subpattern of that number has
2780 nigel 51 previously matched. The number must be greater than zero.
2781     Consider the following pattern, which contains non-
2782     significant white space to make it more readable (assume the
2783     PCRE_EXTENDED option) and to divide it into three parts for
2784     ease of discussion:
2785 nigel 41
2786     ( \( )? [^()]+ (?(1) \) )
2787    
2788     The first part matches an optional opening parenthesis, and
2789     if that character is present, sets it as the first captured
2790     substring. The second part matches one or more characters
2791     that are not parentheses. The third part is a conditional
2792     subpattern that tests whether the first set of parentheses
2793     matched or not. If they did, that is, if subject started
2794     with an opening parenthesis, the condition is true, and so
2795     the yes-pattern is executed and a closing parenthesis is
2796     required. Otherwise, since no-pattern is not present, the
2797     subpattern matches nothing. In other words, this pattern
2798     matches a sequence of non-parentheses, optionally enclosed
2799     in parentheses.
2800    
2801 nigel 63 If the condition is the string (R), it is satisfied if a
2802     recursive call to the pattern or subpattern has been made.
2803     At "top level", the condition is false. This is a PCRE
2804     extension. Recursive patterns are described in the next
2805     section.
2806 nigel 41
2807 nigel 63 If the condition is not a sequence of digits or (R), it must
2808     be an assertion. This may be a positive or negative looka-
2809     head or lookbehind assertion. Consider this pattern, again
2810     containing non-significant white space, and with the two
2811     alternatives on the second line:
2812    
2813 nigel 41 (?(?=[^a-z]*[a-z])
2814     \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
2815    
2816     The condition is a positive lookahead assertion that matches
2817     an optional sequence of non-letters followed by a letter. In
2818     other words, it tests for the presence of at least one
2819     letter in the subject. If a letter is found, the subject is
2820     matched against the first alternative; otherwise it is
2821     matched against the second. This pattern matches strings in
2822     one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
2823     letters and dd are digits.
2824    
2825    
2826 nigel 63 COMMENTS
2827 nigel 41
2828     The sequence (?# marks the start of a comment which contin-
2829     ues up to the next closing parenthesis. Nested parentheses
2830     are not permitted. The characters that make up a comment
2831     play no part in the pattern matching at all.
2832    
2833     If the PCRE_EXTENDED option is set, an unescaped # character
2834     outside a character class introduces a comment that contin-
2835     ues up to the next newline character in the pattern.
2836    
2837    
2838 nigel 63 RECURSIVE PATTERNS
2839 nigel 41
2840 nigel 43 Consider the problem of matching a string in parentheses,
2841     allowing for unlimited nested parentheses. Without the use
2842     of recursion, the best that can be done is to use a pattern
2843     that matches up to some fixed depth of nesting. It is not
2844 nigel 63 possible to handle an arbitrary nesting depth. Perl has pro-
2845     vided an experimental facility that allows regular expres-
2846     sions to recurse (amongst other things). It does this by
2847     interpolating Perl code in the expression at run time, and
2848     the code can refer to the expression itself. A Perl pattern
2849     to solve the parentheses problem can be created like this:
2850 nigel 43
2851     $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
2852    
2853     The (?p{...}) item interpolates Perl code at run time, and
2854     in this case refers recursively to the pattern in which it
2855     appears. Obviously, PCRE cannot support the interpolation of
2856 nigel 63 Perl code. Instead, it supports some special syntax for
2857     recursion of the entire pattern, and also for individual
2858     subpattern recursion.
2859 nigel 43
2860 nigel 63 The special item that consists of (? followed by a number
2861     greater than zero and a closing parenthesis is a recursive
2862     call of the subpattern of the given number, provided that it
2863     occurs inside that subpattern. (If not, it is a "subroutine"
2864     call, which is described in the next section.) The special
2865     item (?R) is a recursive call of the entire regular expres-
2866     sion.
2867    
2868     For example, this PCRE pattern solves the nested parentheses
2869     problem (assume the PCRE_EXTENDED option is set so that
2870     white space is ignored):
2871    
2872 nigel 43 \( ( (?>[^()]+) | (?R) )* \)
2873    
2874     First it matches an opening parenthesis. Then it matches any
2875     number of substrings which can either be a sequence of non-
2876     parentheses, or a recursive match of the pattern itself
2877 nigel 63 (that is a correctly parenthesized substring). Finally
2878     there is a closing parenthesis.
2879 nigel 43
2880 nigel 63 If this were part of a larger pattern, you would not want to
2881     recurse the entire pattern, so instead you could use this:
2882    
2883     ( \( ( (?>[^()]+) | (?1) )* \) )
2884    
2885     We have put the pattern into parentheses, and caused the
2886     recursion to refer to them instead of the whole pattern. In
2887     a larger pattern, keeping track of parenthesis numbers can
2888     be tricky. It may be more convenient to use named
2889     parentheses instead. For this, PCRE uses (?P>name), which is
2890     an extension to the Python syntax that PCRE uses for named
2891     parentheses (Perl does not provide named parentheses). We
2892     could rewrite the above example as follows:
2893    
2894     (?<pn> \( ( (?>[^()]+) | (?P>pn) )* \) )
2895    
2896 nigel 43 This particular example pattern contains nested unlimited
2897 nigel 63 repeats, and so the use of atomic grouping for matching
2898     strings of non-parentheses is important when applying the
2899     pattern to strings that do not match. For example, when this
2900     pattern is applied to
2901 nigel 43
2902     (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
2903    
2904 nigel 63 it yields "no match" quickly. However, if atomic grouping is
2905     not used, the match runs for a very long time indeed because
2906     there are so many different ways the + and * repeats can
2907     carve up the subject, and all have to be tested before
2908     failure can be reported.
2909     At the end of a match, the values set for any capturing sub-
2910     patterns are those from the outermost level of the recursion
2911     at which the subpattern value is set. If you want to obtain
2912     intermediate values, a callout function can be used (see
2913     below and the pcrecallout documentation). If the pattern
2914     above is matched against
2915 nigel 43
2916     (ab(cd)ef)
2917    
2918     the value for the capturing parentheses is "ef", which is
2919     the last value taken on at the top level. If additional
2920     parentheses are added, giving
2921    
2922     \( ( ( (?>[^()]+) | (?R) )* ) \)
2923     ^ ^
2924 nigel 63 ^ ^
2925 nigel 43
2926 nigel 63 the string they capture is "ab(cd)ef", the contents of the
2927     top level parentheses. If there are more than 15 capturing
2928     parentheses in a pattern, PCRE has to obtain extra memory to
2929     store data during a recursion, which it does by using
2930     pcre_malloc, freeing it via pcre_free afterwards. If no
2931     memory can be obtained, the match fails with the
2932     PCRE_ERROR_NOMEMORY error.
2933 nigel 43
2934 nigel 63 Do not confuse the (?R) item with the condition (R), which
2935     tests for recursion. Consider this pattern, which matches
2936     text in angle brackets, allowing for arbitrary nesting. Only
2937     digits are allowed in nested brackets (that is, when recurs-
2938     ing), whereas any characters are permitted at the outer
2939     level.
2940 nigel 43
2941 nigel 63 < (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
2942 nigel 41
2943 nigel 63 In this pattern, (?(R) is the start of a conditional subpat-
2944     tern, with two different alternatives for the recursive and
2945     non-recursive cases. The (?R) item is the actual recursive
2946     call.
2947 nigel 41
2948    
2949 nigel 63 SUBPATTERNS AS SUBROUTINES
2950    
2951     If the syntax for a recursive subpattern reference (either
2952     by number or by name) is used outside the parentheses to
2953     which it refers, it operates like a subroutine in a program-
2954     ming language. An earlier example pointed out that the pat-
2955     tern
2956    
2957     (sens|respons)e and \1ibility
2958    
2959     matches "sense and sensibility" and "response and responsi-
2960     bility", but not "sense and responsibility". If instead the
2961     pattern
2962    
2963     (sens|respons)e and (?1)ibility
2964    
2965     is used, it does match "sense and responsibility" as well as
2966     the other two strings. Such references must, however, follow
2967     the subpattern to which they refer.
2968    
2969    
2970     CALLOUTS
2971    
2972     Perl has a feature whereby using the sequence (?{...})
2973     causes arbitrary Perl code to be obeyed in the middle of
2974     matching a regular expression. This makes it possible,
2975     amongst other things, to extract different substrings that
2976     match the same pair of parentheses when there is a repeti-
2977     tion.
2978    
2979     PCRE provides a similar feature, but of course it cannot
2980     obey arbitrary Perl code. The feature is called "callout".
2981     The caller of PCRE provides an external function by putting
2982     its entry point in the global variable pcre_callout. By
2983     default, this variable contains NULL, which disables all
2984     calling out.
2985    
2986     Within a regular expression, (?C) indicates the points at
2987     which the external function is to be called. If you want to
2988     identify different callout points, you can put a number less
2989     than 256 after the letter C. The default value is zero. For
2990     example, this pattern has two callout points:
2991    
2992     (?C1)9abc(?C2)def
2993    
2994     During matching, when PCRE reaches a callout point (and
2995     pcre_callout is set), the external function is called. It is
2996     provided with the number of the callout, and, optionally,
2997     one item of data originally supplied by the caller of
2998     pcre_exec(). The callout function may cause matching to
2999     backtrack, or to fail altogether. A complete description of
3000     the interface to the callout function is given in the pcre-
3001     callout documentation.
3002    
3003     Last updated: 03 February 2003
3004     Copyright (c) 1997-2003 University of Cambridge.
3005     -----------------------------------------------------------------------------
3006    
3007     NAME
3008     PCRE - Perl-compatible regular expressions
3009    
3010    
3011     PCRE PERFORMANCE
3012    
3013     Certain items that may appear in regular expression patterns
3014     are more efficient than others. It is more efficient to use
3015     a character class like [aeiou] than a set of alternatives
3016     such as (a|e|i|o|u). In general, the simplest construction
3017     that provides the required behaviour is usually the most
3018     efficient. Jeffrey Friedl's book contains a lot of discus-
3019     sion about optimizing regular expressions for efficient per-
3020     formance.
3021    
3022     When a pattern begins with .* not in parentheses, or in
3023     parentheses that are not the subject of a backreference, and
3024     the PCRE_DOTALL option is set, the pattern is implicitly
3025     anchored by PCRE, since it can match only at the start of a
3026     subject string. However, if PCRE_DOTALL is not set, PCRE
3027     cannot make this optimization, because the . metacharacter
3028     does not then match a newline, and if the subject string
3029     contains newlines, the pattern may match from the character
3030     immediately following one of them instead of from the very
3031     start. For example, the pattern
3032    
3033     .*second
3034    
3035 nigel 41 matches the subject "first\nand second" (where \n stands for
3036 nigel 63 a newline character), with the match starting at the seventh
3037     character. In order to do this, PCRE has to retry the match
3038 nigel 41 starting after every newline in the subject.
3039    
3040     If you are using such a pattern with subject strings that do
3041     not contain newlines, the best performance is obtained by
3042     setting PCRE_DOTALL, or starting the pattern with ^.* to
3043     indicate explicit anchoring. That saves PCRE from having to
3044     scan along the subject looking for a newline to restart at.
3045    
3046     Beware of patterns that contain nested indefinite repeats.
3047     These can take a long time to run when applied to a string
3048     that does not match. Consider the pattern fragment
3049    
3050     (a+)*
3051    
3052     This can match "aaaa" in 33 different ways, and this number
3053     increases very rapidly as the string gets longer. (The *
3054     repeat can match 0, 1, 2, 3, or 4 times, and for each of
3055     those cases other than 0, the + repeats can match different
3056     numbers of times.) When the remainder of the pattern is such
3057 nigel 51 that the entire match is going to fail, PCRE has in princi-
3058     ple to try every possible variation, and this can take an
3059     extremely long time.
3060 nigel 41 An optimization catches some of the more simple cases such
3061     as
3062    
3063     (a+)*b
3064    
3065     where a literal character follows. Before embarking on the
3066     standard matching procedure, PCRE checks that there is a "b"
3067     later in the subject string, and if there is not, it fails
3068     the match immediately. However, when there is no following
3069     literal this optimization cannot be used. You can see the
3070     difference by comparing the behaviour of
3071    
3072     (a+)*\d
3073    
3074     with the pattern above. The former gives a failure almost
3075     instantly when applied to a whole line of "a" characters,
3076     whereas the latter takes an appreciable time with strings
3077     longer than about 20 characters.
3078    
3079 nigel 63 Last updated: 03 February 2003
3080     Copyright (c) 1997-2003 University of Cambridge.
3081     -----------------------------------------------------------------------------
3082 nigel 41
3083 nigel 63 NAME
3084     PCRE - Perl-compatible regular expressions.
3085 nigel 41
3086 nigel 49
3087 nigel 63 SYNOPSIS OF POSIX API
3088     #include <pcreposix.h>
3089 nigel 49
3090 nigel 63 int regcomp(regex_t *preg, const char *pattern,
3091     int cflags);
3092 nigel 49
3093 nigel 63 int regexec(regex_t *preg, const char *string,
3094     size_t nmatch, regmatch_t pmatch[], int eflags);
3095 nigel 49
3096 nigel 63 size_t regerror(int errcode, const regex_t *preg,
3097     char *errbuf, size_t errbuf_size);
3098 nigel 49
3099 nigel 63 void regfree(regex_t *preg);
3100 nigel 49
3101    
3102 nigel 63 DESCRIPTION
3103 nigel 49
3104 nigel 63 This set of functions provides a POSIX-style API to the PCRE
3105     regular expression package. See the pcreapi documentation
3106     for a description of the native API, which contains addi-
3107     tional functionality.
3108 nigel 49
3109 nigel 63 The functions described here are just wrapper functions that
3110     ultimately call the PCRE native API. Their prototypes are
3111     defined in the pcreposix.h header file, and on Unix systems
3112     the library itself is called pcreposix.a, so can be accessed
3113     by adding -lpcreposix to the command for linking an applica-
3114     tion which uses them. Because the POSIX functions call the
3115     native ones, it is also necessary to add -lpcre.
3116 nigel 49
3117 nigel 63 I have implemented only those option bits that can be rea-
3118     sonably mapped to PCRE native options. In addition, the
3119     options REG_EXTENDED and REG_NOSUB are defined with the
3120     value zero. They have no effect, but since programs that are
3121     written to the POSIX interface often use them, this makes it
3122     easier to slot in PCRE as a replacement library. Other POSIX
3123     options are not even defined.
3124 nigel 49
3125 nigel 63 When PCRE is called via these functions, it is only the API
3126     that is POSIX-like in style. The syntax and semantics of the
3127     regular expressions themselves are still those of Perl, sub-
3128     ject to the setting of various PCRE options, as described
3129 nigel 69 below. "POSIX-like in style" means that the API approximates
3130     to the POSIX definition; it is not fully POSIX-compatible,
3131     and in multi-byte encoding domains it is probably even less
3132     compatible.
3133 nigel 49
3134 nigel 63 The header for these functions is supplied as pcreposix.h to
3135     avoid any potential clash with other POSIX libraries. It
3136     can, of course, be renamed or aliased as regex.h, which is
3137     the "correct" name. It provides two structure types, regex_t
3138     for compiled internal forms, and regmatch_t for returning
3139     captured substrings. It also defines some constants whose
3140     names start with "REG_"; these are used for setting options
3141     and identifying error codes.
3142 nigel 49
3143    
3144 nigel 63 COMPILING A PATTERN
3145 nigel 49
3146 nigel 63 The function regcomp() is called to compile a pattern into
3147     an internal form. The pattern is a C string terminated by a
3148     binary zero, and is passed in the argument pattern. The preg
3149     argument is a pointer to a regex_t structure which is used
3150     as a base for storing information about the compiled expres-
3151     sion.
3152 nigel 49
3153 nigel 63 The argument cflags is either zero, or contains one or more
3154     of the bits defined by the following macros:
3155 nigel 53
3156 nigel 63 REG_ICASE
3157 nigel 49
3158 nigel 63 The PCRE_CASELESS option is set when the expression is
3159     passed for compilation to the native function.
3160 nigel 49
3161 nigel 63 REG_NEWLINE
3162 nigel 49
3163 nigel 63 The PCRE_MULTILINE option is set when the expression is
3164     passed for compilation to the native function. Note that
3165     this does not mimic the defined POSIX behaviour for
3166     REG_NEWLINE (see the following section).
3167 nigel 49
3168 nigel 63 In the absence of these flags, no options are passed to the
3169     native function. This means the the regex is compiled with
3170     PCRE default semantics. In particular, the way it handles
3171     newline characters in the subject string is the Perl way,
3172     not the POSIX way. Note that setting PCRE_MULTILINE has only
3173     some of the effects specified for REG_NEWLINE. It does not
3174     affect the way newlines are matched by . (they aren't) or by
3175     a negative class such as [^a] (they are).
3176 nigel 53
3177 nigel 63 The yield of regcomp() is zero on success, and non-zero oth-
3178     erwise. The preg structure is filled in on success, and one
3179     member of the structure is public: re_nsub contains the
3180     number of capturing subpatterns in the regular expression.
3181     Various error codes are defined in the header file.
3182 nigel 53
3183    
3184 nigel 63 MATCHING NEWLINE CHARACTERS
3185 nigel 53
3186 nigel 63 This area is not simple, because POSIX and Perl take dif-
3187     ferent views of things. It is not possible to get PCRE to
3188     obey POSIX semantics, but then PCRE was never intended to be
3189     a POSIX engine. The following table lists the different pos-
3190     sibilities for matching newline characters in PCRE:
3191 nigel 53
3192 nigel 63 Default Change with
3193 nigel 53
3194 nigel 63 . matches newline no PCRE_DOTALL
3195     newline matches [^a] yes not changeable
3196     $ matches \n at end yes PCRE_DOLLARENDONLY
3197     $ matches \n in middle no PCRE_MULTILINE
3198     ^ matches \n in middle no PCRE_MULTILINE
3199 nigel 53
3200 nigel 63 This is the equivalent table for POSIX:
3201 nigel 53
3202 nigel 63 Default Change with
3203 nigel 53
3204 nigel 63 . matches newline yes REG_NEWLINE
3205     newline matches [^a] yes REG_NEWLINE
3206     $ matches \n at end no REG_NEWLINE
3207     $ matches \n in middle no REG_NEWLINE
3208     ^ matches \n in middle no REG_NEWLINE
3209 nigel 53
3210 nigel 63 PCRE's behaviour is the same as Perl's, except that there is
3211     no equivalent for PCRE_DOLLARENDONLY in Perl. In both PCRE
3212     and Perl, there is no way to stop newline from matching
3213     [^a].
3214 nigel 53
3215 nigel 63 The default POSIX newline handling can be obtained by set-
3216     ting PCRE_DOTALL and PCRE_DOLLARENDONLY, but there is no way
3217     to make PCRE behave exactly as for the REG_NEWLINE action.
3218 nigel 53
3219    
3220 nigel 63 MATCHING A PATTERN
3221 nigel 53
3222 nigel 63 The function regexec() is called to match a pre-compiled
3223     pattern preg against a given string, which is terminated by
3224     a zero byte, subject to the options in eflags. These can be:
3225 nigel 53
3226 nigel 63 REG_NOTBOL
3227 nigel 53
3228 nigel 63 The PCRE_NOTBOL option is set when calling the underlying
3229     PCRE matching function.
3230 nigel 53
3231 nigel 63 REG_NOTEOL
3232 nigel 53
3233 nigel 63 The PCRE_NOTEOL option is set when calling the underlying
3234     PCRE matching function.
3235 nigel 53
3236 nigel 63 The portion of the string that was matched, and also any
3237     captured substrings, are returned via the pmatch argument,
3238     which points to an array of nmatch structures of type
3239     regmatch_t, containing the members rm_so and rm_eo. These
3240     contain the offset to the first character of each substring
3241     and the offset to the first character after the end of each
3242     substring, respectively. The 0th element of the vector
3243     relates to the entire portion of string that was matched;
3244     subsequent elements relate to the capturing subpatterns of
3245     the regular expression. Unused entries in the array have
3246     both structure members set to -1.
3247 nigel 53
3248 nigel 63 A successful match yields a zero return; various error codes
3249     are defined in the header file, of which REG_NOMATCH is the
3250     "expected" failure code.
3251 nigel 53
3252    
3253 nigel 63 ERROR MESSAGES
3254 nigel 53
3255 nigel 63 The regerror() function maps a non-zero errorcode from
3256     either regcomp() or regexec() to a printable message. If
3257     preg is not NULL, the error should have arisen from the use
3258     of that structure. A message terminated by a binary zero is
3259     placed in errbuf. The length of the message, including the
3260     zero, is limited to errbuf_size. The yield of the function
3261     is the size of buffer needed to hold the whole message.
3262 nigel 53
3263    
3264 nigel 63 STORAGE
3265 nigel 53
3266 nigel 63 Compiling a regular expression causes memory to be allocated
3267     and associated with the preg structure. The function reg-
3268     free() frees all such memory, after which preg may no longer
3269     be used as a compiled expression.
3270 nigel 53
3271    
3272 nigel 63 AUTHOR
3273 nigel 53
3274 nigel 63 Philip Hazel <ph10@cam.ac.uk>
3275     University Computing Service,
3276     Cambridge CB2 3QG, England.
3277 nigel 53
3278 nigel 63 Last updated: 03 February 2003
3279     Copyright (c) 1997-2003 University of Cambridge.
3280     -----------------------------------------------------------------------------
3281 nigel 53
3282 nigel 63 NAME
3283     PCRE - Perl-compatible regular expressions
3284 nigel 53
3285    
3286 nigel 63 PCRE SAMPLE PROGRAM
3287 nigel 41
3288 nigel 63 A simple, complete demonstration program, to get you started
3289     with using PCRE, is supplied in the file pcredemo.c in the
3290     PCRE distribution.
3291    
3292     The program compiles the regular expression that is its
3293     first argument, and matches it against the subject string in
3294     its second argument. No PCRE options are set, and default
3295     character tables are used. If matching succeeds, the program
3296     outputs the portion of the subject that matched, together
3297     with the contents of any captured substrings.
3298    
3299     If the -g option is given on the command line, the program
3300     then goes on to check for further matches of the same regu-
3301     lar expression in the same subject string. The logic is a
3302     little bit tricky because of the possibility of matching an
3303     empty string. Comments in the code explain what is going on.
3304    
3305     On a Unix system that has PCRE installed in /usr/local, you
3306     can compile the demonstration program using a command like
3307     this:
3308    
3309     gcc -o pcredemo pcredemo.c -I/usr/local/include \
3310     -L/usr/local/lib -lpcre
3311    
3312     Then you can run simple tests like this:
3313    
3314     ./pcredemo 'cat|dog' 'the cat sat on the mat'
3315     ./pcredemo -g 'cat|dog' 'the dog sat on the cat'
3316    
3317     Note that there is a much more comprehensive test program,
3318     called pcretest, which supports many more facilities for
3319     testing regular expressions and the PCRE library. The
3320     pcredemo program is provided as a simple coding example.
3321    
3322     On some operating systems (e.g. Solaris) you may get an
3323     error like this when you try to run pcredemo:
3324    
3325     ld.so.1: a.out: fatal: libpcre.so.0: open failed: No such
3326     file or directory
3327    
3328     This is caused by the way shared library support works on
3329     those systems. You need to add
3330    
3331     -R/usr/local/lib
3332    
3333     to the compile command to get round this problem.
3334    
3335     Last updated: 28 January 2003
3336     Copyright (c) 1997-2003 University of Cambridge.
3337     -----------------------------------------------------------------------------
3338    

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12