/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Contents of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log


Revision 49 - (hide annotations) (download)
Sat Feb 24 21:39:33 2007 UTC (7 years, 7 months ago) by nigel
File MIME type: text/plain
File size: 93217 byte(s)
Load pcre-3.3 into code/trunk.

1 nigel 41 NAME
2     pcre - Perl-compatible regular expressions.
3    
4    
5    
6     SYNOPSIS
7     #include <pcre.h>
8    
9     pcre *pcre_compile(const char *pattern, int options,
10     const char **errptr, int *erroffset,
11     const unsigned char *tableptr);
12    
13     pcre_extra *pcre_study(const pcre *code, int options,
14     const char **errptr);
15    
16     int pcre_exec(const pcre *code, const pcre_extra *extra,
17     const char *subject, int length, int startoffset,
18     int options, int *ovector, int ovecsize);
19    
20     int pcre_copy_substring(const char *subject, int *ovector,
21     int stringcount, int stringnumber, char *buffer,
22     int buffersize);
23    
24     int pcre_get_substring(const char *subject, int *ovector,
25     int stringcount, int stringnumber,
26     const char **stringptr);
27    
28     int pcre_get_substring_list(const char *subject,
29     int *ovector, int stringcount, const char ***listptr);
30    
31 nigel 49 void pcre_free_substring(const char *stringptr);
32    
33     void pcre_free_substring_list(const char **stringptr);
34    
35 nigel 41 const unsigned char *pcre_maketables(void);
36    
37 nigel 43 int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
38     int what, void *where);
39    
40 nigel 41 int pcre_info(const pcre *code, int *optptr, *firstcharptr);
41    
42     char *pcre_version(void);
43    
44     void *(*pcre_malloc)(size_t);
45    
46     void (*pcre_free)(void *);
47    
48    
49    
50    
51     DESCRIPTION
52     The PCRE library is a set of functions that implement regu-
53     lar expression pattern matching using the same syntax and
54     semantics as Perl 5, with just a few differences (see
55 nigel 49
56 nigel 41 below). The current implementation corresponds to Perl
57 nigel 49 5.005, with some additional features from later versions.
58     This includes some experimental, incomplete support for
59     UTF-8 encoded strings. Details of exactly what is and what
60     is not supported are given below.
61 nigel 41
62     PCRE has its own native API, which is described in this
63     document. There is also a set of wrapper functions that
64 nigel 43 correspond to the POSIX regular expression API. These are
65     described in the pcreposix documentation.
66    
67 nigel 41 The native API function prototypes are defined in the header
68     file pcre.h, and on Unix systems the library itself is
69     called libpcre.a, so can be accessed by adding -lpcre to the
70 nigel 43 command for linking an application which calls it. The
71     header file defines the macros PCRE_MAJOR and PCRE_MINOR to
72     contain the major and minor release numbers for the library.
73     Applications can use these to include support for different
74     releases.
75 nigel 41
76     The functions pcre_compile(), pcre_study(), and pcre_exec()
77 nigel 49 are used for compiling and matching regular expressions.
78    
79     The functions pcre_copy_substring(), pcre_get_substring(),
80     and pcre_get_substring_list() are convenience functions for
81 nigel 41 extracting captured substrings from a matched subject
82 nigel 49 string; pcre_free_substring() and pcre_free_substring_list()
83     are also provided, to free the memory used for extracted
84     strings.
85 nigel 41
86 nigel 49 The function pcre_maketables() is used (optionally) to build
87     a set of character tables in the current locale for passing
88     to pcre_compile().
89    
90 nigel 43 The function pcre_fullinfo() is used to find out information
91     about a compiled pattern; pcre_info() is an obsolete version
92     which returns only some of the available information, but is
93     retained for backwards compatibility. The function
94     pcre_version() returns a pointer to a string containing the
95     version of PCRE and its date of release.
96 nigel 41
97     The global variables pcre_malloc and pcre_free initially
98     contain the entry points of the standard malloc() and free()
99     functions respectively. PCRE calls the memory management
100     functions via these variables, so a calling program can
101     replace them if it wishes to intercept the calls. This
102     should be done before calling any PCRE functions.
103    
104    
105    
106     MULTI-THREADING
107 nigel 49 The PCRE functions can be used in multi-threading
108 nigel 41
109 nigel 49
110    
111    
112    
113     SunOS 5.8 Last change: 2
114    
115    
116    
117     applications, with the proviso that the memory management
118     functions pointed to by pcre_malloc and pcre_free are shared
119     by all threads.
120    
121 nigel 41 The compiled form of a regular expression is not altered
122     during matching, so the same compiled pattern can safely be
123     used by several threads at once.
124    
125    
126    
127     COMPILING A PATTERN
128     The function pcre_compile() is called to compile a pattern
129     into an internal form. The pattern is a C string terminated
130     by a binary zero, and is passed in the argument pattern. A
131     pointer to a single block of memory that is obtained via
132     pcre_malloc is returned. This contains the compiled code and
133     related data. The pcre type is defined for this for conveni-
134     ence, but in fact pcre is just a typedef for void, since the
135     contents of the block are not externally defined. It is up
136     to the caller to free the memory when it is no longer
137     required.
138    
139     The size of a compiled pattern is roughly proportional to
140     the length of the pattern string, except that each character
141     class (other than those containing just a single character,
142     negated or not) requires 33 bytes, and repeat quantifiers
143     with a minimum greater than one or a bounded maximum cause
144     the relevant portions of the compiled pattern to be repli-
145     cated.
146    
147     The options argument contains independent bits that affect
148     the compilation. It should be zero if no options are
149     required. Some of the options, in particular, those that are
150     compatible with Perl, can also be set and unset from within
151     the pattern (see the detailed description of regular expres-
152     sions below). For these options, the contents of the options
153     argument specifies their initial settings at the start of
154     compilation and execution. The PCRE_ANCHORED option can be
155     set at the time of matching as well as at compile time.
156    
157     If errptr is NULL, pcre_compile() returns NULL immediately.
158     Otherwise, if compilation of a pattern fails, pcre_compile()
159     returns NULL, and sets the variable pointed to by errptr to
160     point to a textual error message. The offset from the start
161     of the pattern to the character where the error was
162     discovered is placed in the variable pointed to by
163     erroffset, which must not be NULL. If it is, an immediate
164     error is given.
165    
166     If the final argument, tableptr, is NULL, PCRE uses a
167     default set of character tables which are built when it is
168     compiled, using the default C locale. Otherwise, tableptr
169     must be the result of a call to pcre_maketables(). See the
170     section on locale support below.
171    
172     The following option bits are defined in the header file:
173    
174     PCRE_ANCHORED
175    
176     If this bit is set, the pattern is forced to be "anchored",
177     that is, it is constrained to match only at the start of the
178     string which is being searched (the "subject string"). This
179     effect can also be achieved by appropriate constructs in the
180     pattern itself, which is the only way to do it in Perl.
181    
182     PCRE_CASELESS
183    
184     If this bit is set, letters in the pattern match both upper
185     and lower case letters. It is equivalent to Perl's /i
186     option.
187    
188     PCRE_DOLLAR_ENDONLY
189    
190     If this bit is set, a dollar metacharacter in the pattern
191     matches only at the end of the subject string. Without this
192     option, a dollar also matches immediately before the final
193     character if it is a newline (but not before any other new-
194     lines). The PCRE_DOLLAR_ENDONLY option is ignored if
195     PCRE_MULTILINE is set. There is no equivalent to this option
196     in Perl.
197    
198     PCRE_DOTALL
199    
200     If this bit is set, a dot metacharater in the pattern
201     matches all characters, including newlines. Without it, new-
202     lines are excluded. This option is equivalent to Perl's /s
203     option. A negative class such as [^a] always matches a new-
204     line character, independent of the setting of this option.
205    
206     PCRE_EXTENDED
207    
208     If this bit is set, whitespace data characters in the pat-
209     tern are totally ignored except when escaped or inside a
210     character class, and characters between an unescaped # out-
211     side a character class and the next newline character,
212     inclusive, are also ignored. This is equivalent to Perl's /x
213     option, and makes it possible to include comments inside
214     complicated patterns. Note, however, that this applies only
215     to data characters. Whitespace characters may never appear
216     within special character sequences in a pattern, for example
217     within the sequence (?( which introduces a conditional sub-
218     pattern.
219    
220     PCRE_EXTRA
221    
222 nigel 43 This option was invented in order to turn on additional
223     functionality of PCRE that is incompatible with Perl, but it
224     is currently of very little use. When set, any backslash in
225     a pattern that is followed by a letter that has no special
226     meaning causes an error, thus reserving these combinations
227     for future expansion. By default, as in Perl, a backslash
228     followed by a letter with no special meaning is treated as a
229     literal. There are at present no other features controlled
230     by this option. It can also be set by a (?X) option setting
231     within a pattern.
232 nigel 41
233     PCRE_MULTILINE
234    
235     By default, PCRE treats the subject string as consisting of
236     a single "line" of characters (even if it actually contains
237     several newlines). The "start of line" metacharacter (^)
238     matches only at the start of the string, while the "end of
239     line" metacharacter ($) matches only at the end of the
240     string, or before a terminating newline (unless
241     PCRE_DOLLAR_ENDONLY is set). This is the same as Perl.
242    
243     When PCRE_MULTILINE it is set, the "start of line" and "end
244 nigel 43 of line" constructs match immediately following or immedi-
245     ately before any newline in the subject string, respec-
246     tively, as well as at the very start and end. This is
247 nigel 41 equivalent to Perl's /m option. If there are no "\n" charac-
248     ters in a subject string, or no occurrences of ^ or $ in a
249     pattern, setting PCRE_MULTILINE has no effect.
250    
251     PCRE_UNGREEDY
252    
253     This option inverts the "greediness" of the quantifiers so
254     that they are not greedy by default, but become greedy if
255     followed by "?". It is not compatible with Perl. It can also
256     be set by a (?U) option setting within the pattern.
257    
258 nigel 49 PCRE_UTF8
259 nigel 41
260 nigel 49 This option causes PCRE to regard both the pattern and the
261     subject as strings of UTF-8 characters instead of just byte
262     strings. However, it is available only if PCRE has been
263     built to include UTF-8 support. If not, the use of this
264     option provokes an error. Support for UTF-8 is new, experi-
265     mental, and incomplete. Details of exactly what it entails
266     are given below.
267 nigel 41
268 nigel 49
269    
270 nigel 41 STUDYING A PATTERN
271     When a pattern is going to be used several times, it is
272     worth spending more time analyzing it in order to speed up
273     the time taken for matching. The function pcre_study() takes
274 nigel 49
275 nigel 41 a pointer to a compiled pattern as its first argument, and
276     returns a pointer to a pcre_extra block (another void
277     typedef) containing additional information about the pat-
278     tern; this can be passed to pcre_exec(). If no additional
279     information is available, NULL is returned.
280    
281     The second argument contains option bits. At present, no
282     options are defined for pcre_study(), and this argument
283     should always be zero.
284    
285     The third argument for pcre_study() is a pointer to an error
286     message. If studying succeeds (even if no data is returned),
287     the variable it points to is set to NULL. Otherwise it
288     points to a textual error message.
289    
290     At present, studying a pattern is useful only for non-
291     anchored patterns that do not have a single fixed starting
292     character. A bitmap of possible starting characters is
293     created.
294    
295    
296    
297     LOCALE SUPPORT
298     PCRE handles caseless matching, and determines whether char-
299     acters are letters, digits, or whatever, by reference to a
300     set of tables. The library contains a default set of tables
301     which is created in the default C locale when PCRE is com-
302     piled. This is used when the final argument of
303     pcre_compile() is NULL, and is sufficient for many applica-
304     tions.
305    
306     An alternative set of tables can, however, be supplied. Such
307     tables are built by calling the pcre_maketables() function,
308     which has no arguments, in the relevant locale. The result
309     can then be passed to pcre_compile() as often as necessary.
310     For example, to build and use tables that are appropriate
311     for the French locale (where accented characters with codes
312     greater than 128 are treated as letters), the following code
313     could be used:
314    
315     setlocale(LC_CTYPE, "fr");
316     tables = pcre_maketables();
317     re = pcre_compile(..., tables);
318    
319     The tables are built in memory that is obtained via
320     pcre_malloc. The pointer that is passed to pcre_compile is
321     saved with the compiled pattern, and the same tables are
322     used via this pointer by pcre_study() and pcre_exec(). Thus
323     for any single pattern, compilation, studying and matching
324     all happen in the same locale, but different patterns can be
325     compiled in different locales. It is the caller's responsi-
326     bility to ensure that the memory containing the tables
327     remains available for as long as it is needed.
328    
329    
330    
331     INFORMATION ABOUT A PATTERN
332 nigel 43 The pcre_fullinfo() function returns information about a
333     compiled pattern. It replaces the obsolete pcre_info() func-
334     tion, which is nevertheless retained for backwards compabil-
335     ity (and is documented below).
336 nigel 41
337 nigel 43 The first argument for pcre_fullinfo() is a pointer to the
338     compiled pattern. The second argument is the result of
339     pcre_study(), or NULL if the pattern was not studied. The
340     third argument specifies which piece of information is
341     required, while the fourth argument is a pointer to a vari-
342     able to receive the data. The yield of the function is zero
343     for success, or one of the following negative numbers:
344    
345 nigel 41 PCRE_ERROR_NULL the argument code was NULL
346 nigel 43 the argument where was NULL
347 nigel 41 PCRE_ERROR_BADMAGIC the "magic number" was not found
348 nigel 43 PCRE_ERROR_BADOPTION the value of what was invalid
349 nigel 41
350 nigel 43 The possible values for the third argument are defined in
351     pcre.h, and are as follows:
352    
353     PCRE_INFO_OPTIONS
354    
355     Return a copy of the options with which the pattern was com-
356     piled. The fourth argument should point to au unsigned long
357     int variable. These option bits are those specified in the
358 nigel 41 call to pcre_compile(), modified by any top-level option
359     settings within the pattern itself, and with the
360 nigel 43 PCRE_ANCHORED bit forcibly set if the form of the pattern
361     implies that it can match only at the start of a subject
362     string.
363 nigel 41
364 nigel 43 PCRE_INFO_SIZE
365    
366     Return the size of the compiled pattern, that is, the value
367     that was passed as the argument to pcre_malloc() when PCRE
368     was getting memory in which to place the compiled data. The
369     fourth argument should point to a size_t variable.
370    
371     PCRE_INFO_CAPTURECOUNT
372    
373     Return the number of capturing subpatterns in the pattern.
374     The fourth argument should point to an int variable.
375    
376     PCRE_INFO_BACKREFMAX
377    
378 nigel 49 Return the number of the highest back reference in the
379     pattern. The fourth argument should point to an int vari-
380     able. Zero is returned if there are no back references.
381 nigel 43
382     PCRE_INFO_FIRSTCHAR
383    
384     Return information about the first character of any matched
385     string, for a non-anchored pattern. If there is a fixed
386     first character, e.g. from a pattern such as
387 nigel 47 (cat|cow|coyote), it is returned in the integer pointed to
388     by where. Otherwise, if either
389 nigel 41
390     (a) the pattern was compiled with the PCRE_MULTILINE option,
391     and every branch starts with "^", or
392    
393     (b) every branch of the pattern starts with ".*" and
394     PCRE_DOTALL is not set (if it were set, the pattern would be
395     anchored),
396 nigel 43
397 nigel 47 -1 is returned, indicating that the pattern matches only at
398     the start of a subject string or after any "\n" within the
399     string. Otherwise -2 is returned. For anchored patterns, -2
400     is returned.
401 nigel 41
402 nigel 43 PCRE_INFO_FIRSTTABLE
403 nigel 41
404 nigel 43 If the pattern was studied, and this resulted in the con-
405     struction of a 256-bit table indicating a fixed set of char-
406     acters for the first character in any matching string, a
407     pointer to the table is returned. Otherwise NULL is
408     returned. The fourth argument should point to an unsigned
409     char * variable.
410 nigel 41
411 nigel 43 PCRE_INFO_LASTLITERAL
412    
413     For a non-anchored pattern, return the value of the right-
414     most literal character which must exist in any matched
415     string, other than at its start. The fourth argument should
416     point to an int variable. If there is no such character, or
417     if the pattern is anchored, -1 is returned. For example, for
418     the pattern /a\d+z\d+/ the returned value is 'z'.
419    
420     The pcre_info() function is now obsolete because its inter-
421     face is too restrictive to return all the available data
422     about a compiled pattern. New programs should use
423     pcre_fullinfo() instead. The yield of pcre_info() is the
424     number of capturing subpatterns, or one of the following
425     negative numbers:
426    
427     PCRE_ERROR_NULL the argument code was NULL
428     PCRE_ERROR_BADMAGIC the "magic number" was not found
429    
430     If the optptr argument is not NULL, a copy of the options
431     with which the pattern was compiled is placed in the integer
432     it points to (see PCRE_INFO_OPTIONS above).
433    
434     If the pattern is not anchored and the firstcharptr argument
435     is not NULL, it is used to pass back information about the
436     first character of any matched string (see
437     PCRE_INFO_FIRSTCHAR above).
438    
439    
440    
441 nigel 41 MATCHING A PATTERN
442     The function pcre_exec() is called to match a subject string
443     against a pre-compiled pattern, which is passed in the code
444     argument. If the pattern has been studied, the result of the
445     study should be passed in the extra argument. Otherwise this
446     must be NULL.
447    
448     The PCRE_ANCHORED option can be passed in the options argu-
449     ment, whose unused bits must be zero. However, if a pattern
450     was compiled with PCRE_ANCHORED, or turned out to be
451     anchored by virtue of its contents, it cannot be made
452     unachored at matching time.
453    
454     There are also three further options that can be set only at
455     matching time:
456    
457     PCRE_NOTBOL
458    
459     The first character of the string is not the beginning of a
460     line, so the circumflex metacharacter should not match
461     before it. Setting this without PCRE_MULTILINE (at compile
462     time) causes circumflex never to match.
463    
464     PCRE_NOTEOL
465    
466     The end of the string is not the end of a line, so the dol-
467     lar metacharacter should not match it nor (except in multi-
468     line mode) a newline immediately before it. Setting this
469     without PCRE_MULTILINE (at compile time) causes dollar never
470     to match.
471    
472     PCRE_NOTEMPTY
473    
474     An empty string is not considered to be a valid match if
475     this option is set. If there are alternatives in the pat-
476     tern, they are tried. If all the alternatives match the
477     empty string, the entire match fails. For example, if the
478     pattern
479    
480     a?b?
481    
482     is applied to a string not beginning with "a" or "b", it
483     matches the empty string at the start of the subject. With
484     PCRE_NOTEMPTY set, this match is not valid, so PCRE searches
485     further into the string for occurrences of "a" or "b".
486    
487     Perl has no direct equivalent of PCRE_NOTEMPTY, but it does
488     make a special case of a pattern match of the empty string
489     within its split() function, and when using the /g modifier.
490     It is possible to emulate Perl's behaviour after matching a
491     null string by first trying the match again at the same
492     offset with PCRE_NOTEMPTY set, and then if that fails by
493     advancing the starting offset (see below) and trying an
494     ordinary match again.
495    
496     The subject string is passed as a pointer in subject, a
497     length in length, and a starting offset in startoffset.
498     Unlike the pattern string, it may contain binary zero char-
499     acters. When the starting offset is zero, the search for a
500     match starts at the beginning of the subject, and this is by
501     far the most common case.
502    
503     A non-zero starting offset is useful when searching for
504     another match in the same subject by calling pcre_exec()
505     again after a previous success. Setting startoffset differs
506     from just passing over a shortened string and setting
507     PCRE_NOTBOL in the case of a pattern that begins with any
508     kind of lookbehind. For example, consider the pattern
509    
510     \Biss\B
511    
512     which finds occurrences of "iss" in the middle of words. (\B
513     matches only if the current position in the subject is not a
514     word boundary.) When applied to the string "Mississipi" the
515     first call to pcre_exec() finds the first occurrence. If
516     pcre_exec() is called again with just the remainder of the
517     subject, namely "issipi", it does not match, because \B is
518     always false at the start of the subject, which is deemed to
519     be a word boundary. However, if pcre_exec() is passed the
520     entire string again, but with startoffset set to 4, it finds
521     the second occurrence of "iss" because it is able to look
522     behind the starting point to discover that it is preceded by
523     a letter.
524    
525     If a non-zero starting offset is passed when the pattern is
526     anchored, one attempt to match at the given offset is tried.
527     This can only succeed if the pattern does not require the
528     match to be at the start of the subject.
529    
530     In general, a pattern matches a certain portion of the sub-
531     ject, and in addition, further substrings from the subject
532     may be picked out by parts of the pattern. Following the
533     usage in Jeffrey Friedl's book, this is called "capturing"
534     in what follows, and the phrase "capturing subpattern" is
535     used for a fragment of a pattern that picks out a substring.
536     PCRE supports several other kinds of parenthesized subpat-
537     tern that do not cause substrings to be captured.
538    
539     Captured substrings are returned to the caller via a vector
540     of integer offsets whose address is passed in ovector. The
541     number of elements in the vector is passed in ovecsize. The
542     first two-thirds of the vector is used to pass back captured
543     substrings, each substring using a pair of integers. The
544     remaining third of the vector is used as workspace by
545     pcre_exec() while matching capturing subpatterns, and is not
546     available for passing back information. The length passed in
547     ovecsize should always be a multiple of three. If it is not,
548     it is rounded down.
549    
550     When a match has been successful, information about captured
551     substrings is returned in pairs of integers, starting at the
552     beginning of ovector, and continuing up to two-thirds of its
553     length at the most. The first element of a pair is set to
554     the offset of the first character in a substring, and the
555     second is set to the offset of the first character after the
556     end of a substring. The first pair, ovector[0] and ovec-
557     tor[1], identify the portion of the subject string matched
558     by the entire pattern. The next pair is used for the first
559     capturing subpattern, and so on. The value returned by
560     pcre_exec() is the number of pairs that have been set. If
561     there are no capturing subpatterns, the return value from a
562     successful match is 1, indicating that just the first pair
563     of offsets has been set.
564    
565     Some convenience functions are provided for extracting the
566     captured substrings as separate strings. These are described
567     in the following section.
568    
569     It is possible for an capturing subpattern number n+1 to
570     match some part of the subject when subpattern n has not
571     been used at all. For example, if the string "abc" is
572     matched against the pattern (a|(z))(bc) subpatterns 1 and 3
573     are matched, but 2 is not. When this happens, both offset
574     values corresponding to the unused subpattern are set to -1.
575    
576     If a capturing subpattern is matched repeatedly, it is the
577     last portion of the string that it matched that gets
578     returned.
579    
580     If the vector is too small to hold all the captured sub-
581     strings, it is used as far as possible (up to two-thirds of
582     its length), and the function returns a value of zero. In
583     particular, if the substring offsets are not of interest,
584     pcre_exec() may be called with ovector passed as NULL and
585     ovecsize as zero. However, if the pattern contains back
586     references and the ovector isn't big enough to remember the
587     related substrings, PCRE has to get additional memory for
588     use during matching. Thus it is usually advisable to supply
589     an ovector.
590    
591     Note that pcre_info() can be used to find out how many cap-
592     turing subpatterns there are in a compiled pattern. The
593     smallest size for ovector that will allow for n captured
594     substrings in addition to the offsets of the substring
595     matched by the whole pattern is (n+1)*3.
596    
597     If pcre_exec() fails, it returns a negative number. The fol-
598     lowing are defined in the header file:
599    
600     PCRE_ERROR_NOMATCH (-1)
601    
602     The subject string did not match the pattern.
603    
604     PCRE_ERROR_NULL (-2)
605    
606     Either code or subject was passed as NULL, or ovector was
607     NULL and ovecsize was not zero.
608    
609     PCRE_ERROR_BADOPTION (-3)
610    
611     An unrecognized bit was set in the options argument.
612    
613     PCRE_ERROR_BADMAGIC (-4)
614    
615     PCRE stores a 4-byte "magic number" at the start of the com-
616     piled code, to catch the case when it is passed a junk
617     pointer. This is the error it gives when the magic number
618     isn't present.
619    
620     PCRE_ERROR_UNKNOWN_NODE (-5)
621    
622     While running the pattern match, an unknown item was encoun-
623     tered in the compiled pattern. This error could be caused by
624     a bug in PCRE or by overwriting of the compiled pattern.
625    
626     PCRE_ERROR_NOMEMORY (-6)
627    
628     If a pattern contains back references, but the ovector that
629     is passed to pcre_exec() is not big enough to remember the
630     referenced substrings, PCRE gets a block of memory at the
631     start of matching to use for this purpose. If the call via
632     pcre_malloc() fails, this error is given. The memory is
633     freed at the end of matching.
634    
635    
636    
637     EXTRACTING CAPTURED SUBSTRINGS
638     Captured substrings can be accessed directly by using the
639 nigel 49
640    
641    
642    
643    
644     SunOS 5.8 Last change: 12
645    
646    
647    
648 nigel 41 offsets returned by pcre_exec() in ovector. For convenience,
649     the functions pcre_copy_substring(), pcre_get_substring(),
650     and pcre_get_substring_list() are provided for extracting
651     captured substrings as new, separate, zero-terminated
652     strings. A substring that contains a binary zero is
653     correctly extracted and has a further zero added on the end,
654     but the result does not, of course, function as a C string.
655    
656     The first three arguments are the same for all three func-
657     tions: subject is the subject string which has just been
658     successfully matched, ovector is a pointer to the vector of
659     integer offsets that was passed to pcre_exec(), and
660     stringcount is the number of substrings that were captured
661     by the match, including the substring that matched the
662     entire regular expression. This is the value returned by
663     pcre_exec if it is greater than zero. If pcre_exec()
664     returned zero, indicating that it ran out of space in ovec-
665 nigel 47 tor, the value passed as stringcount should be the size of
666     the vector divided by three.
667 nigel 41
668     The functions pcre_copy_substring() and pcre_get_substring()
669     extract a single substring, whose number is given as string-
670     number. A value of zero extracts the substring that matched
671     the entire pattern, while higher values extract the captured
672     substrings. For pcre_copy_substring(), the string is placed
673     in buffer, whose length is given by buffersize, while for
674 nigel 49 pcre_get_substring() a new block of memory is obtained via
675 nigel 41 pcre_malloc, and its address is returned via stringptr. The
676     yield of the function is the length of the string, not
677     including the terminating zero, or one of
678    
679     PCRE_ERROR_NOMEMORY (-6)
680    
681     The buffer was too small for pcre_copy_substring(), or the
682     attempt to get memory failed for pcre_get_substring().
683    
684     PCRE_ERROR_NOSUBSTRING (-7)
685    
686     There is no substring whose number is stringnumber.
687    
688     The pcre_get_substring_list() function extracts all avail-
689     able substrings and builds a list of pointers to them. All
690     this is done in a single block of memory which is obtained
691     via pcre_malloc. The address of the memory block is returned
692     via listptr, which is also the start of the list of string
693     pointers. The end of the list is marked by a NULL pointer.
694     The yield of the function is zero if all went well, or
695    
696     PCRE_ERROR_NOMEMORY (-6)
697    
698     if the attempt to get the memory block failed.
699    
700     When any of these functions encounter a substring that is
701     unset, which can happen when capturing subpattern number n+1
702     matches some part of the subject, but subpattern n has not
703     been used at all, they return an empty string. This can be
704     distinguished from a genuine zero-length substring by
705     inspecting the appropriate offset in ovector, which is nega-
706     tive for unset substrings.
707    
708 nigel 49 The two convenience functions pcre_free_substring() and
709     pcre_free_substring_list() can be used to free the memory
710     returned by a previous call of pcre_get_substring() or
711     pcre_get_substring_list(), respectively. They do nothing
712     more than call the function pointed to by pcre_free, which
713     of course could be called directly from a C program. How-
714     ever, PCRE is used in some situations where it is linked via
715     a special interface to another programming language which
716     cannot use pcre_free directly; it is for these cases that
717     the functions are provided.
718 nigel 41
719    
720    
721     LIMITATIONS
722     There are some size limitations in PCRE but it is hoped that
723     they will never in practice be relevant. The maximum length
724     of a compiled pattern is 65539 (sic) bytes. All values in
725     repeating quantifiers must be less than 65536. The maximum
726     number of capturing subpatterns is 99. The maximum number
727     of all parenthesized subpatterns, including capturing sub-
728     patterns, assertions, and other types of subpattern, is 200.
729    
730     The maximum length of a subject string is the largest posi-
731     tive number that an integer variable can hold. However, PCRE
732     uses recursion to handle subpatterns and indefinite repeti-
733     tion. This means that the available stack space may limit
734     the size of a subject string that can be processed by cer-
735     tain patterns.
736    
737    
738    
739     DIFFERENCES FROM PERL
740     The differences described here are with respect to Perl
741     5.005.
742    
743     1. By default, a whitespace character is any character that
744     the C library function isspace() recognizes, though it is
745     possible to compile PCRE with alternative character type
746     tables. Normally isspace() matches space, formfeed, newline,
747     carriage return, horizontal tab, and vertical tab. Perl 5 no
748     longer includes vertical tab in its set of whitespace char-
749     acters. The \v escape that was in the Perl documentation for
750     a long time was never in fact recognized. However, the char-
751     acter itself was treated as whitespace at least up to 5.002.
752     In 5.004 and 5.005 it does not match \s.
753    
754     2. PCRE does not allow repeat quantifiers on lookahead
755     assertions. Perl permits them, but they do not mean what you
756     might think. For example, (?!a){3} does not assert that the
757     next three characters are not "a". It just asserts that the
758     next character is not "a" three times.
759    
760     3. Capturing subpatterns that occur inside negative looka-
761     head assertions are counted, but their entries in the
762     offsets vector are never set. Perl sets its numerical vari-
763     ables from any such patterns that are matched before the
764     assertion fails to match something (thereby succeeding), but
765     only if the negative lookahead assertion contains just one
766     branch.
767    
768     4. Though binary zero characters are supported in the sub-
769     ject string, they are not allowed in a pattern string
770     because it is passed as a normal C string, terminated by
771     zero. The escape sequence "\0" can be used in the pattern to
772     represent a binary zero.
773    
774     5. The following Perl escape sequences are not supported:
775     \l, \u, \L, \U, \E, \Q. In fact these are implemented by
776     Perl's general string-handling and are not part of its pat-
777     tern matching engine.
778    
779     6. The Perl \G assertion is not supported as it is not
780     relevant to single pattern matches.
781    
782 nigel 43 7. Fairly obviously, PCRE does not support the (?{code}) and
783     (?p{code}) constructions. However, there is some experimen-
784     tal support for recursive patterns using the non-Perl item
785     (?R).
786 nigel 49
787 nigel 41 8. There are at the time of writing some oddities in Perl
788     5.005_02 concerned with the settings of captured strings
789     when part of a pattern is repeated. For example, matching
790     "aba" against the pattern /^(a(b)?)+$/ sets $2 to the value
791     "b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2
792     unset. However, if the pattern is changed to
793 nigel 47 /^(aa(b(b))?)+$/ then $2 (and $3) are set.
794 nigel 41
795     In Perl 5.004 $2 is set in both cases, and that is also true
796     of PCRE. If in the future Perl changes to a consistent state
797     that is different, PCRE may change to follow.
798    
799     9. Another as yet unresolved discrepancy is that in Perl
800     5.005_02 the pattern /^(a)?(?(1)a|b)+$/ matches the string
801     "a", whereas in PCRE it does not. However, in both Perl and
802     PCRE /^(a)?a/ matched against "a" leaves $1 unset.
803    
804     10. PCRE provides some extensions to the Perl regular
805     expression facilities:
806    
807     (a) Although lookbehind assertions must match fixed length
808     strings, each alternative branch of a lookbehind assertion
809     can match a different length of string. Perl 5.005 requires
810     them all to have the same length.
811    
812     (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not
813     set, the $ meta- character matches only at the very end of
814     the string.
815    
816     (c) If PCRE_EXTRA is set, a backslash followed by a letter
817     with no special meaning is faulted.
818    
819 nigel 43 (d) If PCRE_UNGREEDY is set, the greediness of the repeti-
820     tion quantifiers is inverted, that is, by default they are
821     not greedy, but if followed by a question mark they are.
822 nigel 41
823     (e) PCRE_ANCHORED can be used to force a pattern to be tried
824     only at the start of the subject.
825    
826     (f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY options
827     for pcre_exec() have no Perl equivalents.
828    
829 nigel 43 (g) The (?R) construct allows for recursive pattern matching
830     (Perl 5.6 can do this using the (?p{code}) construct, which
831     PCRE cannot of course support.)
832 nigel 41
833    
834 nigel 43
835 nigel 41 REGULAR EXPRESSION DETAILS
836     The syntax and semantics of the regular expressions sup-
837     ported by PCRE are described below. Regular expressions are
838     also described in the Perl documentation and in a number of
839     other books, some of which have copious examples. Jeffrey
840     Friedl's "Mastering Regular Expressions", published by
841 nigel 49 O'Reilly (ISBN 1-56592-257), covers them in great detail.
842    
843 nigel 41 The description here is intended as reference documentation.
844 nigel 49 The basic operation of PCRE is on strings of bytes. However,
845     there is the beginnings of some support for UTF-8 character
846     strings. To use this support you must configure PCRE to
847     include it, and then call pcre_compile() with the PCRE_UTF8
848     option. How this affects the pattern matching is described
849     in the final section of this document.
850 nigel 41
851     A regular expression is a pattern that is matched against a
852     subject string from left to right. Most characters stand for
853     themselves in a pattern, and match the corresponding charac-
854     ters in the subject. As a trivial example, the pattern
855    
856     The quick brown fox
857    
858     matches a portion of a subject string that is identical to
859     itself. The power of regular expressions comes from the
860     ability to include alternatives and repetitions in the pat-
861     tern. These are encoded in the pattern by the use of meta-
862     characters, which do not stand for themselves but instead
863     are interpreted in some special way.
864    
865     There are two different sets of meta-characters: those that
866     are recognized anywhere in the pattern except within square
867     brackets, and those that are recognized in square brackets.
868     Outside square brackets, the meta-characters are as follows:
869    
870     \ general escape character with several uses
871     ^ assert start of subject (or line, in multiline
872     mode)
873     $ assert end of subject (or line, in multiline mode)
874     . match any character except newline (by default)
875     [ start character class definition
876     | start of alternative branch
877     ( start subpattern
878     ) end subpattern
879     ? extends the meaning of (
880     also 0 or 1 quantifier
881     also quantifier minimizer
882     * 0 or more quantifier
883     + 1 or more quantifier
884     { start min/max quantifier
885    
886     Part of a pattern that is in square brackets is called a
887     "character class". In a character class the only meta-
888     characters are:
889    
890     \ general escape character
891     ^ negate the class, but only if the first character
892     - indicates character range
893     ] terminates the character class
894    
895     The following sections describe the use of each of the
896     meta-characters.
897    
898    
899    
900     BACKSLASH
901     The backslash character has several uses. Firstly, if it is
902     followed by a non-alphameric character, it takes away any
903     special meaning that character may have. This use of
904     backslash as an escape character applies both inside and
905     outside character classes.
906    
907     For example, if you want to match a "*" character, you write
908     "\*" in the pattern. This applies whether or not the follow-
909     ing character would otherwise be interpreted as a meta-
910     character, so it is always safe to precede a non-alphameric
911     with "\" to specify that it stands for itself. In particu-
912     lar, if you want to match a backslash, you write "\\".
913    
914     If a pattern is compiled with the PCRE_EXTENDED option, whi-
915     tespace in the pattern (other than in a character class) and
916     characters between a "#" outside a character class and the
917     next newline character are ignored. An escaping backslash
918     can be used to include a whitespace or "#" character as part
919     of the pattern.
920    
921     A second use of backslash provides a way of encoding non-
922     printing characters in patterns in a visible manner. There
923     is no restriction on the appearance of non-printing charac-
924     ters, apart from the binary zero that terminates a pattern,
925     but when a pattern is being prepared by text editing, it is
926     usually easier to use one of the following escape sequences
927     than the binary character it represents:
928    
929     \a alarm, that is, the BEL character (hex 07)
930     \cx "control-x", where x is any character
931     \e escape (hex 1B)
932     \f formfeed (hex 0C)
933     \n newline (hex 0A)
934     \r carriage return (hex 0D)
935 nigel 43 \t tab (hex 09)
936 nigel 41 \xhh character with hex code hh
937     \ddd character with octal code ddd, or backreference
938    
939     The precise effect of "\cx" is as follows: if "x" is a lower
940     case letter, it is converted to upper case. Then bit 6 of
941     the character (hex 40) is inverted. Thus "\cz" becomes hex
942     1A, but "\c{" becomes hex 3B, while "\c;" becomes hex 7B.
943    
944     After "\x", up to two hexadecimal digits are read (letters
945     can be in upper or lower case).
946    
947     After "\0" up to two further octal digits are read. In both
948     cases, if there are fewer than two digits, just those that
949     are present are used. Thus the sequence "\0\x\07" specifies
950     two binary zeros followed by a BEL character. Make sure you
951     supply two digits after the initial zero if the character
952     that follows is itself an octal digit.
953    
954     The handling of a backslash followed by a digit other than 0
955     is complicated. Outside a character class, PCRE reads it
956     and any following digits as a decimal number. If the number
957     is less than 10, or if there have been at least that many
958     previous capturing left parentheses in the expression, the
959     entire sequence is taken as a back reference. A description
960     of how this works is given later, following the discussion
961     of parenthesized subpatterns.
962    
963     Inside a character class, or if the decimal number is
964     greater than 9 and there have not been that many capturing
965     subpatterns, PCRE re-reads up to three octal digits follow-
966     ing the backslash, and generates a single byte from the
967     least significant 8 bits of the value. Any subsequent digits
968     stand for themselves. For example:
969    
970     \040 is another way of writing a space
971     \40 is the same, provided there are fewer than 40
972     previous capturing subpatterns
973     \7 is always a back reference
974     \11 might be a back reference, or another way of
975     writing a tab
976     \011 is always a tab
977     \0113 is a tab followed by the character "3"
978     \113 is the character with octal code 113 (since there
979     can be no more than 99 back references)
980     \377 is a byte consisting entirely of 1 bits
981     \81 is either a back reference, or a binary zero
982     followed by the two characters "8" and "1"
983    
984     Note that octal values of 100 or greater must not be intro-
985     duced by a leading zero, because no more than three octal
986     digits are ever read.
987 nigel 43
988 nigel 41 All the sequences that define a single byte value can be
989     used both inside and outside character classes. In addition,
990     inside a character class, the sequence "\b" is interpreted
991     as the backspace character (hex 08). Outside a character
992     class it has a different meaning (see below).
993    
994     The third use of backslash is for specifying generic charac-
995     ter types:
996    
997     \d any decimal digit
998     \D any character that is not a decimal digit
999     \s any whitespace character
1000     \S any character that is not a whitespace character
1001     \w any "word" character
1002     \W any "non-word" character
1003    
1004     Each pair of escape sequences partitions the complete set of
1005     characters into two disjoint sets. Any given character
1006     matches one, and only one, of each pair.
1007    
1008     A "word" character is any letter or digit or the underscore
1009     character, that is, any character which can be part of a
1010     Perl "word". The definition of letters and digits is con-
1011     trolled by PCRE's character tables, and may vary if locale-
1012     specific matching is taking place (see "Locale support"
1013     above). For example, in the "fr" (French) locale, some char-
1014     acter codes greater than 128 are used for accented letters,
1015     and these are matched by \w.
1016    
1017     These character type sequences can appear both inside and
1018     outside character classes. They each match one character of
1019     the appropriate type. If the current matching point is at
1020     the end of the subject string, all of them fail, since there
1021     is no character to match.
1022    
1023     The fourth use of backslash is for certain simple asser-
1024     tions. An assertion specifies a condition that has to be met
1025     at a particular point in a match, without consuming any
1026     characters from the subject string. The use of subpatterns
1027     for more complicated assertions is described below. The
1028     backslashed assertions are
1029    
1030     \b word boundary
1031     \B not a word boundary
1032     \A start of subject (independent of multiline mode)
1033     \Z end of subject or newline at end (independent of
1034     multiline mode)
1035     \z end of subject (independent of multiline mode)
1036    
1037     These assertions may not appear in character classes (but
1038     note that "\b" has a different meaning, namely the backspace
1039     character, inside a character class).
1040 nigel 43
1041 nigel 41 A word boundary is a position in the subject string where
1042     the current character and the previous character do not both
1043     match \w or \W (i.e. one matches \w and the other matches
1044     \W), or the start or end of the string if the first or last
1045     character matches \w, respectively.
1046    
1047     The \A, \Z, and \z assertions differ from the traditional
1048     circumflex and dollar (described below) in that they only
1049     ever match at the very start and end of the subject string,
1050     whatever options are set. They are not affected by the
1051     PCRE_NOTBOL or PCRE_NOTEOL options. If the startoffset argu-
1052     ment of pcre_exec() is non-zero, \A can never match. The
1053     difference between \Z and \z is that \Z matches before a
1054     newline that is the last character of the string as well as
1055     at the end of the string, whereas \z matches only at the
1056     end.
1057    
1058    
1059    
1060     CIRCUMFLEX AND DOLLAR
1061     Outside a character class, in the default matching mode, the
1062     circumflex character is an assertion which is true only if
1063     the current matching point is at the start of the subject
1064 nigel 49
1065 nigel 41 string. If the startoffset argument of pcre_exec() is non-
1066     zero, circumflex can never match. Inside a character class,
1067     circumflex has an entirely different meaning (see below).
1068    
1069     Circumflex need not be the first character of the pattern if
1070     a number of alternatives are involved, but it should be the
1071     first thing in each alternative in which it appears if the
1072     pattern is ever to match that branch. If all possible alter-
1073     natives start with a circumflex, that is, if the pattern is
1074     constrained to match only at the start of the subject, it is
1075     said to be an "anchored" pattern. (There are also other con-
1076     structs that can cause a pattern to be anchored.)
1077    
1078     A dollar character is an assertion which is true only if the
1079     current matching point is at the end of the subject string,
1080     or immediately before a newline character that is the last
1081     character in the string (by default). Dollar need not be the
1082     last character of the pattern if a number of alternatives
1083     are involved, but it should be the last item in any branch
1084     in which it appears. Dollar has no special meaning in a
1085     character class.
1086    
1087     The meaning of dollar can be changed so that it matches only
1088     at the very end of the string, by setting the
1089     PCRE_DOLLAR_ENDONLY option at compile or matching time. This
1090     does not affect the \Z assertion.
1091    
1092     The meanings of the circumflex and dollar characters are
1093     changed if the PCRE_MULTILINE option is set. When this is
1094     the case, they match immediately after and immediately
1095     before an internal "\n" character, respectively, in addition
1096     to matching at the start and end of the subject string. For
1097     example, the pattern /^abc$/ matches the subject string
1098     "def\nabc" in multiline mode, but not otherwise. Conse-
1099     quently, patterns that are anchored in single line mode
1100     because all branches start with "^" are not anchored in mul-
1101     tiline mode, and a match for circumflex is possible when the
1102     startoffset argument of pcre_exec() is non-zero. The
1103     PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is
1104     set.
1105    
1106     Note that the sequences \A, \Z, and \z can be used to match
1107     the start and end of the subject in both modes, and if all
1108     branches of a pattern start with \A is it always anchored,
1109     whether PCRE_MULTILINE is set or not.
1110    
1111    
1112    
1113     FULL STOP (PERIOD, DOT)
1114     Outside a character class, a dot in the pattern matches any
1115     one character in the subject, including a non-printing char-
1116     acter, but not (by default) newline. If the PCRE_DOTALL
1117 nigel 49
1118 nigel 47 option is set, dots match newlines as well. The handling of
1119     dot is entirely independent of the handling of circumflex
1120     and dollar, the only relationship being that they both
1121     involve newline characters. Dot has no special meaning in a
1122     character class.
1123 nigel 41
1124    
1125    
1126     SQUARE BRACKETS
1127     An opening square bracket introduces a character class, ter-
1128     minated by a closing square bracket. A closing square
1129     bracket on its own is not special. If a closing square
1130     bracket is required as a member of the class, it should be
1131     the first data character in the class (after an initial cir-
1132     cumflex, if present) or escaped with a backslash.
1133    
1134     A character class matches a single character in the subject;
1135     the character must be in the set of characters defined by
1136     the class, unless the first character in the class is a cir-
1137     cumflex, in which case the subject character must not be in
1138     the set defined by the class. If a circumflex is actually
1139     required as a member of the class, ensure it is not the
1140     first character, or escape it with a backslash.
1141    
1142     For example, the character class [aeiou] matches any lower
1143     case vowel, while [^aeiou] matches any character that is not
1144     a lower case vowel. Note that a circumflex is just a con-
1145     venient notation for specifying the characters which are in
1146     the class by enumerating those that are not. It is not an
1147     assertion: it still consumes a character from the subject
1148     string, and fails if the current pointer is at the end of
1149     the string.
1150    
1151     When caseless matching is set, any letters in a class
1152     represent both their upper case and lower case versions, so
1153     for example, a caseless [aeiou] matches "A" as well as "a",
1154     and a caseless [^aeiou] does not match "A", whereas a case-
1155     ful version would.
1156    
1157     The newline character is never treated in any special way in
1158     character classes, whatever the setting of the PCRE_DOTALL
1159     or PCRE_MULTILINE options is. A class such as [^a] will
1160     always match a newline.
1161    
1162     The minus (hyphen) character can be used to specify a range
1163     of characters in a character class. For example, [d-m]
1164     matches any letter between d and m, inclusive. If a minus
1165     character is required in a class, it must be escaped with a
1166     backslash or appear in a position where it cannot be inter-
1167     preted as indicating a range, typically as the first or last
1168     character in the class.
1169    
1170     It is not possible to have the literal character "]" as the
1171     end character of a range. A pattern such as [W-]46] is
1172     interpreted as a class of two characters ("W" and "-") fol-
1173     lowed by a literal string "46]", so it would match "W46]" or
1174     "-46]". However, if the "]" is escaped with a backslash it
1175     is interpreted as the end of range, so [W-\]46] is inter-
1176     preted as a single class containing a range followed by two
1177     separate characters. The octal or hexadecimal representation
1178     of "]" can also be used to end a range.
1179    
1180     Ranges operate in ASCII collating sequence. They can also be
1181     used for characters specified numerically, for example
1182     [\000-\037]. If a range that includes letters is used when
1183     caseless matching is set, it matches the letters in either
1184     case. For example, [W-c] is equivalent to [][\^_`wxyzabc],
1185     matched caselessly, and if character tables for the "fr"
1186     locale are in use, [\xc8-\xcb] matches accented E characters
1187     in both cases.
1188    
1189     The character types \d, \D, \s, \S, \w, and \W may also
1190     appear in a character class, and add the characters that
1191     they match to the class. For example, [\dABCDEF] matches any
1192     hexadecimal digit. A circumflex can conveniently be used
1193     with the upper case character types to specify a more res-
1194     tricted set of characters than the matching lower case type.
1195     For example, the class [^\W_] matches any letter or digit,
1196     but not underscore.
1197    
1198     All non-alphameric characters other than \, -, ^ (at the
1199     start) and the terminating ] are non-special in character
1200     classes, but it does no harm if they are escaped.
1201    
1202    
1203    
1204 nigel 43 POSIX CHARACTER CLASSES
1205     Perl 5.6 (not yet released at the time of writing) is going
1206     to support the POSIX notation for character classes, which
1207     uses names enclosed by [: and :] within the enclosing
1208     square brackets. PCRE supports this notation. For example,
1209    
1210     [01[:alpha:]%]
1211    
1212     matches "0", "1", any alphabetic character, or "%". The sup-
1213     ported class names are
1214    
1215     alnum letters and digits
1216     alpha letters
1217     ascii character codes 0 - 127
1218     cntrl control characters
1219     digit decimal digits (same as \d)
1220     graph printing characters, excluding space
1221     lower lower case letters
1222     print printing characters, including space
1223     punct printing characters, excluding letters and digits
1224     space white space (same as \s)
1225     upper upper case letters
1226     word "word" characters (same as \w)
1227     xdigit hexadecimal digits
1228    
1229     The names "ascii" and "word" are Perl extensions. Another
1230     Perl extension is negation, which is indicated by a ^ char-
1231     acter after the colon. For example,
1232    
1233     [12[:^digit:]]
1234    
1235     matches "1", "2", or any non-digit. PCRE (and Perl) also
1236     recogize the POSIX syntax [.ch.] and [=ch=] where "ch" is a
1237     "collating element", but these are not supported, and an
1238     error is given if they are encountered.
1239    
1240    
1241    
1242 nigel 41 VERTICAL BAR
1243     Vertical bar characters are used to separate alternative
1244     patterns. For example, the pattern
1245    
1246     gilbert|sullivan
1247    
1248     matches either "gilbert" or "sullivan". Any number of alter-
1249     natives may appear, and an empty alternative is permitted
1250     (matching the empty string). The matching process tries
1251     each alternative in turn, from left to right, and the first
1252     one that succeeds is used. If the alternatives are within a
1253     subpattern (defined below), "succeeds" means matching the
1254     rest of the main pattern as well as the alternative in the
1255     subpattern.
1256    
1257    
1258    
1259     INTERNAL OPTION SETTING
1260     The settings of PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL,
1261     and PCRE_EXTENDED can be changed from within the pattern by
1262     a sequence of Perl option letters enclosed between "(?" and
1263     ")". The option letters are
1264    
1265     i for PCRE_CASELESS
1266     m for PCRE_MULTILINE
1267     s for PCRE_DOTALL
1268     x for PCRE_EXTENDED
1269    
1270     For example, (?im) sets caseless, multiline matching. It is
1271     also possible to unset these options by preceding the letter
1272     with a hyphen, and a combined setting and unsetting such as
1273     (?im-sx), which sets PCRE_CASELESS and PCRE_MULTILINE while
1274     unsetting PCRE_DOTALL and PCRE_EXTENDED, is also permitted.
1275     If a letter appears both before and after the hyphen, the
1276     option is unset.
1277    
1278     The scope of these option changes depends on where in the
1279     pattern the setting occurs. For settings that are outside
1280     any subpattern (defined below), the effect is the same as if
1281     the options were set or unset at the start of matching. The
1282     following patterns all behave in exactly the same way:
1283    
1284     (?i)abc
1285     a(?i)bc
1286     ab(?i)c
1287     abc(?i)
1288    
1289     which in turn is the same as compiling the pattern abc with
1290     PCRE_CASELESS set. In other words, such "top level" set-
1291     tings apply to the whole pattern (unless there are other
1292     changes inside subpatterns). If there is more than one set-
1293     ting of the same option at top level, the rightmost setting
1294     is used.
1295    
1296     If an option change occurs inside a subpattern, the effect
1297     is different. This is a change of behaviour in Perl 5.005.
1298     An option change inside a subpattern affects only that part
1299     of the subpattern that follows it, so
1300    
1301     (a(?i)b)c
1302    
1303     matches abc and aBc and no other strings (assuming
1304     PCRE_CASELESS is not used). By this means, options can be
1305     made to have different settings in different parts of the
1306     pattern. Any changes made in one alternative do carry on
1307     into subsequent branches within the same subpattern. For
1308     example,
1309    
1310     (a(?i)b|c)
1311    
1312     matches "ab", "aB", "c", and "C", even though when matching
1313     "C" the first branch is abandoned before the option setting.
1314     This is because the effects of option settings happen at
1315     compile time. There would be some very weird behaviour oth-
1316     erwise.
1317    
1318     The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can
1319     be changed in the same way as the Perl-compatible options by
1320     using the characters U and X respectively. The (?X) flag
1321     setting is special in that it must always occur earlier in
1322     the pattern than any of the additional features it turns on,
1323     even when it is at top level. It is best put at the start.
1324    
1325    
1326    
1327     SUBPATTERNS
1328     Subpatterns are delimited by parentheses (round brackets),
1329     which can be nested. Marking part of a pattern as a subpat-
1330     tern does two things:
1331    
1332     1. It localizes a set of alternatives. For example, the pat-
1333     tern
1334    
1335     cat(aract|erpillar|)
1336    
1337     matches one of the words "cat", "cataract", or "caterpil-
1338     lar". Without the parentheses, it would match "cataract",
1339     "erpillar" or the empty string.
1340    
1341     2. It sets up the subpattern as a capturing subpattern (as
1342     defined above). When the whole pattern matches, that por-
1343     tion of the subject string that matched the subpattern is
1344     passed back to the caller via the ovector argument of
1345     pcre_exec(). Opening parentheses are counted from left to
1346     right (starting from 1) to obtain the numbers of the captur-
1347     ing subpatterns.
1348    
1349     For example, if the string "the red king" is matched against
1350     the pattern
1351    
1352     the ((red|white) (king|queen))
1353    
1354     the captured substrings are "red king", "red", and "king",
1355     and are numbered 1, 2, and 3.
1356    
1357     The fact that plain parentheses fulfil two functions is not
1358     always helpful. There are often times when a grouping sub-
1359     pattern is required without a capturing requirement. If an
1360     opening parenthesis is followed by "?:", the subpattern does
1361     not do any capturing, and is not counted when computing the
1362     number of any subsequent capturing subpatterns. For example,
1363     if the string "the white queen" is matched against the pat-
1364     tern
1365    
1366     the ((?:red|white) (king|queen))
1367    
1368     the captured substrings are "white queen" and "queen", and
1369     are numbered 1 and 2. The maximum number of captured sub-
1370     strings is 99, and the maximum number of all subpatterns,
1371     both capturing and non-capturing, is 200.
1372    
1373     As a convenient shorthand, if any option settings are
1374     required at the start of a non-capturing subpattern, the
1375     option letters may appear between the "?" and the ":". Thus
1376     the two patterns
1377    
1378     (?i:saturday|sunday)
1379     (?:(?i)saturday|sunday)
1380    
1381     match exactly the same set of strings. Because alternative
1382     branches are tried from left to right, and options are not
1383     reset until the end of the subpattern is reached, an option
1384     setting in one branch does affect subsequent branches, so
1385     the above patterns match "SUNDAY" as well as "Saturday".
1386    
1387    
1388    
1389     REPETITION
1390     Repetition is specified by quantifiers, which can follow any
1391     of the following items:
1392    
1393     a single character, possibly escaped
1394     the . metacharacter
1395     a character class
1396     a back reference (see next section)
1397     a parenthesized subpattern (unless it is an assertion -
1398     see below)
1399    
1400     The general repetition quantifier specifies a minimum and
1401     maximum number of permitted matches, by giving the two
1402     numbers in curly brackets (braces), separated by a comma.
1403     The numbers must be less than 65536, and the first must be
1404     less than or equal to the second. For example:
1405    
1406     z{2,4}
1407    
1408     matches "zz", "zzz", or "zzzz". A closing brace on its own
1409     is not a special character. If the second number is omitted,
1410     but the comma is present, there is no upper limit; if the
1411     second number and the comma are both omitted, the quantifier
1412     specifies an exact number of required matches. Thus
1413    
1414     [aeiou]{3,}
1415    
1416     matches at least 3 successive vowels, but may match many
1417     more, while
1418    
1419     \d{8}
1420    
1421     matches exactly 8 digits. An opening curly bracket that
1422     appears in a position where a quantifier is not allowed, or
1423     one that does not match the syntax of a quantifier, is taken
1424     as a literal character. For example, {,6} is not a quantif-
1425     ier, but a literal string of four characters.
1426    
1427     The quantifier {0} is permitted, causing the expression to
1428     behave as if the previous item and the quantifier were not
1429     present.
1430    
1431     For convenience (and historical compatibility) the three
1432     most common quantifiers have single-character abbreviations:
1433    
1434     * is equivalent to {0,}
1435     + is equivalent to {1,}
1436     ? is equivalent to {0,1}
1437    
1438     It is possible to construct infinite loops by following a
1439     subpattern that can match no characters with a quantifier
1440     that has no upper limit, for example:
1441    
1442     (a?)*
1443    
1444     Earlier versions of Perl and PCRE used to give an error at
1445     compile time for such patterns. However, because there are
1446     cases where this can be useful, such patterns are now
1447     accepted, but if any repetition of the subpattern does in
1448     fact match no characters, the loop is forcibly broken.
1449    
1450     By default, the quantifiers are "greedy", that is, they
1451     match as much as possible (up to the maximum number of per-
1452     mitted times), without causing the rest of the pattern to
1453     fail. The classic example of where this gives problems is in
1454     trying to match comments in C programs. These appear between
1455     the sequences /* and */ and within the sequence, individual
1456     * and / characters may appear. An attempt to match C com-
1457     ments by applying the pattern
1458    
1459     /\*.*\*/
1460    
1461     to the string
1462    
1463     /* first command */ not comment /* second comment */
1464    
1465     fails, because it matches the entire string due to the
1466     greediness of the .* item.
1467    
1468 nigel 47 However, if a quantifier is followed by a question mark, it
1469     ceases to be greedy, and instead matches the minimum number
1470     of times possible, so the pattern
1471 nigel 41
1472     /\*.*?\*/
1473    
1474     does the right thing with the C comments. The meaning of the
1475     various quantifiers is not otherwise changed, just the pre-
1476     ferred number of matches. Do not confuse this use of ques-
1477     tion mark with its use as a quantifier in its own right.
1478     Because it has two uses, it can sometimes appear doubled, as
1479     in
1480    
1481     \d??\d
1482    
1483     which matches one digit by preference, but can match two if
1484     that is the only way the rest of the pattern matches.
1485    
1486     If the PCRE_UNGREEDY option is set (an option which is not
1487 nigel 47 available in Perl), the quantifiers are not greedy by
1488 nigel 41 default, but individual ones can be made greedy by following
1489     them with a question mark. In other words, it inverts the
1490     default behaviour.
1491    
1492     When a parenthesized subpattern is quantified with a minimum
1493     repeat count that is greater than 1 or with a limited max-
1494     imum, more store is required for the compiled pattern, in
1495     proportion to the size of the minimum or maximum.
1496    
1497     If a pattern starts with .* or .{0,} and the PCRE_DOTALL
1498     option (equivalent to Perl's /s) is set, thus allowing the .
1499 nigel 47 to match newlines, the pattern is implicitly anchored,
1500 nigel 41 because whatever follows will be tried against every charac-
1501     ter position in the subject string, so there is no point in
1502     retrying the overall match at any position after the first.
1503     PCRE treats such a pattern as though it were preceded by \A.
1504     In cases where it is known that the subject string contains
1505     no newlines, it is worth setting PCRE_DOTALL when the pat-
1506     tern begins with .* in order to obtain this optimization, or
1507     alternatively using ^ to indicate anchoring explicitly.
1508    
1509     When a capturing subpattern is repeated, the value captured
1510     is the substring that matched the final iteration. For exam-
1511     ple, after
1512    
1513     (tweedle[dume]{3}\s*)+
1514    
1515     has matched "tweedledum tweedledee" the value of the cap-
1516     tured substring is "tweedledee". However, if there are
1517     nested capturing subpatterns, the corresponding captured
1518     values may have been set in previous iterations. For exam-
1519     ple, after
1520    
1521     /(a|(b))+/
1522    
1523     matches "aba" the value of the second captured substring is
1524     "b".
1525    
1526    
1527    
1528     BACK REFERENCES
1529     Outside a character class, a backslash followed by a digit
1530     greater than 0 (and possibly further digits) is a back
1531     reference to a capturing subpattern earlier (i.e. to its
1532     left) in the pattern, provided there have been that many
1533     previous capturing left parentheses.
1534    
1535     However, if the decimal number following the backslash is
1536     less than 10, it is always taken as a back reference, and
1537     causes an error only if there are not that many capturing
1538     left parentheses in the entire pattern. In other words, the
1539     parentheses that are referenced need not be to the left of
1540     the reference for numbers less than 10. See the section
1541     entitled "Backslash" above for further details of the han-
1542     dling of digits following a backslash.
1543    
1544     A back reference matches whatever actually matched the cap-
1545     turing subpattern in the current subject string, rather than
1546     anything matching the subpattern itself. So the pattern
1547    
1548     (sens|respons)e and \1ibility
1549    
1550     matches "sense and sensibility" and "response and responsi-
1551     bility", but not "sense and responsibility". If caseful
1552 nigel 47 matching is in force at the time of the back reference, the
1553     case of letters is relevant. For example,
1554 nigel 41
1555     ((?i)rah)\s+\1
1556    
1557     matches "rah rah" and "RAH RAH", but not "RAH rah", even
1558     though the original capturing subpattern is matched case-
1559     lessly.
1560    
1561     There may be more than one back reference to the same sub-
1562     pattern. If a subpattern has not actually been used in a
1563 nigel 47 particular match, any back references to it always fail. For
1564     example, the pattern
1565 nigel 41
1566     (a|(bc))\2
1567    
1568     always fails if it starts to match "a" rather than "bc".
1569     Because there may be up to 99 back references, all digits
1570     following the backslash are taken as part of a potential
1571     back reference number. If the pattern continues with a digit
1572 nigel 47 character, some delimiter must be used to terminate the back
1573     reference. If the PCRE_EXTENDED option is set, this can be
1574     whitespace. Otherwise an empty comment can be used.
1575 nigel 41
1576     A back reference that occurs inside the parentheses to which
1577     it refers fails when the subpattern is first used, so, for
1578     example, (a\1) never matches. However, such references can
1579 nigel 49 be useful inside repeated subpatterns. For example, the pat-
1580     tern
1581 nigel 41
1582     (a|b\1)+
1583    
1584 nigel 49 matches any number of "a"s and also "aba", "ababbaa" etc. At
1585 nigel 41 each iteration of the subpattern, the back reference matches
1586 nigel 49 the character string corresponding to the previous
1587     iteration. In order for this to work, the pattern must be
1588     such that the first iteration does not need to match the
1589     back reference. This can be done using alternation, as in
1590     the example above, or by a quantifier with a minimum of
1591     zero.
1592 nigel 41
1593    
1594    
1595     ASSERTIONS
1596     An assertion is a test on the characters following or
1597     preceding the current matching point that does not actually
1598     consume any characters. The simple assertions coded as \b,
1599     \B, \A, \Z, \z, ^ and $ are described above. More compli-
1600     cated assertions are coded as subpatterns. There are two
1601     kinds: those that look ahead of the current position in the
1602     subject string, and those that look behind it.
1603 nigel 43
1604 nigel 41 An assertion subpattern is matched in the normal way, except
1605     that it does not cause the current matching position to be
1606     changed. Lookahead assertions start with (?= for positive
1607     assertions and (?! for negative assertions. For example,
1608    
1609     \w+(?=;)
1610    
1611     matches a word followed by a semicolon, but does not include
1612     the semicolon in the match, and
1613    
1614     foo(?!bar)
1615    
1616     matches any occurrence of "foo" that is not followed by
1617     "bar". Note that the apparently similar pattern
1618    
1619     (?!foo)bar
1620    
1621     does not find an occurrence of "bar" that is preceded by
1622     something other than "foo"; it finds any occurrence of "bar"
1623     whatsoever, because the assertion (?!foo) is always true
1624     when the next three characters are "bar". A lookbehind
1625     assertion is needed to achieve this effect.
1626    
1627     Lookbehind assertions start with (?<= for positive asser-
1628     tions and (?<! for negative assertions. For example,
1629    
1630     (?<!foo)bar
1631    
1632     does find an occurrence of "bar" that is not preceded by
1633     "foo". The contents of a lookbehind assertion are restricted
1634     such that all the strings it matches must have a fixed
1635     length. However, if there are several alternatives, they do
1636     not all have to have the same fixed length. Thus
1637    
1638     (?<=bullock|donkey)
1639    
1640     is permitted, but
1641    
1642     (?<!dogs?|cats?)
1643    
1644     causes an error at compile time. Branches that match dif-
1645     ferent length strings are permitted only at the top level of
1646     a lookbehind assertion. This is an extension compared with
1647     Perl 5.005, which requires all branches to match the same
1648     length of string. An assertion such as
1649    
1650     (?<=ab(c|de))
1651    
1652     is not permitted, because its single top-level branch can
1653     match two different lengths, but it is acceptable if rewrit-
1654     ten to use two top-level branches:
1655    
1656     (?<=abc|abde)
1657    
1658     The implementation of lookbehind assertions is, for each
1659     alternative, to temporarily move the current position back
1660     by the fixed width and then try to match. If there are
1661     insufficient characters before the current position, the
1662     match is deemed to fail. Lookbehinds in conjunction with
1663     once-only subpatterns can be particularly useful for match-
1664     ing at the ends of strings; an example is given at the end
1665     of the section on once-only subpatterns.
1666    
1667     Several assertions (of any sort) may occur in succession.
1668     For example,
1669    
1670     (?<=\d{3})(?<!999)foo
1671    
1672     matches "foo" preceded by three digits that are not "999".
1673     Notice that each of the assertions is applied independently
1674     at the same point in the subject string. First there is a
1675 nigel 47 check that the previous three characters are all digits, and
1676 nigel 41 then there is a check that the same three characters are not
1677     "999". This pattern does not match "foo" preceded by six
1678     characters, the first of which are digits and the last three
1679     of which are not "999". For example, it doesn't match
1680     "123abcfoo". A pattern to do that is
1681    
1682     (?<=\d{3}...)(?<!999)foo
1683    
1684     This time the first assertion looks at the preceding six
1685     characters, checking that the first three are digits, and
1686     then the second assertion checks that the preceding three
1687     characters are not "999".
1688    
1689     Assertions can be nested in any combination. For example,
1690    
1691     (?<=(?<!foo)bar)baz
1692    
1693     matches an occurrence of "baz" that is preceded by "bar"
1694     which in turn is not preceded by "foo", while
1695    
1696     (?<=\d{3}(?!999)...)foo
1697    
1698     is another pattern which matches "foo" preceded by three
1699     digits and any three characters that are not "999".
1700    
1701     Assertion subpatterns are not capturing subpatterns, and may
1702     not be repeated, because it makes no sense to assert the
1703     same thing several times. If any kind of assertion contains
1704     capturing subpatterns within it, these are counted for the
1705     purposes of numbering the capturing subpatterns in the whole
1706     pattern. However, substring capturing is carried out only
1707     for positive assertions, because it does not make sense for
1708     negative assertions.
1709    
1710     Assertions count towards the maximum of 200 parenthesized
1711     subpatterns.
1712    
1713    
1714    
1715     ONCE-ONLY SUBPATTERNS
1716     With both maximizing and minimizing repetition, failure of
1717     what follows normally causes the repeated item to be re-
1718     evaluated to see if a different number of repeats allows the
1719     rest of the pattern to match. Sometimes it is useful to
1720     prevent this, either to change the nature of the match, or
1721     to cause it fail earlier than it otherwise might, when the
1722     author of the pattern knows there is no point in carrying
1723     on.
1724    
1725     Consider, for example, the pattern \d+foo when applied to
1726     the subject line
1727    
1728     123456bar
1729    
1730     After matching all 6 digits and then failing to match "foo",
1731     the normal action of the matcher is to try again with only 5
1732     digits matching the \d+ item, and then with 4, and so on,
1733     before ultimately failing. Once-only subpatterns provide the
1734     means for specifying that once a portion of the pattern has
1735     matched, it is not to be re-evaluated in this way, so the
1736     matcher would give up immediately on failing to match "foo"
1737     the first time. The notation is another kind of special
1738     parenthesis, starting with (?> as in this example:
1739    
1740     (?>\d+)bar
1741    
1742     This kind of parenthesis "locks up" the part of the pattern
1743     it contains once it has matched, and a failure further into
1744 nigel 49 the pattern is prevented from backtracking into it.
1745     Backtracking past it to previous items, however, works as
1746     normal.
1747 nigel 41
1748     An alternative description is that a subpattern of this type
1749     matches the string of characters that an identical stan-
1750     dalone pattern would match, if anchored at the current point
1751     in the subject string.
1752    
1753     Once-only subpatterns are not capturing subpatterns. Simple
1754     cases such as the above example can be thought of as a max-
1755     imizing repeat that must swallow everything it can. So,
1756     while both \d+ and \d+? are prepared to adjust the number of
1757     digits they match in order to make the rest of the pattern
1758     match, (?>\d+) can only match an entire sequence of digits.
1759    
1760     This construction can of course contain arbitrarily compli-
1761     cated subpatterns, and it can be nested.
1762    
1763     Once-only subpatterns can be used in conjunction with look-
1764     behind assertions to specify efficient matching at the end
1765     of the subject string. Consider a simple pattern such as
1766    
1767     abcd$
1768    
1769 nigel 43 when applied to a long string which does not match. Because
1770     matching proceeds from left to right, PCRE will look for
1771     each "a" in the subject and then see if what follows matches
1772     the rest of the pattern. If the pattern is specified as
1773 nigel 41
1774     ^.*abcd$
1775    
1776 nigel 47 the initial .* matches the entire string at first, but when
1777     this fails (because there is no following "a"), it back-
1778     tracks to match all but the last character, then all but the
1779     last two characters, and so on. Once again the search for
1780     "a" covers the entire string, from right to left, so we are
1781     no better off. However, if the pattern is written as
1782 nigel 41
1783     ^(?>.*)(?<=abcd)
1784    
1785 nigel 47 there can be no backtracking for the .* item; it can match
1786     only the entire string. The subsequent lookbehind assertion
1787     does a single test on the last four characters. If it fails,
1788     the match fails immediately. For long strings, this approach
1789     makes a significant difference to the processing time.
1790 nigel 41
1791 nigel 43 When a pattern contains an unlimited repeat inside a subpat-
1792     tern that can itself be repeated an unlimited number of
1793     times, the use of a once-only subpattern is the only way to
1794     avoid some failing matches taking a very long time indeed.
1795     The pattern
1796 nigel 41
1797 nigel 43 (\D+|<\d+>)*[!?]
1798 nigel 41
1799 nigel 43 matches an unlimited number of substrings that either con-
1800     sist of non-digits, or digits enclosed in <>, followed by
1801     either ! or ?. When it matches, it runs quickly. However, if
1802     it is applied to
1803    
1804     aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
1805    
1806     it takes a long time before reporting failure. This is
1807     because the string can be divided between the two repeats in
1808     a large number of ways, and all have to be tried. (The exam-
1809     ple used [!?] rather than a single character at the end,
1810     because both PCRE and Perl have an optimization that allows
1811     for fast failure when a single character is used. They
1812     remember the last single character that is required for a
1813     match, and fail early if it is not present in the string.)
1814     If the pattern is changed to
1815    
1816     ((?>\D+)|<\d+>)*[!?]
1817    
1818     sequences of non-digits cannot be broken, and failure hap-
1819     pens quickly.
1820    
1821    
1822    
1823 nigel 41 CONDITIONAL SUBPATTERNS
1824     It is possible to cause the matching process to obey a sub-
1825     pattern conditionally or to choose between two alternative
1826     subpatterns, depending on the result of an assertion, or
1827     whether a previous capturing subpattern matched or not. The
1828     two possible forms of conditional subpattern are
1829    
1830     (?(condition)yes-pattern)
1831     (?(condition)yes-pattern|no-pattern)
1832    
1833     If the condition is satisfied, the yes-pattern is used; oth-
1834     erwise the no-pattern (if present) is used. If there are
1835     more than two alternatives in the subpattern, a compile-time
1836     error occurs.
1837    
1838     There are two kinds of condition. If the text between the
1839 nigel 47 parentheses consists of a sequence of digits, the condition
1840     is satisfied if the capturing subpattern of that number has
1841     previously matched. Consider the following pattern, which
1842     contains non-significant white space to make it more read-
1843     able (assume the PCRE_EXTENDED option) and to divide it into
1844     three parts for ease of discussion:
1845 nigel 41
1846     ( \( )? [^()]+ (?(1) \) )
1847    
1848     The first part matches an optional opening parenthesis, and
1849     if that character is present, sets it as the first captured
1850     substring. The second part matches one or more characters
1851     that are not parentheses. The third part is a conditional
1852     subpattern that tests whether the first set of parentheses
1853     matched or not. If they did, that is, if subject started
1854     with an opening parenthesis, the condition is true, and so
1855     the yes-pattern is executed and a closing parenthesis is
1856     required. Otherwise, since no-pattern is not present, the
1857     subpattern matches nothing. In other words, this pattern
1858     matches a sequence of non-parentheses, optionally enclosed
1859     in parentheses.
1860    
1861     If the condition is not a sequence of digits, it must be an
1862     assertion. This may be a positive or negative lookahead or
1863     lookbehind assertion. Consider this pattern, again contain-
1864     ing non-significant white space, and with the two alterna-
1865     tives on the second line:
1866    
1867     (?(?=[^a-z]*[a-z])
1868     \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
1869    
1870     The condition is a positive lookahead assertion that matches
1871     an optional sequence of non-letters followed by a letter. In
1872     other words, it tests for the presence of at least one
1873     letter in the subject. If a letter is found, the subject is
1874     matched against the first alternative; otherwise it is
1875     matched against the second. This pattern matches strings in
1876     one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
1877     letters and dd are digits.
1878    
1879    
1880    
1881     COMMENTS
1882     The sequence (?# marks the start of a comment which contin-
1883     ues up to the next closing parenthesis. Nested parentheses
1884     are not permitted. The characters that make up a comment
1885     play no part in the pattern matching at all.
1886    
1887     If the PCRE_EXTENDED option is set, an unescaped # character
1888     outside a character class introduces a comment that contin-
1889     ues up to the next newline character in the pattern.
1890    
1891    
1892    
1893 nigel 43 RECURSIVE PATTERNS
1894     Consider the problem of matching a string in parentheses,
1895     allowing for unlimited nested parentheses. Without the use
1896     of recursion, the best that can be done is to use a pattern
1897     that matches up to some fixed depth of nesting. It is not
1898     possible to handle an arbitrary nesting depth. Perl 5.6 has
1899     provided an experimental facility that allows regular
1900     expressions to recurse (amongst other things). It does this
1901     by interpolating Perl code in the expression at run time,
1902     and the code can refer to the expression itself. A Perl pat-
1903     tern to solve the parentheses problem can be created like
1904     this:
1905    
1906     $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
1907    
1908     The (?p{...}) item interpolates Perl code at run time, and
1909     in this case refers recursively to the pattern in which it
1910     appears. Obviously, PCRE cannot support the interpolation of
1911     Perl code. Instead, the special item (?R) is provided for
1912     the specific case of recursion. This PCRE pattern solves the
1913     parentheses problem (assume the PCRE_EXTENDED option is set
1914     so that white space is ignored):
1915    
1916     \( ( (?>[^()]+) | (?R) )* \)
1917    
1918     First it matches an opening parenthesis. Then it matches any
1919     number of substrings which can either be a sequence of non-
1920     parentheses, or a recursive match of the pattern itself
1921     (i.e. a correctly parenthesized substring). Finally there is
1922     a closing parenthesis.
1923    
1924     This particular example pattern contains nested unlimited
1925     repeats, and so the use of a once-only subpattern for match-
1926     ing strings of non-parentheses is important when applying
1927     the pattern to strings that do not match. For example, when
1928     it is applied to
1929    
1930     (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
1931    
1932     it yields "no match" quickly. However, if a once-only sub-
1933     pattern is not used, the match runs for a very long time
1934     indeed because there are so many different ways the + and *
1935     repeats can carve up the subject, and all have to be tested
1936     before failure can be reported.
1937    
1938     The values set for any capturing subpatterns are those from
1939     the outermost level of the recursion at which the subpattern
1940     value is set. If the pattern above is matched against
1941    
1942     (ab(cd)ef)
1943    
1944     the value for the capturing parentheses is "ef", which is
1945     the last value taken on at the top level. If additional
1946     parentheses are added, giving
1947    
1948     \( ( ( (?>[^()]+) | (?R) )* ) \)
1949     ^ ^
1950 nigel 47 ^ ^ the string they capture is
1951     "ab(cd)ef", the contents of the top level parentheses. If
1952 nigel 43 there are more than 15 capturing parentheses in a pattern,
1953     PCRE has to obtain extra memory to store data during a
1954     recursion, which it does by using pcre_malloc, freeing it
1955     via pcre_free afterwards. If no memory can be obtained, it
1956     saves data for the first 15 capturing parentheses only, as
1957     there is no way to give an out-of-memory error from within a
1958     recursion.
1959    
1960    
1961    
1962 nigel 41 PERFORMANCE
1963     Certain items that may appear in patterns are more efficient
1964     than others. It is more efficient to use a character class
1965     like [aeiou] than a set of alternatives such as (a|e|i|o|u).
1966     In general, the simplest construction that provides the
1967     required behaviour is usually the most efficient. Jeffrey
1968     Friedl's book contains a lot of discussion about optimizing
1969     regular expressions for efficient performance.
1970    
1971     When a pattern begins with .* and the PCRE_DOTALL option is
1972     set, the pattern is implicitly anchored by PCRE, since it
1973     can match only at the start of a subject string. However, if
1974     PCRE_DOTALL is not set, PCRE cannot make this optimization,
1975     because the . metacharacter does not then match a newline,
1976     and if the subject string contains newlines, the pattern may
1977     match from the character immediately following one of them
1978     instead of from the very start. For example, the pattern
1979    
1980     (.*) second
1981    
1982     matches the subject "first\nand second" (where \n stands for
1983     a newline character) with the first captured substring being
1984     "and". In order to do this, PCRE has to retry the match
1985     starting after every newline in the subject.
1986    
1987     If you are using such a pattern with subject strings that do
1988     not contain newlines, the best performance is obtained by
1989     setting PCRE_DOTALL, or starting the pattern with ^.* to
1990     indicate explicit anchoring. That saves PCRE from having to
1991     scan along the subject looking for a newline to restart at.
1992    
1993     Beware of patterns that contain nested indefinite repeats.
1994     These can take a long time to run when applied to a string
1995     that does not match. Consider the pattern fragment
1996    
1997     (a+)*
1998    
1999     This can match "aaaa" in 33 different ways, and this number
2000     increases very rapidly as the string gets longer. (The *
2001     repeat can match 0, 1, 2, 3, or 4 times, and for each of
2002     those cases other than 0, the + repeats can match different
2003     numbers of times.) When the remainder of the pattern is such
2004 nigel 49 that the entire match is going to fail, PCRE has in
2005     principle to try every possible variation, and this can take
2006     an extremely long time.
2007 nigel 41
2008     An optimization catches some of the more simple cases such
2009     as
2010    
2011     (a+)*b
2012    
2013     where a literal character follows. Before embarking on the
2014     standard matching procedure, PCRE checks that there is a "b"
2015     later in the subject string, and if there is not, it fails
2016     the match immediately. However, when there is no following
2017     literal this optimization cannot be used. You can see the
2018     difference by comparing the behaviour of
2019    
2020     (a+)*\d
2021    
2022     with the pattern above. The former gives a failure almost
2023     instantly when applied to a whole line of "a" characters,
2024     whereas the latter takes an appreciable time with strings
2025     longer than about 20 characters.
2026    
2027    
2028    
2029 nigel 49 UTF-8 SUPPORT
2030     Starting at release 3.3, PCRE has some support for character
2031     strings encoded in the UTF-8 format. This is incomplete, and
2032     is regarded as experimental. In order to use it, you must
2033     configure PCRE to include UTF-8 support in the code, and, in
2034     addition, you must call pcre_compile() with the PCRE_UTF8
2035     option flag. When you do this, both the pattern and any sub-
2036     ject strings that are matched against it are treated as
2037     UTF-8 strings instead of just strings of bytes, but only in
2038     the cases that are mentioned below.
2039    
2040     If you compile PCRE with UTF-8 support, but do not use it at
2041     run time, the library will be a bit bigger, but the addi-
2042     tional run time overhead is limited to testing the PCRE_UTF8
2043     flag in several places, so should not be very large.
2044    
2045     PCRE assumes that the strings it is given contain valid
2046     UTF-8 codes. It does not diagnose invalid UTF-8 strings. If
2047     you pass invalid UTF-8 strings to PCRE, the results are
2048     undefined.
2049    
2050     Running with PCRE_UTF8 set causes these changes in the way
2051     PCRE works:
2052    
2053     1. In a pattern, the escape sequence \x{...}, where the con-
2054     tents of the braces is a string of hexadecimal digits, is
2055     interpreted as a UTF-8 character whose code number is the
2056     given hexadecimal number, for example: \x{1234}. This
2057     inserts from one to six literal bytes into the pattern,
2058     using the UTF-8 encoding. If a non-hexadecimal digit appears
2059     between the braces, the item is not recognized.
2060    
2061     2. The original hexadecimal escape sequence, \xhh, generates
2062     a two-byte UTF-8 character if its value is greater than 127.
2063    
2064     3. Repeat quantifiers are NOT correctly handled if they fol-
2065     low a multibyte character. For example, \x{100}* and \xc3+
2066     do not work. If you want to repeat such characters, you must
2067     enclose them in non-capturing parentheses, for example
2068     (?:\x{100}), at present.
2069    
2070     4. The dot metacharacter matches one UTF-8 character instead
2071     of a single byte.
2072    
2073     5. Unlike literal UTF-8 characters, the dot metacharacter
2074     followed by a repeat quantifier does operate correctly on
2075     UTF-8 characters instead of single bytes.
2076    
2077     4. Although the \x{...} escape is permitted in a character
2078     class, characters whose values are greater than 255 cannot
2079     be included in a class.
2080    
2081     5. A class is matched against a UTF-8 character instead of
2082     just a single byte, but it can match only characters whose
2083     values are less than 256. Characters with greater values
2084     always fail to match a class.
2085    
2086     6. Repeated classes work correctly on multiple characters.
2087    
2088     7. Classes containing just a single character whose value is
2089     greater than 127 (but less than 256), for example, [\x80] or
2090     [^\x{93}], do not work because these are optimized into sin-
2091     gle byte matches. In the first case, of course, the class
2092     brackets are just redundant.
2093    
2094     8. Lookbehind assertions move backwards in the subject by a
2095     fixed number of characters instead of a fixed number of
2096     bytes. Simple cases have been tested to work correctly, but
2097     there may be hidden gotchas herein.
2098    
2099     9. The character types such as \d and \w do not work
2100     correctly with UTF-8 characters. They continue to test a
2101     single byte.
2102    
2103     10. Anything not explicitly mentioned here continues to work
2104     in bytes rather than in characters.
2105    
2106     The following UTF-8 features of Perl 5.6 are not imple-
2107     mented:
2108    
2109     1. The escape sequence \C to match a single byte.
2110    
2111     2. The use of Unicode tables and properties and escapes \p,
2112     \P, and \X.
2113    
2114    
2115    
2116 nigel 41 AUTHOR
2117     Philip Hazel <ph10@cam.ac.uk>
2118     University Computing Service,
2119     New Museums Site,
2120     Cambridge CB2 3QG, England.
2121     Phone: +44 1223 334714
2122    
2123 nigel 49 Last updated: 28 August 2000,
2124     the 250th anniversary of the death of J.S. Bach.
2125 nigel 43 Copyright (c) 1997-2000 University of Cambridge.

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12