/[pcre]/code/tags/pcre-2.07/pcre.3.txt
ViewVC logotype

Contents of /code/tags/pcre-2.07/pcre.3.txt

Parent Directory Parent Directory | Revision Log Revision Log


Revision 38 - (hide annotations) (download)
Sat Feb 24 21:39:11 2007 UTC (7 years, 9 months ago) by nigel
File MIME type: text/plain
File size: 77234 byte(s)
Tag code/trunk as code/tags/pcre-2.07.

1 nigel 37 NAME
2     pcre - Perl-compatible regular expressions.
3    
4    
5    
6     SYNOPSIS
7     #include <pcre.h>
8    
9     pcre *pcre_compile(const char *pattern, int options,
10     const char **errptr, int *erroffset,
11     const unsigned char *tableptr);
12    
13     pcre_extra *pcre_study(const pcre *code, int options,
14     const char **errptr);
15    
16     int pcre_exec(const pcre *code, const pcre_extra *extra,
17     const char *subject, int length, int startoffset,
18     int options, int *ovector, int ovecsize);
19    
20     int pcre_copy_substring(const char *subject, int *ovector,
21     int stringcount, int stringnumber, char *buffer,
22     int buffersize);
23    
24     int pcre_get_substring(const char *subject, int *ovector,
25     int stringcount, int stringnumber,
26     const char **stringptr);
27    
28     int pcre_get_substring_list(const char *subject,
29     int *ovector, int stringcount, const char ***listptr);
30    
31     const unsigned char *pcre_maketables(void);
32    
33     int pcre_info(const pcre *code, int *optptr, *firstcharptr);
34    
35     char *pcre_version(void);
36    
37     void *(*pcre_malloc)(size_t);
38    
39     void (*pcre_free)(void *);
40    
41    
42    
43    
44     DESCRIPTION
45     The PCRE library is a set of functions that implement regu-
46     lar expression pattern matching using the same syntax and
47     semantics as Perl 5, with just a few differences (see
48     below). The current implementation corresponds to Perl
49     5.005.
50    
51     PCRE has its own native API, which is described in this
52     document. There is also a set of wrapper functions that
53     correspond to the POSIX API. These are described in the
54     pcreposix documentation.
55     The native API function prototypes are defined in the header
56     file pcre.h, and on Unix systems the library itself is
57     called libpcre.a, so can be accessed by adding -lpcre to the
58     command for linking an application which calls it.
59    
60     The functions pcre_compile(), pcre_study(), and pcre_exec()
61     are used for compiling and matching regular expressions,
62     while pcre_copy_substring(), pcre_get_substring(), and
63     pcre_get_substring_list() are convenience functions for
64     extracting captured substrings from a matched subject
65     string. The function pcre_maketables() is used (optionally)
66     to build a set of character tables in the current locale for
67     passing to pcre_compile().
68    
69     The function pcre_info() is used to find out information
70     about a compiled pattern, while the function pcre_version()
71     returns a pointer to a string containing the version of PCRE
72     and its date of release.
73    
74     The global variables pcre_malloc and pcre_free initially
75     contain the entry points of the standard malloc() and free()
76     functions respectively. PCRE calls the memory management
77     functions via these variables, so a calling program can
78     replace them if it wishes to intercept the calls. This
79     should be done before calling any PCRE functions.
80    
81    
82    
83     MULTI-THREADING
84     The PCRE functions can be used in multi-threading applica-
85     tions, with the proviso that the memory management functions
86     pointed to by pcre_malloc and pcre_free are shared by all
87     threads.
88    
89     The compiled form of a regular expression is not altered
90     during matching, so the same compiled pattern can safely be
91     used by several threads at once.
92    
93    
94    
95     COMPILING A PATTERN
96     The function pcre_compile() is called to compile a pattern
97     into an internal form. The pattern is a C string terminated
98     by a binary zero, and is passed in the argument pattern. A
99     pointer to a single block of memory that is obtained via
100     pcre_malloc is returned. This contains the compiled code and
101     related data. The pcre type is defined for this for conveni-
102     ence, but in fact pcre is just a typedef for void, since the
103     contents of the block are not externally defined. It is up
104     to the caller to free the memory when it is no longer
105     required.
106    
107     The size of a compiled pattern is roughly proportional to
108     the length of the pattern string, except that each character
109     class (other than those containing just a single character,
110     negated or not) requires 33 bytes, and repeat quantifiers
111     with a minimum greater than one or a bounded maximum cause
112     the relevant portions of the compiled pattern to be repli-
113     cated.
114    
115     The options argument contains independent bits that affect
116     the compilation. It should be zero if no options are
117     required. Some of the options, in particular, those that are
118     compatible with Perl, can also be set and unset from within
119     the pattern (see the detailed description of regular expres-
120     sions below). For these options, the contents of the options
121     argument specifies their initial settings at the start of
122     compilation and execution. The PCRE_ANCHORED option can be
123     set at the time of matching as well as at compile time.
124    
125     If errptr is NULL, pcre_compile() returns NULL immediately.
126     Otherwise, if compilation of a pattern fails, pcre_compile()
127     returns NULL, and sets the variable pointed to by errptr to
128     point to a textual error message. The offset from the start
129     of the pattern to the character where the error was
130     discovered is placed in the variable pointed to by
131     erroffset, which must not be NULL. If it is, an immediate
132     error is given.
133    
134     If the final argument, tableptr, is NULL, PCRE uses a
135     default set of character tables which are built when it is
136     compiled, using the default C locale. Otherwise, tableptr
137     must be the result of a call to pcre_maketables(). See the
138     section on locale support below.
139    
140     The following option bits are defined in the header file:
141    
142     PCRE_ANCHORED
143    
144     If this bit is set, the pattern is forced to be "anchored",
145     that is, it is constrained to match only at the start of the
146     string which is being searched (the "subject string"). This
147     effect can also be achieved by appropriate constructs in the
148     pattern itself, which is the only way to do it in Perl.
149    
150     PCRE_CASELESS
151    
152     If this bit is set, letters in the pattern match both upper
153     and lower case letters. It is equivalent to Perl's /i
154     option.
155    
156     PCRE_DOLLAR_ENDONLY
157    
158     If this bit is set, a dollar metacharacter in the pattern
159     matches only at the end of the subject string. Without this
160     option, a dollar also matches immediately before the final
161     character if it is a newline (but not before any other new-
162     lines). The PCRE_DOLLAR_ENDONLY option is ignored if
163     PCRE_MULTILINE is set. There is no equivalent to this option
164     in Perl.
165    
166     PCRE_DOTALL
167    
168     If this bit is set, a dot metacharater in the pattern
169     matches all characters, including newlines. Without it, new-
170     lines are excluded. This option is equivalent to Perl's /s
171     option. A negative class such as [^a] always matches a new-
172     line character, independent of the setting of this option.
173    
174     PCRE_EXTENDED
175    
176     If this bit is set, whitespace data characters in the pat-
177     tern are totally ignored except when escaped or inside a
178     character class, and characters between an unescaped # out-
179     side a character class and the next newline character,
180     inclusive, are also ignored. This is equivalent to Perl's /x
181     option, and makes it possible to include comments inside
182     complicated patterns. Note, however, that this applies only
183     to data characters. Whitespace characters may never appear
184     within special character sequences in a pattern, for example
185     within the sequence (?( which introduces a conditional sub-
186     pattern.
187    
188     PCRE_EXTRA
189    
190     This option turns on additional functionality of PCRE that
191     is incompatible with Perl. Any backslash in a pattern that
192     is followed by a letter that has no special meaning causes
193     an error, thus reserving these combinations for future
194     expansion. By default, as in Perl, a backslash followed by a
195     letter with no special meaning is treated as a literal.
196     There are at present no other features controlled by this
197     option.
198    
199     PCRE_MULTILINE
200    
201     By default, PCRE treats the subject string as consisting of
202     a single "line" of characters (even if it actually contains
203     several newlines). The "start of line" metacharacter (^)
204     matches only at the start of the string, while the "end of
205     line" metacharacter ($) matches only at the end of the
206     string, or before a terminating newline (unless
207     PCRE_DOLLAR_ENDONLY is set). This is the same as Perl.
208    
209     When PCRE_MULTILINE it is set, the "start of line" and "end
210     of line" constructs match immediately following or
211     immediately before any newline in the subject string,
212     respectively, as well as at the very start and end. This is
213     equivalent to Perl's /m option. If there are no "\n" charac-
214     ters in a subject string, or no occurrences of ^ or $ in a
215     pattern, setting PCRE_MULTILINE has no effect.
216    
217     PCRE_UNGREEDY
218    
219     This option inverts the "greediness" of the quantifiers so
220     that they are not greedy by default, but become greedy if
221     followed by "?". It is not compatible with Perl. It can also
222     be set by a (?U) option setting within the pattern.
223    
224    
225    
226     STUDYING A PATTERN
227     When a pattern is going to be used several times, it is
228     worth spending more time analyzing it in order to speed up
229     the time taken for matching. The function pcre_study() takes
230     a pointer to a compiled pattern as its first argument, and
231     returns a pointer to a pcre_extra block (another void
232     typedef) containing additional information about the pat-
233     tern; this can be passed to pcre_exec(). If no additional
234     information is available, NULL is returned.
235    
236     The second argument contains option bits. At present, no
237     options are defined for pcre_study(), and this argument
238     should always be zero.
239    
240     The third argument for pcre_study() is a pointer to an error
241     message. If studying succeeds (even if no data is returned),
242     the variable it points to is set to NULL. Otherwise it
243     points to a textual error message.
244    
245     At present, studying a pattern is useful only for non-
246     anchored patterns that do not have a single fixed starting
247     character. A bitmap of possible starting characters is
248     created.
249    
250    
251    
252     LOCALE SUPPORT
253     PCRE handles caseless matching, and determines whether char-
254     acters are letters, digits, or whatever, by reference to a
255     set of tables. The library contains a default set of tables
256     which is created in the default C locale when PCRE is com-
257     piled. This is used when the final argument of
258     pcre_compile() is NULL, and is sufficient for many applica-
259     tions.
260    
261     An alternative set of tables can, however, be supplied. Such
262     tables are built by calling the pcre_maketables() function,
263     which has no arguments, in the relevant locale. The result
264     can then be passed to pcre_compile() as often as necessary.
265     For example, to build and use tables that are appropriate
266     for the French locale (where accented characters with codes
267     greater than 128 are treated as letters), the following code
268     could be used:
269    
270     setlocale(LC_CTYPE, "fr");
271     tables = pcre_maketables();
272     re = pcre_compile(..., tables);
273    
274     The tables are built in memory that is obtained via
275     pcre_malloc. The pointer that is passed to pcre_compile is
276     saved with the compiled pattern, and the same tables are
277     used via this pointer by pcre_study() and pcre_exec(). Thus
278     for any single pattern, compilation, studying and matching
279     all happen in the same locale, but different patterns can be
280     compiled in different locales. It is the caller's responsi-
281     bility to ensure that the memory containing the tables
282     remains available for as long as it is needed.
283    
284    
285    
286     INFORMATION ABOUT A PATTERN
287     The pcre_info() function returns information about a com-
288     piled pattern. Its yield is the number of capturing subpat-
289     terns, or one of the following negative numbers:
290    
291     PCRE_ERROR_NULL the argument code was NULL
292     PCRE_ERROR_BADMAGIC the "magic number" was not found
293    
294     If the optptr argument is not NULL, a copy of the options
295     with which the pattern was compiled is placed in the integer
296     it points to. These option bits are those specified in the
297     call to pcre_compile(), modified by any top-level option
298     settings within the pattern itself, and with the
299     PCRE_ANCHORED bit set if the form of the pattern implies
300     that it can match only at the start of a subject string.
301    
302     If the pattern is not anchored and the firstcharptr argument
303     is not NULL, it is used to pass back information about the
304     first character of any matched string. If there is a fixed
305     first character, e.g. from a pattern such as
306     (cat|cow|coyote), then it is returned in the integer pointed
307     to by firstcharptr. Otherwise, if either
308    
309     (a) the pattern was compiled with the PCRE_MULTILINE option,
310     and every branch starts with "^", or
311    
312     (b) every branch of the pattern starts with ".*" and
313     PCRE_DOTALL is not set (if it were set, the pattern would be
314     anchored),
315     then -1 is returned, indicating that the pattern matches
316     only at the start of a subject string or after any "\n"
317     within the string. Otherwise -2 is returned.
318    
319    
320    
321     MATCHING A PATTERN
322     The function pcre_exec() is called to match a subject string
323     against a pre-compiled pattern, which is passed in the code
324     argument. If the pattern has been studied, the result of the
325     study should be passed in the extra argument. Otherwise this
326     must be NULL.
327    
328     The PCRE_ANCHORED option can be passed in the options argu-
329     ment, whose unused bits must be zero. However, if a pattern
330     was compiled with PCRE_ANCHORED, or turned out to be
331     anchored by virtue of its contents, it cannot be made
332     unachored at matching time.
333    
334     There are also three further options that can be set only at
335     matching time:
336    
337     PCRE_NOTBOL
338    
339     The first character of the string is not the beginning of a
340     line, so the circumflex metacharacter should not match
341     before it. Setting this without PCRE_MULTILINE (at compile
342     time) causes circumflex never to match.
343    
344     PCRE_NOTEOL
345    
346     The end of the string is not the end of a line, so the dol-
347     lar metacharacter should not match it nor (except in multi-
348     line mode) a newline immediately before it. Setting this
349     without PCRE_MULTILINE (at compile time) causes dollar never
350     to match.
351    
352     PCRE_NOTEMPTY
353    
354     An empty string is not considered to be a valid match if
355     this option is set. If there are alternatives in the pat-
356     tern, they are tried. If all the alternatives match the
357     empty string, the entire match fails. For example, if the
358     pattern
359    
360     a?b?
361    
362     is applied to a string not beginning with "a" or "b", it
363     matches the empty string at the start of the subject. With
364     PCRE_NOTEMPTY set, this match is not valid, so PCRE searches
365     further into the string for occurrences of "a" or "b". Perl
366     has no direct equivalent of this option, but it makes a
367     special case of a pattern match of the empty string within
368     its split() function. Using PCRE_NOTEMPTY it is possible to
369     emulate this behaviour.
370    
371     The subject string is passed as a pointer in subject, a
372     length in length, and a starting offset in startoffset.
373     Unlike the pattern string, it may contain binary zero char-
374     acters. When the starting offset is zero, the search for a
375     match starts at the beginning of the subject, and this is by
376     far the most common case.
377    
378     A non-zero starting offset is useful when searching for
379     another match in the same subject by calling pcre_exec()
380     again after a previous success. Setting startoffset differs
381     from just passing over a shortened string and setting
382     PCRE_NOTBOL in the case of a pattern that begins with any
383     kind of lookbehind. For example, consider the pattern
384    
385     \Biss\B
386    
387     which finds occurrences of "iss" in the middle of words. (\B
388     matches only if the current position in the subject is not a
389     word boundary.) When applied to the string "Mississipi" the
390     first call to pcre_exec() finds the first occurrence. If
391     pcre_exec() is called again with just the remainder of the
392     subject, namely "issipi", it does not match, because \B is
393     always false at the start of the subject, which is deemed to
394     be a word boundary. However, if pcre_exec() is passed the
395     entire string again, but with startoffset set to 4, it finds
396     the second occurrence of "iss" because it is able to look
397     behind the starting point to discover that it is preceded by
398     a letter.
399    
400     If a non-zero starting offset is passed when the pattern is
401     anchored, one attempt to match at the given offset is tried.
402     This can only succeed if the pattern does not require the
403     match to be at the start of the subject.
404    
405     In general, a pattern matches a certain portion of the sub-
406     ject, and in addition, further substrings from the subject
407     may be picked out by parts of the pattern. Following the
408     usage in Jeffrey Friedl's book, this is called "capturing"
409     in what follows, and the phrase "capturing subpattern" is
410     used for a fragment of a pattern that picks out a substring.
411     PCRE supports several other kinds of parenthesized subpat-
412     tern that do not cause substrings to be captured.
413    
414     Captured substrings are returned to the caller via a vector
415     of integer offsets whose address is passed in ovector. The
416     number of elements in the vector is passed in ovecsize. The
417     first two-thirds of the vector is used to pass back captured
418     substrings, each substring using a pair of integers. The
419     remaining third of the vector is used as workspace by
420     pcre_exec() while matching capturing subpatterns, and is not
421     available for passing back information. The length passed in
422     ovecsize should always be a multiple of three. If it is not,
423     it is rounded down.
424    
425     When a match has been successful, information about captured
426     substrings is returned in pairs of integers, starting at the
427     beginning of ovector, and continuing up to two-thirds of its
428     length at the most. The first element of a pair is set to
429     the offset of the first character in a substring, and the
430     second is set to the offset of the first character after the
431     end of a substring. The first pair, ovector[0] and ovec-
432     tor[1], identify the portion of the subject string matched
433     by the entire pattern. The next pair is used for the first
434     capturing subpattern, and so on. The value returned by
435     pcre_exec() is the number of pairs that have been set. If
436     there are no capturing subpatterns, the return value from a
437     successful match is 1, indicating that just the first pair
438     of offsets has been set.
439    
440     Some convenience functions are provided for extracting the
441     captured substrings as separate strings. These are described
442     in the following section.
443    
444     It is possible for an capturing subpattern number n+1 to
445     match some part of the subject when subpattern n has not
446     been used at all. For example, if the string "abc" is
447     matched against the pattern (a|(z))(bc) subpatterns 1 and 3
448     are matched, but 2 is not. When this happens, both offset
449     values corresponding to the unused subpattern are set to -1.
450    
451     If a capturing subpattern is matched repeatedly, it is the
452     last portion of the string that it matched that gets
453     returned.
454    
455     If the vector is too small to hold all the captured sub-
456     strings, it is used as far as possible (up to two-thirds of
457     its length), and the function returns a value of zero. In
458     particular, if the substring offsets are not of interest,
459     pcre_exec() may be called with ovector passed as NULL and
460     ovecsize as zero. However, if the pattern contains back
461     references and the ovector isn't big enough to remember the
462     related substrings, PCRE has to get additional memory for
463     use during matching. Thus it is usually advisable to supply
464     an ovector.
465    
466     Note that pcre_info() can be used to find out how many cap-
467     turing subpatterns there are in a compiled pattern. The
468     smallest size for ovector that will allow for n captured
469     substrings in addition to the offsets of the substring
470     matched by the whole pattern is (n+1)*3.
471     If pcre_exec() fails, it returns a negative number. The fol-
472     lowing are defined in the header file:
473    
474     PCRE_ERROR_NOMATCH (-1)
475    
476     The subject string did not match the pattern.
477    
478     PCRE_ERROR_NULL (-2)
479    
480     Either code or subject was passed as NULL, or ovector was
481     NULL and ovecsize was not zero.
482    
483     PCRE_ERROR_BADOPTION (-3)
484    
485     An unrecognized bit was set in the options argument.
486    
487     PCRE_ERROR_BADMAGIC (-4)
488    
489     PCRE stores a 4-byte "magic number" at the start of the com-
490     piled code, to catch the case when it is passed a junk
491     pointer. This is the error it gives when the magic number
492     isn't present.
493    
494     PCRE_ERROR_UNKNOWN_NODE (-5)
495    
496     While running the pattern match, an unknown item was encoun-
497     tered in the compiled pattern. This error could be caused by
498     a bug in PCRE or by overwriting of the compiled pattern.
499    
500     PCRE_ERROR_NOMEMORY (-6)
501    
502     If a pattern contains back references, but the ovector that
503     is passed to pcre_exec() is not big enough to remember the
504     referenced substrings, PCRE gets a block of memory at the
505     start of matching to use for this purpose. If the call via
506     pcre_malloc() fails, this error is given. The memory is
507     freed at the end of matching.
508    
509    
510    
511     EXTRACTING CAPTURED SUBSTRINGS
512     Captured substrings can be accessed directly by using the
513     offsets returned by pcre_exec() in ovector. For convenience,
514     the functions pcre_copy_substring(), pcre_get_substring(),
515     and pcre_get_substring_list() are provided for extracting
516     captured substrings as new, separate, zero-terminated
517     strings. A substring that contains a binary zero is
518     correctly extracted and has a further zero added on the end,
519     but the result does not, of course, function as a C string.
520    
521     The first three arguments are the same for all three func-
522     tions: subject is the subject string which has just been
523     successfully matched, ovector is a pointer to the vector of
524     integer offsets that was passed to pcre_exec(), and
525     stringcount is the number of substrings that were captured
526     by the match, including the substring that matched the
527     entire regular expression. This is the value returned by
528     pcre_exec if it is greater than zero. If pcre_exec()
529     returned zero, indicating that it ran out of space in ovec-
530     tor, then the value passed as stringcount should be the size
531     of the vector divided by three.
532    
533     The functions pcre_copy_substring() and pcre_get_substring()
534     extract a single substring, whose number is given as string-
535     number. A value of zero extracts the substring that matched
536     the entire pattern, while higher values extract the captured
537     substrings. For pcre_copy_substring(), the string is placed
538     in buffer, whose length is given by buffersize, while for
539     pcre_get_substring() a new block of store is obtained via
540     pcre_malloc, and its address is returned via stringptr. The
541     yield of the function is the length of the string, not
542     including the terminating zero, or one of
543    
544     PCRE_ERROR_NOMEMORY (-6)
545    
546     The buffer was too small for pcre_copy_substring(), or the
547     attempt to get memory failed for pcre_get_substring().
548    
549     PCRE_ERROR_NOSUBSTRING (-7)
550    
551     There is no substring whose number is stringnumber.
552    
553     The pcre_get_substring_list() function extracts all avail-
554     able substrings and builds a list of pointers to them. All
555     this is done in a single block of memory which is obtained
556     via pcre_malloc. The address of the memory block is returned
557     via listptr, which is also the start of the list of string
558     pointers. The end of the list is marked by a NULL pointer.
559     The yield of the function is zero if all went well, or
560    
561     PCRE_ERROR_NOMEMORY (-6)
562    
563     if the attempt to get the memory block failed.
564    
565     When any of these functions encounter a substring that is
566     unset, which can happen when capturing subpattern number n+1
567     matches some part of the subject, but subpattern n has not
568     been used at all, they return an empty string. This can be
569     distinguished from a genuine zero-length substring by
570     inspecting the appropriate offset in ovector, which is nega-
571     tive for unset substrings.
572    
573    
574    
575     LIMITATIONS
576     There are some size limitations in PCRE but it is hoped that
577     they will never in practice be relevant. The maximum length
578     of a compiled pattern is 65539 (sic) bytes. All values in
579     repeating quantifiers must be less than 65536. The maximum
580     number of capturing subpatterns is 99. The maximum number
581     of all parenthesized subpatterns, including capturing sub-
582     patterns, assertions, and other types of subpattern, is 200.
583    
584     The maximum length of a subject string is the largest posi-
585     tive number that an integer variable can hold. However, PCRE
586     uses recursion to handle subpatterns and indefinite repeti-
587     tion. This means that the available stack space may limit
588     the size of a subject string that can be processed by cer-
589     tain patterns.
590    
591    
592    
593     DIFFERENCES FROM PERL
594     The differences described here are with respect to Perl
595     5.005.
596    
597     1. By default, a whitespace character is any character that
598     the C library function isspace() recognizes, though it is
599     possible to compile PCRE with alternative character type
600     tables. Normally isspace() matches space, formfeed, newline,
601     carriage return, horizontal tab, and vertical tab. Perl 5 no
602     longer includes vertical tab in its set of whitespace char-
603     acters. The \v escape that was in the Perl documentation for
604     a long time was never in fact recognized. However, the char-
605     acter itself was treated as whitespace at least up to 5.002.
606     In 5.004 and 5.005 it does not match \s.
607    
608     2. PCRE does not allow repeat quantifiers on lookahead
609     assertions. Perl permits them, but they do not mean what you
610     might think. For example, (?!a){3} does not assert that the
611     next three characters are not "a". It just asserts that the
612     next character is not "a" three times.
613    
614     3. Capturing subpatterns that occur inside negative looka-
615     head assertions are counted, but their entries in the
616     offsets vector are never set. Perl sets its numerical vari-
617     ables from any such patterns that are matched before the
618     assertion fails to match something (thereby succeeding), but
619     only if the negative lookahead assertion contains just one
620     branch.
621    
622     4. Though binary zero characters are supported in the sub-
623     ject string, they are not allowed in a pattern string
624     because it is passed as a normal C string, terminated by
625     zero. The escape sequence "\0" can be used in the pattern to
626     represent a binary zero.
627     5. The following Perl escape sequences are not supported:
628     \l, \u, \L, \U, \E, \Q. In fact these are implemented by
629     Perl's general string-handling and are not part of its pat-
630     tern matching engine.
631    
632     6. The Perl \G assertion is not supported as it is not
633     relevant to single pattern matches.
634    
635     7. Fairly obviously, PCRE does not support the (?{code})
636     construction.
637    
638     8. There are at the time of writing some oddities in Perl
639     5.005_02 concerned with the settings of captured strings
640     when part of a pattern is repeated. For example, matching
641     "aba" against the pattern /^(a(b)?)+$/ sets $2 to the value
642     "b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2
643     unset. However, if the pattern is changed to
644     /^(aa(b(b))?)+$/ then $2 (and $3) get set.
645    
646     In Perl 5.004 $2 is set in both cases, and that is also true
647     of PCRE. If in the future Perl changes to a consistent state
648     that is different, PCRE may change to follow.
649    
650     9. Another as yet unresolved discrepancy is that in Perl
651     5.005_02 the pattern /^(a)?(?(1)a|b)+$/ matches the string
652     "a", whereas in PCRE it does not. However, in both Perl and
653     PCRE /^(a)?a/ matched against "a" leaves $1 unset.
654    
655     10. PCRE provides some extensions to the Perl regular
656     expression facilities:
657    
658     (a) Although lookbehind assertions must match fixed length
659     strings, each alternative branch of a lookbehind assertion
660     can match a different length of string. Perl 5.005 requires
661     them all to have the same length.
662    
663     (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not
664     set, the $ meta- character matches only at the very end of
665     the string.
666    
667     (c) If PCRE_EXTRA is set, a backslash followed by a letter
668     with no special meaning is faulted.
669    
670     (d) If PCRE_UNGREEDY is set, the greediness of the repeti-
671     tion quantifiers is inverted, that is, by default they are
672     not greedy, but if followed by a question mark they are.
673    
674     (e) PCRE_ANCHORED can be used to force a pattern to be tried
675     only at the start of the subject.
676    
677     (f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY options
678     for pcre_exec() have no Perl equivalents.
679    
680    
681    
682     REGULAR EXPRESSION DETAILS
683     The syntax and semantics of the regular expressions sup-
684     ported by PCRE are described below. Regular expressions are
685     also described in the Perl documentation and in a number of
686     other books, some of which have copious examples. Jeffrey
687     Friedl's "Mastering Regular Expressions", published by
688     O'Reilly (ISBN 1-56592-257-3), covers them in great detail.
689     The description here is intended as reference documentation.
690    
691     A regular expression is a pattern that is matched against a
692     subject string from left to right. Most characters stand for
693     themselves in a pattern, and match the corresponding charac-
694     ters in the subject. As a trivial example, the pattern
695    
696     The quick brown fox
697    
698     matches a portion of a subject string that is identical to
699     itself. The power of regular expressions comes from the
700     ability to include alternatives and repetitions in the pat-
701     tern. These are encoded in the pattern by the use of meta-
702     characters, which do not stand for themselves but instead
703     are interpreted in some special way.
704    
705     There are two different sets of meta-characters: those that
706     are recognized anywhere in the pattern except within square
707     brackets, and those that are recognized in square brackets.
708     Outside square brackets, the meta-characters are as follows:
709    
710     \ general escape character with several uses
711     ^ assert start of subject (or line, in multiline
712     mode)
713     $ assert end of subject (or line, in multiline mode)
714     . match any character except newline (by default)
715     [ start character class definition
716     | start of alternative branch
717     ( start subpattern
718     ) end subpattern
719     ? extends the meaning of (
720     also 0 or 1 quantifier
721     also quantifier minimizer
722     * 0 or more quantifier
723     + 1 or more quantifier
724     { start min/max quantifier
725    
726     Part of a pattern that is in square brackets is called a
727     "character class". In a character class the only meta-
728     characters are:
729    
730     \ general escape character
731     ^ negate the class, but only if the first character
732     - indicates character range
733     ] terminates the character class
734    
735     The following sections describe the use of each of the
736     meta-characters.
737    
738    
739    
740     BACKSLASH
741     The backslash character has several uses. Firstly, if it is
742     followed by a non-alphameric character, it takes away any
743     special meaning that character may have. This use of
744     backslash as an escape character applies both inside and
745     outside character classes.
746    
747     For example, if you want to match a "*" character, you write
748     "\*" in the pattern. This applies whether or not the follow-
749     ing character would otherwise be interpreted as a meta-
750     character, so it is always safe to precede a non-alphameric
751     with "\" to specify that it stands for itself. In particu-
752     lar, if you want to match a backslash, you write "\\".
753    
754     If a pattern is compiled with the PCRE_EXTENDED option, whi-
755     tespace in the pattern (other than in a character class) and
756     characters between a "#" outside a character class and the
757     next newline character are ignored. An escaping backslash
758     can be used to include a whitespace or "#" character as part
759     of the pattern.
760    
761     A second use of backslash provides a way of encoding non-
762     printing characters in patterns in a visible manner. There
763     is no restriction on the appearance of non-printing charac-
764     ters, apart from the binary zero that terminates a pattern,
765     but when a pattern is being prepared by text editing, it is
766     usually easier to use one of the following escape sequences
767     than the binary character it represents:
768    
769     \a alarm, that is, the BEL character (hex 07)
770     \cx "control-x", where x is any character
771     \e escape (hex 1B)
772     \f formfeed (hex 0C)
773     \n newline (hex 0A)
774     \r carriage return (hex 0D)
775     \t tab (hex 09)
776     \xhh character with hex code hh
777     \ddd character with octal code ddd, or backreference
778    
779     The precise effect of "\cx" is as follows: if "x" is a lower
780     case letter, it is converted to upper case. Then bit 6 of
781     the character (hex 40) is inverted. Thus "\cz" becomes hex
782     1A, but "\c{" becomes hex 3B, while "\c;" becomes hex 7B.
783    
784     After "\x", up to two hexadecimal digits are read (letters
785     can be in upper or lower case).
786    
787     After "\0" up to two further octal digits are read. In both
788     cases, if there are fewer than two digits, just those that
789     are present are used. Thus the sequence "\0\x\07" specifies
790     two binary zeros followed by a BEL character. Make sure you
791     supply two digits after the initial zero if the character
792     that follows is itself an octal digit.
793    
794     The handling of a backslash followed by a digit other than 0
795     is complicated. Outside a character class, PCRE reads it
796     and any following digits as a decimal number. If the number
797     is less than 10, or if there have been at least that many
798     previous capturing left parentheses in the expression, the
799     entire sequence is taken as a back reference. A description
800     of how this works is given later, following the discussion
801     of parenthesized subpatterns.
802    
803     Inside a character class, or if the decimal number is
804     greater than 9 and there have not been that many capturing
805     subpatterns, PCRE re-reads up to three octal digits follow-
806     ing the backslash, and generates a single byte from the
807     least significant 8 bits of the value. Any subsequent digits
808     stand for themselves. For example:
809    
810     \040 is another way of writing a space
811     \40 is the same, provided there are fewer than 40
812     previous capturing subpatterns
813     \7 is always a back reference
814     \11 might be a back reference, or another way of
815     writing a tab
816     \011 is always a tab
817     \0113 is a tab followed by the character "3"
818     \113 is the character with octal code 113 (since there
819     can be no more than 99 back references)
820     \377 is a byte consisting entirely of 1 bits
821     \81 is either a back reference, or a binary zero
822     followed by the two characters "8" and "1"
823    
824     Note that octal values of 100 or greater must not be intro-
825     duced by a leading zero, because no more than three octal
826     digits are ever read.
827    
828     All the sequences that define a single byte value can be
829     used both inside and outside character classes. In addition,
830     inside a character class, the sequence "\b" is interpreted
831     as the backspace character (hex 08). Outside a character
832     class it has a different meaning (see below).
833    
834     The third use of backslash is for specifying generic charac-
835     ter types:
836    
837     \d any decimal digit
838     \D any character that is not a decimal digit
839     any whitespace character
840     \S any character that is not a whitespace character
841     \w any "word" character
842     \W any "non-word" character
843    
844     Each pair of escape sequences partitions the complete set of
845     characters into two disjoint sets. Any given character
846     matches one, and only one, of each pair.
847    
848     A "word" character is any letter or digit or the underscore
849     character, that is, any character which can be part of a
850     Perl "word". The definition of letters and digits is con-
851     trolled by PCRE's character tables, and may vary if locale-
852     specific matching is taking place (see "Locale support"
853     above). For example, in the "fr" (French) locale, some char-
854     acter codes greater than 128 are used for accented letters,
855     and these are matched by \w.
856    
857     These character type sequences can appear both inside and
858     outside character classes. They each match one character of
859     the appropriate type. If the current matching point is at
860     the end of the subject string, all of them fail, since there
861     is no character to match.
862    
863     The fourth use of backslash is for certain simple asser-
864     tions. An assertion specifies a condition that has to be met
865     at a particular point in a match, without consuming any
866     characters from the subject string. The use of subpatterns
867     for more complicated assertions is described below. The
868     backslashed assertions are
869    
870     \b word boundary
871     \B not a word boundary
872     \A start of subject (independent of multiline mode)
873     \Z end of subject or newline at end (independent of
874     multiline mode)
875     \z end of subject (independent of multiline mode)
876    
877     These assertions may not appear in character classes (but
878     note that "\b" has a different meaning, namely the backspace
879     character, inside a character class).
880    
881     A word boundary is a position in the subject string where
882     the current character and the previous character do not both
883     match \w or \W (i.e. one matches \w and the other matches
884     \W), or the start or end of the string if the first or last
885     character matches \w, respectively.
886    
887     The \A, \Z, and \z assertions differ from the traditional
888     circumflex and dollar (described below) in that they only
889     ever match at the very start and end of the subject string,
890     whatever options are set. They are not affected by the
891     PCRE_NOTBOL or PCRE_NOTEOL options. If the startoffset argu-
892     ment of pcre_exec() is non-zero, \A can never match. The
893     difference between \Z and \z is that \Z matches before a
894     newline that is the last character of the string as well as
895     at the end of the string, whereas \z matches only at the
896     end.
897    
898    
899    
900     CIRCUMFLEX AND DOLLAR
901     Outside a character class, in the default matching mode, the
902     circumflex character is an assertion which is true only if
903     the current matching point is at the start of the subject
904     string. If the startoffset argument of pcre_exec() is non-
905     zero, circumflex can never match. Inside a character class,
906     circumflex has an entirely different meaning (see below).
907    
908     Circumflex need not be the first character of the pattern if
909     a number of alternatives are involved, but it should be the
910     first thing in each alternative in which it appears if the
911     pattern is ever to match that branch. If all possible alter-
912     natives start with a circumflex, that is, if the pattern is
913     constrained to match only at the start of the subject, it is
914     said to be an "anchored" pattern. (There are also other con-
915     structs that can cause a pattern to be anchored.)
916    
917     A dollar character is an assertion which is true only if the
918     current matching point is at the end of the subject string,
919     or immediately before a newline character that is the last
920     character in the string (by default). Dollar need not be the
921     last character of the pattern if a number of alternatives
922     are involved, but it should be the last item in any branch
923     in which it appears. Dollar has no special meaning in a
924     character class.
925    
926     The meaning of dollar can be changed so that it matches only
927     at the very end of the string, by setting the
928     PCRE_DOLLAR_ENDONLY option at compile or matching time. This
929     does not affect the \Z assertion.
930    
931     The meanings of the circumflex and dollar characters are
932     changed if the PCRE_MULTILINE option is set. When this is
933     the case, they match immediately after and immediately
934     before an internal "\n" character, respectively, in addition
935     to matching at the start and end of the subject string. For
936     example, the pattern /^abc$/ matches the subject string
937     "def\nabc" in multiline mode, but not otherwise. Conse-
938     quently, patterns that are anchored in single line mode
939     because all branches start with "^" are not anchored in mul-
940     tiline mode, and a match for circumflex is possible when the
941     startoffset argument of pcre_exec() is non-zero. The
942     PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is
943     set.
944    
945     Note that the sequences \A, \Z, and \z can be used to match
946     the start and end of the subject in both modes, and if all
947     branches of a pattern start with \A is it always anchored,
948     whether PCRE_MULTILINE is set or not.
949    
950    
951    
952     FULL STOP (PERIOD, DOT)
953     Outside a character class, a dot in the pattern matches any
954     one character in the subject, including a non-printing char-
955     acter, but not (by default) newline. If the PCRE_DOTALL
956     option is set, then dots match newlines as well. The han-
957     dling of dot is entirely independent of the handling of cir-
958     cumflex and dollar, the only relationship being that they
959     both involve newline characters. Dot has no special meaning
960     in a character class.
961    
962    
963    
964     SQUARE BRACKETS
965     An opening square bracket introduces a character class, ter-
966     minated by a closing square bracket. A closing square
967     bracket on its own is not special. If a closing square
968     bracket is required as a member of the class, it should be
969     the first data character in the class (after an initial cir-
970     cumflex, if present) or escaped with a backslash.
971    
972     A character class matches a single character in the subject;
973     the character must be in the set of characters defined by
974     the class, unless the first character in the class is a cir-
975     cumflex, in which case the subject character must not be in
976     the set defined by the class. If a circumflex is actually
977     required as a member of the class, ensure it is not the
978     first character, or escape it with a backslash.
979    
980     For example, the character class [aeiou] matches any lower
981     case vowel, while [^aeiou] matches any character that is not
982     a lower case vowel. Note that a circumflex is just a con-
983     venient notation for specifying the characters which are in
984     the class by enumerating those that are not. It is not an
985     assertion: it still consumes a character from the subject
986     string, and fails if the current pointer is at the end of
987     the string.
988    
989     When caseless matching is set, any letters in a class
990     represent both their upper case and lower case versions, so
991     for example, a caseless [aeiou] matches "A" as well as "a",
992     and a caseless [^aeiou] does not match "A", whereas a case-
993     ful version would.
994    
995     The newline character is never treated in any special way in
996     character classes, whatever the setting of the PCRE_DOTALL
997     or PCRE_MULTILINE options is. A class such as [^a] will
998     always match a newline.
999    
1000     The minus (hyphen) character can be used to specify a range
1001     of characters in a character class. For example, [d-m]
1002     matches any letter between d and m, inclusive. If a minus
1003     character is required in a class, it must be escaped with a
1004     backslash or appear in a position where it cannot be inter-
1005     preted as indicating a range, typically as the first or last
1006     character in the class.
1007    
1008     It is not possible to have the literal character "]" as the
1009     end character of a range. A pattern such as [W-]46] is
1010     interpreted as a class of two characters ("W" and "-") fol-
1011     lowed by a literal string "46]", so it would match "W46]" or
1012     "-46]". However, if the "]" is escaped with a backslash it
1013     is interpreted as the end of range, so [W-\]46] is inter-
1014     preted as a single class containing a range followed by two
1015     separate characters. The octal or hexadecimal representation
1016     of "]" can also be used to end a range.
1017    
1018     Ranges operate in ASCII collating sequence. They can also be
1019     used for characters specified numerically, for example
1020     [\000-\037]. If a range that includes letters is used when
1021     caseless matching is set, it matches the letters in either
1022     case. For example, [W-c] is equivalent to [][\^_`wxyzabc],
1023     matched caselessly, and if character tables for the "fr"
1024     locale are in use, [\xc8-\xcb] matches accented E characters
1025     in both cases.
1026    
1027     The character types \d, \D, \s, \S, \w, and \W may also
1028     appear in a character class, and add the characters that
1029     they match to the class. For example, [\dABCDEF] matches any
1030     hexadecimal digit. A circumflex can conveniently be used
1031     with the upper case character types to specify a more res-
1032     tricted set of characters than the matching lower case type.
1033     For example, the class [^\W_] matches any letter or digit,
1034     but not underscore.
1035    
1036     All non-alphameric characters other than \, -, ^ (at the
1037     start) and the terminating ] are non-special in character
1038     classes, but it does no harm if they are escaped.
1039    
1040    
1041    
1042     VERTICAL BAR
1043     Vertical bar characters are used to separate alternative
1044     patterns. For example, the pattern
1045    
1046     gilbert|sullivan
1047    
1048     matches either "gilbert" or "sullivan". Any number of alter-
1049     natives may appear, and an empty alternative is permitted
1050     (matching the empty string). The matching process tries
1051     each alternative in turn, from left to right, and the first
1052     one that succeeds is used. If the alternatives are within a
1053     subpattern (defined below), "succeeds" means matching the
1054     rest of the main pattern as well as the alternative in the
1055     subpattern.
1056    
1057    
1058    
1059     INTERNAL OPTION SETTING
1060     The settings of PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL,
1061     and PCRE_EXTENDED can be changed from within the pattern by
1062     a sequence of Perl option letters enclosed between "(?" and
1063     ")". The option letters are
1064    
1065     i for PCRE_CASELESS
1066     m for PCRE_MULTILINE
1067     s for PCRE_DOTALL
1068     x for PCRE_EXTENDED
1069    
1070     For example, (?im) sets caseless, multiline matching. It is
1071     also possible to unset these options by preceding the letter
1072     with a hyphen, and a combined setting and unsetting such as
1073     (?im-sx), which sets PCRE_CASELESS and PCRE_MULTILINE while
1074     unsetting PCRE_DOTALL and PCRE_EXTENDED, is also permitted.
1075     If a letter appears both before and after the hyphen, the
1076     option is unset.
1077    
1078     The scope of these option changes depends on where in the
1079     pattern the setting occurs. For settings that are outside
1080     any subpattern (defined below), the effect is the same as if
1081     the options were set or unset at the start of matching. The
1082     following patterns all behave in exactly the same way:
1083    
1084     (?i)abc
1085     a(?i)bc
1086     ab(?i)c
1087     abc(?i)
1088    
1089     which in turn is the same as compiling the pattern abc with
1090     PCRE_CASELESS set. In other words, such "top level" set-
1091     tings apply to the whole pattern (unless there are other
1092     changes inside subpatterns). If there is more than one set-
1093     ting of the same option at top level, the rightmost setting
1094     is used.
1095    
1096     If an option change occurs inside a subpattern, the effect
1097     is different. This is a change of behaviour in Perl 5.005.
1098     An option change inside a subpattern affects only that part
1099     of the subpattern that follows it, so
1100    
1101     (a(?i)b)c
1102    
1103     matches abc and aBc and no other strings (assuming
1104     PCRE_CASELESS is not used). By this means, options can be
1105     made to have different settings in different parts of the
1106     pattern. Any changes made in one alternative do carry on
1107     into subsequent branches within the same subpattern. For
1108     example,
1109    
1110     (a(?i)b|c)
1111    
1112     matches "ab", "aB", "c", and "C", even though when matching
1113     "C" the first branch is abandoned before the option setting.
1114     This is because the effects of option settings happen at
1115     compile time. There would be some very weird behaviour oth-
1116     erwise.
1117    
1118     The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can
1119     be changed in the same way as the Perl-compatible options by
1120     using the characters U and X respectively. The (?X) flag
1121     setting is special in that it must always occur earlier in
1122     the pattern than any of the additional features it turns on,
1123     even when it is at top level. It is best put at the start.
1124    
1125    
1126    
1127     SUBPATTERNS
1128     Subpatterns are delimited by parentheses (round brackets),
1129     which can be nested. Marking part of a pattern as a subpat-
1130     tern does two things:
1131    
1132     1. It localizes a set of alternatives. For example, the pat-
1133     tern
1134    
1135     cat(aract|erpillar|)
1136    
1137     matches one of the words "cat", "cataract", or "caterpil-
1138     lar". Without the parentheses, it would match "cataract",
1139     "erpillar" or the empty string.
1140    
1141     2. It sets up the subpattern as a capturing subpattern (as
1142     defined above). When the whole pattern matches, that por-
1143     tion of the subject string that matched the subpattern is
1144     passed back to the caller via the ovector argument of
1145     pcre_exec(). Opening parentheses are counted from left to
1146     right (starting from 1) to obtain the numbers of the captur-
1147     ing subpatterns.
1148    
1149     For example, if the string "the red king" is matched against
1150     the pattern
1151    
1152     the ((red|white) (king|queen))
1153    
1154     the captured substrings are "red king", "red", and "king",
1155     and are numbered 1, 2, and 3.
1156    
1157     The fact that plain parentheses fulfil two functions is not
1158     always helpful. There are often times when a grouping sub-
1159     pattern is required without a capturing requirement. If an
1160     opening parenthesis is followed by "?:", the subpattern does
1161     not do any capturing, and is not counted when computing the
1162     number of any subsequent capturing subpatterns. For example,
1163     if the string "the white queen" is matched against the pat-
1164     tern
1165    
1166     the ((?:red|white) (king|queen))
1167    
1168     the captured substrings are "white queen" and "queen", and
1169     are numbered 1 and 2. The maximum number of captured sub-
1170     strings is 99, and the maximum number of all subpatterns,
1171     both capturing and non-capturing, is 200.
1172    
1173     As a convenient shorthand, if any option settings are
1174     required at the start of a non-capturing subpattern, the
1175     option letters may appear between the "?" and the ":". Thus
1176     the two patterns
1177    
1178     (?i:saturday|sunday)
1179     (?:(?i)saturday|sunday)
1180    
1181     match exactly the same set of strings. Because alternative
1182     branches are tried from left to right, and options are not
1183     reset until the end of the subpattern is reached, an option
1184     setting in one branch does affect subsequent branches, so
1185     the above patterns match "SUNDAY" as well as "Saturday".
1186    
1187    
1188    
1189     REPETITION
1190     Repetition is specified by quantifiers, which can follow any
1191     of the following items:
1192    
1193     a single character, possibly escaped
1194     the . metacharacter
1195     a character class
1196     a back reference (see next section)
1197     a parenthesized subpattern (unless it is an assertion -
1198     see below)
1199    
1200     The general repetition quantifier specifies a minimum and
1201     maximum number of permitted matches, by giving the two
1202     numbers in curly brackets (braces), separated by a comma.
1203     The numbers must be less than 65536, and the first must be
1204     less than or equal to the second. For example:
1205    
1206     z{2,4}
1207    
1208     matches "zz", "zzz", or "zzzz". A closing brace on its own
1209     is not a special character. If the second number is omitted,
1210     but the comma is present, there is no upper limit; if the
1211     second number and the comma are both omitted, the quantifier
1212     specifies an exact number of required matches. Thus
1213    
1214     [aeiou]{3,}
1215    
1216     matches at least 3 successive vowels, but may match many
1217     more, while
1218    
1219     \d{8}
1220    
1221     matches exactly 8 digits. An opening curly bracket that
1222     appears in a position where a quantifier is not allowed, or
1223     one that does not match the syntax of a quantifier, is taken
1224     as a literal character. For example, {,6} is not a quantif-
1225     ier, but a literal string of four characters.
1226    
1227     The quantifier {0} is permitted, causing the expression to
1228     behave as if the previous item and the quantifier were not
1229     present.
1230    
1231     For convenience (and historical compatibility) the three
1232     most common quantifiers have single-character abbreviations:
1233    
1234     * is equivalent to {0,}
1235     + is equivalent to {1,}
1236     ? is equivalent to {0,1}
1237    
1238     It is possible to construct infinite loops by following a
1239     subpattern that can match no characters with a quantifier
1240     that has no upper limit, for example:
1241    
1242     (a?)*
1243    
1244     Earlier versions of Perl and PCRE used to give an error at
1245     compile time for such patterns. However, because there are
1246     cases where this can be useful, such patterns are now
1247     accepted, but if any repetition of the subpattern does in
1248     fact match no characters, the loop is forcibly broken.
1249    
1250     By default, the quantifiers are "greedy", that is, they
1251     match as much as possible (up to the maximum number of per-
1252     mitted times), without causing the rest of the pattern to
1253     fail. The classic example of where this gives problems is in
1254     trying to match comments in C programs. These appear between
1255     the sequences /* and */ and within the sequence, individual
1256     * and / characters may appear. An attempt to match C com-
1257     ments by applying the pattern
1258    
1259     /\*.*\*/
1260    
1261     to the string
1262    
1263     /* first command */ not comment /* second comment */
1264    
1265     fails, because it matches the entire string due to the
1266     greediness of the .* item.
1267    
1268     However, if a quantifier is followed by a question mark,
1269     then it ceases to be greedy, and instead matches the minimum
1270     number of times possible, so the pattern
1271    
1272     /\*.*?\*/
1273    
1274     does the right thing with the C comments. The meaning of the
1275     various quantifiers is not otherwise changed, just the pre-
1276     ferred number of matches. Do not confuse this use of ques-
1277     tion mark with its use as a quantifier in its own right.
1278     Because it has two uses, it can sometimes appear doubled, as
1279     in
1280    
1281     \d??\d
1282    
1283     which matches one digit by preference, but can match two if
1284     that is the only way the rest of the pattern matches.
1285    
1286     If the PCRE_UNGREEDY option is set (an option which is not
1287     available in Perl) then the quantifiers are not greedy by
1288     default, but individual ones can be made greedy by following
1289     them with a question mark. In other words, it inverts the
1290     default behaviour.
1291    
1292     When a parenthesized subpattern is quantified with a minimum
1293     repeat count that is greater than 1 or with a limited max-
1294     imum, more store is required for the compiled pattern, in
1295     proportion to the size of the minimum or maximum.
1296    
1297     If a pattern starts with .* or .{0,} and the PCRE_DOTALL
1298     option (equivalent to Perl's /s) is set, thus allowing the .
1299     to match newlines, then the pattern is implicitly anchored,
1300     because whatever follows will be tried against every charac-
1301     ter position in the subject string, so there is no point in
1302     retrying the overall match at any position after the first.
1303     PCRE treats such a pattern as though it were preceded by \A.
1304     In cases where it is known that the subject string contains
1305     no newlines, it is worth setting PCRE_DOTALL when the pat-
1306     tern begins with .* in order to obtain this optimization, or
1307     alternatively using ^ to indicate anchoring explicitly.
1308    
1309     When a capturing subpattern is repeated, the value captured
1310     is the substring that matched the final iteration. For
1311     example, after
1312    
1313     (tweedle[dume]{3}\s*)+
1314    
1315     has matched "tweedledum tweedledee" the value of the cap-
1316     tured substring is "tweedledee". However, if there are
1317     nested capturing subpatterns, the corresponding captured
1318     values may have been set in previous iterations. For exam-
1319     ple, after
1320    
1321     /(a|(b))+/
1322    
1323     matches "aba" the value of the second captured substring is
1324     "b".
1325    
1326    
1327    
1328     BACK REFERENCES
1329     Outside a character class, a backslash followed by a digit
1330     greater than 0 (and possibly further digits) is a back
1331     reference to a capturing subpattern earlier (i.e. to its
1332     left) in the pattern, provided there have been that many
1333     previous capturing left parentheses.
1334    
1335     However, if the decimal number following the backslash is
1336     less than 10, it is always taken as a back reference, and
1337     causes an error only if there are not that many capturing
1338     left parentheses in the entire pattern. In other words, the
1339     parentheses that are referenced need not be to the left of
1340     the reference for numbers less than 10. See the section
1341     entitled "Backslash" above for further details of the han-
1342     dling of digits following a backslash.
1343    
1344     A back reference matches whatever actually matched the cap-
1345     turing subpattern in the current subject string, rather than
1346     anything matching the subpattern itself. So the pattern
1347    
1348     (sens|respons)e and \1ibility
1349    
1350     matches "sense and sensibility" and "response and responsi-
1351     bility", but not "sense and responsibility". If caseful
1352     matching is in force at the time of the back reference, then
1353     the case of letters is relevant. For example,
1354    
1355     ((?i)rah)\s+\1
1356    
1357     matches "rah rah" and "RAH RAH", but not "RAH rah", even
1358     though the original capturing subpattern is matched case-
1359     lessly.
1360    
1361     There may be more than one back reference to the same sub-
1362     pattern. If a subpattern has not actually been used in a
1363     particular match, then any back references to it always
1364     fail. For example, the pattern
1365    
1366     (a|(bc))\2
1367    
1368     always fails if it starts to match "a" rather than "bc".
1369     Because there may be up to 99 back references, all digits
1370     following the backslash are taken as part of a potential
1371     back reference number. If the pattern continues with a digit
1372     character, then some delimiter must be used to terminate the
1373     back reference. If the PCRE_EXTENDED option is set, this can
1374     be whitespace. Otherwise an empty comment can be used.
1375    
1376     A back reference that occurs inside the parentheses to which
1377     it refers fails when the subpattern is first used, so, for
1378     example, (a\1) never matches. However, such references can
1379     be useful inside repeated subpatterns. For example, the pat-
1380     tern
1381    
1382     (a|b\1)+
1383    
1384     matches any number of "a"s and also "aba", "ababaa" etc. At
1385     each iteration of the subpattern, the back reference matches
1386     the character string corresponding to the previous itera-
1387     tion. In order for this to work, the pattern must be such
1388     that the first iteration does not need to match the back
1389     reference. This can be done using alternation, as in the
1390     example above, or by a quantifier with a minimum of zero.
1391    
1392    
1393    
1394     ASSERTIONS
1395     An assertion is a test on the characters following or
1396     preceding the current matching point that does not actually
1397     consume any characters. The simple assertions coded as \b,
1398     \B, \A, \Z, \z, ^ and $ are described above. More compli-
1399     cated assertions are coded as subpatterns. There are two
1400     kinds: those that look ahead of the current position in the
1401     subject string, and those that look behind it.
1402    
1403     An assertion subpattern is matched in the normal way, except
1404     that it does not cause the current matching position to be
1405     changed. Lookahead assertions start with (?= for positive
1406     assertions and (?! for negative assertions. For example,
1407    
1408     \w+(?=;)
1409    
1410     matches a word followed by a semicolon, but does not include
1411     the semicolon in the match, and
1412    
1413     foo(?!bar)
1414    
1415     matches any occurrence of "foo" that is not followed by
1416     "bar". Note that the apparently similar pattern
1417    
1418     (?!foo)bar
1419    
1420     does not find an occurrence of "bar" that is preceded by
1421     something other than "foo"; it finds any occurrence of "bar"
1422     whatsoever, because the assertion (?!foo) is always true
1423     when the next three characters are "bar". A lookbehind
1424     assertion is needed to achieve this effect.
1425    
1426     Lookbehind assertions start with (?<= for positive asser-
1427     tions and (?<! for negative assertions. For example,
1428    
1429     (?<!foo)bar
1430    
1431     does find an occurrence of "bar" that is not preceded by
1432     "foo". The contents of a lookbehind assertion are restricted
1433     such that all the strings it matches must have a fixed
1434     length. However, if there are several alternatives, they do
1435     not all have to have the same fixed length. Thus
1436    
1437     (?<=bullock|donkey)
1438    
1439     is permitted, but
1440    
1441     (?<!dogs?|cats?)
1442    
1443     causes an error at compile time. Branches that match dif-
1444     ferent length strings are permitted only at the top level of
1445     a lookbehind assertion. This is an extension compared with
1446     Perl 5.005, which requires all branches to match the same
1447     length of string. An assertion such as
1448    
1449     (?<=ab(c|de))
1450    
1451     is not permitted, because its single top-level branch can
1452     match two different lengths, but it is acceptable if rewrit-
1453     ten to use two top-level branches:
1454    
1455     (?<=abc|abde)
1456    
1457     The implementation of lookbehind assertions is, for each
1458     alternative, to temporarily move the current position back
1459     by the fixed width and then try to match. If there are
1460     insufficient characters before the current position, the
1461     match is deemed to fail. Lookbehinds in conjunction with
1462     once-only subpatterns can be particularly useful for match-
1463     ing at the ends of strings; an example is given at the end
1464     of the section on once-only subpatterns.
1465    
1466     Several assertions (of any sort) may occur in succession.
1467     For example,
1468    
1469     (?<=\d{3})(?<!999)foo
1470    
1471     matches "foo" preceded by three digits that are not "999".
1472     Notice that each of the assertions is applied independently
1473     at the same point in the subject string. First there is a
1474     check that the previous three characters are all digits,
1475     then there is a check that the same three characters are not
1476     "999". This pattern does not match "foo" preceded by six
1477     characters, the first of which are digits and the last three
1478     of which are not "999". For example, it doesn't match
1479     "123abcfoo". A pattern to do that is
1480    
1481     (?<=\d{3}...)(?<!999)foo
1482    
1483     This time the first assertion looks at the preceding six
1484     characters, checking that the first three are digits, and
1485     then the second assertion checks that the preceding three
1486     characters are not "999".
1487    
1488     Assertions can be nested in any combination. For example,
1489    
1490     (?<=(?<!foo)bar)baz
1491    
1492     matches an occurrence of "baz" that is preceded by "bar"
1493     which in turn is not preceded by "foo", while
1494    
1495     (?<=\d{3}(?!999)...)foo
1496    
1497     is another pattern which matches "foo" preceded by three
1498     digits and any three characters that are not "999".
1499    
1500     Assertion subpatterns are not capturing subpatterns, and may
1501     not be repeated, because it makes no sense to assert the
1502     same thing several times. If any kind of assertion contains
1503     capturing subpatterns within it, these are counted for the
1504     purposes of numbering the capturing subpatterns in the whole
1505     pattern. However, substring capturing is carried out only
1506     for positive assertions, because it does not make sense for
1507     negative assertions.
1508    
1509     Assertions count towards the maximum of 200 parenthesized
1510     subpatterns.
1511    
1512    
1513    
1514     ONCE-ONLY SUBPATTERNS
1515     With both maximizing and minimizing repetition, failure of
1516     what follows normally causes the repeated item to be re-
1517     evaluated to see if a different number of repeats allows the
1518     rest of the pattern to match. Sometimes it is useful to
1519     prevent this, either to change the nature of the match, or
1520     to cause it fail earlier than it otherwise might, when the
1521     author of the pattern knows there is no point in carrying
1522     on.
1523    
1524     Consider, for example, the pattern \d+foo when applied to
1525     the subject line
1526    
1527     123456bar
1528    
1529     After matching all 6 digits and then failing to match "foo",
1530     the normal action of the matcher is to try again with only 5
1531     digits matching the \d+ item, and then with 4, and so on,
1532     before ultimately failing. Once-only subpatterns provide the
1533     means for specifying that once a portion of the pattern has
1534     matched, it is not to be re-evaluated in this way, so the
1535     matcher would give up immediately on failing to match "foo"
1536     the first time. The notation is another kind of special
1537     parenthesis, starting with (?> as in this example:
1538    
1539     (?>\d+)bar
1540    
1541     This kind of parenthesis "locks up" the part of the pattern
1542     it contains once it has matched, and a failure further into
1543     the pattern is prevented from backtracking into it. Back-
1544     tracking past it to previous items, however, works as nor-
1545     mal.
1546    
1547     An alternative description is that a subpattern of this type
1548     matches the string of characters that an identical stan-
1549     dalone pattern would match, if anchored at the current point
1550     in the subject string.
1551    
1552     Once-only subpatterns are not capturing subpatterns. Simple
1553     cases such as the above example can be thought of as a max-
1554     imizing repeat that must swallow everything it can. So,
1555     while both \d+ and \d+? are prepared to adjust the number of
1556     digits they match in order to make the rest of the pattern
1557     match, (?>\d+) can only match an entire sequence of digits.
1558    
1559     This construction can of course contain arbitrarily compli-
1560     cated subpatterns, and it can be nested.
1561    
1562     Once-only subpatterns can be used in conjunction with look-
1563     behind assertions to specify efficient matching at the end
1564     of the subject string. Consider a simple pattern such as
1565    
1566     abcd$
1567    
1568     when applied to a long string which does not match it.
1569     Because matching proceeds from left to right, PCRE will look
1570     for each "a" in the subject and then see if what follows
1571     matches the rest of the pattern. If the pattern is specified
1572     as
1573    
1574     ^.*abcd$
1575    
1576     then the initial .* matches the entire string at first, but
1577     when this fails, it backtracks to match all but the last
1578     character, then all but the last two characters, and so on.
1579     Once again the search for "a" covers the entire string, from
1580     right to left, so we are no better off. However, if the pat-
1581     tern is written as
1582    
1583     ^(?>.*)(?<=abcd)
1584    
1585     then there can be no backtracking for the .* item; it can
1586     match only the entire string. The subsequent lookbehind
1587     assertion does a single test on the last four characters. If
1588     it fails, the match fails immediately. For long strings,
1589     this approach makes a significant difference to the process-
1590     ing time.
1591    
1592    
1593    
1594     CONDITIONAL SUBPATTERNS
1595     It is possible to cause the matching process to obey a sub-
1596     pattern conditionally or to choose between two alternative
1597     subpatterns, depending on the result of an assertion, or
1598     whether a previous capturing subpattern matched or not. The
1599     two possible forms of conditional subpattern are
1600    
1601     (?(condition)yes-pattern)
1602     (?(condition)yes-pattern|no-pattern)
1603    
1604     If the condition is satisfied, the yes-pattern is used; oth-
1605     erwise the no-pattern (if present) is used. If there are
1606     more than two alternatives in the subpattern, a compile-time
1607     error occurs.
1608    
1609     There are two kinds of condition. If the text between the
1610     parentheses consists of a sequence of digits, then the con-
1611     dition is satisfied if the capturing subpattern of that
1612     number has previously matched. Consider the following pat-
1613     tern, which contains non-significant white space to make it
1614     more readable (assume the PCRE_EXTENDED option) and to
1615     divide it into three parts for ease of discussion:
1616    
1617     ( \( )? [^()]+ (?(1) \) )
1618    
1619     The first part matches an optional opening parenthesis, and
1620     if that character is present, sets it as the first captured
1621     substring. The second part matches one or more characters
1622     that are not parentheses. The third part is a conditional
1623     subpattern that tests whether the first set of parentheses
1624     matched or not. If they did, that is, if subject started
1625     with an opening parenthesis, the condition is true, and so
1626     the yes-pattern is executed and a closing parenthesis is
1627     required. Otherwise, since no-pattern is not present, the
1628     subpattern matches nothing. In other words, this pattern
1629     matches a sequence of non-parentheses, optionally enclosed
1630     in parentheses.
1631    
1632     If the condition is not a sequence of digits, it must be an
1633     assertion. This may be a positive or negative lookahead or
1634     lookbehind assertion. Consider this pattern, again contain-
1635     ing non-significant white space, and with the two alterna-
1636     tives on the second line:
1637    
1638     (?(?=[^a-z]*[a-z])
1639     \d{2}[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
1640    
1641     The condition is a positive lookahead assertion that matches
1642     an optional sequence of non-letters followed by a letter. In
1643     other words, it tests for the presence of at least one
1644     letter in the subject. If a letter is found, the subject is
1645     matched against the first alternative; otherwise it is
1646     matched against the second. This pattern matches strings in
1647     one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
1648     letters and dd are digits.
1649    
1650    
1651    
1652     COMMENTS
1653     The sequence (?# marks the start of a comment which contin-
1654     ues up to the next closing parenthesis. Nested parentheses
1655     are not permitted. The characters that make up a comment
1656     play no part in the pattern matching at all.
1657    
1658     If the PCRE_EXTENDED option is set, an unescaped # character
1659     outside a character class introduces a comment that contin-
1660     ues up to the next newline character in the pattern.
1661    
1662    
1663    
1664     PERFORMANCE
1665     Certain items that may appear in patterns are more efficient
1666     than others. It is more efficient to use a character class
1667     like [aeiou] than a set of alternatives such as (a|e|i|o|u).
1668     In general, the simplest construction that provides the
1669     required behaviour is usually the most efficient. Jeffrey
1670     Friedl's book contains a lot of discussion about optimizing
1671     regular expressions for efficient performance.
1672    
1673     When a pattern begins with .* and the PCRE_DOTALL option is
1674     set, the pattern is implicitly anchored by PCRE, since it
1675     can match only at the start of a subject string. However, if
1676     PCRE_DOTALL is not set, PCRE cannot make this optimization,
1677     because the . metacharacter does not then match a newline,
1678     and if the subject string contains newlines, the pattern may
1679     match from the character immediately following one of them
1680     instead of from the very start. For example, the pattern
1681    
1682     (.*) second
1683    
1684     matches the subject "first\nand second" (where \n stands for
1685     a newline character) with the first captured substring being
1686     "and". In order to do this, PCRE has to retry the match
1687     starting after every newline in the subject.
1688    
1689     If you are using such a pattern with subject strings that do
1690     not contain newlines, the best performance is obtained by
1691     setting PCRE_DOTALL, or starting the pattern with ^.* to
1692     indicate explicit anchoring. That saves PCRE from having to
1693     scan along the subject looking for a newline to restart at.
1694    
1695     Beware of patterns that contain nested indefinite repeats.
1696     These can take a long time to run when applied to a string
1697     that does not match. Consider the pattern fragment
1698    
1699     (a+)*
1700    
1701     This can match "aaaa" in 33 different ways, and this number
1702     increases very rapidly as the string gets longer. (The *
1703     repeat can match 0, 1, 2, 3, or 4 times, and for each of
1704     those cases other than 0, the + repeats can match different
1705     numbers of times.) When the remainder of the pattern is such
1706     that the entire match is going to fail, PCRE has in princi-
1707     ple to try every possible variation, and this can take an
1708     extremely long time.
1709    
1710     An optimization catches some of the more simple cases such
1711     as
1712    
1713     (a+)*b
1714    
1715     where a literal character follows. Before embarking on the
1716     standard matching procedure, PCRE checks that there is a "b"
1717     later in the subject string, and if there is not, it fails
1718     the match immediately. However, when there is no following
1719     literal this optimization cannot be used. You can see the
1720     difference by comparing the behaviour of
1721    
1722     (a+)*\d
1723    
1724     with the pattern above. The former gives a failure almost
1725     instantly when applied to a whole line of "a" characters,
1726     whereas the latter takes an appreciable time with strings
1727     longer than about 20 characters.
1728    
1729    
1730    
1731     AUTHOR
1732     Philip Hazel <ph10@cam.ac.uk>
1733     University Computing Service,
1734     New Museums Site,
1735     Cambridge CB2 3QG, England.
1736     Phone: +44 1223 334714
1737    
1738     Last updated: 29 July 1999
1739     Copyright (c) 1997-1999 University of Cambridge.

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12