ViewVC logotype

Contents of /code/tags/pcre-2.08a/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log

Revision 42 - (hide annotations) (download)
Sat Feb 24 21:39:19 2007 UTC (8 years, 2 months ago) by nigel
File MIME type: text/plain
File size: 77499 byte(s)
Tag code/trunk as code/tags/pcre-2.08a.

1 nigel 41 NAME
2     pcre - Perl-compatible regular expressions.
7     #include <pcre.h>
9     pcre *pcre_compile(const char *pattern, int options,
10     const char **errptr, int *erroffset,
11     const unsigned char *tableptr);
13     pcre_extra *pcre_study(const pcre *code, int options,
14     const char **errptr);
16     int pcre_exec(const pcre *code, const pcre_extra *extra,
17     const char *subject, int length, int startoffset,
18     int options, int *ovector, int ovecsize);
20     int pcre_copy_substring(const char *subject, int *ovector,
21     int stringcount, int stringnumber, char *buffer,
22     int buffersize);
24     int pcre_get_substring(const char *subject, int *ovector,
25     int stringcount, int stringnumber,
26     const char **stringptr);
28     int pcre_get_substring_list(const char *subject,
29     int *ovector, int stringcount, const char ***listptr);
31     const unsigned char *pcre_maketables(void);
33     int pcre_info(const pcre *code, int *optptr, *firstcharptr);
35     char *pcre_version(void);
37     void *(*pcre_malloc)(size_t);
39     void (*pcre_free)(void *);
45     The PCRE library is a set of functions that implement regu-
46     lar expression pattern matching using the same syntax and
47     semantics as Perl 5, with just a few differences (see
48     below). The current implementation corresponds to Perl
49     5.005.
51     PCRE has its own native API, which is described in this
52     document. There is also a set of wrapper functions that
53     correspond to the POSIX API. These are described in the
54     pcreposix documentation.
55     The native API function prototypes are defined in the header
56     file pcre.h, and on Unix systems the library itself is
57     called libpcre.a, so can be accessed by adding -lpcre to the
58     command for linking an application which calls it.
60     The functions pcre_compile(), pcre_study(), and pcre_exec()
61     are used for compiling and matching regular expressions,
62     while pcre_copy_substring(), pcre_get_substring(), and
63     pcre_get_substring_list() are convenience functions for
64     extracting captured substrings from a matched subject
65     string. The function pcre_maketables() is used (optionally)
66     to build a set of character tables in the current locale for
67     passing to pcre_compile().
69     The function pcre_info() is used to find out information
70     about a compiled pattern, while the function pcre_version()
71     returns a pointer to a string containing the version of PCRE
72     and its date of release.
74     The global variables pcre_malloc and pcre_free initially
75     contain the entry points of the standard malloc() and free()
76     functions respectively. PCRE calls the memory management
77     functions via these variables, so a calling program can
78     replace them if it wishes to intercept the calls. This
79     should be done before calling any PCRE functions.
84     The PCRE functions can be used in multi-threading applica-
85     tions, with the proviso that the memory management functions
86     pointed to by pcre_malloc and pcre_free are shared by all
87     threads.
89     The compiled form of a regular expression is not altered
90     during matching, so the same compiled pattern can safely be
91     used by several threads at once.
96     The function pcre_compile() is called to compile a pattern
97     into an internal form. The pattern is a C string terminated
98     by a binary zero, and is passed in the argument pattern. A
99     pointer to a single block of memory that is obtained via
100     pcre_malloc is returned. This contains the compiled code and
101     related data. The pcre type is defined for this for conveni-
102     ence, but in fact pcre is just a typedef for void, since the
103     contents of the block are not externally defined. It is up
104     to the caller to free the memory when it is no longer
105     required.
107     The size of a compiled pattern is roughly proportional to
108     the length of the pattern string, except that each character
109     class (other than those containing just a single character,
110     negated or not) requires 33 bytes, and repeat quantifiers
111     with a minimum greater than one or a bounded maximum cause
112     the relevant portions of the compiled pattern to be repli-
113     cated.
115     The options argument contains independent bits that affect
116     the compilation. It should be zero if no options are
117     required. Some of the options, in particular, those that are
118     compatible with Perl, can also be set and unset from within
119     the pattern (see the detailed description of regular expres-
120     sions below). For these options, the contents of the options
121     argument specifies their initial settings at the start of
122     compilation and execution. The PCRE_ANCHORED option can be
123     set at the time of matching as well as at compile time.
125     If errptr is NULL, pcre_compile() returns NULL immediately.
126     Otherwise, if compilation of a pattern fails, pcre_compile()
127     returns NULL, and sets the variable pointed to by errptr to
128     point to a textual error message. The offset from the start
129     of the pattern to the character where the error was
130     discovered is placed in the variable pointed to by
131     erroffset, which must not be NULL. If it is, an immediate
132     error is given.
134     If the final argument, tableptr, is NULL, PCRE uses a
135     default set of character tables which are built when it is
136     compiled, using the default C locale. Otherwise, tableptr
137     must be the result of a call to pcre_maketables(). See the
138     section on locale support below.
140     The following option bits are defined in the header file:
144     If this bit is set, the pattern is forced to be "anchored",
145     that is, it is constrained to match only at the start of the
146     string which is being searched (the "subject string"). This
147     effect can also be achieved by appropriate constructs in the
148     pattern itself, which is the only way to do it in Perl.
152     If this bit is set, letters in the pattern match both upper
153     and lower case letters. It is equivalent to Perl's /i
154     option.
158     If this bit is set, a dollar metacharacter in the pattern
159     matches only at the end of the subject string. Without this
160     option, a dollar also matches immediately before the final
161     character if it is a newline (but not before any other new-
162     lines). The PCRE_DOLLAR_ENDONLY option is ignored if
163     PCRE_MULTILINE is set. There is no equivalent to this option
164     in Perl.
168     If this bit is set, a dot metacharater in the pattern
169     matches all characters, including newlines. Without it, new-
170     lines are excluded. This option is equivalent to Perl's /s
171     option. A negative class such as [^a] always matches a new-
172     line character, independent of the setting of this option.
176     If this bit is set, whitespace data characters in the pat-
177     tern are totally ignored except when escaped or inside a
178     character class, and characters between an unescaped # out-
179     side a character class and the next newline character,
180     inclusive, are also ignored. This is equivalent to Perl's /x
181     option, and makes it possible to include comments inside
182     complicated patterns. Note, however, that this applies only
183     to data characters. Whitespace characters may never appear
184     within special character sequences in a pattern, for example
185     within the sequence (?( which introduces a conditional sub-
186     pattern.
188     PCRE_EXTRA
190     This option turns on additional functionality of PCRE that
191     is incompatible with Perl. Any backslash in a pattern that
192     is followed by a letter that has no special meaning causes
193     an error, thus reserving these combinations for future
194     expansion. By default, as in Perl, a backslash followed by a
195     letter with no special meaning is treated as a literal.
196     There are at present no other features controlled by this
197     option.
201     By default, PCRE treats the subject string as consisting of
202     a single "line" of characters (even if it actually contains
203     several newlines). The "start of line" metacharacter (^)
204     matches only at the start of the string, while the "end of
205     line" metacharacter ($) matches only at the end of the
206     string, or before a terminating newline (unless
207     PCRE_DOLLAR_ENDONLY is set). This is the same as Perl.
209     When PCRE_MULTILINE it is set, the "start of line" and "end
210     of line" constructs match immediately following or
211     immediately before any newline in the subject string,
212     respectively, as well as at the very start and end. This is
213     equivalent to Perl's /m option. If there are no "\n" charac-
214     ters in a subject string, or no occurrences of ^ or $ in a
215     pattern, setting PCRE_MULTILINE has no effect.
219     This option inverts the "greediness" of the quantifiers so
220     that they are not greedy by default, but become greedy if
221     followed by "?". It is not compatible with Perl. It can also
222     be set by a (?U) option setting within the pattern.
227     When a pattern is going to be used several times, it is
228     worth spending more time analyzing it in order to speed up
229     the time taken for matching. The function pcre_study() takes
230     a pointer to a compiled pattern as its first argument, and
231     returns a pointer to a pcre_extra block (another void
232     typedef) containing additional information about the pat-
233     tern; this can be passed to pcre_exec(). If no additional
234     information is available, NULL is returned.
236     The second argument contains option bits. At present, no
237     options are defined for pcre_study(), and this argument
238     should always be zero.
240     The third argument for pcre_study() is a pointer to an error
241     message. If studying succeeds (even if no data is returned),
242     the variable it points to is set to NULL. Otherwise it
243     points to a textual error message.
245     At present, studying a pattern is useful only for non-
246     anchored patterns that do not have a single fixed starting
247     character. A bitmap of possible starting characters is
248     created.
253     PCRE handles caseless matching, and determines whether char-
254     acters are letters, digits, or whatever, by reference to a
255     set of tables. The library contains a default set of tables
256     which is created in the default C locale when PCRE is com-
257     piled. This is used when the final argument of
258     pcre_compile() is NULL, and is sufficient for many applica-
259     tions.
261     An alternative set of tables can, however, be supplied. Such
262     tables are built by calling the pcre_maketables() function,
263     which has no arguments, in the relevant locale. The result
264     can then be passed to pcre_compile() as often as necessary.
265     For example, to build and use tables that are appropriate
266     for the French locale (where accented characters with codes
267     greater than 128 are treated as letters), the following code
268     could be used:
270     setlocale(LC_CTYPE, "fr");
271     tables = pcre_maketables();
272     re = pcre_compile(..., tables);
274     The tables are built in memory that is obtained via
275     pcre_malloc. The pointer that is passed to pcre_compile is
276     saved with the compiled pattern, and the same tables are
277     used via this pointer by pcre_study() and pcre_exec(). Thus
278     for any single pattern, compilation, studying and matching
279     all happen in the same locale, but different patterns can be
280     compiled in different locales. It is the caller's responsi-
281     bility to ensure that the memory containing the tables
282     remains available for as long as it is needed.
287     The pcre_info() function returns information about a com-
288     piled pattern. Its yield is the number of capturing subpat-
289     terns, or one of the following negative numbers:
291     PCRE_ERROR_NULL the argument code was NULL
292     PCRE_ERROR_BADMAGIC the "magic number" was not found
294     If the optptr argument is not NULL, a copy of the options
295     with which the pattern was compiled is placed in the integer
296     it points to. These option bits are those specified in the
297     call to pcre_compile(), modified by any top-level option
298     settings within the pattern itself, and with the
299     PCRE_ANCHORED bit set if the form of the pattern implies
300     that it can match only at the start of a subject string.
302     If the pattern is not anchored and the firstcharptr argument
303     is not NULL, it is used to pass back information about the
304     first character of any matched string. If there is a fixed
305     first character, e.g. from a pattern such as
306     (cat|cow|coyote), then it is returned in the integer pointed
307     to by firstcharptr. Otherwise, if either
309     (a) the pattern was compiled with the PCRE_MULTILINE option,
310     and every branch starts with "^", or
312     (b) every branch of the pattern starts with ".*" and
313     PCRE_DOTALL is not set (if it were set, the pattern would be
314     anchored),
315     then -1 is returned, indicating that the pattern matches
316     only at the start of a subject string or after any "\n"
317     within the string. Otherwise -2 is returned.
322     The function pcre_exec() is called to match a subject string
323     against a pre-compiled pattern, which is passed in the code
324     argument. If the pattern has been studied, the result of the
325     study should be passed in the extra argument. Otherwise this
326     must be NULL.
328     The PCRE_ANCHORED option can be passed in the options argu-
329     ment, whose unused bits must be zero. However, if a pattern
330     was compiled with PCRE_ANCHORED, or turned out to be
331     anchored by virtue of its contents, it cannot be made
332     unachored at matching time.
334     There are also three further options that can be set only at
335     matching time:
339     The first character of the string is not the beginning of a
340     line, so the circumflex metacharacter should not match
341     before it. Setting this without PCRE_MULTILINE (at compile
342     time) causes circumflex never to match.
346     The end of the string is not the end of a line, so the dol-
347     lar metacharacter should not match it nor (except in multi-
348     line mode) a newline immediately before it. Setting this
349     without PCRE_MULTILINE (at compile time) causes dollar never
350     to match.
354     An empty string is not considered to be a valid match if
355     this option is set. If there are alternatives in the pat-
356     tern, they are tried. If all the alternatives match the
357     empty string, the entire match fails. For example, if the
358     pattern
360     a?b?
362     is applied to a string not beginning with "a" or "b", it
363     matches the empty string at the start of the subject. With
364     PCRE_NOTEMPTY set, this match is not valid, so PCRE searches
365     further into the string for occurrences of "a" or "b".
367     Perl has no direct equivalent of PCRE_NOTEMPTY, but it does
368     make a special case of a pattern match of the empty string
369     within its split() function, and when using the /g modifier.
370     It is possible to emulate Perl's behaviour after matching a
371     null string by first trying the match again at the same
372     offset with PCRE_NOTEMPTY set, and then if that fails by
373     advancing the starting offset (see below) and trying an
374     ordinary match again.
376     The subject string is passed as a pointer in subject, a
377     length in length, and a starting offset in startoffset.
378     Unlike the pattern string, it may contain binary zero char-
379     acters. When the starting offset is zero, the search for a
380     match starts at the beginning of the subject, and this is by
381     far the most common case.
383     A non-zero starting offset is useful when searching for
384     another match in the same subject by calling pcre_exec()
385     again after a previous success. Setting startoffset differs
386     from just passing over a shortened string and setting
387     PCRE_NOTBOL in the case of a pattern that begins with any
388     kind of lookbehind. For example, consider the pattern
390     \Biss\B
392     which finds occurrences of "iss" in the middle of words. (\B
393     matches only if the current position in the subject is not a
394     word boundary.) When applied to the string "Mississipi" the
395     first call to pcre_exec() finds the first occurrence. If
396     pcre_exec() is called again with just the remainder of the
397     subject, namely "issipi", it does not match, because \B is
398     always false at the start of the subject, which is deemed to
399     be a word boundary. However, if pcre_exec() is passed the
400     entire string again, but with startoffset set to 4, it finds
401     the second occurrence of "iss" because it is able to look
402     behind the starting point to discover that it is preceded by
403     a letter.
405     If a non-zero starting offset is passed when the pattern is
406     anchored, one attempt to match at the given offset is tried.
407     This can only succeed if the pattern does not require the
408     match to be at the start of the subject.
410     In general, a pattern matches a certain portion of the sub-
411     ject, and in addition, further substrings from the subject
412     may be picked out by parts of the pattern. Following the
413     usage in Jeffrey Friedl's book, this is called "capturing"
414     in what follows, and the phrase "capturing subpattern" is
415     used for a fragment of a pattern that picks out a substring.
416     PCRE supports several other kinds of parenthesized subpat-
417     tern that do not cause substrings to be captured.
419     Captured substrings are returned to the caller via a vector
420     of integer offsets whose address is passed in ovector. The
421     number of elements in the vector is passed in ovecsize. The
422     first two-thirds of the vector is used to pass back captured
423     substrings, each substring using a pair of integers. The
424     remaining third of the vector is used as workspace by
425     pcre_exec() while matching capturing subpatterns, and is not
426     available for passing back information. The length passed in
427     ovecsize should always be a multiple of three. If it is not,
428     it is rounded down.
430     When a match has been successful, information about captured
431     substrings is returned in pairs of integers, starting at the
432     beginning of ovector, and continuing up to two-thirds of its
433     length at the most. The first element of a pair is set to
434     the offset of the first character in a substring, and the
435     second is set to the offset of the first character after the
436     end of a substring. The first pair, ovector[0] and ovec-
437     tor[1], identify the portion of the subject string matched
438     by the entire pattern. The next pair is used for the first
439     capturing subpattern, and so on. The value returned by
440     pcre_exec() is the number of pairs that have been set. If
441     there are no capturing subpatterns, the return value from a
442     successful match is 1, indicating that just the first pair
443     of offsets has been set.
445     Some convenience functions are provided for extracting the
446     captured substrings as separate strings. These are described
447     in the following section.
449     It is possible for an capturing subpattern number n+1 to
450     match some part of the subject when subpattern n has not
451     been used at all. For example, if the string "abc" is
452     matched against the pattern (a|(z))(bc) subpatterns 1 and 3
453     are matched, but 2 is not. When this happens, both offset
454     values corresponding to the unused subpattern are set to -1.
456     If a capturing subpattern is matched repeatedly, it is the
457     last portion of the string that it matched that gets
458     returned.
460     If the vector is too small to hold all the captured sub-
461     strings, it is used as far as possible (up to two-thirds of
462     its length), and the function returns a value of zero. In
463     particular, if the substring offsets are not of interest,
464     pcre_exec() may be called with ovector passed as NULL and
465     ovecsize as zero. However, if the pattern contains back
466     references and the ovector isn't big enough to remember the
467     related substrings, PCRE has to get additional memory for
468     use during matching. Thus it is usually advisable to supply
469     an ovector.
471     Note that pcre_info() can be used to find out how many cap-
472     turing subpatterns there are in a compiled pattern. The
473     smallest size for ovector that will allow for n captured
474     substrings in addition to the offsets of the substring
475     matched by the whole pattern is (n+1)*3.
477     If pcre_exec() fails, it returns a negative number. The fol-
478     lowing are defined in the header file:
482     The subject string did not match the pattern.
484     PCRE_ERROR_NULL (-2)
486     Either code or subject was passed as NULL, or ovector was
487     NULL and ovecsize was not zero.
491     An unrecognized bit was set in the options argument.
495     PCRE stores a 4-byte "magic number" at the start of the com-
496     piled code, to catch the case when it is passed a junk
497     pointer. This is the error it gives when the magic number
498     isn't present.
502     While running the pattern match, an unknown item was encoun-
503     tered in the compiled pattern. This error could be caused by
504     a bug in PCRE or by overwriting of the compiled pattern.
508     If a pattern contains back references, but the ovector that
509     is passed to pcre_exec() is not big enough to remember the
510     referenced substrings, PCRE gets a block of memory at the
511     start of matching to use for this purpose. If the call via
512     pcre_malloc() fails, this error is given. The memory is
513     freed at the end of matching.
518     Captured substrings can be accessed directly by using the
519     offsets returned by pcre_exec() in ovector. For convenience,
520     the functions pcre_copy_substring(), pcre_get_substring(),
521     and pcre_get_substring_list() are provided for extracting
522     captured substrings as new, separate, zero-terminated
523     strings. A substring that contains a binary zero is
524     correctly extracted and has a further zero added on the end,
525     but the result does not, of course, function as a C string.
527     The first three arguments are the same for all three func-
528     tions: subject is the subject string which has just been
529     successfully matched, ovector is a pointer to the vector of
530     integer offsets that was passed to pcre_exec(), and
531     stringcount is the number of substrings that were captured
532     by the match, including the substring that matched the
533     entire regular expression. This is the value returned by
534     pcre_exec if it is greater than zero. If pcre_exec()
535     returned zero, indicating that it ran out of space in ovec-
536     tor, then the value passed as stringcount should be the size
537     of the vector divided by three.
539     The functions pcre_copy_substring() and pcre_get_substring()
540     extract a single substring, whose number is given as string-
541     number. A value of zero extracts the substring that matched
542     the entire pattern, while higher values extract the captured
543     substrings. For pcre_copy_substring(), the string is placed
544     in buffer, whose length is given by buffersize, while for
545     pcre_get_substring() a new block of store is obtained via
546     pcre_malloc, and its address is returned via stringptr. The
547     yield of the function is the length of the string, not
548     including the terminating zero, or one of
552     The buffer was too small for pcre_copy_substring(), or the
553     attempt to get memory failed for pcre_get_substring().
557     There is no substring whose number is stringnumber.
559     The pcre_get_substring_list() function extracts all avail-
560     able substrings and builds a list of pointers to them. All
561     this is done in a single block of memory which is obtained
562     via pcre_malloc. The address of the memory block is returned
563     via listptr, which is also the start of the list of string
564     pointers. The end of the list is marked by a NULL pointer.
565     The yield of the function is zero if all went well, or
569     if the attempt to get the memory block failed.
571     When any of these functions encounter a substring that is
572     unset, which can happen when capturing subpattern number n+1
573     matches some part of the subject, but subpattern n has not
574     been used at all, they return an empty string. This can be
575     distinguished from a genuine zero-length substring by
576     inspecting the appropriate offset in ovector, which is nega-
577     tive for unset substrings.
583     There are some size limitations in PCRE but it is hoped that
584     they will never in practice be relevant. The maximum length
585     of a compiled pattern is 65539 (sic) bytes. All values in
586     repeating quantifiers must be less than 65536. The maximum
587     number of capturing subpatterns is 99. The maximum number
588     of all parenthesized subpatterns, including capturing sub-
589     patterns, assertions, and other types of subpattern, is 200.
591     The maximum length of a subject string is the largest posi-
592     tive number that an integer variable can hold. However, PCRE
593     uses recursion to handle subpatterns and indefinite repeti-
594     tion. This means that the available stack space may limit
595     the size of a subject string that can be processed by cer-
596     tain patterns.
601     The differences described here are with respect to Perl
602     5.005.
604     1. By default, a whitespace character is any character that
605     the C library function isspace() recognizes, though it is
606     possible to compile PCRE with alternative character type
607     tables. Normally isspace() matches space, formfeed, newline,
608     carriage return, horizontal tab, and vertical tab. Perl 5 no
609     longer includes vertical tab in its set of whitespace char-
610     acters. The \v escape that was in the Perl documentation for
611     a long time was never in fact recognized. However, the char-
612     acter itself was treated as whitespace at least up to 5.002.
613     In 5.004 and 5.005 it does not match \s.
615     2. PCRE does not allow repeat quantifiers on lookahead
616     assertions. Perl permits them, but they do not mean what you
617     might think. For example, (?!a){3} does not assert that the
618     next three characters are not "a". It just asserts that the
619     next character is not "a" three times.
621     3. Capturing subpatterns that occur inside negative looka-
622     head assertions are counted, but their entries in the
623     offsets vector are never set. Perl sets its numerical vari-
624     ables from any such patterns that are matched before the
625     assertion fails to match something (thereby succeeding), but
626     only if the negative lookahead assertion contains just one
627     branch.
629     4. Though binary zero characters are supported in the sub-
630     ject string, they are not allowed in a pattern string
631     because it is passed as a normal C string, terminated by
632     zero. The escape sequence "\0" can be used in the pattern to
633     represent a binary zero.
635     5. The following Perl escape sequences are not supported:
636     \l, \u, \L, \U, \E, \Q. In fact these are implemented by
637     Perl's general string-handling and are not part of its pat-
638     tern matching engine.
640     6. The Perl \G assertion is not supported as it is not
641     relevant to single pattern matches.
643     7. Fairly obviously, PCRE does not support the (?{code})
644     construction.
646     8. There are at the time of writing some oddities in Perl
647     5.005_02 concerned with the settings of captured strings
648     when part of a pattern is repeated. For example, matching
649     "aba" against the pattern /^(a(b)?)+$/ sets $2 to the value
650     "b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2
651     unset. However, if the pattern is changed to
652     /^(aa(b(b))?)+$/ then $2 (and $3) get set.
654     In Perl 5.004 $2 is set in both cases, and that is also true
655     of PCRE. If in the future Perl changes to a consistent state
656     that is different, PCRE may change to follow.
658     9. Another as yet unresolved discrepancy is that in Perl
659     5.005_02 the pattern /^(a)?(?(1)a|b)+$/ matches the string
660     "a", whereas in PCRE it does not. However, in both Perl and
661     PCRE /^(a)?a/ matched against "a" leaves $1 unset.
663     10. PCRE provides some extensions to the Perl regular
664     expression facilities:
666     (a) Although lookbehind assertions must match fixed length
667     strings, each alternative branch of a lookbehind assertion
668     can match a different length of string. Perl 5.005 requires
669     them all to have the same length.
671     (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not
672     set, the $ meta- character matches only at the very end of
673     the string.
675     (c) If PCRE_EXTRA is set, a backslash followed by a letter
676     with no special meaning is faulted.
678     (d) If PCRE_UNGREEDY is set, the greediness of the
679     repetition quantifiers is inverted, that is, by default they
680     are not greedy, but if followed by a question mark they are.
682     (e) PCRE_ANCHORED can be used to force a pattern to be tried
683     only at the start of the subject.
685     (f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY options
686     for pcre_exec() have no Perl equivalents.
691     The syntax and semantics of the regular expressions sup-
692     ported by PCRE are described below. Regular expressions are
693     also described in the Perl documentation and in a number of
694     other books, some of which have copious examples. Jeffrey
695     Friedl's "Mastering Regular Expressions", published by
696     O'Reilly (ISBN 1-56592-257-3), covers them in great detail.
697     The description here is intended as reference documentation.
699     A regular expression is a pattern that is matched against a
700     subject string from left to right. Most characters stand for
701     themselves in a pattern, and match the corresponding charac-
702     ters in the subject. As a trivial example, the pattern
704     The quick brown fox
706     matches a portion of a subject string that is identical to
707     itself. The power of regular expressions comes from the
708     ability to include alternatives and repetitions in the pat-
709     tern. These are encoded in the pattern by the use of meta-
710     characters, which do not stand for themselves but instead
711     are interpreted in some special way.
713     There are two different sets of meta-characters: those that
714     are recognized anywhere in the pattern except within square
715     brackets, and those that are recognized in square brackets.
716     Outside square brackets, the meta-characters are as follows:
718     \ general escape character with several uses
719     ^ assert start of subject (or line, in multiline
720     mode)
721     $ assert end of subject (or line, in multiline mode)
722     . match any character except newline (by default)
723     [ start character class definition
724     | start of alternative branch
725     ( start subpattern
726     ) end subpattern
727     ? extends the meaning of (
728     also 0 or 1 quantifier
729     also quantifier minimizer
730     * 0 or more quantifier
731     + 1 or more quantifier
732     { start min/max quantifier
734     Part of a pattern that is in square brackets is called a
735     "character class". In a character class the only meta-
736     characters are:
738     \ general escape character
739     ^ negate the class, but only if the first character
740     - indicates character range
741     ] terminates the character class
743     The following sections describe the use of each of the
744     meta-characters.
749     The backslash character has several uses. Firstly, if it is
750     followed by a non-alphameric character, it takes away any
751     special meaning that character may have. This use of
752     backslash as an escape character applies both inside and
753     outside character classes.
755     For example, if you want to match a "*" character, you write
756     "\*" in the pattern. This applies whether or not the follow-
757     ing character would otherwise be interpreted as a meta-
758     character, so it is always safe to precede a non-alphameric
759     with "\" to specify that it stands for itself. In particu-
760     lar, if you want to match a backslash, you write "\\".
762     If a pattern is compiled with the PCRE_EXTENDED option, whi-
763     tespace in the pattern (other than in a character class) and
764     characters between a "#" outside a character class and the
765     next newline character are ignored. An escaping backslash
766     can be used to include a whitespace or "#" character as part
767     of the pattern.
769     A second use of backslash provides a way of encoding non-
770     printing characters in patterns in a visible manner. There
771     is no restriction on the appearance of non-printing charac-
772     ters, apart from the binary zero that terminates a pattern,
773     but when a pattern is being prepared by text editing, it is
774     usually easier to use one of the following escape sequences
775     than the binary character it represents:
777     \a alarm, that is, the BEL character (hex 07)
778     \cx "control-x", where x is any character
779     \e escape (hex 1B)
780     \f formfeed (hex 0C)
781     \n newline (hex 0A)
782     \r carriage return (hex 0D)
784     tab (hex 09)
785     \xhh character with hex code hh
786     \ddd character with octal code ddd, or backreference
788     The precise effect of "\cx" is as follows: if "x" is a lower
789     case letter, it is converted to upper case. Then bit 6 of
790     the character (hex 40) is inverted. Thus "\cz" becomes hex
791     1A, but "\c{" becomes hex 3B, while "\c;" becomes hex 7B.
793     After "\x", up to two hexadecimal digits are read (letters
794     can be in upper or lower case).
796     After "\0" up to two further octal digits are read. In both
797     cases, if there are fewer than two digits, just those that
798     are present are used. Thus the sequence "\0\x\07" specifies
799     two binary zeros followed by a BEL character. Make sure you
800     supply two digits after the initial zero if the character
801     that follows is itself an octal digit.
803     The handling of a backslash followed by a digit other than 0
804     is complicated. Outside a character class, PCRE reads it
805     and any following digits as a decimal number. If the number
806     is less than 10, or if there have been at least that many
807     previous capturing left parentheses in the expression, the
808     entire sequence is taken as a back reference. A description
809     of how this works is given later, following the discussion
810     of parenthesized subpatterns.
812     Inside a character class, or if the decimal number is
813     greater than 9 and there have not been that many capturing
814     subpatterns, PCRE re-reads up to three octal digits follow-
815     ing the backslash, and generates a single byte from the
816     least significant 8 bits of the value. Any subsequent digits
817     stand for themselves. For example:
819     \040 is another way of writing a space
820     \40 is the same, provided there are fewer than 40
821     previous capturing subpatterns
822     \7 is always a back reference
823     \11 might be a back reference, or another way of
824     writing a tab
825     \011 is always a tab
826     \0113 is a tab followed by the character "3"
827     \113 is the character with octal code 113 (since there
828     can be no more than 99 back references)
829     \377 is a byte consisting entirely of 1 bits
830     \81 is either a back reference, or a binary zero
831     followed by the two characters "8" and "1"
833     Note that octal values of 100 or greater must not be intro-
834     duced by a leading zero, because no more than three octal
835     digits are ever read.
836     All the sequences that define a single byte value can be
837     used both inside and outside character classes. In addition,
838     inside a character class, the sequence "\b" is interpreted
839     as the backspace character (hex 08). Outside a character
840     class it has a different meaning (see below).
842     The third use of backslash is for specifying generic charac-
843     ter types:
845     \d any decimal digit
846     \D any character that is not a decimal digit
847     \s any whitespace character
848     \S any character that is not a whitespace character
849     \w any "word" character
850     \W any "non-word" character
852     Each pair of escape sequences partitions the complete set of
853     characters into two disjoint sets. Any given character
854     matches one, and only one, of each pair.
856     A "word" character is any letter or digit or the underscore
857     character, that is, any character which can be part of a
858     Perl "word". The definition of letters and digits is con-
859     trolled by PCRE's character tables, and may vary if locale-
860     specific matching is taking place (see "Locale support"
861     above). For example, in the "fr" (French) locale, some char-
862     acter codes greater than 128 are used for accented letters,
863     and these are matched by \w.
865     These character type sequences can appear both inside and
866     outside character classes. They each match one character of
867     the appropriate type. If the current matching point is at
868     the end of the subject string, all of them fail, since there
869     is no character to match.
871     The fourth use of backslash is for certain simple asser-
872     tions. An assertion specifies a condition that has to be met
873     at a particular point in a match, without consuming any
874     characters from the subject string. The use of subpatterns
875     for more complicated assertions is described below. The
876     backslashed assertions are
878     \b word boundary
879     \B not a word boundary
880     \A start of subject (independent of multiline mode)
881     \Z end of subject or newline at end (independent of
882     multiline mode)
883     \z end of subject (independent of multiline mode)
885     These assertions may not appear in character classes (but
886     note that "\b" has a different meaning, namely the backspace
887     character, inside a character class).
888     A word boundary is a position in the subject string where
889     the current character and the previous character do not both
890     match \w or \W (i.e. one matches \w and the other matches
891     \W), or the start or end of the string if the first or last
892     character matches \w, respectively.
894     The \A, \Z, and \z assertions differ from the traditional
895     circumflex and dollar (described below) in that they only
896     ever match at the very start and end of the subject string,
897     whatever options are set. They are not affected by the
898     PCRE_NOTBOL or PCRE_NOTEOL options. If the startoffset argu-
899     ment of pcre_exec() is non-zero, \A can never match. The
900     difference between \Z and \z is that \Z matches before a
901     newline that is the last character of the string as well as
902     at the end of the string, whereas \z matches only at the
903     end.
908     Outside a character class, in the default matching mode, the
909     circumflex character is an assertion which is true only if
910     the current matching point is at the start of the subject
911     string. If the startoffset argument of pcre_exec() is non-
912     zero, circumflex can never match. Inside a character class,
913     circumflex has an entirely different meaning (see below).
915     Circumflex need not be the first character of the pattern if
916     a number of alternatives are involved, but it should be the
917     first thing in each alternative in which it appears if the
918     pattern is ever to match that branch. If all possible alter-
919     natives start with a circumflex, that is, if the pattern is
920     constrained to match only at the start of the subject, it is
921     said to be an "anchored" pattern. (There are also other con-
922     structs that can cause a pattern to be anchored.)
924     A dollar character is an assertion which is true only if the
925     current matching point is at the end of the subject string,
926     or immediately before a newline character that is the last
927     character in the string (by default). Dollar need not be the
928     last character of the pattern if a number of alternatives
929     are involved, but it should be the last item in any branch
930     in which it appears. Dollar has no special meaning in a
931     character class.
933     The meaning of dollar can be changed so that it matches only
934     at the very end of the string, by setting the
935     PCRE_DOLLAR_ENDONLY option at compile or matching time. This
936     does not affect the \Z assertion.
938     The meanings of the circumflex and dollar characters are
939     changed if the PCRE_MULTILINE option is set. When this is
940     the case, they match immediately after and immediately
941     before an internal "\n" character, respectively, in addition
942     to matching at the start and end of the subject string. For
943     example, the pattern /^abc$/ matches the subject string
944     "def\nabc" in multiline mode, but not otherwise. Conse-
945     quently, patterns that are anchored in single line mode
946     because all branches start with "^" are not anchored in mul-
947     tiline mode, and a match for circumflex is possible when the
948     startoffset argument of pcre_exec() is non-zero. The
949     PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is
950     set.
952     Note that the sequences \A, \Z, and \z can be used to match
953     the start and end of the subject in both modes, and if all
954     branches of a pattern start with \A is it always anchored,
955     whether PCRE_MULTILINE is set or not.
960     Outside a character class, a dot in the pattern matches any
961     one character in the subject, including a non-printing char-
962     acter, but not (by default) newline. If the PCRE_DOTALL
963     option is set, then dots match newlines as well. The han-
964     dling of dot is entirely independent of the handling of cir-
965     cumflex and dollar, the only relationship being that they
966     both involve newline characters. Dot has no special meaning
967     in a character class.
972     An opening square bracket introduces a character class, ter-
973     minated by a closing square bracket. A closing square
974     bracket on its own is not special. If a closing square
975     bracket is required as a member of the class, it should be
976     the first data character in the class (after an initial cir-
977     cumflex, if present) or escaped with a backslash.
979     A character class matches a single character in the subject;
980     the character must be in the set of characters defined by
981     the class, unless the first character in the class is a cir-
982     cumflex, in which case the subject character must not be in
983     the set defined by the class. If a circumflex is actually
984     required as a member of the class, ensure it is not the
985     first character, or escape it with a backslash.
987     For example, the character class [aeiou] matches any lower
988     case vowel, while [^aeiou] matches any character that is not
989     a lower case vowel. Note that a circumflex is just a con-
990     venient notation for specifying the characters which are in
991     the class by enumerating those that are not. It is not an
992     assertion: it still consumes a character from the subject
993     string, and fails if the current pointer is at the end of
994     the string.
996     When caseless matching is set, any letters in a class
997     represent both their upper case and lower case versions, so
998     for example, a caseless [aeiou] matches "A" as well as "a",
999     and a caseless [^aeiou] does not match "A", whereas a case-
1000     ful version would.
1002     The newline character is never treated in any special way in
1003     character classes, whatever the setting of the PCRE_DOTALL
1004     or PCRE_MULTILINE options is. A class such as [^a] will
1005     always match a newline.
1007     The minus (hyphen) character can be used to specify a range
1008     of characters in a character class. For example, [d-m]
1009     matches any letter between d and m, inclusive. If a minus
1010     character is required in a class, it must be escaped with a
1011     backslash or appear in a position where it cannot be inter-
1012     preted as indicating a range, typically as the first or last
1013     character in the class.
1015     It is not possible to have the literal character "]" as the
1016     end character of a range. A pattern such as [W-]46] is
1017     interpreted as a class of two characters ("W" and "-") fol-
1018     lowed by a literal string "46]", so it would match "W46]" or
1019     "-46]". However, if the "]" is escaped with a backslash it
1020     is interpreted as the end of range, so [W-\]46] is inter-
1021     preted as a single class containing a range followed by two
1022     separate characters. The octal or hexadecimal representation
1023     of "]" can also be used to end a range.
1025     Ranges operate in ASCII collating sequence. They can also be
1026     used for characters specified numerically, for example
1027     [\000-\037]. If a range that includes letters is used when
1028     caseless matching is set, it matches the letters in either
1029     case. For example, [W-c] is equivalent to [][\^_`wxyzabc],
1030     matched caselessly, and if character tables for the "fr"
1031     locale are in use, [\xc8-\xcb] matches accented E characters
1032     in both cases.
1034     The character types \d, \D, \s, \S, \w, and \W may also
1035     appear in a character class, and add the characters that
1036     they match to the class. For example, [\dABCDEF] matches any
1037     hexadecimal digit. A circumflex can conveniently be used
1038     with the upper case character types to specify a more res-
1039     tricted set of characters than the matching lower case type.
1040     For example, the class [^\W_] matches any letter or digit,
1041     but not underscore.
1043     All non-alphameric characters other than \, -, ^ (at the
1044     start) and the terminating ] are non-special in character
1045     classes, but it does no harm if they are escaped.
1050     Vertical bar characters are used to separate alternative
1051     patterns. For example, the pattern
1053     gilbert|sullivan
1055     matches either "gilbert" or "sullivan". Any number of alter-
1056     natives may appear, and an empty alternative is permitted
1057     (matching the empty string). The matching process tries
1058     each alternative in turn, from left to right, and the first
1059     one that succeeds is used. If the alternatives are within a
1060     subpattern (defined below), "succeeds" means matching the
1061     rest of the main pattern as well as the alternative in the
1062     subpattern.
1068     and PCRE_EXTENDED can be changed from within the pattern by
1069     a sequence of Perl option letters enclosed between "(?" and
1070     ")". The option letters are
1072     i for PCRE_CASELESS
1073     m for PCRE_MULTILINE
1074     s for PCRE_DOTALL
1075     x for PCRE_EXTENDED
1077     For example, (?im) sets caseless, multiline matching. It is
1078     also possible to unset these options by preceding the letter
1079     with a hyphen, and a combined setting and unsetting such as
1080     (?im-sx), which sets PCRE_CASELESS and PCRE_MULTILINE while
1081     unsetting PCRE_DOTALL and PCRE_EXTENDED, is also permitted.
1082     If a letter appears both before and after the hyphen, the
1083     option is unset.
1085     The scope of these option changes depends on where in the
1086     pattern the setting occurs. For settings that are outside
1087     any subpattern (defined below), the effect is the same as if
1088     the options were set or unset at the start of matching. The
1089     following patterns all behave in exactly the same way:
1091     (?i)abc
1092     a(?i)bc
1093     ab(?i)c
1094     abc(?i)
1096     which in turn is the same as compiling the pattern abc with
1097     PCRE_CASELESS set. In other words, such "top level" set-
1098     tings apply to the whole pattern (unless there are other
1099     changes inside subpatterns). If there is more than one set-
1100     ting of the same option at top level, the rightmost setting
1101     is used.
1103     If an option change occurs inside a subpattern, the effect
1104     is different. This is a change of behaviour in Perl 5.005.
1105     An option change inside a subpattern affects only that part
1106     of the subpattern that follows it, so
1108     (a(?i)b)c
1110     matches abc and aBc and no other strings (assuming
1111     PCRE_CASELESS is not used). By this means, options can be
1112     made to have different settings in different parts of the
1113     pattern. Any changes made in one alternative do carry on
1114     into subsequent branches within the same subpattern. For
1115     example,
1117     (a(?i)b|c)
1119     matches "ab", "aB", "c", and "C", even though when matching
1120     "C" the first branch is abandoned before the option setting.
1121     This is because the effects of option settings happen at
1122     compile time. There would be some very weird behaviour oth-
1123     erwise.
1125     The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can
1126     be changed in the same way as the Perl-compatible options by
1127     using the characters U and X respectively. The (?X) flag
1128     setting is special in that it must always occur earlier in
1129     the pattern than any of the additional features it turns on,
1130     even when it is at top level. It is best put at the start.
1135     Subpatterns are delimited by parentheses (round brackets),
1136     which can be nested. Marking part of a pattern as a subpat-
1137     tern does two things:
1139     1. It localizes a set of alternatives. For example, the pat-
1140     tern
1142     cat(aract|erpillar|)
1144     matches one of the words "cat", "cataract", or "caterpil-
1145     lar". Without the parentheses, it would match "cataract",
1146     "erpillar" or the empty string.
1148     2. It sets up the subpattern as a capturing subpattern (as
1149     defined above). When the whole pattern matches, that por-
1150     tion of the subject string that matched the subpattern is
1151     passed back to the caller via the ovector argument of
1152     pcre_exec(). Opening parentheses are counted from left to
1153     right (starting from 1) to obtain the numbers of the captur-
1154     ing subpatterns.
1156     For example, if the string "the red king" is matched against
1157     the pattern
1159     the ((red|white) (king|queen))
1161     the captured substrings are "red king", "red", and "king",
1162     and are numbered 1, 2, and 3.
1164     The fact that plain parentheses fulfil two functions is not
1165     always helpful. There are often times when a grouping sub-
1166     pattern is required without a capturing requirement. If an
1167     opening parenthesis is followed by "?:", the subpattern does
1168     not do any capturing, and is not counted when computing the
1169     number of any subsequent capturing subpatterns. For example,
1170     if the string "the white queen" is matched against the pat-
1171     tern
1173     the ((?:red|white) (king|queen))
1175     the captured substrings are "white queen" and "queen", and
1176     are numbered 1 and 2. The maximum number of captured sub-
1177     strings is 99, and the maximum number of all subpatterns,
1178     both capturing and non-capturing, is 200.
1180     As a convenient shorthand, if any option settings are
1181     required at the start of a non-capturing subpattern, the
1182     option letters may appear between the "?" and the ":". Thus
1183     the two patterns
1185     (?i:saturday|sunday)
1186     (?:(?i)saturday|sunday)
1188     match exactly the same set of strings. Because alternative
1189     branches are tried from left to right, and options are not
1190     reset until the end of the subpattern is reached, an option
1191     setting in one branch does affect subsequent branches, so
1192     the above patterns match "SUNDAY" as well as "Saturday".
1197     Repetition is specified by quantifiers, which can follow any
1198     of the following items:
1201     a single character, possibly escaped
1202     the . metacharacter
1203     a character class
1204     a back reference (see next section)
1205     a parenthesized subpattern (unless it is an assertion -
1206     see below)
1208     The general repetition quantifier specifies a minimum and
1209     maximum number of permitted matches, by giving the two
1210     numbers in curly brackets (braces), separated by a comma.
1211     The numbers must be less than 65536, and the first must be
1212     less than or equal to the second. For example:
1214     z{2,4}
1216     matches "zz", "zzz", or "zzzz". A closing brace on its own
1217     is not a special character. If the second number is omitted,
1218     but the comma is present, there is no upper limit; if the
1219     second number and the comma are both omitted, the quantifier
1220     specifies an exact number of required matches. Thus
1222     [aeiou]{3,}
1224     matches at least 3 successive vowels, but may match many
1225     more, while
1227     \d{8}
1229     matches exactly 8 digits. An opening curly bracket that
1230     appears in a position where a quantifier is not allowed, or
1231     one that does not match the syntax of a quantifier, is taken
1232     as a literal character. For example, {,6} is not a quantif-
1233     ier, but a literal string of four characters.
1235     The quantifier {0} is permitted, causing the expression to
1236     behave as if the previous item and the quantifier were not
1237     present.
1239     For convenience (and historical compatibility) the three
1240     most common quantifiers have single-character abbreviations:
1242     * is equivalent to {0,}
1243     + is equivalent to {1,}
1244     ? is equivalent to {0,1}
1246     It is possible to construct infinite loops by following a
1247     subpattern that can match no characters with a quantifier
1248     that has no upper limit, for example:
1250     (a?)*
1252     Earlier versions of Perl and PCRE used to give an error at
1253     compile time for such patterns. However, because there are
1254     cases where this can be useful, such patterns are now
1255     accepted, but if any repetition of the subpattern does in
1256     fact match no characters, the loop is forcibly broken.
1258     By default, the quantifiers are "greedy", that is, they
1259     match as much as possible (up to the maximum number of per-
1260     mitted times), without causing the rest of the pattern to
1261     fail. The classic example of where this gives problems is in
1262     trying to match comments in C programs. These appear between
1263     the sequences /* and */ and within the sequence, individual
1264     * and / characters may appear. An attempt to match C com-
1265     ments by applying the pattern
1267     /\*.*\*/
1269     to the string
1271     /* first command */ not comment /* second comment */
1273     fails, because it matches the entire string due to the
1274     greediness of the .* item.
1276     However, if a quantifier is followed by a question mark,
1277     then it ceases to be greedy, and instead matches the minimum
1278     number of times possible, so the pattern
1280     /\*.*?\*/
1282     does the right thing with the C comments. The meaning of the
1283     various quantifiers is not otherwise changed, just the pre-
1284     ferred number of matches. Do not confuse this use of ques-
1285     tion mark with its use as a quantifier in its own right.
1286     Because it has two uses, it can sometimes appear doubled, as
1287     in
1289     \d??\d
1291     which matches one digit by preference, but can match two if
1292     that is the only way the rest of the pattern matches.
1294     If the PCRE_UNGREEDY option is set (an option which is not
1295     available in Perl) then the quantifiers are not greedy by
1296     default, but individual ones can be made greedy by following
1297     them with a question mark. In other words, it inverts the
1298     default behaviour.
1300     When a parenthesized subpattern is quantified with a minimum
1301     repeat count that is greater than 1 or with a limited max-
1302     imum, more store is required for the compiled pattern, in
1303     proportion to the size of the minimum or maximum.
1305     If a pattern starts with .* or .{0,} and the PCRE_DOTALL
1306     option (equivalent to Perl's /s) is set, thus allowing the .
1307     to match newlines, then the pattern is implicitly anchored,
1308     because whatever follows will be tried against every charac-
1309     ter position in the subject string, so there is no point in
1310     retrying the overall match at any position after the first.
1311     PCRE treats such a pattern as though it were preceded by \A.
1312     In cases where it is known that the subject string contains
1313     no newlines, it is worth setting PCRE_DOTALL when the pat-
1314     tern begins with .* in order to obtain this optimization, or
1315     alternatively using ^ to indicate anchoring explicitly.
1317     When a capturing subpattern is repeated, the value captured
1318     is the substring that matched the final iteration. For exam-
1319     ple, after
1321     (tweedle[dume]{3}\s*)+
1323     has matched "tweedledum tweedledee" the value of the cap-
1324     tured substring is "tweedledee". However, if there are
1325     nested capturing subpatterns, the corresponding captured
1326     values may have been set in previous iterations. For exam-
1327     ple, after
1329     /(a|(b))+/
1331     matches "aba" the value of the second captured substring is
1332     "b".
1337     Outside a character class, a backslash followed by a digit
1338     greater than 0 (and possibly further digits) is a back
1339     reference to a capturing subpattern earlier (i.e. to its
1340     left) in the pattern, provided there have been that many
1341     previous capturing left parentheses.
1343     However, if the decimal number following the backslash is
1344     less than 10, it is always taken as a back reference, and
1345     causes an error only if there are not that many capturing
1346     left parentheses in the entire pattern. In other words, the
1347     parentheses that are referenced need not be to the left of
1348     the reference for numbers less than 10. See the section
1349     entitled "Backslash" above for further details of the han-
1350     dling of digits following a backslash.
1352     A back reference matches whatever actually matched the cap-
1353     turing subpattern in the current subject string, rather than
1354     anything matching the subpattern itself. So the pattern
1356     (sens|respons)e and \1ibility
1358     matches "sense and sensibility" and "response and responsi-
1359     bility", but not "sense and responsibility". If caseful
1360     matching is in force at the time of the back reference, then
1361     the case of letters is relevant. For example,
1363     ((?i)rah)\s+\1
1365     matches "rah rah" and "RAH RAH", but not "RAH rah", even
1366     though the original capturing subpattern is matched case-
1367     lessly.
1369     There may be more than one back reference to the same sub-
1370     pattern. If a subpattern has not actually been used in a
1371     particular match, then any back references to it always
1372     fail. For example, the pattern
1374     (a|(bc))\2
1376     always fails if it starts to match "a" rather than "bc".
1377     Because there may be up to 99 back references, all digits
1378     following the backslash are taken as part of a potential
1379     back reference number. If the pattern continues with a digit
1380     character, then some delimiter must be used to terminate the
1381     back reference. If the PCRE_EXTENDED option is set, this can
1382     be whitespace. Otherwise an empty comment can be used.
1384     A back reference that occurs inside the parentheses to which
1385     it refers fails when the subpattern is first used, so, for
1386     example, (a\1) never matches. However, such references can
1387     be useful inside repeated subpatterns. For example, the pat-
1388     tern
1390     (a|b\1)+
1392     matches any number of "a"s and also "aba", "ababaa" etc. At
1393     each iteration of the subpattern, the back reference matches
1394     the character string corresponding to the previous itera-
1395     tion. In order for this to work, the pattern must be such
1396     that the first iteration does not need to match the back
1397     reference. This can be done using alternation, as in the
1398     example above, or by a quantifier with a minimum of zero.
1403     An assertion is a test on the characters following or
1404     preceding the current matching point that does not actually
1405     consume any characters. The simple assertions coded as \b,
1406     \B, \A, \Z, \z, ^ and $ are described above. More compli-
1407     cated assertions are coded as subpatterns. There are two
1408     kinds: those that look ahead of the current position in the
1409     subject string, and those that look behind it.
1410     An assertion subpattern is matched in the normal way, except
1411     that it does not cause the current matching position to be
1412     changed. Lookahead assertions start with (?= for positive
1413     assertions and (?! for negative assertions. For example,
1415     \w+(?=;)
1417     matches a word followed by a semicolon, but does not include
1418     the semicolon in the match, and
1420     foo(?!bar)
1422     matches any occurrence of "foo" that is not followed by
1423     "bar". Note that the apparently similar pattern
1425     (?!foo)bar
1427     does not find an occurrence of "bar" that is preceded by
1428     something other than "foo"; it finds any occurrence of "bar"
1429     whatsoever, because the assertion (?!foo) is always true
1430     when the next three characters are "bar". A lookbehind
1431     assertion is needed to achieve this effect.
1433     Lookbehind assertions start with (?<= for positive asser-
1434     tions and (?<! for negative assertions. For example,
1436     (?<!foo)bar
1438     does find an occurrence of "bar" that is not preceded by
1439     "foo". The contents of a lookbehind assertion are restricted
1440     such that all the strings it matches must have a fixed
1441     length. However, if there are several alternatives, they do
1442     not all have to have the same fixed length. Thus
1444     (?<=bullock|donkey)
1446     is permitted, but
1448     (?<!dogs?|cats?)
1450     causes an error at compile time. Branches that match dif-
1451     ferent length strings are permitted only at the top level of
1452     a lookbehind assertion. This is an extension compared with
1453     Perl 5.005, which requires all branches to match the same
1454     length of string. An assertion such as
1456     (?<=ab(c|de))
1458     is not permitted, because its single top-level branch can
1459     match two different lengths, but it is acceptable if rewrit-
1460     ten to use two top-level branches:
1462     (?<=abc|abde)
1464     The implementation of lookbehind assertions is, for each
1465     alternative, to temporarily move the current position back
1466     by the fixed width and then try to match. If there are
1467     insufficient characters before the current position, the
1468     match is deemed to fail. Lookbehinds in conjunction with
1469     once-only subpatterns can be particularly useful for match-
1470     ing at the ends of strings; an example is given at the end
1471     of the section on once-only subpatterns.
1473     Several assertions (of any sort) may occur in succession.
1474     For example,
1476     (?<=\d{3})(?<!999)foo
1478     matches "foo" preceded by three digits that are not "999".
1479     Notice that each of the assertions is applied independently
1480     at the same point in the subject string. First there is a
1481     check that the previous three characters are all digits,
1482     then there is a check that the same three characters are not
1483     "999". This pattern does not match "foo" preceded by six
1484     characters, the first of which are digits and the last three
1485     of which are not "999". For example, it doesn't match
1486     "123abcfoo". A pattern to do that is
1488     (?<=\d{3}...)(?<!999)foo
1490     This time the first assertion looks at the preceding six
1491     characters, checking that the first three are digits, and
1492     then the second assertion checks that the preceding three
1493     characters are not "999".
1495     Assertions can be nested in any combination. For example,
1497     (?<=(?<!foo)bar)baz
1499     matches an occurrence of "baz" that is preceded by "bar"
1500     which in turn is not preceded by "foo", while
1502     (?<=\d{3}(?!999)...)foo
1504     is another pattern which matches "foo" preceded by three
1505     digits and any three characters that are not "999".
1507     Assertion subpatterns are not capturing subpatterns, and may
1508     not be repeated, because it makes no sense to assert the
1509     same thing several times. If any kind of assertion contains
1510     capturing subpatterns within it, these are counted for the
1511     purposes of numbering the capturing subpatterns in the whole
1512     pattern. However, substring capturing is carried out only
1513     for positive assertions, because it does not make sense for
1514     negative assertions.
1516     Assertions count towards the maximum of 200 parenthesized
1517     subpatterns.
1522     With both maximizing and minimizing repetition, failure of
1523     what follows normally causes the repeated item to be re-
1524     evaluated to see if a different number of repeats allows the
1525     rest of the pattern to match. Sometimes it is useful to
1526     prevent this, either to change the nature of the match, or
1527     to cause it fail earlier than it otherwise might, when the
1528     author of the pattern knows there is no point in carrying
1529     on.
1531     Consider, for example, the pattern \d+foo when applied to
1532     the subject line
1534     123456bar
1536     After matching all 6 digits and then failing to match "foo",
1537     the normal action of the matcher is to try again with only 5
1538     digits matching the \d+ item, and then with 4, and so on,
1539     before ultimately failing. Once-only subpatterns provide the
1540     means for specifying that once a portion of the pattern has
1541     matched, it is not to be re-evaluated in this way, so the
1542     matcher would give up immediately on failing to match "foo"
1543     the first time. The notation is another kind of special
1544     parenthesis, starting with (?> as in this example:
1546     (?>\d+)bar
1548     This kind of parenthesis "locks up" the part of the pattern
1549     it contains once it has matched, and a failure further into
1550     the pattern is prevented from backtracking into it. Back-
1551     tracking past it to previous items, however, works as nor-
1552     mal.
1554     An alternative description is that a subpattern of this type
1555     matches the string of characters that an identical stan-
1556     dalone pattern would match, if anchored at the current point
1557     in the subject string.
1559     Once-only subpatterns are not capturing subpatterns. Simple
1560     cases such as the above example can be thought of as a max-
1561     imizing repeat that must swallow everything it can. So,
1562     while both \d+ and \d+? are prepared to adjust the number of
1563     digits they match in order to make the rest of the pattern
1564     match, (?>\d+) can only match an entire sequence of digits.
1566     This construction can of course contain arbitrarily compli-
1567     cated subpatterns, and it can be nested.
1569     Once-only subpatterns can be used in conjunction with look-
1570     behind assertions to specify efficient matching at the end
1571     of the subject string. Consider a simple pattern such as
1573     abcd$
1575     when applied to a long string which does not match it.
1576     Because matching proceeds from left to right, PCRE will look
1577     for each "a" in the subject and then see if what follows
1578     matches the rest of the pattern. If the pattern is specified
1579     as
1581     ^.*abcd$
1583     then the initial .* matches the entire string at first, but
1584     when this fails, it backtracks to match all but the last
1585     character, then all but the last two characters, and so on.
1586     Once again the search for "a" covers the entire string, from
1587     right to left, so we are no better off. However, if the pat-
1588     tern is written as
1590     ^(?>.*)(?<=abcd)
1592     then there can be no backtracking for the .* item; it can
1593     match only the entire string. The subsequent lookbehind
1594     assertion does a single test on the last four characters. If
1595     it fails, the match fails immediately. For long strings,
1596     this approach makes a significant difference to the process-
1597     ing time.
1602     It is possible to cause the matching process to obey a sub-
1603     pattern conditionally or to choose between two alternative
1604     subpatterns, depending on the result of an assertion, or
1605     whether a previous capturing subpattern matched or not. The
1606     two possible forms of conditional subpattern are
1608     (?(condition)yes-pattern)
1609     (?(condition)yes-pattern|no-pattern)
1611     If the condition is satisfied, the yes-pattern is used; oth-
1612     erwise the no-pattern (if present) is used. If there are
1613     more than two alternatives in the subpattern, a compile-time
1614     error occurs.
1616     There are two kinds of condition. If the text between the
1617     parentheses consists of a sequence of digits, then the
1618     condition is satisfied if the capturing subpattern of that
1619     number has previously matched. Consider the following pat-
1620     tern, which contains non-significant white space to make it
1621     more readable (assume the PCRE_EXTENDED option) and to
1622     divide it into three parts for ease of discussion:
1624     ( \( )? [^()]+ (?(1) \) )
1626     The first part matches an optional opening parenthesis, and
1627     if that character is present, sets it as the first captured
1628     substring. The second part matches one or more characters
1629     that are not parentheses. The third part is a conditional
1630     subpattern that tests whether the first set of parentheses
1631     matched or not. If they did, that is, if subject started
1632     with an opening parenthesis, the condition is true, and so
1633     the yes-pattern is executed and a closing parenthesis is
1634     required. Otherwise, since no-pattern is not present, the
1635     subpattern matches nothing. In other words, this pattern
1636     matches a sequence of non-parentheses, optionally enclosed
1637     in parentheses.
1639     If the condition is not a sequence of digits, it must be an
1640     assertion. This may be a positive or negative lookahead or
1641     lookbehind assertion. Consider this pattern, again contain-
1642     ing non-significant white space, and with the two alterna-
1643     tives on the second line:
1645     (?(?=[^a-z]*[a-z])
1646     \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
1648     The condition is a positive lookahead assertion that matches
1649     an optional sequence of non-letters followed by a letter. In
1650     other words, it tests for the presence of at least one
1651     letter in the subject. If a letter is found, the subject is
1652     matched against the first alternative; otherwise it is
1653     matched against the second. This pattern matches strings in
1654     one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
1655     letters and dd are digits.
1660     The sequence (?# marks the start of a comment which contin-
1661     ues up to the next closing parenthesis. Nested parentheses
1662     are not permitted. The characters that make up a comment
1663     play no part in the pattern matching at all.
1665     If the PCRE_EXTENDED option is set, an unescaped # character
1666     outside a character class introduces a comment that contin-
1667     ues up to the next newline character in the pattern.
1672     Certain items that may appear in patterns are more efficient
1673     than others. It is more efficient to use a character class
1674     like [aeiou] than a set of alternatives such as (a|e|i|o|u).
1675     In general, the simplest construction that provides the
1676     required behaviour is usually the most efficient. Jeffrey
1677     Friedl's book contains a lot of discussion about optimizing
1678     regular expressions for efficient performance.
1680     When a pattern begins with .* and the PCRE_DOTALL option is
1681     set, the pattern is implicitly anchored by PCRE, since it
1682     can match only at the start of a subject string. However, if
1683     PCRE_DOTALL is not set, PCRE cannot make this optimization,
1684     because the . metacharacter does not then match a newline,
1685     and if the subject string contains newlines, the pattern may
1686     match from the character immediately following one of them
1687     instead of from the very start. For example, the pattern
1689     (.*) second
1691     matches the subject "first\nand second" (where \n stands for
1692     a newline character) with the first captured substring being
1693     "and". In order to do this, PCRE has to retry the match
1694     starting after every newline in the subject.
1696     If you are using such a pattern with subject strings that do
1697     not contain newlines, the best performance is obtained by
1698     setting PCRE_DOTALL, or starting the pattern with ^.* to
1699     indicate explicit anchoring. That saves PCRE from having to
1700     scan along the subject looking for a newline to restart at.
1702     Beware of patterns that contain nested indefinite repeats.
1703     These can take a long time to run when applied to a string
1704     that does not match. Consider the pattern fragment
1706     (a+)*
1708     This can match "aaaa" in 33 different ways, and this number
1709     increases very rapidly as the string gets longer. (The *
1710     repeat can match 0, 1, 2, 3, or 4 times, and for each of
1711     those cases other than 0, the + repeats can match different
1712     numbers of times.) When the remainder of the pattern is such
1713     that the entire match is going to fail, PCRE has in princi-
1714     ple to try every possible variation, and this can take an
1715     extremely long time.
1717     An optimization catches some of the more simple cases such
1718     as
1720     (a+)*b
1722     where a literal character follows. Before embarking on the
1723     standard matching procedure, PCRE checks that there is a "b"
1724     later in the subject string, and if there is not, it fails
1725     the match immediately. However, when there is no following
1726     literal this optimization cannot be used. You can see the
1727     difference by comparing the behaviour of
1729     (a+)*\d
1731     with the pattern above. The former gives a failure almost
1732     instantly when applied to a whole line of "a" characters,
1733     whereas the latter takes an appreciable time with strings
1734     longer than about 20 characters.
1738     AUTHOR
1739     Philip Hazel <ph10@cam.ac.uk>
1740     University Computing Service,
1741     New Museums Site,
1742     Cambridge CB2 3QG, England.
1743     Phone: +44 1223 334714
1745     Last updated: 29 July 1999
1746     Copyright (c) 1997-1999 University of Cambridge.

ViewVC Help
Powered by ViewVC 1.1.12