ViewVC logotype

Contents of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log

Revision 47 - (hide annotations) (download)
Sat Feb 24 21:39:29 2007 UTC (8 years, 1 month ago) by nigel
File MIME type: text/plain
File size: 87371 byte(s)
Load pcre-3.2 into code/trunk.

1 nigel 41 NAME
2     pcre - Perl-compatible regular expressions.
7     #include <pcre.h>
9     pcre *pcre_compile(const char *pattern, int options,
10     const char **errptr, int *erroffset,
11     const unsigned char *tableptr);
13     pcre_extra *pcre_study(const pcre *code, int options,
14     const char **errptr);
16     int pcre_exec(const pcre *code, const pcre_extra *extra,
17     const char *subject, int length, int startoffset,
18     int options, int *ovector, int ovecsize);
20     int pcre_copy_substring(const char *subject, int *ovector,
21     int stringcount, int stringnumber, char *buffer,
22     int buffersize);
24     int pcre_get_substring(const char *subject, int *ovector,
25     int stringcount, int stringnumber,
26     const char **stringptr);
28     int pcre_get_substring_list(const char *subject,
29     int *ovector, int stringcount, const char ***listptr);
31     const unsigned char *pcre_maketables(void);
33 nigel 43 int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
34     int what, void *where);
36 nigel 41 int pcre_info(const pcre *code, int *optptr, *firstcharptr);
38     char *pcre_version(void);
40     void *(*pcre_malloc)(size_t);
42     void (*pcre_free)(void *);
48     The PCRE library is a set of functions that implement regu-
49     lar expression pattern matching using the same syntax and
50     semantics as Perl 5, with just a few differences (see
51     below). The current implementation corresponds to Perl
52 nigel 43 5.005, with some additional features from the Perl develop-
53     ment release.
54 nigel 41
55     PCRE has its own native API, which is described in this
56     document. There is also a set of wrapper functions that
57 nigel 43 correspond to the POSIX regular expression API. These are
58     described in the pcreposix documentation.
60 nigel 41 The native API function prototypes are defined in the header
61     file pcre.h, and on Unix systems the library itself is
62     called libpcre.a, so can be accessed by adding -lpcre to the
63 nigel 43 command for linking an application which calls it. The
64     header file defines the macros PCRE_MAJOR and PCRE_MINOR to
65     contain the major and minor release numbers for the library.
66     Applications can use these to include support for different
67     releases.
68 nigel 41
69     The functions pcre_compile(), pcre_study(), and pcre_exec()
70     are used for compiling and matching regular expressions,
71     while pcre_copy_substring(), pcre_get_substring(), and
72     pcre_get_substring_list() are convenience functions for
73     extracting captured substrings from a matched subject
74     string. The function pcre_maketables() is used (optionally)
75     to build a set of character tables in the current locale for
76     passing to pcre_compile().
78 nigel 43 The function pcre_fullinfo() is used to find out information
79     about a compiled pattern; pcre_info() is an obsolete version
80     which returns only some of the available information, but is
81     retained for backwards compatibility. The function
82     pcre_version() returns a pointer to a string containing the
83     version of PCRE and its date of release.
84 nigel 41
85     The global variables pcre_malloc and pcre_free initially
86     contain the entry points of the standard malloc() and free()
87     functions respectively. PCRE calls the memory management
88     functions via these variables, so a calling program can
89     replace them if it wishes to intercept the calls. This
90     should be done before calling any PCRE functions.
95     The PCRE functions can be used in multi-threading applica-
96     tions, with the proviso that the memory management functions
97     pointed to by pcre_malloc and pcre_free are shared by all
98     threads.
100     The compiled form of a regular expression is not altered
101     during matching, so the same compiled pattern can safely be
102     used by several threads at once.
106 nigel 43
108     The function pcre_compile() is called to compile a pattern
109     into an internal form. The pattern is a C string terminated
110     by a binary zero, and is passed in the argument pattern. A
111     pointer to a single block of memory that is obtained via
112     pcre_malloc is returned. This contains the compiled code and
113     related data. The pcre type is defined for this for conveni-
114     ence, but in fact pcre is just a typedef for void, since the
115     contents of the block are not externally defined. It is up
116     to the caller to free the memory when it is no longer
117     required.
119     The size of a compiled pattern is roughly proportional to
120     the length of the pattern string, except that each character
121     class (other than those containing just a single character,
122     negated or not) requires 33 bytes, and repeat quantifiers
123     with a minimum greater than one or a bounded maximum cause
124     the relevant portions of the compiled pattern to be repli-
125     cated.
127     The options argument contains independent bits that affect
128     the compilation. It should be zero if no options are
129     required. Some of the options, in particular, those that are
130     compatible with Perl, can also be set and unset from within
131     the pattern (see the detailed description of regular expres-
132     sions below). For these options, the contents of the options
133     argument specifies their initial settings at the start of
134     compilation and execution. The PCRE_ANCHORED option can be
135     set at the time of matching as well as at compile time.
137     If errptr is NULL, pcre_compile() returns NULL immediately.
138     Otherwise, if compilation of a pattern fails, pcre_compile()
139     returns NULL, and sets the variable pointed to by errptr to
140     point to a textual error message. The offset from the start
141     of the pattern to the character where the error was
142     discovered is placed in the variable pointed to by
143     erroffset, which must not be NULL. If it is, an immediate
144     error is given.
146     If the final argument, tableptr, is NULL, PCRE uses a
147     default set of character tables which are built when it is
148     compiled, using the default C locale. Otherwise, tableptr
149     must be the result of a call to pcre_maketables(). See the
150     section on locale support below.
152     The following option bits are defined in the header file:
156     If this bit is set, the pattern is forced to be "anchored",
157     that is, it is constrained to match only at the start of the
158     string which is being searched (the "subject string"). This
159     effect can also be achieved by appropriate constructs in the
160     pattern itself, which is the only way to do it in Perl.
164     If this bit is set, letters in the pattern match both upper
165     and lower case letters. It is equivalent to Perl's /i
166     option.
170     If this bit is set, a dollar metacharacter in the pattern
171     matches only at the end of the subject string. Without this
172     option, a dollar also matches immediately before the final
173     character if it is a newline (but not before any other new-
174     lines). The PCRE_DOLLAR_ENDONLY option is ignored if
175     PCRE_MULTILINE is set. There is no equivalent to this option
176     in Perl.
180     If this bit is set, a dot metacharater in the pattern
181     matches all characters, including newlines. Without it, new-
182     lines are excluded. This option is equivalent to Perl's /s
183     option. A negative class such as [^a] always matches a new-
184     line character, independent of the setting of this option.
188     If this bit is set, whitespace data characters in the pat-
189     tern are totally ignored except when escaped or inside a
190     character class, and characters between an unescaped # out-
191     side a character class and the next newline character,
192     inclusive, are also ignored. This is equivalent to Perl's /x
193     option, and makes it possible to include comments inside
194     complicated patterns. Note, however, that this applies only
195     to data characters. Whitespace characters may never appear
196     within special character sequences in a pattern, for example
197     within the sequence (?( which introduces a conditional sub-
198     pattern.
200     PCRE_EXTRA
202 nigel 43 This option was invented in order to turn on additional
203     functionality of PCRE that is incompatible with Perl, but it
204     is currently of very little use. When set, any backslash in
205     a pattern that is followed by a letter that has no special
206     meaning causes an error, thus reserving these combinations
207     for future expansion. By default, as in Perl, a backslash
208     followed by a letter with no special meaning is treated as a
209     literal. There are at present no other features controlled
210     by this option. It can also be set by a (?X) option setting
211     within a pattern.
212 nigel 41
215     By default, PCRE treats the subject string as consisting of
216     a single "line" of characters (even if it actually contains
217     several newlines). The "start of line" metacharacter (^)
218     matches only at the start of the string, while the "end of
219     line" metacharacter ($) matches only at the end of the
220     string, or before a terminating newline (unless
221     PCRE_DOLLAR_ENDONLY is set). This is the same as Perl.
223     When PCRE_MULTILINE it is set, the "start of line" and "end
224 nigel 43 of line" constructs match immediately following or immedi-
225     ately before any newline in the subject string, respec-
226     tively, as well as at the very start and end. This is
227 nigel 41 equivalent to Perl's /m option. If there are no "\n" charac-
228     ters in a subject string, or no occurrences of ^ or $ in a
229     pattern, setting PCRE_MULTILINE has no effect.
233     This option inverts the "greediness" of the quantifiers so
234     that they are not greedy by default, but become greedy if
235     followed by "?". It is not compatible with Perl. It can also
236     be set by a (?U) option setting within the pattern.
241     When a pattern is going to be used several times, it is
242     worth spending more time analyzing it in order to speed up
243     the time taken for matching. The function pcre_study() takes
244     a pointer to a compiled pattern as its first argument, and
245     returns a pointer to a pcre_extra block (another void
246     typedef) containing additional information about the pat-
247     tern; this can be passed to pcre_exec(). If no additional
248     information is available, NULL is returned.
250     The second argument contains option bits. At present, no
251     options are defined for pcre_study(), and this argument
252     should always be zero.
254     The third argument for pcre_study() is a pointer to an error
255     message. If studying succeeds (even if no data is returned),
256     the variable it points to is set to NULL. Otherwise it
257     points to a textual error message.
259     At present, studying a pattern is useful only for non-
260     anchored patterns that do not have a single fixed starting
261     character. A bitmap of possible starting characters is
262     created.
267     PCRE handles caseless matching, and determines whether char-
268     acters are letters, digits, or whatever, by reference to a
269     set of tables. The library contains a default set of tables
270     which is created in the default C locale when PCRE is com-
271     piled. This is used when the final argument of
272     pcre_compile() is NULL, and is sufficient for many applica-
273     tions.
275     An alternative set of tables can, however, be supplied. Such
276     tables are built by calling the pcre_maketables() function,
277     which has no arguments, in the relevant locale. The result
278     can then be passed to pcre_compile() as often as necessary.
279     For example, to build and use tables that are appropriate
280     for the French locale (where accented characters with codes
281     greater than 128 are treated as letters), the following code
282     could be used:
284     setlocale(LC_CTYPE, "fr");
285     tables = pcre_maketables();
286     re = pcre_compile(..., tables);
288     The tables are built in memory that is obtained via
289     pcre_malloc. The pointer that is passed to pcre_compile is
290     saved with the compiled pattern, and the same tables are
291     used via this pointer by pcre_study() and pcre_exec(). Thus
292     for any single pattern, compilation, studying and matching
293     all happen in the same locale, but different patterns can be
294     compiled in different locales. It is the caller's responsi-
295     bility to ensure that the memory containing the tables
296     remains available for as long as it is needed.
301 nigel 43 The pcre_fullinfo() function returns information about a
302     compiled pattern. It replaces the obsolete pcre_info() func-
303     tion, which is nevertheless retained for backwards compabil-
304     ity (and is documented below).
305 nigel 41
306 nigel 43 The first argument for pcre_fullinfo() is a pointer to the
307     compiled pattern. The second argument is the result of
308     pcre_study(), or NULL if the pattern was not studied. The
309     third argument specifies which piece of information is
310     required, while the fourth argument is a pointer to a vari-
311     able to receive the data. The yield of the function is zero
312     for success, or one of the following negative numbers:
314 nigel 41 PCRE_ERROR_NULL the argument code was NULL
315 nigel 43 the argument where was NULL
316 nigel 41 PCRE_ERROR_BADMAGIC the "magic number" was not found
317 nigel 43 PCRE_ERROR_BADOPTION the value of what was invalid
318 nigel 41
319 nigel 43 The possible values for the third argument are defined in
320     pcre.h, and are as follows:
324     Return a copy of the options with which the pattern was com-
325     piled. The fourth argument should point to au unsigned long
326     int variable. These option bits are those specified in the
327 nigel 41 call to pcre_compile(), modified by any top-level option
328     settings within the pattern itself, and with the
329 nigel 43 PCRE_ANCHORED bit forcibly set if the form of the pattern
330     implies that it can match only at the start of a subject
331     string.
332 nigel 41
333 nigel 43 PCRE_INFO_SIZE
335     Return the size of the compiled pattern, that is, the value
336     that was passed as the argument to pcre_malloc() when PCRE
337     was getting memory in which to place the compiled data. The
338     fourth argument should point to a size_t variable.
342     Return the number of capturing subpatterns in the pattern.
343     The fourth argument should point to an int variable.
347     Return the number of the highest back reference in the pat-
348     tern. The fourth argument should point to an int variable.
349     Zero is returned if there are no back references.
353     Return information about the first character of any matched
354     string, for a non-anchored pattern. If there is a fixed
355     first character, e.g. from a pattern such as
356 nigel 47 (cat|cow|coyote), it is returned in the integer pointed to
357     by where. Otherwise, if either
358 nigel 41
359     (a) the pattern was compiled with the PCRE_MULTILINE option,
360     and every branch starts with "^", or
362     (b) every branch of the pattern starts with ".*" and
363     PCRE_DOTALL is not set (if it were set, the pattern would be
364     anchored),
365 nigel 43
366 nigel 47 -1 is returned, indicating that the pattern matches only at
367     the start of a subject string or after any "\n" within the
368     string. Otherwise -2 is returned. For anchored patterns, -2
369     is returned.
370 nigel 41
372 nigel 41
373 nigel 43 If the pattern was studied, and this resulted in the con-
374     struction of a 256-bit table indicating a fixed set of char-
375     acters for the first character in any matching string, a
376     pointer to the table is returned. Otherwise NULL is
377     returned. The fourth argument should point to an unsigned
378     char * variable.
379 nigel 41
382     For a non-anchored pattern, return the value of the right-
383     most literal character which must exist in any matched
384     string, other than at its start. The fourth argument should
385     point to an int variable. If there is no such character, or
386     if the pattern is anchored, -1 is returned. For example, for
387     the pattern /a\d+z\d+/ the returned value is 'z'.
389     The pcre_info() function is now obsolete because its inter-
390     face is too restrictive to return all the available data
391     about a compiled pattern. New programs should use
392     pcre_fullinfo() instead. The yield of pcre_info() is the
393     number of capturing subpatterns, or one of the following
394     negative numbers:
396     PCRE_ERROR_NULL the argument code was NULL
397     PCRE_ERROR_BADMAGIC the "magic number" was not found
399     If the optptr argument is not NULL, a copy of the options
400     with which the pattern was compiled is placed in the integer
401     it points to (see PCRE_INFO_OPTIONS above).
403     If the pattern is not anchored and the firstcharptr argument
404     is not NULL, it is used to pass back information about the
405     first character of any matched string (see
406     PCRE_INFO_FIRSTCHAR above).
411     The function pcre_exec() is called to match a subject string
412     against a pre-compiled pattern, which is passed in the code
413     argument. If the pattern has been studied, the result of the
414     study should be passed in the extra argument. Otherwise this
415     must be NULL.
417     The PCRE_ANCHORED option can be passed in the options argu-
418     ment, whose unused bits must be zero. However, if a pattern
419     was compiled with PCRE_ANCHORED, or turned out to be
420     anchored by virtue of its contents, it cannot be made
421     unachored at matching time.
423     There are also three further options that can be set only at
424     matching time:
428     The first character of the string is not the beginning of a
429     line, so the circumflex metacharacter should not match
430     before it. Setting this without PCRE_MULTILINE (at compile
431     time) causes circumflex never to match.
435     The end of the string is not the end of a line, so the dol-
436     lar metacharacter should not match it nor (except in multi-
437     line mode) a newline immediately before it. Setting this
438     without PCRE_MULTILINE (at compile time) causes dollar never
439     to match.
443     An empty string is not considered to be a valid match if
444     this option is set. If there are alternatives in the pat-
445     tern, they are tried. If all the alternatives match the
446     empty string, the entire match fails. For example, if the
447     pattern
449     a?b?
451     is applied to a string not beginning with "a" or "b", it
452     matches the empty string at the start of the subject. With
453     PCRE_NOTEMPTY set, this match is not valid, so PCRE searches
454     further into the string for occurrences of "a" or "b".
456     Perl has no direct equivalent of PCRE_NOTEMPTY, but it does
457     make a special case of a pattern match of the empty string
458     within its split() function, and when using the /g modifier.
459     It is possible to emulate Perl's behaviour after matching a
460     null string by first trying the match again at the same
461     offset with PCRE_NOTEMPTY set, and then if that fails by
462     advancing the starting offset (see below) and trying an
463     ordinary match again.
465     The subject string is passed as a pointer in subject, a
466     length in length, and a starting offset in startoffset.
467     Unlike the pattern string, it may contain binary zero char-
468     acters. When the starting offset is zero, the search for a
469     match starts at the beginning of the subject, and this is by
470     far the most common case.
472     A non-zero starting offset is useful when searching for
473     another match in the same subject by calling pcre_exec()
474     again after a previous success. Setting startoffset differs
475     from just passing over a shortened string and setting
476     PCRE_NOTBOL in the case of a pattern that begins with any
477     kind of lookbehind. For example, consider the pattern
479     \Biss\B
481     which finds occurrences of "iss" in the middle of words. (\B
482     matches only if the current position in the subject is not a
483     word boundary.) When applied to the string "Mississipi" the
484     first call to pcre_exec() finds the first occurrence. If
485     pcre_exec() is called again with just the remainder of the
486     subject, namely "issipi", it does not match, because \B is
487     always false at the start of the subject, which is deemed to
488     be a word boundary. However, if pcre_exec() is passed the
489     entire string again, but with startoffset set to 4, it finds
490     the second occurrence of "iss" because it is able to look
491     behind the starting point to discover that it is preceded by
492     a letter.
494     If a non-zero starting offset is passed when the pattern is
495     anchored, one attempt to match at the given offset is tried.
496     This can only succeed if the pattern does not require the
497     match to be at the start of the subject.
499     In general, a pattern matches a certain portion of the sub-
500     ject, and in addition, further substrings from the subject
501     may be picked out by parts of the pattern. Following the
502     usage in Jeffrey Friedl's book, this is called "capturing"
503     in what follows, and the phrase "capturing subpattern" is
504     used for a fragment of a pattern that picks out a substring.
505     PCRE supports several other kinds of parenthesized subpat-
506     tern that do not cause substrings to be captured.
508     Captured substrings are returned to the caller via a vector
509     of integer offsets whose address is passed in ovector. The
510     number of elements in the vector is passed in ovecsize. The
511     first two-thirds of the vector is used to pass back captured
512     substrings, each substring using a pair of integers. The
513     remaining third of the vector is used as workspace by
514     pcre_exec() while matching capturing subpatterns, and is not
515     available for passing back information. The length passed in
516     ovecsize should always be a multiple of three. If it is not,
517     it is rounded down.
519     When a match has been successful, information about captured
520     substrings is returned in pairs of integers, starting at the
521     beginning of ovector, and continuing up to two-thirds of its
522     length at the most. The first element of a pair is set to
523     the offset of the first character in a substring, and the
524     second is set to the offset of the first character after the
525     end of a substring. The first pair, ovector[0] and ovec-
526     tor[1], identify the portion of the subject string matched
527     by the entire pattern. The next pair is used for the first
528     capturing subpattern, and so on. The value returned by
529     pcre_exec() is the number of pairs that have been set. If
530     there are no capturing subpatterns, the return value from a
531     successful match is 1, indicating that just the first pair
532     of offsets has been set.
534     Some convenience functions are provided for extracting the
535     captured substrings as separate strings. These are described
536     in the following section.
538     It is possible for an capturing subpattern number n+1 to
539     match some part of the subject when subpattern n has not
540     been used at all. For example, if the string "abc" is
541     matched against the pattern (a|(z))(bc) subpatterns 1 and 3
542     are matched, but 2 is not. When this happens, both offset
543     values corresponding to the unused subpattern are set to -1.
545     If a capturing subpattern is matched repeatedly, it is the
546     last portion of the string that it matched that gets
547     returned.
549     If the vector is too small to hold all the captured sub-
550     strings, it is used as far as possible (up to two-thirds of
551     its length), and the function returns a value of zero. In
552     particular, if the substring offsets are not of interest,
553     pcre_exec() may be called with ovector passed as NULL and
554     ovecsize as zero. However, if the pattern contains back
555     references and the ovector isn't big enough to remember the
556     related substrings, PCRE has to get additional memory for
557     use during matching. Thus it is usually advisable to supply
558     an ovector.
560     Note that pcre_info() can be used to find out how many cap-
561     turing subpatterns there are in a compiled pattern. The
562     smallest size for ovector that will allow for n captured
563     substrings in addition to the offsets of the substring
564     matched by the whole pattern is (n+1)*3.
566     If pcre_exec() fails, it returns a negative number. The fol-
567     lowing are defined in the header file:
571     The subject string did not match the pattern.
573     PCRE_ERROR_NULL (-2)
575     Either code or subject was passed as NULL, or ovector was
576     NULL and ovecsize was not zero.
580     An unrecognized bit was set in the options argument.
584     PCRE stores a 4-byte "magic number" at the start of the com-
585     piled code, to catch the case when it is passed a junk
586     pointer. This is the error it gives when the magic number
587     isn't present.
591     While running the pattern match, an unknown item was encoun-
592     tered in the compiled pattern. This error could be caused by
593     a bug in PCRE or by overwriting of the compiled pattern.
597     If a pattern contains back references, but the ovector that
598     is passed to pcre_exec() is not big enough to remember the
599     referenced substrings, PCRE gets a block of memory at the
600     start of matching to use for this purpose. If the call via
601     pcre_malloc() fails, this error is given. The memory is
602     freed at the end of matching.
607     Captured substrings can be accessed directly by using the
608     offsets returned by pcre_exec() in ovector. For convenience,
609     the functions pcre_copy_substring(), pcre_get_substring(),
610     and pcre_get_substring_list() are provided for extracting
611     captured substrings as new, separate, zero-terminated
612     strings. A substring that contains a binary zero is
613     correctly extracted and has a further zero added on the end,
614     but the result does not, of course, function as a C string.
616     The first three arguments are the same for all three func-
617     tions: subject is the subject string which has just been
618     successfully matched, ovector is a pointer to the vector of
619     integer offsets that was passed to pcre_exec(), and
620     stringcount is the number of substrings that were captured
621     by the match, including the substring that matched the
622     entire regular expression. This is the value returned by
623     pcre_exec if it is greater than zero. If pcre_exec()
624     returned zero, indicating that it ran out of space in ovec-
625 nigel 47 tor, the value passed as stringcount should be the size of
626     the vector divided by three.
627 nigel 41
628     The functions pcre_copy_substring() and pcre_get_substring()
629     extract a single substring, whose number is given as string-
630     number. A value of zero extracts the substring that matched
631     the entire pattern, while higher values extract the captured
632     substrings. For pcre_copy_substring(), the string is placed
633     in buffer, whose length is given by buffersize, while for
634     pcre_get_substring() a new block of store is obtained via
635     pcre_malloc, and its address is returned via stringptr. The
636     yield of the function is the length of the string, not
637     including the terminating zero, or one of
641     The buffer was too small for pcre_copy_substring(), or the
642     attempt to get memory failed for pcre_get_substring().
646     There is no substring whose number is stringnumber.
648     The pcre_get_substring_list() function extracts all avail-
649     able substrings and builds a list of pointers to them. All
650     this is done in a single block of memory which is obtained
651     via pcre_malloc. The address of the memory block is returned
652     via listptr, which is also the start of the list of string
653     pointers. The end of the list is marked by a NULL pointer.
654     The yield of the function is zero if all went well, or
658     if the attempt to get the memory block failed.
660     When any of these functions encounter a substring that is
661     unset, which can happen when capturing subpattern number n+1
662     matches some part of the subject, but subpattern n has not
663     been used at all, they return an empty string. This can be
664     distinguished from a genuine zero-length substring by
665     inspecting the appropriate offset in ovector, which is nega-
666     tive for unset substrings.
672     There are some size limitations in PCRE but it is hoped that
673     they will never in practice be relevant. The maximum length
674     of a compiled pattern is 65539 (sic) bytes. All values in
675     repeating quantifiers must be less than 65536. The maximum
676     number of capturing subpatterns is 99. The maximum number
677     of all parenthesized subpatterns, including capturing sub-
678     patterns, assertions, and other types of subpattern, is 200.
680     The maximum length of a subject string is the largest posi-
681     tive number that an integer variable can hold. However, PCRE
682     uses recursion to handle subpatterns and indefinite repeti-
683     tion. This means that the available stack space may limit
684     the size of a subject string that can be processed by cer-
685     tain patterns.
690     The differences described here are with respect to Perl
691     5.005.
693     1. By default, a whitespace character is any character that
694     the C library function isspace() recognizes, though it is
695     possible to compile PCRE with alternative character type
696     tables. Normally isspace() matches space, formfeed, newline,
697     carriage return, horizontal tab, and vertical tab. Perl 5 no
698     longer includes vertical tab in its set of whitespace char-
699     acters. The \v escape that was in the Perl documentation for
700     a long time was never in fact recognized. However, the char-
701     acter itself was treated as whitespace at least up to 5.002.
702     In 5.004 and 5.005 it does not match \s.
704     2. PCRE does not allow repeat quantifiers on lookahead
705     assertions. Perl permits them, but they do not mean what you
706     might think. For example, (?!a){3} does not assert that the
707     next three characters are not "a". It just asserts that the
708     next character is not "a" three times.
710     3. Capturing subpatterns that occur inside negative looka-
711     head assertions are counted, but their entries in the
712     offsets vector are never set. Perl sets its numerical vari-
713     ables from any such patterns that are matched before the
714     assertion fails to match something (thereby succeeding), but
715     only if the negative lookahead assertion contains just one
716     branch.
718     4. Though binary zero characters are supported in the sub-
719     ject string, they are not allowed in a pattern string
720     because it is passed as a normal C string, terminated by
721     zero. The escape sequence "\0" can be used in the pattern to
722     represent a binary zero.
724     5. The following Perl escape sequences are not supported:
725     \l, \u, \L, \U, \E, \Q. In fact these are implemented by
726     Perl's general string-handling and are not part of its pat-
727     tern matching engine.
729     6. The Perl \G assertion is not supported as it is not
730     relevant to single pattern matches.
732 nigel 43 7. Fairly obviously, PCRE does not support the (?{code}) and
733     (?p{code}) constructions. However, there is some experimen-
734     tal support for recursive patterns using the non-Perl item
735     (?R).
736 nigel 41 8. There are at the time of writing some oddities in Perl
737     5.005_02 concerned with the settings of captured strings
738     when part of a pattern is repeated. For example, matching
739     "aba" against the pattern /^(a(b)?)+$/ sets $2 to the value
740     "b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2
741     unset. However, if the pattern is changed to
742 nigel 47 /^(aa(b(b))?)+$/ then $2 (and $3) are set.
743 nigel 41
744     In Perl 5.004 $2 is set in both cases, and that is also true
745     of PCRE. If in the future Perl changes to a consistent state
746     that is different, PCRE may change to follow.
748     9. Another as yet unresolved discrepancy is that in Perl
749     5.005_02 the pattern /^(a)?(?(1)a|b)+$/ matches the string
750     "a", whereas in PCRE it does not. However, in both Perl and
751     PCRE /^(a)?a/ matched against "a" leaves $1 unset.
753     10. PCRE provides some extensions to the Perl regular
754     expression facilities:
756     (a) Although lookbehind assertions must match fixed length
757     strings, each alternative branch of a lookbehind assertion
758     can match a different length of string. Perl 5.005 requires
759     them all to have the same length.
761     (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not
762     set, the $ meta- character matches only at the very end of
763     the string.
765     (c) If PCRE_EXTRA is set, a backslash followed by a letter
766     with no special meaning is faulted.
768 nigel 43 (d) If PCRE_UNGREEDY is set, the greediness of the repeti-
769     tion quantifiers is inverted, that is, by default they are
770     not greedy, but if followed by a question mark they are.
771 nigel 41
772     (e) PCRE_ANCHORED can be used to force a pattern to be tried
773     only at the start of the subject.
775     (f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY options
776     for pcre_exec() have no Perl equivalents.
778 nigel 43 (g) The (?R) construct allows for recursive pattern matching
779     (Perl 5.6 can do this using the (?p{code}) construct, which
780     PCRE cannot of course support.)
781 nigel 41
783 nigel 43
785     The syntax and semantics of the regular expressions sup-
786     ported by PCRE are described below. Regular expressions are
787     also described in the Perl documentation and in a number of
788 nigel 43
789 nigel 41 other books, some of which have copious examples. Jeffrey
790     Friedl's "Mastering Regular Expressions", published by
791 nigel 43 O'Reilly (ISBN 1-56592-257), covers them in great detail.
792 nigel 41 The description here is intended as reference documentation.
794     A regular expression is a pattern that is matched against a
795     subject string from left to right. Most characters stand for
796     themselves in a pattern, and match the corresponding charac-
797     ters in the subject. As a trivial example, the pattern
799     The quick brown fox
801     matches a portion of a subject string that is identical to
802     itself. The power of regular expressions comes from the
803     ability to include alternatives and repetitions in the pat-
804     tern. These are encoded in the pattern by the use of meta-
805     characters, which do not stand for themselves but instead
806     are interpreted in some special way.
808     There are two different sets of meta-characters: those that
809     are recognized anywhere in the pattern except within square
810     brackets, and those that are recognized in square brackets.
811     Outside square brackets, the meta-characters are as follows:
813     \ general escape character with several uses
814     ^ assert start of subject (or line, in multiline
815     mode)
816     $ assert end of subject (or line, in multiline mode)
817     . match any character except newline (by default)
818     [ start character class definition
819     | start of alternative branch
820     ( start subpattern
821     ) end subpattern
822     ? extends the meaning of (
823     also 0 or 1 quantifier
824     also quantifier minimizer
825     * 0 or more quantifier
826     + 1 or more quantifier
827     { start min/max quantifier
829     Part of a pattern that is in square brackets is called a
830     "character class". In a character class the only meta-
831     characters are:
833     \ general escape character
834     ^ negate the class, but only if the first character
835     - indicates character range
836     ] terminates the character class
838     The following sections describe the use of each of the
839     meta-characters.
844     The backslash character has several uses. Firstly, if it is
845     followed by a non-alphameric character, it takes away any
846     special meaning that character may have. This use of
847     backslash as an escape character applies both inside and
848     outside character classes.
850     For example, if you want to match a "*" character, you write
851     "\*" in the pattern. This applies whether or not the follow-
852     ing character would otherwise be interpreted as a meta-
853     character, so it is always safe to precede a non-alphameric
854     with "\" to specify that it stands for itself. In particu-
855     lar, if you want to match a backslash, you write "\\".
857     If a pattern is compiled with the PCRE_EXTENDED option, whi-
858     tespace in the pattern (other than in a character class) and
859     characters between a "#" outside a character class and the
860     next newline character are ignored. An escaping backslash
861     can be used to include a whitespace or "#" character as part
862     of the pattern.
864     A second use of backslash provides a way of encoding non-
865     printing characters in patterns in a visible manner. There
866     is no restriction on the appearance of non-printing charac-
867     ters, apart from the binary zero that terminates a pattern,
868     but when a pattern is being prepared by text editing, it is
869     usually easier to use one of the following escape sequences
870     than the binary character it represents:
872     \a alarm, that is, the BEL character (hex 07)
873     \cx "control-x", where x is any character
874     \e escape (hex 1B)
875     \f formfeed (hex 0C)
876     \n newline (hex 0A)
877     \r carriage return (hex 0D)
878 nigel 43 \t tab (hex 09)
879 nigel 41 \xhh character with hex code hh
880     \ddd character with octal code ddd, or backreference
882     The precise effect of "\cx" is as follows: if "x" is a lower
883     case letter, it is converted to upper case. Then bit 6 of
884     the character (hex 40) is inverted. Thus "\cz" becomes hex
885     1A, but "\c{" becomes hex 3B, while "\c;" becomes hex 7B.
887     After "\x", up to two hexadecimal digits are read (letters
888     can be in upper or lower case).
890     After "\0" up to two further octal digits are read. In both
891     cases, if there are fewer than two digits, just those that
892     are present are used. Thus the sequence "\0\x\07" specifies
893     two binary zeros followed by a BEL character. Make sure you
894     supply two digits after the initial zero if the character
895     that follows is itself an octal digit.
897     The handling of a backslash followed by a digit other than 0
898     is complicated. Outside a character class, PCRE reads it
899     and any following digits as a decimal number. If the number
900     is less than 10, or if there have been at least that many
901     previous capturing left parentheses in the expression, the
902     entire sequence is taken as a back reference. A description
903     of how this works is given later, following the discussion
904     of parenthesized subpatterns.
906     Inside a character class, or if the decimal number is
907     greater than 9 and there have not been that many capturing
908     subpatterns, PCRE re-reads up to three octal digits follow-
909     ing the backslash, and generates a single byte from the
910     least significant 8 bits of the value. Any subsequent digits
911     stand for themselves. For example:
913     \040 is another way of writing a space
914     \40 is the same, provided there are fewer than 40
915     previous capturing subpatterns
916     \7 is always a back reference
917     \11 might be a back reference, or another way of
918     writing a tab
919     \011 is always a tab
920     \0113 is a tab followed by the character "3"
921     \113 is the character with octal code 113 (since there
922     can be no more than 99 back references)
923     \377 is a byte consisting entirely of 1 bits
924     \81 is either a back reference, or a binary zero
925     followed by the two characters "8" and "1"
927     Note that octal values of 100 or greater must not be intro-
928     duced by a leading zero, because no more than three octal
929     digits are ever read.
930 nigel 43
931 nigel 41 All the sequences that define a single byte value can be
932     used both inside and outside character classes. In addition,
933     inside a character class, the sequence "\b" is interpreted
934     as the backspace character (hex 08). Outside a character
935     class it has a different meaning (see below).
937     The third use of backslash is for specifying generic charac-
938     ter types:
940     \d any decimal digit
941     \D any character that is not a decimal digit
942     \s any whitespace character
943     \S any character that is not a whitespace character
944     \w any "word" character
945     \W any "non-word" character
947     Each pair of escape sequences partitions the complete set of
948     characters into two disjoint sets. Any given character
949     matches one, and only one, of each pair.
951     A "word" character is any letter or digit or the underscore
952     character, that is, any character which can be part of a
953     Perl "word". The definition of letters and digits is con-
954     trolled by PCRE's character tables, and may vary if locale-
955     specific matching is taking place (see "Locale support"
956     above). For example, in the "fr" (French) locale, some char-
957     acter codes greater than 128 are used for accented letters,
958     and these are matched by \w.
960     These character type sequences can appear both inside and
961     outside character classes. They each match one character of
962     the appropriate type. If the current matching point is at
963     the end of the subject string, all of them fail, since there
964     is no character to match.
966     The fourth use of backslash is for certain simple asser-
967     tions. An assertion specifies a condition that has to be met
968     at a particular point in a match, without consuming any
969     characters from the subject string. The use of subpatterns
970     for more complicated assertions is described below. The
971     backslashed assertions are
973     \b word boundary
974     \B not a word boundary
975     \A start of subject (independent of multiline mode)
976     \Z end of subject or newline at end (independent of
977     multiline mode)
978     \z end of subject (independent of multiline mode)
980     These assertions may not appear in character classes (but
981     note that "\b" has a different meaning, namely the backspace
982     character, inside a character class).
983 nigel 43
984 nigel 41 A word boundary is a position in the subject string where
985     the current character and the previous character do not both
986     match \w or \W (i.e. one matches \w and the other matches
987     \W), or the start or end of the string if the first or last
988     character matches \w, respectively.
990     The \A, \Z, and \z assertions differ from the traditional
991     circumflex and dollar (described below) in that they only
992     ever match at the very start and end of the subject string,
993     whatever options are set. They are not affected by the
994     PCRE_NOTBOL or PCRE_NOTEOL options. If the startoffset argu-
995     ment of pcre_exec() is non-zero, \A can never match. The
996     difference between \Z and \z is that \Z matches before a
997     newline that is the last character of the string as well as
998     at the end of the string, whereas \z matches only at the
999     end.
1004     Outside a character class, in the default matching mode, the
1005     circumflex character is an assertion which is true only if
1006     the current matching point is at the start of the subject
1007     string. If the startoffset argument of pcre_exec() is non-
1008     zero, circumflex can never match. Inside a character class,
1009     circumflex has an entirely different meaning (see below).
1011     Circumflex need not be the first character of the pattern if
1012     a number of alternatives are involved, but it should be the
1013     first thing in each alternative in which it appears if the
1014     pattern is ever to match that branch. If all possible alter-
1015     natives start with a circumflex, that is, if the pattern is
1016     constrained to match only at the start of the subject, it is
1017     said to be an "anchored" pattern. (There are also other con-
1018     structs that can cause a pattern to be anchored.)
1020     A dollar character is an assertion which is true only if the
1021     current matching point is at the end of the subject string,
1022     or immediately before a newline character that is the last
1023     character in the string (by default). Dollar need not be the
1024     last character of the pattern if a number of alternatives
1025     are involved, but it should be the last item in any branch
1026     in which it appears. Dollar has no special meaning in a
1027     character class.
1029     The meaning of dollar can be changed so that it matches only
1030     at the very end of the string, by setting the
1031     PCRE_DOLLAR_ENDONLY option at compile or matching time. This
1032     does not affect the \Z assertion.
1034     The meanings of the circumflex and dollar characters are
1035     changed if the PCRE_MULTILINE option is set. When this is
1036     the case, they match immediately after and immediately
1037     before an internal "\n" character, respectively, in addition
1038     to matching at the start and end of the subject string. For
1039     example, the pattern /^abc$/ matches the subject string
1040     "def\nabc" in multiline mode, but not otherwise. Conse-
1041     quently, patterns that are anchored in single line mode
1042     because all branches start with "^" are not anchored in mul-
1043     tiline mode, and a match for circumflex is possible when the
1044     startoffset argument of pcre_exec() is non-zero. The
1045     PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is
1046     set.
1048     Note that the sequences \A, \Z, and \z can be used to match
1049     the start and end of the subject in both modes, and if all
1050     branches of a pattern start with \A is it always anchored,
1051     whether PCRE_MULTILINE is set or not.
1056     Outside a character class, a dot in the pattern matches any
1057     one character in the subject, including a non-printing char-
1058     acter, but not (by default) newline. If the PCRE_DOTALL
1059 nigel 47 option is set, dots match newlines as well. The handling of
1060     dot is entirely independent of the handling of circumflex
1061     and dollar, the only relationship being that they both
1062     involve newline characters. Dot has no special meaning in a
1063     character class.
1064 nigel 41
1068     An opening square bracket introduces a character class, ter-
1069     minated by a closing square bracket. A closing square
1070     bracket on its own is not special. If a closing square
1071     bracket is required as a member of the class, it should be
1072     the first data character in the class (after an initial cir-
1073     cumflex, if present) or escaped with a backslash.
1075     A character class matches a single character in the subject;
1076     the character must be in the set of characters defined by
1077     the class, unless the first character in the class is a cir-
1078     cumflex, in which case the subject character must not be in
1079     the set defined by the class. If a circumflex is actually
1080     required as a member of the class, ensure it is not the
1081     first character, or escape it with a backslash.
1083     For example, the character class [aeiou] matches any lower
1084     case vowel, while [^aeiou] matches any character that is not
1085     a lower case vowel. Note that a circumflex is just a con-
1086     venient notation for specifying the characters which are in
1087     the class by enumerating those that are not. It is not an
1088     assertion: it still consumes a character from the subject
1089     string, and fails if the current pointer is at the end of
1090     the string.
1092     When caseless matching is set, any letters in a class
1093     represent both their upper case and lower case versions, so
1094     for example, a caseless [aeiou] matches "A" as well as "a",
1095     and a caseless [^aeiou] does not match "A", whereas a case-
1096     ful version would.
1098     The newline character is never treated in any special way in
1099     character classes, whatever the setting of the PCRE_DOTALL
1100     or PCRE_MULTILINE options is. A class such as [^a] will
1101     always match a newline.
1103     The minus (hyphen) character can be used to specify a range
1104     of characters in a character class. For example, [d-m]
1105     matches any letter between d and m, inclusive. If a minus
1106     character is required in a class, it must be escaped with a
1107     backslash or appear in a position where it cannot be inter-
1108     preted as indicating a range, typically as the first or last
1109     character in the class.
1111     It is not possible to have the literal character "]" as the
1112     end character of a range. A pattern such as [W-]46] is
1113     interpreted as a class of two characters ("W" and "-") fol-
1114     lowed by a literal string "46]", so it would match "W46]" or
1115     "-46]". However, if the "]" is escaped with a backslash it
1116     is interpreted as the end of range, so [W-\]46] is inter-
1117     preted as a single class containing a range followed by two
1118     separate characters. The octal or hexadecimal representation
1119     of "]" can also be used to end a range.
1121     Ranges operate in ASCII collating sequence. They can also be
1122     used for characters specified numerically, for example
1123     [\000-\037]. If a range that includes letters is used when
1124     caseless matching is set, it matches the letters in either
1125     case. For example, [W-c] is equivalent to [][\^_`wxyzabc],
1126     matched caselessly, and if character tables for the "fr"
1127     locale are in use, [\xc8-\xcb] matches accented E characters
1128     in both cases.
1130     The character types \d, \D, \s, \S, \w, and \W may also
1131     appear in a character class, and add the characters that
1132     they match to the class. For example, [\dABCDEF] matches any
1133     hexadecimal digit. A circumflex can conveniently be used
1134     with the upper case character types to specify a more res-
1135     tricted set of characters than the matching lower case type.
1136     For example, the class [^\W_] matches any letter or digit,
1137     but not underscore.
1139     All non-alphameric characters other than \, -, ^ (at the
1140     start) and the terminating ] are non-special in character
1141     classes, but it does no harm if they are escaped.
1146     Perl 5.6 (not yet released at the time of writing) is going
1147     to support the POSIX notation for character classes, which
1148     uses names enclosed by [: and :] within the enclosing
1149     square brackets. PCRE supports this notation. For example,
1151     [01[:alpha:]%]
1153     matches "0", "1", any alphabetic character, or "%". The sup-
1154     ported class names are
1156     alnum letters and digits
1157     alpha letters
1158     ascii character codes 0 - 127
1159     cntrl control characters
1160     digit decimal digits (same as \d)
1161     graph printing characters, excluding space
1162     lower lower case letters
1163     print printing characters, including space
1164     punct printing characters, excluding letters and digits
1165     space white space (same as \s)
1166     upper upper case letters
1167     word "word" characters (same as \w)
1168     xdigit hexadecimal digits
1170     The names "ascii" and "word" are Perl extensions. Another
1171     Perl extension is negation, which is indicated by a ^ char-
1172     acter after the colon. For example,
1174     [12[:^digit:]]
1176     matches "1", "2", or any non-digit. PCRE (and Perl) also
1177     recogize the POSIX syntax [.ch.] and [=ch=] where "ch" is a
1178     "collating element", but these are not supported, and an
1179     error is given if they are encountered.
1183 nigel 41 VERTICAL BAR
1184     Vertical bar characters are used to separate alternative
1185     patterns. For example, the pattern
1187     gilbert|sullivan
1189     matches either "gilbert" or "sullivan". Any number of alter-
1190     natives may appear, and an empty alternative is permitted
1191     (matching the empty string). The matching process tries
1192     each alternative in turn, from left to right, and the first
1193     one that succeeds is used. If the alternatives are within a
1194     subpattern (defined below), "succeeds" means matching the
1195     rest of the main pattern as well as the alternative in the
1196     subpattern.
1202     and PCRE_EXTENDED can be changed from within the pattern by
1203     a sequence of Perl option letters enclosed between "(?" and
1204     ")". The option letters are
1206     i for PCRE_CASELESS
1207     m for PCRE_MULTILINE
1208     s for PCRE_DOTALL
1209     x for PCRE_EXTENDED
1211     For example, (?im) sets caseless, multiline matching. It is
1212     also possible to unset these options by preceding the letter
1213     with a hyphen, and a combined setting and unsetting such as
1214     (?im-sx), which sets PCRE_CASELESS and PCRE_MULTILINE while
1215     unsetting PCRE_DOTALL and PCRE_EXTENDED, is also permitted.
1216     If a letter appears both before and after the hyphen, the
1217     option is unset.
1219     The scope of these option changes depends on where in the
1220     pattern the setting occurs. For settings that are outside
1221     any subpattern (defined below), the effect is the same as if
1222     the options were set or unset at the start of matching. The
1223     following patterns all behave in exactly the same way:
1225     (?i)abc
1226     a(?i)bc
1227     ab(?i)c
1228     abc(?i)
1230     which in turn is the same as compiling the pattern abc with
1231     PCRE_CASELESS set. In other words, such "top level" set-
1232     tings apply to the whole pattern (unless there are other
1233     changes inside subpatterns). If there is more than one set-
1234     ting of the same option at top level, the rightmost setting
1235     is used.
1237     If an option change occurs inside a subpattern, the effect
1238     is different. This is a change of behaviour in Perl 5.005.
1239     An option change inside a subpattern affects only that part
1240     of the subpattern that follows it, so
1242     (a(?i)b)c
1244     matches abc and aBc and no other strings (assuming
1245     PCRE_CASELESS is not used). By this means, options can be
1246     made to have different settings in different parts of the
1247     pattern. Any changes made in one alternative do carry on
1248     into subsequent branches within the same subpattern. For
1249     example,
1251     (a(?i)b|c)
1253     matches "ab", "aB", "c", and "C", even though when matching
1254     "C" the first branch is abandoned before the option setting.
1255     This is because the effects of option settings happen at
1256     compile time. There would be some very weird behaviour oth-
1257     erwise.
1259     The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can
1260     be changed in the same way as the Perl-compatible options by
1261     using the characters U and X respectively. The (?X) flag
1262     setting is special in that it must always occur earlier in
1263     the pattern than any of the additional features it turns on,
1264     even when it is at top level. It is best put at the start.
1269     Subpatterns are delimited by parentheses (round brackets),
1270     which can be nested. Marking part of a pattern as a subpat-
1271     tern does two things:
1273     1. It localizes a set of alternatives. For example, the pat-
1274     tern
1276     cat(aract|erpillar|)
1278     matches one of the words "cat", "cataract", or "caterpil-
1279     lar". Without the parentheses, it would match "cataract",
1280     "erpillar" or the empty string.
1282     2. It sets up the subpattern as a capturing subpattern (as
1283     defined above). When the whole pattern matches, that por-
1284     tion of the subject string that matched the subpattern is
1285     passed back to the caller via the ovector argument of
1286     pcre_exec(). Opening parentheses are counted from left to
1287     right (starting from 1) to obtain the numbers of the captur-
1288     ing subpatterns.
1290     For example, if the string "the red king" is matched against
1291     the pattern
1293     the ((red|white) (king|queen))
1295     the captured substrings are "red king", "red", and "king",
1296     and are numbered 1, 2, and 3.
1298     The fact that plain parentheses fulfil two functions is not
1299     always helpful. There are often times when a grouping sub-
1300     pattern is required without a capturing requirement. If an
1301     opening parenthesis is followed by "?:", the subpattern does
1302     not do any capturing, and is not counted when computing the
1303     number of any subsequent capturing subpatterns. For example,
1304     if the string "the white queen" is matched against the pat-
1305     tern
1307     the ((?:red|white) (king|queen))
1309     the captured substrings are "white queen" and "queen", and
1310     are numbered 1 and 2. The maximum number of captured sub-
1311     strings is 99, and the maximum number of all subpatterns,
1312     both capturing and non-capturing, is 200.
1314     As a convenient shorthand, if any option settings are
1315     required at the start of a non-capturing subpattern, the
1316     option letters may appear between the "?" and the ":". Thus
1317     the two patterns
1319     (?i:saturday|sunday)
1320     (?:(?i)saturday|sunday)
1322     match exactly the same set of strings. Because alternative
1323     branches are tried from left to right, and options are not
1324     reset until the end of the subpattern is reached, an option
1325     setting in one branch does affect subsequent branches, so
1326     the above patterns match "SUNDAY" as well as "Saturday".
1331     Repetition is specified by quantifiers, which can follow any
1332     of the following items:
1334     a single character, possibly escaped
1335     the . metacharacter
1336     a character class
1337     a back reference (see next section)
1338     a parenthesized subpattern (unless it is an assertion -
1339     see below)
1341     The general repetition quantifier specifies a minimum and
1342     maximum number of permitted matches, by giving the two
1343     numbers in curly brackets (braces), separated by a comma.
1344     The numbers must be less than 65536, and the first must be
1345     less than or equal to the second. For example:
1347     z{2,4}
1349     matches "zz", "zzz", or "zzzz". A closing brace on its own
1350     is not a special character. If the second number is omitted,
1351     but the comma is present, there is no upper limit; if the
1352     second number and the comma are both omitted, the quantifier
1353     specifies an exact number of required matches. Thus
1355     [aeiou]{3,}
1357     matches at least 3 successive vowels, but may match many
1358     more, while
1360     \d{8}
1362     matches exactly 8 digits. An opening curly bracket that
1363     appears in a position where a quantifier is not allowed, or
1364     one that does not match the syntax of a quantifier, is taken
1365     as a literal character. For example, {,6} is not a quantif-
1366     ier, but a literal string of four characters.
1368     The quantifier {0} is permitted, causing the expression to
1369     behave as if the previous item and the quantifier were not
1370     present.
1372     For convenience (and historical compatibility) the three
1373     most common quantifiers have single-character abbreviations:
1375     * is equivalent to {0,}
1376     + is equivalent to {1,}
1377     ? is equivalent to {0,1}
1379     It is possible to construct infinite loops by following a
1380     subpattern that can match no characters with a quantifier
1381     that has no upper limit, for example:
1383     (a?)*
1385     Earlier versions of Perl and PCRE used to give an error at
1386     compile time for such patterns. However, because there are
1387     cases where this can be useful, such patterns are now
1388     accepted, but if any repetition of the subpattern does in
1389     fact match no characters, the loop is forcibly broken.
1391     By default, the quantifiers are "greedy", that is, they
1392     match as much as possible (up to the maximum number of per-
1393     mitted times), without causing the rest of the pattern to
1394     fail. The classic example of where this gives problems is in
1395     trying to match comments in C programs. These appear between
1396     the sequences /* and */ and within the sequence, individual
1397     * and / characters may appear. An attempt to match C com-
1398     ments by applying the pattern
1400     /\*.*\*/
1402     to the string
1404     /* first command */ not comment /* second comment */
1406     fails, because it matches the entire string due to the
1407     greediness of the .* item.
1409 nigel 47 However, if a quantifier is followed by a question mark, it
1410     ceases to be greedy, and instead matches the minimum number
1411     of times possible, so the pattern
1412 nigel 41
1413     /\*.*?\*/
1415     does the right thing with the C comments. The meaning of the
1416     various quantifiers is not otherwise changed, just the pre-
1417     ferred number of matches. Do not confuse this use of ques-
1418     tion mark with its use as a quantifier in its own right.
1419     Because it has two uses, it can sometimes appear doubled, as
1420     in
1422     \d??\d
1424     which matches one digit by preference, but can match two if
1425     that is the only way the rest of the pattern matches.
1427     If the PCRE_UNGREEDY option is set (an option which is not
1428 nigel 47 available in Perl), the quantifiers are not greedy by
1429 nigel 41 default, but individual ones can be made greedy by following
1430     them with a question mark. In other words, it inverts the
1431     default behaviour.
1433     When a parenthesized subpattern is quantified with a minimum
1434     repeat count that is greater than 1 or with a limited max-
1435     imum, more store is required for the compiled pattern, in
1436     proportion to the size of the minimum or maximum.
1438     If a pattern starts with .* or .{0,} and the PCRE_DOTALL
1439     option (equivalent to Perl's /s) is set, thus allowing the .
1440 nigel 47 to match newlines, the pattern is implicitly anchored,
1441 nigel 41 because whatever follows will be tried against every charac-
1442     ter position in the subject string, so there is no point in
1443     retrying the overall match at any position after the first.
1444     PCRE treats such a pattern as though it were preceded by \A.
1445     In cases where it is known that the subject string contains
1446     no newlines, it is worth setting PCRE_DOTALL when the pat-
1447     tern begins with .* in order to obtain this optimization, or
1448     alternatively using ^ to indicate anchoring explicitly.
1450     When a capturing subpattern is repeated, the value captured
1451     is the substring that matched the final iteration. For exam-
1452     ple, after
1454     (tweedle[dume]{3}\s*)+
1456     has matched "tweedledum tweedledee" the value of the cap-
1457     tured substring is "tweedledee". However, if there are
1458     nested capturing subpatterns, the corresponding captured
1459     values may have been set in previous iterations. For exam-
1460     ple, after
1462     /(a|(b))+/
1464     matches "aba" the value of the second captured substring is
1465     "b".
1470     Outside a character class, a backslash followed by a digit
1471     greater than 0 (and possibly further digits) is a back
1472     reference to a capturing subpattern earlier (i.e. to its
1473     left) in the pattern, provided there have been that many
1474     previous capturing left parentheses.
1476     However, if the decimal number following the backslash is
1477     less than 10, it is always taken as a back reference, and
1478     causes an error only if there are not that many capturing
1479     left parentheses in the entire pattern. In other words, the
1480     parentheses that are referenced need not be to the left of
1481     the reference for numbers less than 10. See the section
1482     entitled "Backslash" above for further details of the han-
1483     dling of digits following a backslash.
1485     A back reference matches whatever actually matched the cap-
1486     turing subpattern in the current subject string, rather than
1487     anything matching the subpattern itself. So the pattern
1489     (sens|respons)e and \1ibility
1491     matches "sense and sensibility" and "response and responsi-
1492     bility", but not "sense and responsibility". If caseful
1493 nigel 47 matching is in force at the time of the back reference, the
1494     case of letters is relevant. For example,
1495 nigel 41
1496     ((?i)rah)\s+\1
1498     matches "rah rah" and "RAH RAH", but not "RAH rah", even
1499     though the original capturing subpattern is matched case-
1500     lessly.
1502     There may be more than one back reference to the same sub-
1503     pattern. If a subpattern has not actually been used in a
1504 nigel 47 particular match, any back references to it always fail. For
1505     example, the pattern
1506 nigel 41
1507     (a|(bc))\2
1509     always fails if it starts to match "a" rather than "bc".
1510     Because there may be up to 99 back references, all digits
1511     following the backslash are taken as part of a potential
1512     back reference number. If the pattern continues with a digit
1513 nigel 47 character, some delimiter must be used to terminate the back
1514     reference. If the PCRE_EXTENDED option is set, this can be
1515     whitespace. Otherwise an empty comment can be used.
1516 nigel 41
1517     A back reference that occurs inside the parentheses to which
1518     it refers fails when the subpattern is first used, so, for
1519     example, (a\1) never matches. However, such references can
1520 nigel 43 be useful inside repeated subpatterns. For example, the
1521     pattern
1522 nigel 41
1523     (a|b\1)+
1525     matches any number of "a"s and also "aba", "ababaa" etc. At
1526     each iteration of the subpattern, the back reference matches
1527     the character string corresponding to the previous itera-
1528     tion. In order for this to work, the pattern must be such
1529     that the first iteration does not need to match the back
1530     reference. This can be done using alternation, as in the
1531     example above, or by a quantifier with a minimum of zero.
1536     An assertion is a test on the characters following or
1537     preceding the current matching point that does not actually
1538     consume any characters. The simple assertions coded as \b,
1539     \B, \A, \Z, \z, ^ and $ are described above. More compli-
1540     cated assertions are coded as subpatterns. There are two
1541     kinds: those that look ahead of the current position in the
1542     subject string, and those that look behind it.
1543 nigel 43
1544 nigel 41 An assertion subpattern is matched in the normal way, except
1545     that it does not cause the current matching position to be
1546     changed. Lookahead assertions start with (?= for positive
1547     assertions and (?! for negative assertions. For example,
1549     \w+(?=;)
1551     matches a word followed by a semicolon, but does not include
1552     the semicolon in the match, and
1554     foo(?!bar)
1556     matches any occurrence of "foo" that is not followed by
1557     "bar". Note that the apparently similar pattern
1559     (?!foo)bar
1561     does not find an occurrence of "bar" that is preceded by
1562     something other than "foo"; it finds any occurrence of "bar"
1563     whatsoever, because the assertion (?!foo) is always true
1564     when the next three characters are "bar". A lookbehind
1565     assertion is needed to achieve this effect.
1567     Lookbehind assertions start with (?<= for positive asser-
1568     tions and (?<! for negative assertions. For example,
1570     (?<!foo)bar
1572     does find an occurrence of "bar" that is not preceded by
1573     "foo". The contents of a lookbehind assertion are restricted
1574     such that all the strings it matches must have a fixed
1575     length. However, if there are several alternatives, they do
1576     not all have to have the same fixed length. Thus
1578     (?<=bullock|donkey)
1580     is permitted, but
1582     (?<!dogs?|cats?)
1584     causes an error at compile time. Branches that match dif-
1585     ferent length strings are permitted only at the top level of
1586     a lookbehind assertion. This is an extension compared with
1587     Perl 5.005, which requires all branches to match the same
1588     length of string. An assertion such as
1590     (?<=ab(c|de))
1592     is not permitted, because its single top-level branch can
1593     match two different lengths, but it is acceptable if rewrit-
1594     ten to use two top-level branches:
1596     (?<=abc|abde)
1598     The implementation of lookbehind assertions is, for each
1599     alternative, to temporarily move the current position back
1600     by the fixed width and then try to match. If there are
1601     insufficient characters before the current position, the
1602     match is deemed to fail. Lookbehinds in conjunction with
1603     once-only subpatterns can be particularly useful for match-
1604     ing at the ends of strings; an example is given at the end
1605     of the section on once-only subpatterns.
1607     Several assertions (of any sort) may occur in succession.
1608     For example,
1610     (?<=\d{3})(?<!999)foo
1612     matches "foo" preceded by three digits that are not "999".
1613     Notice that each of the assertions is applied independently
1614     at the same point in the subject string. First there is a
1615 nigel 47 check that the previous three characters are all digits, and
1616 nigel 41 then there is a check that the same three characters are not
1617     "999". This pattern does not match "foo" preceded by six
1618     characters, the first of which are digits and the last three
1619     of which are not "999". For example, it doesn't match
1620     "123abcfoo". A pattern to do that is
1622     (?<=\d{3}...)(?<!999)foo
1624     This time the first assertion looks at the preceding six
1625     characters, checking that the first three are digits, and
1626     then the second assertion checks that the preceding three
1627     characters are not "999".
1629     Assertions can be nested in any combination. For example,
1631     (?<=(?<!foo)bar)baz
1633     matches an occurrence of "baz" that is preceded by "bar"
1634     which in turn is not preceded by "foo", while
1636     (?<=\d{3}(?!999)...)foo
1638     is another pattern which matches "foo" preceded by three
1639     digits and any three characters that are not "999".
1641     Assertion subpatterns are not capturing subpatterns, and may
1642     not be repeated, because it makes no sense to assert the
1643     same thing several times. If any kind of assertion contains
1644     capturing subpatterns within it, these are counted for the
1645     purposes of numbering the capturing subpatterns in the whole
1646     pattern. However, substring capturing is carried out only
1647     for positive assertions, because it does not make sense for
1648     negative assertions.
1650     Assertions count towards the maximum of 200 parenthesized
1651     subpatterns.
1656     With both maximizing and minimizing repetition, failure of
1657     what follows normally causes the repeated item to be re-
1658     evaluated to see if a different number of repeats allows the
1659     rest of the pattern to match. Sometimes it is useful to
1660     prevent this, either to change the nature of the match, or
1661     to cause it fail earlier than it otherwise might, when the
1662     author of the pattern knows there is no point in carrying
1663     on.
1665     Consider, for example, the pattern \d+foo when applied to
1666     the subject line
1668     123456bar
1670     After matching all 6 digits and then failing to match "foo",
1671     the normal action of the matcher is to try again with only 5
1672     digits matching the \d+ item, and then with 4, and so on,
1673     before ultimately failing. Once-only subpatterns provide the
1674     means for specifying that once a portion of the pattern has
1675     matched, it is not to be re-evaluated in this way, so the
1676     matcher would give up immediately on failing to match "foo"
1677     the first time. The notation is another kind of special
1678     parenthesis, starting with (?> as in this example:
1680     (?>\d+)bar
1682     This kind of parenthesis "locks up" the part of the pattern
1683     it contains once it has matched, and a failure further into
1684     the pattern is prevented from backtracking into it. Back-
1685     tracking past it to previous items, however, works as nor-
1686     mal.
1688     An alternative description is that a subpattern of this type
1689     matches the string of characters that an identical stan-
1690     dalone pattern would match, if anchored at the current point
1691     in the subject string.
1693     Once-only subpatterns are not capturing subpatterns. Simple
1694     cases such as the above example can be thought of as a max-
1695     imizing repeat that must swallow everything it can. So,
1696     while both \d+ and \d+? are prepared to adjust the number of
1697     digits they match in order to make the rest of the pattern
1698     match, (?>\d+) can only match an entire sequence of digits.
1700     This construction can of course contain arbitrarily compli-
1701     cated subpatterns, and it can be nested.
1703     Once-only subpatterns can be used in conjunction with look-
1704     behind assertions to specify efficient matching at the end
1705     of the subject string. Consider a simple pattern such as
1707     abcd$
1709 nigel 43 when applied to a long string which does not match. Because
1710     matching proceeds from left to right, PCRE will look for
1711     each "a" in the subject and then see if what follows matches
1712     the rest of the pattern. If the pattern is specified as
1713 nigel 41
1714     ^.*abcd$
1716 nigel 47 the initial .* matches the entire string at first, but when
1717     this fails (because there is no following "a"), it back-
1718     tracks to match all but the last character, then all but the
1719     last two characters, and so on. Once again the search for
1720     "a" covers the entire string, from right to left, so we are
1721     no better off. However, if the pattern is written as
1722 nigel 41
1723     ^(?>.*)(?<=abcd)
1725 nigel 47 there can be no backtracking for the .* item; it can match
1726     only the entire string. The subsequent lookbehind assertion
1727     does a single test on the last four characters. If it fails,
1728     the match fails immediately. For long strings, this approach
1729     makes a significant difference to the processing time.
1730 nigel 41
1731 nigel 43 When a pattern contains an unlimited repeat inside a subpat-
1732     tern that can itself be repeated an unlimited number of
1733     times, the use of a once-only subpattern is the only way to
1734     avoid some failing matches taking a very long time indeed.
1735     The pattern
1736 nigel 41
1737 nigel 43 (\D+|<\d+>)*[!?]
1738 nigel 41
1739 nigel 43 matches an unlimited number of substrings that either con-
1740     sist of non-digits, or digits enclosed in <>, followed by
1741     either ! or ?. When it matches, it runs quickly. However, if
1742     it is applied to
1744     aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
1746     it takes a long time before reporting failure. This is
1747     because the string can be divided between the two repeats in
1748     a large number of ways, and all have to be tried. (The exam-
1749     ple used [!?] rather than a single character at the end,
1750     because both PCRE and Perl have an optimization that allows
1751     for fast failure when a single character is used. They
1752     remember the last single character that is required for a
1753     match, and fail early if it is not present in the string.)
1754     If the pattern is changed to
1756     ((?>\D+)|<\d+>)*[!?]
1758     sequences of non-digits cannot be broken, and failure hap-
1759     pens quickly.
1764     It is possible to cause the matching process to obey a sub-
1765     pattern conditionally or to choose between two alternative
1766     subpatterns, depending on the result of an assertion, or
1767     whether a previous capturing subpattern matched or not. The
1768     two possible forms of conditional subpattern are
1770     (?(condition)yes-pattern)
1771     (?(condition)yes-pattern|no-pattern)
1773     If the condition is satisfied, the yes-pattern is used; oth-
1774     erwise the no-pattern (if present) is used. If there are
1775     more than two alternatives in the subpattern, a compile-time
1776     error occurs.
1778     There are two kinds of condition. If the text between the
1779 nigel 47 parentheses consists of a sequence of digits, the condition
1780     is satisfied if the capturing subpattern of that number has
1781     previously matched. Consider the following pattern, which
1782     contains non-significant white space to make it more read-
1783     able (assume the PCRE_EXTENDED option) and to divide it into
1784     three parts for ease of discussion:
1785 nigel 41
1786     ( \( )? [^()]+ (?(1) \) )
1788     The first part matches an optional opening parenthesis, and
1789     if that character is present, sets it as the first captured
1790     substring. The second part matches one or more characters
1791     that are not parentheses. The third part is a conditional
1792     subpattern that tests whether the first set of parentheses
1793     matched or not. If they did, that is, if subject started
1794     with an opening parenthesis, the condition is true, and so
1795     the yes-pattern is executed and a closing parenthesis is
1796     required. Otherwise, since no-pattern is not present, the
1797     subpattern matches nothing. In other words, this pattern
1798     matches a sequence of non-parentheses, optionally enclosed
1799     in parentheses.
1801     If the condition is not a sequence of digits, it must be an
1802     assertion. This may be a positive or negative lookahead or
1803     lookbehind assertion. Consider this pattern, again contain-
1804     ing non-significant white space, and with the two alterna-
1805     tives on the second line:
1807     (?(?=[^a-z]*[a-z])
1808     \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
1810     The condition is a positive lookahead assertion that matches
1811     an optional sequence of non-letters followed by a letter. In
1812     other words, it tests for the presence of at least one
1813     letter in the subject. If a letter is found, the subject is
1814     matched against the first alternative; otherwise it is
1815     matched against the second. This pattern matches strings in
1816     one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
1817     letters and dd are digits.
1822     The sequence (?# marks the start of a comment which contin-
1823     ues up to the next closing parenthesis. Nested parentheses
1824     are not permitted. The characters that make up a comment
1825     play no part in the pattern matching at all.
1827     If the PCRE_EXTENDED option is set, an unescaped # character
1828     outside a character class introduces a comment that contin-
1829     ues up to the next newline character in the pattern.
1834     Consider the problem of matching a string in parentheses,
1835     allowing for unlimited nested parentheses. Without the use
1836     of recursion, the best that can be done is to use a pattern
1837     that matches up to some fixed depth of nesting. It is not
1838     possible to handle an arbitrary nesting depth. Perl 5.6 has
1839     provided an experimental facility that allows regular
1840     expressions to recurse (amongst other things). It does this
1841     by interpolating Perl code in the expression at run time,
1842     and the code can refer to the expression itself. A Perl pat-
1843     tern to solve the parentheses problem can be created like
1844     this:
1846     $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
1848     The (?p{...}) item interpolates Perl code at run time, and
1849     in this case refers recursively to the pattern in which it
1850     appears. Obviously, PCRE cannot support the interpolation of
1851     Perl code. Instead, the special item (?R) is provided for
1852     the specific case of recursion. This PCRE pattern solves the
1853     parentheses problem (assume the PCRE_EXTENDED option is set
1854     so that white space is ignored):
1856     \( ( (?>[^()]+) | (?R) )* \)
1858     First it matches an opening parenthesis. Then it matches any
1859     number of substrings which can either be a sequence of non-
1860     parentheses, or a recursive match of the pattern itself
1861     (i.e. a correctly parenthesized substring). Finally there is
1862     a closing parenthesis.
1864     This particular example pattern contains nested unlimited
1865     repeats, and so the use of a once-only subpattern for match-
1866     ing strings of non-parentheses is important when applying
1867     the pattern to strings that do not match. For example, when
1868     it is applied to
1870     (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
1872     it yields "no match" quickly. However, if a once-only sub-
1873     pattern is not used, the match runs for a very long time
1874     indeed because there are so many different ways the + and *
1875     repeats can carve up the subject, and all have to be tested
1876     before failure can be reported.
1878     The values set for any capturing subpatterns are those from
1879     the outermost level of the recursion at which the subpattern
1880     value is set. If the pattern above is matched against
1882     (ab(cd)ef)
1884     the value for the capturing parentheses is "ef", which is
1885     the last value taken on at the top level. If additional
1886     parentheses are added, giving
1888     \( ( ( (?>[^()]+) | (?R) )* ) \)
1889     ^ ^
1890 nigel 47 ^ ^ the string they capture is
1891     "ab(cd)ef", the contents of the top level parentheses. If
1892 nigel 43 there are more than 15 capturing parentheses in a pattern,
1893     PCRE has to obtain extra memory to store data during a
1894     recursion, which it does by using pcre_malloc, freeing it
1895     via pcre_free afterwards. If no memory can be obtained, it
1896     saves data for the first 15 capturing parentheses only, as
1897     there is no way to give an out-of-memory error from within a
1898     recursion.
1902 nigel 41 PERFORMANCE
1903     Certain items that may appear in patterns are more efficient
1904     than others. It is more efficient to use a character class
1905     like [aeiou] than a set of alternatives such as (a|e|i|o|u).
1906     In general, the simplest construction that provides the
1907     required behaviour is usually the most efficient. Jeffrey
1908     Friedl's book contains a lot of discussion about optimizing
1909     regular expressions for efficient performance.
1911     When a pattern begins with .* and the PCRE_DOTALL option is
1912     set, the pattern is implicitly anchored by PCRE, since it
1913     can match only at the start of a subject string. However, if
1914     PCRE_DOTALL is not set, PCRE cannot make this optimization,
1915     because the . metacharacter does not then match a newline,
1916     and if the subject string contains newlines, the pattern may
1917     match from the character immediately following one of them
1918     instead of from the very start. For example, the pattern
1920     (.*) second
1922     matches the subject "first\nand second" (where \n stands for
1923     a newline character) with the first captured substring being
1924     "and". In order to do this, PCRE has to retry the match
1925     starting after every newline in the subject.
1927     If you are using such a pattern with subject strings that do
1928     not contain newlines, the best performance is obtained by
1929     setting PCRE_DOTALL, or starting the pattern with ^.* to
1930     indicate explicit anchoring. That saves PCRE from having to
1931     scan along the subject looking for a newline to restart at.
1933     Beware of patterns that contain nested indefinite repeats.
1934     These can take a long time to run when applied to a string
1935     that does not match. Consider the pattern fragment
1937     (a+)*
1939     This can match "aaaa" in 33 different ways, and this number
1940     increases very rapidly as the string gets longer. (The *
1941     repeat can match 0, 1, 2, 3, or 4 times, and for each of
1942     those cases other than 0, the + repeats can match different
1943     numbers of times.) When the remainder of the pattern is such
1944     that the entire match is going to fail, PCRE has in princi-
1945     ple to try every possible variation, and this can take an
1946     extremely long time.
1948     An optimization catches some of the more simple cases such
1949     as
1951     (a+)*b
1953     where a literal character follows. Before embarking on the
1954     standard matching procedure, PCRE checks that there is a "b"
1955     later in the subject string, and if there is not, it fails
1956     the match immediately. However, when there is no following
1957     literal this optimization cannot be used. You can see the
1958     difference by comparing the behaviour of
1960     (a+)*\d
1962     with the pattern above. The former gives a failure almost
1963     instantly when applied to a whole line of "a" characters,
1964     whereas the latter takes an appreciable time with strings
1965     longer than about 20 characters.
1969     AUTHOR
1970     Philip Hazel <ph10@cam.ac.uk>
1971     University Computing Service,
1972     New Museums Site,
1973     Cambridge CB2 3QG, England.
1974     Phone: +44 1223 334714
1976 nigel 43 Last updated: 27 January 2000
1977     Copyright (c) 1997-2000 University of Cambridge.

ViewVC Help
Powered by ViewVC 1.1.12