/[pcre]/code/tags/pcre-4.1/doc/pcre.txt
ViewVC logotype

Contents of /code/tags/pcre-4.1/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log


Revision 41 - (show annotations) (download)
Sat Feb 24 21:39:17 2007 UTC (7 years, 6 months ago) by nigel
Original Path: code/trunk/doc/pcre.txt
File MIME type: text/plain
File size: 77499 byte(s)
Load pcre-2.08a into code/trunk.

1 NAME
2 pcre - Perl-compatible regular expressions.
3
4
5
6 SYNOPSIS
7 #include <pcre.h>
8
9 pcre *pcre_compile(const char *pattern, int options,
10 const char **errptr, int *erroffset,
11 const unsigned char *tableptr);
12
13 pcre_extra *pcre_study(const pcre *code, int options,
14 const char **errptr);
15
16 int pcre_exec(const pcre *code, const pcre_extra *extra,
17 const char *subject, int length, int startoffset,
18 int options, int *ovector, int ovecsize);
19
20 int pcre_copy_substring(const char *subject, int *ovector,
21 int stringcount, int stringnumber, char *buffer,
22 int buffersize);
23
24 int pcre_get_substring(const char *subject, int *ovector,
25 int stringcount, int stringnumber,
26 const char **stringptr);
27
28 int pcre_get_substring_list(const char *subject,
29 int *ovector, int stringcount, const char ***listptr);
30
31 const unsigned char *pcre_maketables(void);
32
33 int pcre_info(const pcre *code, int *optptr, *firstcharptr);
34
35 char *pcre_version(void);
36
37 void *(*pcre_malloc)(size_t);
38
39 void (*pcre_free)(void *);
40
41
42
43
44 DESCRIPTION
45 The PCRE library is a set of functions that implement regu-
46 lar expression pattern matching using the same syntax and
47 semantics as Perl 5, with just a few differences (see
48 below). The current implementation corresponds to Perl
49 5.005.
50
51 PCRE has its own native API, which is described in this
52 document. There is also a set of wrapper functions that
53 correspond to the POSIX API. These are described in the
54 pcreposix documentation.
55 The native API function prototypes are defined in the header
56 file pcre.h, and on Unix systems the library itself is
57 called libpcre.a, so can be accessed by adding -lpcre to the
58 command for linking an application which calls it.
59
60 The functions pcre_compile(), pcre_study(), and pcre_exec()
61 are used for compiling and matching regular expressions,
62 while pcre_copy_substring(), pcre_get_substring(), and
63 pcre_get_substring_list() are convenience functions for
64 extracting captured substrings from a matched subject
65 string. The function pcre_maketables() is used (optionally)
66 to build a set of character tables in the current locale for
67 passing to pcre_compile().
68
69 The function pcre_info() is used to find out information
70 about a compiled pattern, while the function pcre_version()
71 returns a pointer to a string containing the version of PCRE
72 and its date of release.
73
74 The global variables pcre_malloc and pcre_free initially
75 contain the entry points of the standard malloc() and free()
76 functions respectively. PCRE calls the memory management
77 functions via these variables, so a calling program can
78 replace them if it wishes to intercept the calls. This
79 should be done before calling any PCRE functions.
80
81
82
83 MULTI-THREADING
84 The PCRE functions can be used in multi-threading applica-
85 tions, with the proviso that the memory management functions
86 pointed to by pcre_malloc and pcre_free are shared by all
87 threads.
88
89 The compiled form of a regular expression is not altered
90 during matching, so the same compiled pattern can safely be
91 used by several threads at once.
92
93
94
95 COMPILING A PATTERN
96 The function pcre_compile() is called to compile a pattern
97 into an internal form. The pattern is a C string terminated
98 by a binary zero, and is passed in the argument pattern. A
99 pointer to a single block of memory that is obtained via
100 pcre_malloc is returned. This contains the compiled code and
101 related data. The pcre type is defined for this for conveni-
102 ence, but in fact pcre is just a typedef for void, since the
103 contents of the block are not externally defined. It is up
104 to the caller to free the memory when it is no longer
105 required.
106
107 The size of a compiled pattern is roughly proportional to
108 the length of the pattern string, except that each character
109 class (other than those containing just a single character,
110 negated or not) requires 33 bytes, and repeat quantifiers
111 with a minimum greater than one or a bounded maximum cause
112 the relevant portions of the compiled pattern to be repli-
113 cated.
114
115 The options argument contains independent bits that affect
116 the compilation. It should be zero if no options are
117 required. Some of the options, in particular, those that are
118 compatible with Perl, can also be set and unset from within
119 the pattern (see the detailed description of regular expres-
120 sions below). For these options, the contents of the options
121 argument specifies their initial settings at the start of
122 compilation and execution. The PCRE_ANCHORED option can be
123 set at the time of matching as well as at compile time.
124
125 If errptr is NULL, pcre_compile() returns NULL immediately.
126 Otherwise, if compilation of a pattern fails, pcre_compile()
127 returns NULL, and sets the variable pointed to by errptr to
128 point to a textual error message. The offset from the start
129 of the pattern to the character where the error was
130 discovered is placed in the variable pointed to by
131 erroffset, which must not be NULL. If it is, an immediate
132 error is given.
133
134 If the final argument, tableptr, is NULL, PCRE uses a
135 default set of character tables which are built when it is
136 compiled, using the default C locale. Otherwise, tableptr
137 must be the result of a call to pcre_maketables(). See the
138 section on locale support below.
139
140 The following option bits are defined in the header file:
141
142 PCRE_ANCHORED
143
144 If this bit is set, the pattern is forced to be "anchored",
145 that is, it is constrained to match only at the start of the
146 string which is being searched (the "subject string"). This
147 effect can also be achieved by appropriate constructs in the
148 pattern itself, which is the only way to do it in Perl.
149
150 PCRE_CASELESS
151
152 If this bit is set, letters in the pattern match both upper
153 and lower case letters. It is equivalent to Perl's /i
154 option.
155
156 PCRE_DOLLAR_ENDONLY
157
158 If this bit is set, a dollar metacharacter in the pattern
159 matches only at the end of the subject string. Without this
160 option, a dollar also matches immediately before the final
161 character if it is a newline (but not before any other new-
162 lines). The PCRE_DOLLAR_ENDONLY option is ignored if
163 PCRE_MULTILINE is set. There is no equivalent to this option
164 in Perl.
165
166 PCRE_DOTALL
167
168 If this bit is set, a dot metacharater in the pattern
169 matches all characters, including newlines. Without it, new-
170 lines are excluded. This option is equivalent to Perl's /s
171 option. A negative class such as [^a] always matches a new-
172 line character, independent of the setting of this option.
173
174 PCRE_EXTENDED
175
176 If this bit is set, whitespace data characters in the pat-
177 tern are totally ignored except when escaped or inside a
178 character class, and characters between an unescaped # out-
179 side a character class and the next newline character,
180 inclusive, are also ignored. This is equivalent to Perl's /x
181 option, and makes it possible to include comments inside
182 complicated patterns. Note, however, that this applies only
183 to data characters. Whitespace characters may never appear
184 within special character sequences in a pattern, for example
185 within the sequence (?( which introduces a conditional sub-
186 pattern.
187
188 PCRE_EXTRA
189
190 This option turns on additional functionality of PCRE that
191 is incompatible with Perl. Any backslash in a pattern that
192 is followed by a letter that has no special meaning causes
193 an error, thus reserving these combinations for future
194 expansion. By default, as in Perl, a backslash followed by a
195 letter with no special meaning is treated as a literal.
196 There are at present no other features controlled by this
197 option.
198
199 PCRE_MULTILINE
200
201 By default, PCRE treats the subject string as consisting of
202 a single "line" of characters (even if it actually contains
203 several newlines). The "start of line" metacharacter (^)
204 matches only at the start of the string, while the "end of
205 line" metacharacter ($) matches only at the end of the
206 string, or before a terminating newline (unless
207 PCRE_DOLLAR_ENDONLY is set). This is the same as Perl.
208
209 When PCRE_MULTILINE it is set, the "start of line" and "end
210 of line" constructs match immediately following or
211 immediately before any newline in the subject string,
212 respectively, as well as at the very start and end. This is
213 equivalent to Perl's /m option. If there are no "\n" charac-
214 ters in a subject string, or no occurrences of ^ or $ in a
215 pattern, setting PCRE_MULTILINE has no effect.
216
217 PCRE_UNGREEDY
218
219 This option inverts the "greediness" of the quantifiers so
220 that they are not greedy by default, but become greedy if
221 followed by "?". It is not compatible with Perl. It can also
222 be set by a (?U) option setting within the pattern.
223
224
225
226 STUDYING A PATTERN
227 When a pattern is going to be used several times, it is
228 worth spending more time analyzing it in order to speed up
229 the time taken for matching. The function pcre_study() takes
230 a pointer to a compiled pattern as its first argument, and
231 returns a pointer to a pcre_extra block (another void
232 typedef) containing additional information about the pat-
233 tern; this can be passed to pcre_exec(). If no additional
234 information is available, NULL is returned.
235
236 The second argument contains option bits. At present, no
237 options are defined for pcre_study(), and this argument
238 should always be zero.
239
240 The third argument for pcre_study() is a pointer to an error
241 message. If studying succeeds (even if no data is returned),
242 the variable it points to is set to NULL. Otherwise it
243 points to a textual error message.
244
245 At present, studying a pattern is useful only for non-
246 anchored patterns that do not have a single fixed starting
247 character. A bitmap of possible starting characters is
248 created.
249
250
251
252 LOCALE SUPPORT
253 PCRE handles caseless matching, and determines whether char-
254 acters are letters, digits, or whatever, by reference to a
255 set of tables. The library contains a default set of tables
256 which is created in the default C locale when PCRE is com-
257 piled. This is used when the final argument of
258 pcre_compile() is NULL, and is sufficient for many applica-
259 tions.
260
261 An alternative set of tables can, however, be supplied. Such
262 tables are built by calling the pcre_maketables() function,
263 which has no arguments, in the relevant locale. The result
264 can then be passed to pcre_compile() as often as necessary.
265 For example, to build and use tables that are appropriate
266 for the French locale (where accented characters with codes
267 greater than 128 are treated as letters), the following code
268 could be used:
269
270 setlocale(LC_CTYPE, "fr");
271 tables = pcre_maketables();
272 re = pcre_compile(..., tables);
273
274 The tables are built in memory that is obtained via
275 pcre_malloc. The pointer that is passed to pcre_compile is
276 saved with the compiled pattern, and the same tables are
277 used via this pointer by pcre_study() and pcre_exec(). Thus
278 for any single pattern, compilation, studying and matching
279 all happen in the same locale, but different patterns can be
280 compiled in different locales. It is the caller's responsi-
281 bility to ensure that the memory containing the tables
282 remains available for as long as it is needed.
283
284
285
286 INFORMATION ABOUT A PATTERN
287 The pcre_info() function returns information about a com-
288 piled pattern. Its yield is the number of capturing subpat-
289 terns, or one of the following negative numbers:
290
291 PCRE_ERROR_NULL the argument code was NULL
292 PCRE_ERROR_BADMAGIC the "magic number" was not found
293
294 If the optptr argument is not NULL, a copy of the options
295 with which the pattern was compiled is placed in the integer
296 it points to. These option bits are those specified in the
297 call to pcre_compile(), modified by any top-level option
298 settings within the pattern itself, and with the
299 PCRE_ANCHORED bit set if the form of the pattern implies
300 that it can match only at the start of a subject string.
301
302 If the pattern is not anchored and the firstcharptr argument
303 is not NULL, it is used to pass back information about the
304 first character of any matched string. If there is a fixed
305 first character, e.g. from a pattern such as
306 (cat|cow|coyote), then it is returned in the integer pointed
307 to by firstcharptr. Otherwise, if either
308
309 (a) the pattern was compiled with the PCRE_MULTILINE option,
310 and every branch starts with "^", or
311
312 (b) every branch of the pattern starts with ".*" and
313 PCRE_DOTALL is not set (if it were set, the pattern would be
314 anchored),
315 then -1 is returned, indicating that the pattern matches
316 only at the start of a subject string or after any "\n"
317 within the string. Otherwise -2 is returned.
318
319
320
321 MATCHING A PATTERN
322 The function pcre_exec() is called to match a subject string
323 against a pre-compiled pattern, which is passed in the code
324 argument. If the pattern has been studied, the result of the
325 study should be passed in the extra argument. Otherwise this
326 must be NULL.
327
328 The PCRE_ANCHORED option can be passed in the options argu-
329 ment, whose unused bits must be zero. However, if a pattern
330 was compiled with PCRE_ANCHORED, or turned out to be
331 anchored by virtue of its contents, it cannot be made
332 unachored at matching time.
333
334 There are also three further options that can be set only at
335 matching time:
336
337 PCRE_NOTBOL
338
339 The first character of the string is not the beginning of a
340 line, so the circumflex metacharacter should not match
341 before it. Setting this without PCRE_MULTILINE (at compile
342 time) causes circumflex never to match.
343
344 PCRE_NOTEOL
345
346 The end of the string is not the end of a line, so the dol-
347 lar metacharacter should not match it nor (except in multi-
348 line mode) a newline immediately before it. Setting this
349 without PCRE_MULTILINE (at compile time) causes dollar never
350 to match.
351
352 PCRE_NOTEMPTY
353
354 An empty string is not considered to be a valid match if
355 this option is set. If there are alternatives in the pat-
356 tern, they are tried. If all the alternatives match the
357 empty string, the entire match fails. For example, if the
358 pattern
359
360 a?b?
361
362 is applied to a string not beginning with "a" or "b", it
363 matches the empty string at the start of the subject. With
364 PCRE_NOTEMPTY set, this match is not valid, so PCRE searches
365 further into the string for occurrences of "a" or "b".
366
367 Perl has no direct equivalent of PCRE_NOTEMPTY, but it does
368 make a special case of a pattern match of the empty string
369 within its split() function, and when using the /g modifier.
370 It is possible to emulate Perl's behaviour after matching a
371 null string by first trying the match again at the same
372 offset with PCRE_NOTEMPTY set, and then if that fails by
373 advancing the starting offset (see below) and trying an
374 ordinary match again.
375
376 The subject string is passed as a pointer in subject, a
377 length in length, and a starting offset in startoffset.
378 Unlike the pattern string, it may contain binary zero char-
379 acters. When the starting offset is zero, the search for a
380 match starts at the beginning of the subject, and this is by
381 far the most common case.
382
383 A non-zero starting offset is useful when searching for
384 another match in the same subject by calling pcre_exec()
385 again after a previous success. Setting startoffset differs
386 from just passing over a shortened string and setting
387 PCRE_NOTBOL in the case of a pattern that begins with any
388 kind of lookbehind. For example, consider the pattern
389
390 \Biss\B
391
392 which finds occurrences of "iss" in the middle of words. (\B
393 matches only if the current position in the subject is not a
394 word boundary.) When applied to the string "Mississipi" the
395 first call to pcre_exec() finds the first occurrence. If
396 pcre_exec() is called again with just the remainder of the
397 subject, namely "issipi", it does not match, because \B is
398 always false at the start of the subject, which is deemed to
399 be a word boundary. However, if pcre_exec() is passed the
400 entire string again, but with startoffset set to 4, it finds
401 the second occurrence of "iss" because it is able to look
402 behind the starting point to discover that it is preceded by
403 a letter.
404
405 If a non-zero starting offset is passed when the pattern is
406 anchored, one attempt to match at the given offset is tried.
407 This can only succeed if the pattern does not require the
408 match to be at the start of the subject.
409
410 In general, a pattern matches a certain portion of the sub-
411 ject, and in addition, further substrings from the subject
412 may be picked out by parts of the pattern. Following the
413 usage in Jeffrey Friedl's book, this is called "capturing"
414 in what follows, and the phrase "capturing subpattern" is
415 used for a fragment of a pattern that picks out a substring.
416 PCRE supports several other kinds of parenthesized subpat-
417 tern that do not cause substrings to be captured.
418
419 Captured substrings are returned to the caller via a vector
420 of integer offsets whose address is passed in ovector. The
421 number of elements in the vector is passed in ovecsize. The
422 first two-thirds of the vector is used to pass back captured
423 substrings, each substring using a pair of integers. The
424 remaining third of the vector is used as workspace by
425 pcre_exec() while matching capturing subpatterns, and is not
426 available for passing back information. The length passed in
427 ovecsize should always be a multiple of three. If it is not,
428 it is rounded down.
429
430 When a match has been successful, information about captured
431 substrings is returned in pairs of integers, starting at the
432 beginning of ovector, and continuing up to two-thirds of its
433 length at the most. The first element of a pair is set to
434 the offset of the first character in a substring, and the
435 second is set to the offset of the first character after the
436 end of a substring. The first pair, ovector[0] and ovec-
437 tor[1], identify the portion of the subject string matched
438 by the entire pattern. The next pair is used for the first
439 capturing subpattern, and so on. The value returned by
440 pcre_exec() is the number of pairs that have been set. If
441 there are no capturing subpatterns, the return value from a
442 successful match is 1, indicating that just the first pair
443 of offsets has been set.
444
445 Some convenience functions are provided for extracting the
446 captured substrings as separate strings. These are described
447 in the following section.
448
449 It is possible for an capturing subpattern number n+1 to
450 match some part of the subject when subpattern n has not
451 been used at all. For example, if the string "abc" is
452 matched against the pattern (a|(z))(bc) subpatterns 1 and 3
453 are matched, but 2 is not. When this happens, both offset
454 values corresponding to the unused subpattern are set to -1.
455
456 If a capturing subpattern is matched repeatedly, it is the
457 last portion of the string that it matched that gets
458 returned.
459
460 If the vector is too small to hold all the captured sub-
461 strings, it is used as far as possible (up to two-thirds of
462 its length), and the function returns a value of zero. In
463 particular, if the substring offsets are not of interest,
464 pcre_exec() may be called with ovector passed as NULL and
465 ovecsize as zero. However, if the pattern contains back
466 references and the ovector isn't big enough to remember the
467 related substrings, PCRE has to get additional memory for
468 use during matching. Thus it is usually advisable to supply
469 an ovector.
470
471 Note that pcre_info() can be used to find out how many cap-
472 turing subpatterns there are in a compiled pattern. The
473 smallest size for ovector that will allow for n captured
474 substrings in addition to the offsets of the substring
475 matched by the whole pattern is (n+1)*3.
476
477 If pcre_exec() fails, it returns a negative number. The fol-
478 lowing are defined in the header file:
479
480 PCRE_ERROR_NOMATCH (-1)
481
482 The subject string did not match the pattern.
483
484 PCRE_ERROR_NULL (-2)
485
486 Either code or subject was passed as NULL, or ovector was
487 NULL and ovecsize was not zero.
488
489 PCRE_ERROR_BADOPTION (-3)
490
491 An unrecognized bit was set in the options argument.
492
493 PCRE_ERROR_BADMAGIC (-4)
494
495 PCRE stores a 4-byte "magic number" at the start of the com-
496 piled code, to catch the case when it is passed a junk
497 pointer. This is the error it gives when the magic number
498 isn't present.
499
500 PCRE_ERROR_UNKNOWN_NODE (-5)
501
502 While running the pattern match, an unknown item was encoun-
503 tered in the compiled pattern. This error could be caused by
504 a bug in PCRE or by overwriting of the compiled pattern.
505
506 PCRE_ERROR_NOMEMORY (-6)
507
508 If a pattern contains back references, but the ovector that
509 is passed to pcre_exec() is not big enough to remember the
510 referenced substrings, PCRE gets a block of memory at the
511 start of matching to use for this purpose. If the call via
512 pcre_malloc() fails, this error is given. The memory is
513 freed at the end of matching.
514
515
516
517 EXTRACTING CAPTURED SUBSTRINGS
518 Captured substrings can be accessed directly by using the
519 offsets returned by pcre_exec() in ovector. For convenience,
520 the functions pcre_copy_substring(), pcre_get_substring(),
521 and pcre_get_substring_list() are provided for extracting
522 captured substrings as new, separate, zero-terminated
523 strings. A substring that contains a binary zero is
524 correctly extracted and has a further zero added on the end,
525 but the result does not, of course, function as a C string.
526
527 The first three arguments are the same for all three func-
528 tions: subject is the subject string which has just been
529 successfully matched, ovector is a pointer to the vector of
530 integer offsets that was passed to pcre_exec(), and
531 stringcount is the number of substrings that were captured
532 by the match, including the substring that matched the
533 entire regular expression. This is the value returned by
534 pcre_exec if it is greater than zero. If pcre_exec()
535 returned zero, indicating that it ran out of space in ovec-
536 tor, then the value passed as stringcount should be the size
537 of the vector divided by three.
538
539 The functions pcre_copy_substring() and pcre_get_substring()
540 extract a single substring, whose number is given as string-
541 number. A value of zero extracts the substring that matched
542 the entire pattern, while higher values extract the captured
543 substrings. For pcre_copy_substring(), the string is placed
544 in buffer, whose length is given by buffersize, while for
545 pcre_get_substring() a new block of store is obtained via
546 pcre_malloc, and its address is returned via stringptr. The
547 yield of the function is the length of the string, not
548 including the terminating zero, or one of
549
550 PCRE_ERROR_NOMEMORY (-6)
551
552 The buffer was too small for pcre_copy_substring(), or the
553 attempt to get memory failed for pcre_get_substring().
554
555 PCRE_ERROR_NOSUBSTRING (-7)
556
557 There is no substring whose number is stringnumber.
558
559 The pcre_get_substring_list() function extracts all avail-
560 able substrings and builds a list of pointers to them. All
561 this is done in a single block of memory which is obtained
562 via pcre_malloc. The address of the memory block is returned
563 via listptr, which is also the start of the list of string
564 pointers. The end of the list is marked by a NULL pointer.
565 The yield of the function is zero if all went well, or
566
567 PCRE_ERROR_NOMEMORY (-6)
568
569 if the attempt to get the memory block failed.
570
571 When any of these functions encounter a substring that is
572 unset, which can happen when capturing subpattern number n+1
573 matches some part of the subject, but subpattern n has not
574 been used at all, they return an empty string. This can be
575 distinguished from a genuine zero-length substring by
576 inspecting the appropriate offset in ovector, which is nega-
577 tive for unset substrings.
578
579
580
581
582 LIMITATIONS
583 There are some size limitations in PCRE but it is hoped that
584 they will never in practice be relevant. The maximum length
585 of a compiled pattern is 65539 (sic) bytes. All values in
586 repeating quantifiers must be less than 65536. The maximum
587 number of capturing subpatterns is 99. The maximum number
588 of all parenthesized subpatterns, including capturing sub-
589 patterns, assertions, and other types of subpattern, is 200.
590
591 The maximum length of a subject string is the largest posi-
592 tive number that an integer variable can hold. However, PCRE
593 uses recursion to handle subpatterns and indefinite repeti-
594 tion. This means that the available stack space may limit
595 the size of a subject string that can be processed by cer-
596 tain patterns.
597
598
599
600 DIFFERENCES FROM PERL
601 The differences described here are with respect to Perl
602 5.005.
603
604 1. By default, a whitespace character is any character that
605 the C library function isspace() recognizes, though it is
606 possible to compile PCRE with alternative character type
607 tables. Normally isspace() matches space, formfeed, newline,
608 carriage return, horizontal tab, and vertical tab. Perl 5 no
609 longer includes vertical tab in its set of whitespace char-
610 acters. The \v escape that was in the Perl documentation for
611 a long time was never in fact recognized. However, the char-
612 acter itself was treated as whitespace at least up to 5.002.
613 In 5.004 and 5.005 it does not match \s.
614
615 2. PCRE does not allow repeat quantifiers on lookahead
616 assertions. Perl permits them, but they do not mean what you
617 might think. For example, (?!a){3} does not assert that the
618 next three characters are not "a". It just asserts that the
619 next character is not "a" three times.
620
621 3. Capturing subpatterns that occur inside negative looka-
622 head assertions are counted, but their entries in the
623 offsets vector are never set. Perl sets its numerical vari-
624 ables from any such patterns that are matched before the
625 assertion fails to match something (thereby succeeding), but
626 only if the negative lookahead assertion contains just one
627 branch.
628
629 4. Though binary zero characters are supported in the sub-
630 ject string, they are not allowed in a pattern string
631 because it is passed as a normal C string, terminated by
632 zero. The escape sequence "\0" can be used in the pattern to
633 represent a binary zero.
634
635 5. The following Perl escape sequences are not supported:
636 \l, \u, \L, \U, \E, \Q. In fact these are implemented by
637 Perl's general string-handling and are not part of its pat-
638 tern matching engine.
639
640 6. The Perl \G assertion is not supported as it is not
641 relevant to single pattern matches.
642
643 7. Fairly obviously, PCRE does not support the (?{code})
644 construction.
645
646 8. There are at the time of writing some oddities in Perl
647 5.005_02 concerned with the settings of captured strings
648 when part of a pattern is repeated. For example, matching
649 "aba" against the pattern /^(a(b)?)+$/ sets $2 to the value
650 "b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2
651 unset. However, if the pattern is changed to
652 /^(aa(b(b))?)+$/ then $2 (and $3) get set.
653
654 In Perl 5.004 $2 is set in both cases, and that is also true
655 of PCRE. If in the future Perl changes to a consistent state
656 that is different, PCRE may change to follow.
657
658 9. Another as yet unresolved discrepancy is that in Perl
659 5.005_02 the pattern /^(a)?(?(1)a|b)+$/ matches the string
660 "a", whereas in PCRE it does not. However, in both Perl and
661 PCRE /^(a)?a/ matched against "a" leaves $1 unset.
662
663 10. PCRE provides some extensions to the Perl regular
664 expression facilities:
665
666 (a) Although lookbehind assertions must match fixed length
667 strings, each alternative branch of a lookbehind assertion
668 can match a different length of string. Perl 5.005 requires
669 them all to have the same length.
670
671 (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not
672 set, the $ meta- character matches only at the very end of
673 the string.
674
675 (c) If PCRE_EXTRA is set, a backslash followed by a letter
676 with no special meaning is faulted.
677
678 (d) If PCRE_UNGREEDY is set, the greediness of the
679 repetition quantifiers is inverted, that is, by default they
680 are not greedy, but if followed by a question mark they are.
681
682 (e) PCRE_ANCHORED can be used to force a pattern to be tried
683 only at the start of the subject.
684
685 (f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY options
686 for pcre_exec() have no Perl equivalents.
687
688
689
690 REGULAR EXPRESSION DETAILS
691 The syntax and semantics of the regular expressions sup-
692 ported by PCRE are described below. Regular expressions are
693 also described in the Perl documentation and in a number of
694 other books, some of which have copious examples. Jeffrey
695 Friedl's "Mastering Regular Expressions", published by
696 O'Reilly (ISBN 1-56592-257-3), covers them in great detail.
697 The description here is intended as reference documentation.
698
699 A regular expression is a pattern that is matched against a
700 subject string from left to right. Most characters stand for
701 themselves in a pattern, and match the corresponding charac-
702 ters in the subject. As a trivial example, the pattern
703
704 The quick brown fox
705
706 matches a portion of a subject string that is identical to
707 itself. The power of regular expressions comes from the
708 ability to include alternatives and repetitions in the pat-
709 tern. These are encoded in the pattern by the use of meta-
710 characters, which do not stand for themselves but instead
711 are interpreted in some special way.
712
713 There are two different sets of meta-characters: those that
714 are recognized anywhere in the pattern except within square
715 brackets, and those that are recognized in square brackets.
716 Outside square brackets, the meta-characters are as follows:
717
718 \ general escape character with several uses
719 ^ assert start of subject (or line, in multiline
720 mode)
721 $ assert end of subject (or line, in multiline mode)
722 . match any character except newline (by default)
723 [ start character class definition
724 | start of alternative branch
725 ( start subpattern
726 ) end subpattern
727 ? extends the meaning of (
728 also 0 or 1 quantifier
729 also quantifier minimizer
730 * 0 or more quantifier
731 + 1 or more quantifier
732 { start min/max quantifier
733
734 Part of a pattern that is in square brackets is called a
735 "character class". In a character class the only meta-
736 characters are:
737
738 \ general escape character
739 ^ negate the class, but only if the first character
740 - indicates character range
741 ] terminates the character class
742
743 The following sections describe the use of each of the
744 meta-characters.
745
746
747
748 BACKSLASH
749 The backslash character has several uses. Firstly, if it is
750 followed by a non-alphameric character, it takes away any
751 special meaning that character may have. This use of
752 backslash as an escape character applies both inside and
753 outside character classes.
754
755 For example, if you want to match a "*" character, you write
756 "\*" in the pattern. This applies whether or not the follow-
757 ing character would otherwise be interpreted as a meta-
758 character, so it is always safe to precede a non-alphameric
759 with "\" to specify that it stands for itself. In particu-
760 lar, if you want to match a backslash, you write "\\".
761
762 If a pattern is compiled with the PCRE_EXTENDED option, whi-
763 tespace in the pattern (other than in a character class) and
764 characters between a "#" outside a character class and the
765 next newline character are ignored. An escaping backslash
766 can be used to include a whitespace or "#" character as part
767 of the pattern.
768
769 A second use of backslash provides a way of encoding non-
770 printing characters in patterns in a visible manner. There
771 is no restriction on the appearance of non-printing charac-
772 ters, apart from the binary zero that terminates a pattern,
773 but when a pattern is being prepared by text editing, it is
774 usually easier to use one of the following escape sequences
775 than the binary character it represents:
776
777 \a alarm, that is, the BEL character (hex 07)
778 \cx "control-x", where x is any character
779 \e escape (hex 1B)
780 \f formfeed (hex 0C)
781 \n newline (hex 0A)
782 \r carriage return (hex 0D)
783
784 tab (hex 09)
785 \xhh character with hex code hh
786 \ddd character with octal code ddd, or backreference
787
788 The precise effect of "\cx" is as follows: if "x" is a lower
789 case letter, it is converted to upper case. Then bit 6 of
790 the character (hex 40) is inverted. Thus "\cz" becomes hex
791 1A, but "\c{" becomes hex 3B, while "\c;" becomes hex 7B.
792
793 After "\x", up to two hexadecimal digits are read (letters
794 can be in upper or lower case).
795
796 After "\0" up to two further octal digits are read. In both
797 cases, if there are fewer than two digits, just those that
798 are present are used. Thus the sequence "\0\x\07" specifies
799 two binary zeros followed by a BEL character. Make sure you
800 supply two digits after the initial zero if the character
801 that follows is itself an octal digit.
802
803 The handling of a backslash followed by a digit other than 0
804 is complicated. Outside a character class, PCRE reads it
805 and any following digits as a decimal number. If the number
806 is less than 10, or if there have been at least that many
807 previous capturing left parentheses in the expression, the
808 entire sequence is taken as a back reference. A description
809 of how this works is given later, following the discussion
810 of parenthesized subpatterns.
811
812 Inside a character class, or if the decimal number is
813 greater than 9 and there have not been that many capturing
814 subpatterns, PCRE re-reads up to three octal digits follow-
815 ing the backslash, and generates a single byte from the
816 least significant 8 bits of the value. Any subsequent digits
817 stand for themselves. For example:
818
819 \040 is another way of writing a space
820 \40 is the same, provided there are fewer than 40
821 previous capturing subpatterns
822 \7 is always a back reference
823 \11 might be a back reference, or another way of
824 writing a tab
825 \011 is always a tab
826 \0113 is a tab followed by the character "3"
827 \113 is the character with octal code 113 (since there
828 can be no more than 99 back references)
829 \377 is a byte consisting entirely of 1 bits
830 \81 is either a back reference, or a binary zero
831 followed by the two characters "8" and "1"
832
833 Note that octal values of 100 or greater must not be intro-
834 duced by a leading zero, because no more than three octal
835 digits are ever read.
836 All the sequences that define a single byte value can be
837 used both inside and outside character classes. In addition,
838 inside a character class, the sequence "\b" is interpreted
839 as the backspace character (hex 08). Outside a character
840 class it has a different meaning (see below).
841
842 The third use of backslash is for specifying generic charac-
843 ter types:
844
845 \d any decimal digit
846 \D any character that is not a decimal digit
847 \s any whitespace character
848 \S any character that is not a whitespace character
849 \w any "word" character
850 \W any "non-word" character
851
852 Each pair of escape sequences partitions the complete set of
853 characters into two disjoint sets. Any given character
854 matches one, and only one, of each pair.
855
856 A "word" character is any letter or digit or the underscore
857 character, that is, any character which can be part of a
858 Perl "word". The definition of letters and digits is con-
859 trolled by PCRE's character tables, and may vary if locale-
860 specific matching is taking place (see "Locale support"
861 above). For example, in the "fr" (French) locale, some char-
862 acter codes greater than 128 are used for accented letters,
863 and these are matched by \w.
864
865 These character type sequences can appear both inside and
866 outside character classes. They each match one character of
867 the appropriate type. If the current matching point is at
868 the end of the subject string, all of them fail, since there
869 is no character to match.
870
871 The fourth use of backslash is for certain simple asser-
872 tions. An assertion specifies a condition that has to be met
873 at a particular point in a match, without consuming any
874 characters from the subject string. The use of subpatterns
875 for more complicated assertions is described below. The
876 backslashed assertions are
877
878 \b word boundary
879 \B not a word boundary
880 \A start of subject (independent of multiline mode)
881 \Z end of subject or newline at end (independent of
882 multiline mode)
883 \z end of subject (independent of multiline mode)
884
885 These assertions may not appear in character classes (but
886 note that "\b" has a different meaning, namely the backspace
887 character, inside a character class).
888 A word boundary is a position in the subject string where
889 the current character and the previous character do not both
890 match \w or \W (i.e. one matches \w and the other matches
891 \W), or the start or end of the string if the first or last
892 character matches \w, respectively.
893
894 The \A, \Z, and \z assertions differ from the traditional
895 circumflex and dollar (described below) in that they only
896 ever match at the very start and end of the subject string,
897 whatever options are set. They are not affected by the
898 PCRE_NOTBOL or PCRE_NOTEOL options. If the startoffset argu-
899 ment of pcre_exec() is non-zero, \A can never match. The
900 difference between \Z and \z is that \Z matches before a
901 newline that is the last character of the string as well as
902 at the end of the string, whereas \z matches only at the
903 end.
904
905
906
907 CIRCUMFLEX AND DOLLAR
908 Outside a character class, in the default matching mode, the
909 circumflex character is an assertion which is true only if
910 the current matching point is at the start of the subject
911 string. If the startoffset argument of pcre_exec() is non-
912 zero, circumflex can never match. Inside a character class,
913 circumflex has an entirely different meaning (see below).
914
915 Circumflex need not be the first character of the pattern if
916 a number of alternatives are involved, but it should be the
917 first thing in each alternative in which it appears if the
918 pattern is ever to match that branch. If all possible alter-
919 natives start with a circumflex, that is, if the pattern is
920 constrained to match only at the start of the subject, it is
921 said to be an "anchored" pattern. (There are also other con-
922 structs that can cause a pattern to be anchored.)
923
924 A dollar character is an assertion which is true only if the
925 current matching point is at the end of the subject string,
926 or immediately before a newline character that is the last
927 character in the string (by default). Dollar need not be the
928 last character of the pattern if a number of alternatives
929 are involved, but it should be the last item in any branch
930 in which it appears. Dollar has no special meaning in a
931 character class.
932
933 The meaning of dollar can be changed so that it matches only
934 at the very end of the string, by setting the
935 PCRE_DOLLAR_ENDONLY option at compile or matching time. This
936 does not affect the \Z assertion.
937
938 The meanings of the circumflex and dollar characters are
939 changed if the PCRE_MULTILINE option is set. When this is
940 the case, they match immediately after and immediately
941 before an internal "\n" character, respectively, in addition
942 to matching at the start and end of the subject string. For
943 example, the pattern /^abc$/ matches the subject string
944 "def\nabc" in multiline mode, but not otherwise. Conse-
945 quently, patterns that are anchored in single line mode
946 because all branches start with "^" are not anchored in mul-
947 tiline mode, and a match for circumflex is possible when the
948 startoffset argument of pcre_exec() is non-zero. The
949 PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is
950 set.
951
952 Note that the sequences \A, \Z, and \z can be used to match
953 the start and end of the subject in both modes, and if all
954 branches of a pattern start with \A is it always anchored,
955 whether PCRE_MULTILINE is set or not.
956
957
958
959 FULL STOP (PERIOD, DOT)
960 Outside a character class, a dot in the pattern matches any
961 one character in the subject, including a non-printing char-
962 acter, but not (by default) newline. If the PCRE_DOTALL
963 option is set, then dots match newlines as well. The han-
964 dling of dot is entirely independent of the handling of cir-
965 cumflex and dollar, the only relationship being that they
966 both involve newline characters. Dot has no special meaning
967 in a character class.
968
969
970
971 SQUARE BRACKETS
972 An opening square bracket introduces a character class, ter-
973 minated by a closing square bracket. A closing square
974 bracket on its own is not special. If a closing square
975 bracket is required as a member of the class, it should be
976 the first data character in the class (after an initial cir-
977 cumflex, if present) or escaped with a backslash.
978
979 A character class matches a single character in the subject;
980 the character must be in the set of characters defined by
981 the class, unless the first character in the class is a cir-
982 cumflex, in which case the subject character must not be in
983 the set defined by the class. If a circumflex is actually
984 required as a member of the class, ensure it is not the
985 first character, or escape it with a backslash.
986
987 For example, the character class [aeiou] matches any lower
988 case vowel, while [^aeiou] matches any character that is not
989 a lower case vowel. Note that a circumflex is just a con-
990 venient notation for specifying the characters which are in
991 the class by enumerating those that are not. It is not an
992 assertion: it still consumes a character from the subject
993 string, and fails if the current pointer is at the end of
994 the string.
995
996 When caseless matching is set, any letters in a class
997 represent both their upper case and lower case versions, so
998 for example, a caseless [aeiou] matches "A" as well as "a",
999 and a caseless [^aeiou] does not match "A", whereas a case-
1000 ful version would.
1001
1002 The newline character is never treated in any special way in
1003 character classes, whatever the setting of the PCRE_DOTALL
1004 or PCRE_MULTILINE options is. A class such as [^a] will
1005 always match a newline.
1006
1007 The minus (hyphen) character can be used to specify a range
1008 of characters in a character class. For example, [d-m]
1009 matches any letter between d and m, inclusive. If a minus
1010 character is required in a class, it must be escaped with a
1011 backslash or appear in a position where it cannot be inter-
1012 preted as indicating a range, typically as the first or last
1013 character in the class.
1014
1015 It is not possible to have the literal character "]" as the
1016 end character of a range. A pattern such as [W-]46] is
1017 interpreted as a class of two characters ("W" and "-") fol-
1018 lowed by a literal string "46]", so it would match "W46]" or
1019 "-46]". However, if the "]" is escaped with a backslash it
1020 is interpreted as the end of range, so [W-\]46] is inter-
1021 preted as a single class containing a range followed by two
1022 separate characters. The octal or hexadecimal representation
1023 of "]" can also be used to end a range.
1024
1025 Ranges operate in ASCII collating sequence. They can also be
1026 used for characters specified numerically, for example
1027 [\000-\037]. If a range that includes letters is used when
1028 caseless matching is set, it matches the letters in either
1029 case. For example, [W-c] is equivalent to [][\^_`wxyzabc],
1030 matched caselessly, and if character tables for the "fr"
1031 locale are in use, [\xc8-\xcb] matches accented E characters
1032 in both cases.
1033
1034 The character types \d, \D, \s, \S, \w, and \W may also
1035 appear in a character class, and add the characters that
1036 they match to the class. For example, [\dABCDEF] matches any
1037 hexadecimal digit. A circumflex can conveniently be used
1038 with the upper case character types to specify a more res-
1039 tricted set of characters than the matching lower case type.
1040 For example, the class [^\W_] matches any letter or digit,
1041 but not underscore.
1042
1043 All non-alphameric characters other than \, -, ^ (at the
1044 start) and the terminating ] are non-special in character
1045 classes, but it does no harm if they are escaped.
1046
1047
1048
1049 VERTICAL BAR
1050 Vertical bar characters are used to separate alternative
1051 patterns. For example, the pattern
1052
1053 gilbert|sullivan
1054
1055 matches either "gilbert" or "sullivan". Any number of alter-
1056 natives may appear, and an empty alternative is permitted
1057 (matching the empty string). The matching process tries
1058 each alternative in turn, from left to right, and the first
1059 one that succeeds is used. If the alternatives are within a
1060 subpattern (defined below), "succeeds" means matching the
1061 rest of the main pattern as well as the alternative in the
1062 subpattern.
1063
1064
1065
1066 INTERNAL OPTION SETTING
1067 The settings of PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL,
1068 and PCRE_EXTENDED can be changed from within the pattern by
1069 a sequence of Perl option letters enclosed between "(?" and
1070 ")". The option letters are
1071
1072 i for PCRE_CASELESS
1073 m for PCRE_MULTILINE
1074 s for PCRE_DOTALL
1075 x for PCRE_EXTENDED
1076
1077 For example, (?im) sets caseless, multiline matching. It is
1078 also possible to unset these options by preceding the letter
1079 with a hyphen, and a combined setting and unsetting such as
1080 (?im-sx), which sets PCRE_CASELESS and PCRE_MULTILINE while
1081 unsetting PCRE_DOTALL and PCRE_EXTENDED, is also permitted.
1082 If a letter appears both before and after the hyphen, the
1083 option is unset.
1084
1085 The scope of these option changes depends on where in the
1086 pattern the setting occurs. For settings that are outside
1087 any subpattern (defined below), the effect is the same as if
1088 the options were set or unset at the start of matching. The
1089 following patterns all behave in exactly the same way:
1090
1091 (?i)abc
1092 a(?i)bc
1093 ab(?i)c
1094 abc(?i)
1095
1096 which in turn is the same as compiling the pattern abc with
1097 PCRE_CASELESS set. In other words, such "top level" set-
1098 tings apply to the whole pattern (unless there are other
1099 changes inside subpatterns). If there is more than one set-
1100 ting of the same option at top level, the rightmost setting
1101 is used.
1102
1103 If an option change occurs inside a subpattern, the effect
1104 is different. This is a change of behaviour in Perl 5.005.
1105 An option change inside a subpattern affects only that part
1106 of the subpattern that follows it, so
1107
1108 (a(?i)b)c
1109
1110 matches abc and aBc and no other strings (assuming
1111 PCRE_CASELESS is not used). By this means, options can be
1112 made to have different settings in different parts of the
1113 pattern. Any changes made in one alternative do carry on
1114 into subsequent branches within the same subpattern. For
1115 example,
1116
1117 (a(?i)b|c)
1118
1119 matches "ab", "aB", "c", and "C", even though when matching
1120 "C" the first branch is abandoned before the option setting.
1121 This is because the effects of option settings happen at
1122 compile time. There would be some very weird behaviour oth-
1123 erwise.
1124
1125 The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can
1126 be changed in the same way as the Perl-compatible options by
1127 using the characters U and X respectively. The (?X) flag
1128 setting is special in that it must always occur earlier in
1129 the pattern than any of the additional features it turns on,
1130 even when it is at top level. It is best put at the start.
1131
1132
1133
1134 SUBPATTERNS
1135 Subpatterns are delimited by parentheses (round brackets),
1136 which can be nested. Marking part of a pattern as a subpat-
1137 tern does two things:
1138
1139 1. It localizes a set of alternatives. For example, the pat-
1140 tern
1141
1142 cat(aract|erpillar|)
1143
1144 matches one of the words "cat", "cataract", or "caterpil-
1145 lar". Without the parentheses, it would match "cataract",
1146 "erpillar" or the empty string.
1147
1148 2. It sets up the subpattern as a capturing subpattern (as
1149 defined above). When the whole pattern matches, that por-
1150 tion of the subject string that matched the subpattern is
1151 passed back to the caller via the ovector argument of
1152 pcre_exec(). Opening parentheses are counted from left to
1153 right (starting from 1) to obtain the numbers of the captur-
1154 ing subpatterns.
1155
1156 For example, if the string "the red king" is matched against
1157 the pattern
1158
1159 the ((red|white) (king|queen))
1160
1161 the captured substrings are "red king", "red", and "king",
1162 and are numbered 1, 2, and 3.
1163
1164 The fact that plain parentheses fulfil two functions is not
1165 always helpful. There are often times when a grouping sub-
1166 pattern is required without a capturing requirement. If an
1167 opening parenthesis is followed by "?:", the subpattern does
1168 not do any capturing, and is not counted when computing the
1169 number of any subsequent capturing subpatterns. For example,
1170 if the string "the white queen" is matched against the pat-
1171 tern
1172
1173 the ((?:red|white) (king|queen))
1174
1175 the captured substrings are "white queen" and "queen", and
1176 are numbered 1 and 2. The maximum number of captured sub-
1177 strings is 99, and the maximum number of all subpatterns,
1178 both capturing and non-capturing, is 200.
1179
1180 As a convenient shorthand, if any option settings are
1181 required at the start of a non-capturing subpattern, the
1182 option letters may appear between the "?" and the ":". Thus
1183 the two patterns
1184
1185 (?i:saturday|sunday)
1186 (?:(?i)saturday|sunday)
1187
1188 match exactly the same set of strings. Because alternative
1189 branches are tried from left to right, and options are not
1190 reset until the end of the subpattern is reached, an option
1191 setting in one branch does affect subsequent branches, so
1192 the above patterns match "SUNDAY" as well as "Saturday".
1193
1194
1195
1196 REPETITION
1197 Repetition is specified by quantifiers, which can follow any
1198 of the following items:
1199
1200
1201 a single character, possibly escaped
1202 the . metacharacter
1203 a character class
1204 a back reference (see next section)
1205 a parenthesized subpattern (unless it is an assertion -
1206 see below)
1207
1208 The general repetition quantifier specifies a minimum and
1209 maximum number of permitted matches, by giving the two
1210 numbers in curly brackets (braces), separated by a comma.
1211 The numbers must be less than 65536, and the first must be
1212 less than or equal to the second. For example:
1213
1214 z{2,4}
1215
1216 matches "zz", "zzz", or "zzzz". A closing brace on its own
1217 is not a special character. If the second number is omitted,
1218 but the comma is present, there is no upper limit; if the
1219 second number and the comma are both omitted, the quantifier
1220 specifies an exact number of required matches. Thus
1221
1222 [aeiou]{3,}
1223
1224 matches at least 3 successive vowels, but may match many
1225 more, while
1226
1227 \d{8}
1228
1229 matches exactly 8 digits. An opening curly bracket that
1230 appears in a position where a quantifier is not allowed, or
1231 one that does not match the syntax of a quantifier, is taken
1232 as a literal character. For example, {,6} is not a quantif-
1233 ier, but a literal string of four characters.
1234
1235 The quantifier {0} is permitted, causing the expression to
1236 behave as if the previous item and the quantifier were not
1237 present.
1238
1239 For convenience (and historical compatibility) the three
1240 most common quantifiers have single-character abbreviations:
1241
1242 * is equivalent to {0,}
1243 + is equivalent to {1,}
1244 ? is equivalent to {0,1}
1245
1246 It is possible to construct infinite loops by following a
1247 subpattern that can match no characters with a quantifier
1248 that has no upper limit, for example:
1249
1250 (a?)*
1251
1252 Earlier versions of Perl and PCRE used to give an error at
1253 compile time for such patterns. However, because there are
1254 cases where this can be useful, such patterns are now
1255 accepted, but if any repetition of the subpattern does in
1256 fact match no characters, the loop is forcibly broken.
1257
1258 By default, the quantifiers are "greedy", that is, they
1259 match as much as possible (up to the maximum number of per-
1260 mitted times), without causing the rest of the pattern to
1261 fail. The classic example of where this gives problems is in
1262 trying to match comments in C programs. These appear between
1263 the sequences /* and */ and within the sequence, individual
1264 * and / characters may appear. An attempt to match C com-
1265 ments by applying the pattern
1266
1267 /\*.*\*/
1268
1269 to the string
1270
1271 /* first command */ not comment /* second comment */
1272
1273 fails, because it matches the entire string due to the
1274 greediness of the .* item.
1275
1276 However, if a quantifier is followed by a question mark,
1277 then it ceases to be greedy, and instead matches the minimum
1278 number of times possible, so the pattern
1279
1280 /\*.*?\*/
1281
1282 does the right thing with the C comments. The meaning of the
1283 various quantifiers is not otherwise changed, just the pre-
1284 ferred number of matches. Do not confuse this use of ques-
1285 tion mark with its use as a quantifier in its own right.
1286 Because it has two uses, it can sometimes appear doubled, as
1287 in
1288
1289 \d??\d
1290
1291 which matches one digit by preference, but can match two if
1292 that is the only way the rest of the pattern matches.
1293
1294 If the PCRE_UNGREEDY option is set (an option which is not
1295 available in Perl) then the quantifiers are not greedy by
1296 default, but individual ones can be made greedy by following
1297 them with a question mark. In other words, it inverts the
1298 default behaviour.
1299
1300 When a parenthesized subpattern is quantified with a minimum
1301 repeat count that is greater than 1 or with a limited max-
1302 imum, more store is required for the compiled pattern, in
1303 proportion to the size of the minimum or maximum.
1304
1305 If a pattern starts with .* or .{0,} and the PCRE_DOTALL
1306 option (equivalent to Perl's /s) is set, thus allowing the .
1307 to match newlines, then the pattern is implicitly anchored,
1308 because whatever follows will be tried against every charac-
1309 ter position in the subject string, so there is no point in
1310 retrying the overall match at any position after the first.
1311 PCRE treats such a pattern as though it were preceded by \A.
1312 In cases where it is known that the subject string contains
1313 no newlines, it is worth setting PCRE_DOTALL when the pat-
1314 tern begins with .* in order to obtain this optimization, or
1315 alternatively using ^ to indicate anchoring explicitly.
1316
1317 When a capturing subpattern is repeated, the value captured
1318 is the substring that matched the final iteration. For exam-
1319 ple, after
1320
1321 (tweedle[dume]{3}\s*)+
1322
1323 has matched "tweedledum tweedledee" the value of the cap-
1324 tured substring is "tweedledee". However, if there are
1325 nested capturing subpatterns, the corresponding captured
1326 values may have been set in previous iterations. For exam-
1327 ple, after
1328
1329 /(a|(b))+/
1330
1331 matches "aba" the value of the second captured substring is
1332 "b".
1333
1334
1335
1336 BACK REFERENCES
1337 Outside a character class, a backslash followed by a digit
1338 greater than 0 (and possibly further digits) is a back
1339 reference to a capturing subpattern earlier (i.e. to its
1340 left) in the pattern, provided there have been that many
1341 previous capturing left parentheses.
1342
1343 However, if the decimal number following the backslash is
1344 less than 10, it is always taken as a back reference, and
1345 causes an error only if there are not that many capturing
1346 left parentheses in the entire pattern. In other words, the
1347 parentheses that are referenced need not be to the left of
1348 the reference for numbers less than 10. See the section
1349 entitled "Backslash" above for further details of the han-
1350 dling of digits following a backslash.
1351
1352 A back reference matches whatever actually matched the cap-
1353 turing subpattern in the current subject string, rather than
1354 anything matching the subpattern itself. So the pattern
1355
1356 (sens|respons)e and \1ibility
1357
1358 matches "sense and sensibility" and "response and responsi-
1359 bility", but not "sense and responsibility". If caseful
1360 matching is in force at the time of the back reference, then
1361 the case of letters is relevant. For example,
1362
1363 ((?i)rah)\s+\1
1364
1365 matches "rah rah" and "RAH RAH", but not "RAH rah", even
1366 though the original capturing subpattern is matched case-
1367 lessly.
1368
1369 There may be more than one back reference to the same sub-
1370 pattern. If a subpattern has not actually been used in a
1371 particular match, then any back references to it always
1372 fail. For example, the pattern
1373
1374 (a|(bc))\2
1375
1376 always fails if it starts to match "a" rather than "bc".
1377 Because there may be up to 99 back references, all digits
1378 following the backslash are taken as part of a potential
1379 back reference number. If the pattern continues with a digit
1380 character, then some delimiter must be used to terminate the
1381 back reference. If the PCRE_EXTENDED option is set, this can
1382 be whitespace. Otherwise an empty comment can be used.
1383
1384 A back reference that occurs inside the parentheses to which
1385 it refers fails when the subpattern is first used, so, for
1386 example, (a\1) never matches. However, such references can
1387 be useful inside repeated subpatterns. For example, the pat-
1388 tern
1389
1390 (a|b\1)+
1391
1392 matches any number of "a"s and also "aba", "ababaa" etc. At
1393 each iteration of the subpattern, the back reference matches
1394 the character string corresponding to the previous itera-
1395 tion. In order for this to work, the pattern must be such
1396 that the first iteration does not need to match the back
1397 reference. This can be done using alternation, as in the
1398 example above, or by a quantifier with a minimum of zero.
1399
1400
1401
1402 ASSERTIONS
1403 An assertion is a test on the characters following or
1404 preceding the current matching point that does not actually
1405 consume any characters. The simple assertions coded as \b,
1406 \B, \A, \Z, \z, ^ and $ are described above. More compli-
1407 cated assertions are coded as subpatterns. There are two
1408 kinds: those that look ahead of the current position in the
1409 subject string, and those that look behind it.
1410 An assertion subpattern is matched in the normal way, except
1411 that it does not cause the current matching position to be
1412 changed. Lookahead assertions start with (?= for positive
1413 assertions and (?! for negative assertions. For example,
1414
1415 \w+(?=;)
1416
1417 matches a word followed by a semicolon, but does not include
1418 the semicolon in the match, and
1419
1420 foo(?!bar)
1421
1422 matches any occurrence of "foo" that is not followed by
1423 "bar". Note that the apparently similar pattern
1424
1425 (?!foo)bar
1426
1427 does not find an occurrence of "bar" that is preceded by
1428 something other than "foo"; it finds any occurrence of "bar"
1429 whatsoever, because the assertion (?!foo) is always true
1430 when the next three characters are "bar". A lookbehind
1431 assertion is needed to achieve this effect.
1432
1433 Lookbehind assertions start with (?<= for positive asser-
1434 tions and (?<! for negative assertions. For example,
1435
1436 (?<!foo)bar
1437
1438 does find an occurrence of "bar" that is not preceded by
1439 "foo". The contents of a lookbehind assertion are restricted
1440 such that all the strings it matches must have a fixed
1441 length. However, if there are several alternatives, they do
1442 not all have to have the same fixed length. Thus
1443
1444 (?<=bullock|donkey)
1445
1446 is permitted, but
1447
1448 (?<!dogs?|cats?)
1449
1450 causes an error at compile time. Branches that match dif-
1451 ferent length strings are permitted only at the top level of
1452 a lookbehind assertion. This is an extension compared with
1453 Perl 5.005, which requires all branches to match the same
1454 length of string. An assertion such as
1455
1456 (?<=ab(c|de))
1457
1458 is not permitted, because its single top-level branch can
1459 match two different lengths, but it is acceptable if rewrit-
1460 ten to use two top-level branches:
1461
1462 (?<=abc|abde)
1463
1464 The implementation of lookbehind assertions is, for each
1465 alternative, to temporarily move the current position back
1466 by the fixed width and then try to match. If there are
1467 insufficient characters before the current position, the
1468 match is deemed to fail. Lookbehinds in conjunction with
1469 once-only subpatterns can be particularly useful for match-
1470 ing at the ends of strings; an example is given at the end
1471 of the section on once-only subpatterns.
1472
1473 Several assertions (of any sort) may occur in succession.
1474 For example,
1475
1476 (?<=\d{3})(?<!999)foo
1477
1478 matches "foo" preceded by three digits that are not "999".
1479 Notice that each of the assertions is applied independently
1480 at the same point in the subject string. First there is a
1481 check that the previous three characters are all digits,
1482 then there is a check that the same three characters are not
1483 "999". This pattern does not match "foo" preceded by six
1484 characters, the first of which are digits and the last three
1485 of which are not "999". For example, it doesn't match
1486 "123abcfoo". A pattern to do that is
1487
1488 (?<=\d{3}...)(?<!999)foo
1489
1490 This time the first assertion looks at the preceding six
1491 characters, checking that the first three are digits, and
1492 then the second assertion checks that the preceding three
1493 characters are not "999".
1494
1495 Assertions can be nested in any combination. For example,
1496
1497 (?<=(?<!foo)bar)baz
1498
1499 matches an occurrence of "baz" that is preceded by "bar"
1500 which in turn is not preceded by "foo", while
1501
1502 (?<=\d{3}(?!999)...)foo
1503
1504 is another pattern which matches "foo" preceded by three
1505 digits and any three characters that are not "999".
1506
1507 Assertion subpatterns are not capturing subpatterns, and may
1508 not be repeated, because it makes no sense to assert the
1509 same thing several times. If any kind of assertion contains
1510 capturing subpatterns within it, these are counted for the
1511 purposes of numbering the capturing subpatterns in the whole
1512 pattern. However, substring capturing is carried out only
1513 for positive assertions, because it does not make sense for
1514 negative assertions.
1515
1516 Assertions count towards the maximum of 200 parenthesized
1517 subpatterns.
1518
1519
1520
1521 ONCE-ONLY SUBPATTERNS
1522 With both maximizing and minimizing repetition, failure of
1523 what follows normally causes the repeated item to be re-
1524 evaluated to see if a different number of repeats allows the
1525 rest of the pattern to match. Sometimes it is useful to
1526 prevent this, either to change the nature of the match, or
1527 to cause it fail earlier than it otherwise might, when the
1528 author of the pattern knows there is no point in carrying
1529 on.
1530
1531 Consider, for example, the pattern \d+foo when applied to
1532 the subject line
1533
1534 123456bar
1535
1536 After matching all 6 digits and then failing to match "foo",
1537 the normal action of the matcher is to try again with only 5
1538 digits matching the \d+ item, and then with 4, and so on,
1539 before ultimately failing. Once-only subpatterns provide the
1540 means for specifying that once a portion of the pattern has
1541 matched, it is not to be re-evaluated in this way, so the
1542 matcher would give up immediately on failing to match "foo"
1543 the first time. The notation is another kind of special
1544 parenthesis, starting with (?> as in this example:
1545
1546 (?>\d+)bar
1547
1548 This kind of parenthesis "locks up" the part of the pattern
1549 it contains once it has matched, and a failure further into
1550 the pattern is prevented from backtracking into it. Back-
1551 tracking past it to previous items, however, works as nor-
1552 mal.
1553
1554 An alternative description is that a subpattern of this type
1555 matches the string of characters that an identical stan-
1556 dalone pattern would match, if anchored at the current point
1557 in the subject string.
1558
1559 Once-only subpatterns are not capturing subpatterns. Simple
1560 cases such as the above example can be thought of as a max-
1561 imizing repeat that must swallow everything it can. So,
1562 while both \d+ and \d+? are prepared to adjust the number of
1563 digits they match in order to make the rest of the pattern
1564 match, (?>\d+) can only match an entire sequence of digits.
1565
1566 This construction can of course contain arbitrarily compli-
1567 cated subpatterns, and it can be nested.
1568
1569 Once-only subpatterns can be used in conjunction with look-
1570 behind assertions to specify efficient matching at the end
1571 of the subject string. Consider a simple pattern such as
1572
1573 abcd$
1574
1575 when applied to a long string which does not match it.
1576 Because matching proceeds from left to right, PCRE will look
1577 for each "a" in the subject and then see if what follows
1578 matches the rest of the pattern. If the pattern is specified
1579 as
1580
1581 ^.*abcd$
1582
1583 then the initial .* matches the entire string at first, but
1584 when this fails, it backtracks to match all but the last
1585 character, then all but the last two characters, and so on.
1586 Once again the search for "a" covers the entire string, from
1587 right to left, so we are no better off. However, if the pat-
1588 tern is written as
1589
1590 ^(?>.*)(?<=abcd)
1591
1592 then there can be no backtracking for the .* item; it can
1593 match only the entire string. The subsequent lookbehind
1594 assertion does a single test on the last four characters. If
1595 it fails, the match fails immediately. For long strings,
1596 this approach makes a significant difference to the process-
1597 ing time.
1598
1599
1600
1601 CONDITIONAL SUBPATTERNS
1602 It is possible to cause the matching process to obey a sub-
1603 pattern conditionally or to choose between two alternative
1604 subpatterns, depending on the result of an assertion, or
1605 whether a previous capturing subpattern matched or not. The
1606 two possible forms of conditional subpattern are
1607
1608 (?(condition)yes-pattern)
1609 (?(condition)yes-pattern|no-pattern)
1610
1611 If the condition is satisfied, the yes-pattern is used; oth-
1612 erwise the no-pattern (if present) is used. If there are
1613 more than two alternatives in the subpattern, a compile-time
1614 error occurs.
1615
1616 There are two kinds of condition. If the text between the
1617 parentheses consists of a sequence of digits, then the
1618 condition is satisfied if the capturing subpattern of that
1619 number has previously matched. Consider the following pat-
1620 tern, which contains non-significant white space to make it
1621 more readable (assume the PCRE_EXTENDED option) and to
1622 divide it into three parts for ease of discussion:
1623
1624 ( \( )? [^()]+ (?(1) \) )
1625
1626 The first part matches an optional opening parenthesis, and
1627 if that character is present, sets it as the first captured
1628 substring. The second part matches one or more characters
1629 that are not parentheses. The third part is a conditional
1630 subpattern that tests whether the first set of parentheses
1631 matched or not. If they did, that is, if subject started
1632 with an opening parenthesis, the condition is true, and so
1633 the yes-pattern is executed and a closing parenthesis is
1634 required. Otherwise, since no-pattern is not present, the
1635 subpattern matches nothing. In other words, this pattern
1636 matches a sequence of non-parentheses, optionally enclosed
1637 in parentheses.
1638
1639 If the condition is not a sequence of digits, it must be an
1640 assertion. This may be a positive or negative lookahead or
1641 lookbehind assertion. Consider this pattern, again contain-
1642 ing non-significant white space, and with the two alterna-
1643 tives on the second line:
1644
1645 (?(?=[^a-z]*[a-z])
1646 \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
1647
1648 The condition is a positive lookahead assertion that matches
1649 an optional sequence of non-letters followed by a letter. In
1650 other words, it tests for the presence of at least one
1651 letter in the subject. If a letter is found, the subject is
1652 matched against the first alternative; otherwise it is
1653 matched against the second. This pattern matches strings in
1654 one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
1655 letters and dd are digits.
1656
1657
1658
1659 COMMENTS
1660 The sequence (?# marks the start of a comment which contin-
1661 ues up to the next closing parenthesis. Nested parentheses
1662 are not permitted. The characters that make up a comment
1663 play no part in the pattern matching at all.
1664
1665 If the PCRE_EXTENDED option is set, an unescaped # character
1666 outside a character class introduces a comment that contin-
1667 ues up to the next newline character in the pattern.
1668
1669
1670
1671 PERFORMANCE
1672 Certain items that may appear in patterns are more efficient
1673 than others. It is more efficient to use a character class
1674 like [aeiou] than a set of alternatives such as (a|e|i|o|u).
1675 In general, the simplest construction that provides the
1676 required behaviour is usually the most efficient. Jeffrey
1677 Friedl's book contains a lot of discussion about optimizing
1678 regular expressions for efficient performance.
1679
1680 When a pattern begins with .* and the PCRE_DOTALL option is
1681 set, the pattern is implicitly anchored by PCRE, since it
1682 can match only at the start of a subject string. However, if
1683 PCRE_DOTALL is not set, PCRE cannot make this optimization,
1684 because the . metacharacter does not then match a newline,
1685 and if the subject string contains newlines, the pattern may
1686 match from the character immediately following one of them
1687 instead of from the very start. For example, the pattern
1688
1689 (.*) second
1690
1691 matches the subject "first\nand second" (where \n stands for
1692 a newline character) with the first captured substring being
1693 "and". In order to do this, PCRE has to retry the match
1694 starting after every newline in the subject.
1695
1696 If you are using such a pattern with subject strings that do
1697 not contain newlines, the best performance is obtained by
1698 setting PCRE_DOTALL, or starting the pattern with ^.* to
1699 indicate explicit anchoring. That saves PCRE from having to
1700 scan along the subject looking for a newline to restart at.
1701
1702 Beware of patterns that contain nested indefinite repeats.
1703 These can take a long time to run when applied to a string
1704 that does not match. Consider the pattern fragment
1705
1706 (a+)*
1707
1708 This can match "aaaa" in 33 different ways, and this number
1709 increases very rapidly as the string gets longer. (The *
1710 repeat can match 0, 1, 2, 3, or 4 times, and for each of
1711 those cases other than 0, the + repeats can match different
1712 numbers of times.) When the remainder of the pattern is such
1713 that the entire match is going to fail, PCRE has in princi-
1714 ple to try every possible variation, and this can take an
1715 extremely long time.
1716
1717 An optimization catches some of the more simple cases such
1718 as
1719
1720 (a+)*b
1721
1722 where a literal character follows. Before embarking on the
1723 standard matching procedure, PCRE checks that there is a "b"
1724 later in the subject string, and if there is not, it fails
1725 the match immediately. However, when there is no following
1726 literal this optimization cannot be used. You can see the
1727 difference by comparing the behaviour of
1728
1729 (a+)*\d
1730
1731 with the pattern above. The former gives a failure almost
1732 instantly when applied to a whole line of "a" characters,
1733 whereas the latter takes an appreciable time with strings
1734 longer than about 20 characters.
1735
1736
1737
1738 AUTHOR
1739 Philip Hazel <ph10@cam.ac.uk>
1740 University Computing Service,
1741 New Museums Site,
1742 Cambridge CB2 3QG, England.
1743 Phone: +44 1223 334714
1744
1745 Last updated: 29 July 1999
1746 Copyright (c) 1997-1999 University of Cambridge.

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12