/[pcre]/code/tags/pcre-4.1/doc/pcre.txt
ViewVC logotype

Contents of /code/tags/pcre-4.1/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log


Revision 51 - (show annotations) (download)
Sat Feb 24 21:39:37 2007 UTC (7 years, 4 months ago) by nigel
Original Path: code/trunk/doc/pcre.txt
File MIME type: text/plain
File size: 93263 byte(s)
Load pcre-3.4 into code/trunk.

1 NAME
2 pcre - Perl-compatible regular expressions.
3
4
5
6 SYNOPSIS
7 #include <pcre.h>
8
9 pcre *pcre_compile(const char *pattern, int options,
10 const char **errptr, int *erroffset,
11 const unsigned char *tableptr);
12
13 pcre_extra *pcre_study(const pcre *code, int options,
14 const char **errptr);
15
16 int pcre_exec(const pcre *code, const pcre_extra *extra,
17 const char *subject, int length, int startoffset,
18 int options, int *ovector, int ovecsize);
19
20 int pcre_copy_substring(const char *subject, int *ovector,
21 int stringcount, int stringnumber, char *buffer,
22 int buffersize);
23
24 int pcre_get_substring(const char *subject, int *ovector,
25 int stringcount, int stringnumber,
26 const char **stringptr);
27
28 int pcre_get_substring_list(const char *subject,
29 int *ovector, int stringcount, const char ***listptr);
30
31 void pcre_free_substring(const char *stringptr);
32
33 void pcre_free_substring_list(const char **stringptr);
34
35 const unsigned char *pcre_maketables(void);
36
37 int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
38 int what, void *where);
39
40 int pcre_info(const pcre *code, int *optptr, *firstcharptr);
41
42 char *pcre_version(void);
43
44 void *(*pcre_malloc)(size_t);
45
46 void (*pcre_free)(void *);
47
48
49
50
51 DESCRIPTION
52 The PCRE library is a set of functions that implement regu-
53 lar expression pattern matching using the same syntax and
54 semantics as Perl 5, with just a few differences (see
55
56 below). The current implementation corresponds to Perl
57 5.005, with some additional features from later versions.
58 This includes some experimental, incomplete support for
59 UTF-8 encoded strings. Details of exactly what is and what
60 is not supported are given below.
61
62 PCRE has its own native API, which is described in this
63 document. There is also a set of wrapper functions that
64 correspond to the POSIX regular expression API. These are
65 described in the pcreposix documentation.
66
67 The native API function prototypes are defined in the header
68 file pcre.h, and on Unix systems the library itself is
69 called libpcre.a, so can be accessed by adding -lpcre to the
70 command for linking an application which calls it. The
71 header file defines the macros PCRE_MAJOR and PCRE_MINOR to
72 contain the major and minor release numbers for the library.
73 Applications can use these to include support for different
74 releases.
75
76 The functions pcre_compile(), pcre_study(), and pcre_exec()
77 are used for compiling and matching regular expressions.
78
79 The functions pcre_copy_substring(), pcre_get_substring(),
80 and pcre_get_substring_list() are convenience functions for
81 extracting captured substrings from a matched subject
82 string; pcre_free_substring() and pcre_free_substring_list()
83 are also provided, to free the memory used for extracted
84 strings.
85
86 The function pcre_maketables() is used (optionally) to build
87 a set of character tables in the current locale for passing
88 to pcre_compile().
89
90 The function pcre_fullinfo() is used to find out information
91 about a compiled pattern; pcre_info() is an obsolete version
92 which returns only some of the available information, but is
93 retained for backwards compatibility. The function
94 pcre_version() returns a pointer to a string containing the
95 version of PCRE and its date of release.
96
97 The global variables pcre_malloc and pcre_free initially
98 contain the entry points of the standard malloc() and free()
99 functions respectively. PCRE calls the memory management
100 functions via these variables, so a calling program can
101 replace them if it wishes to intercept the calls. This
102 should be done before calling any PCRE functions.
103
104
105
106 MULTI-THREADING
107 The PCRE functions can be used in multi-threading
108
109
110
111
112
113 SunOS 5.8 Last change: 2
114
115
116
117 applications, with the proviso that the memory management
118 functions pointed to by pcre_malloc and pcre_free are shared
119 by all threads.
120
121 The compiled form of a regular expression is not altered
122 during matching, so the same compiled pattern can safely be
123 used by several threads at once.
124
125
126
127 COMPILING A PATTERN
128 The function pcre_compile() is called to compile a pattern
129 into an internal form. The pattern is a C string terminated
130 by a binary zero, and is passed in the argument pattern. A
131 pointer to a single block of memory that is obtained via
132 pcre_malloc is returned. This contains the compiled code and
133 related data. The pcre type is defined for this for conveni-
134 ence, but in fact pcre is just a typedef for void, since the
135 contents of the block are not externally defined. It is up
136 to the caller to free the memory when it is no longer
137 required.
138
139 The size of a compiled pattern is roughly proportional to
140 the length of the pattern string, except that each character
141 class (other than those containing just a single character,
142 negated or not) requires 33 bytes, and repeat quantifiers
143 with a minimum greater than one or a bounded maximum cause
144 the relevant portions of the compiled pattern to be repli-
145 cated.
146
147 The options argument contains independent bits that affect
148 the compilation. It should be zero if no options are
149 required. Some of the options, in particular, those that are
150 compatible with Perl, can also be set and unset from within
151 the pattern (see the detailed description of regular expres-
152 sions below). For these options, the contents of the options
153 argument specifies their initial settings at the start of
154 compilation and execution. The PCRE_ANCHORED option can be
155 set at the time of matching as well as at compile time.
156
157 If errptr is NULL, pcre_compile() returns NULL immediately.
158 Otherwise, if compilation of a pattern fails, pcre_compile()
159 returns NULL, and sets the variable pointed to by errptr to
160 point to a textual error message. The offset from the start
161 of the pattern to the character where the error was
162 discovered is placed in the variable pointed to by
163 erroffset, which must not be NULL. If it is, an immediate
164 error is given.
165
166 If the final argument, tableptr, is NULL, PCRE uses a
167 default set of character tables which are built when it is
168 compiled, using the default C locale. Otherwise, tableptr
169 must be the result of a call to pcre_maketables(). See the
170 section on locale support below.
171
172 The following option bits are defined in the header file:
173
174 PCRE_ANCHORED
175
176 If this bit is set, the pattern is forced to be "anchored",
177 that is, it is constrained to match only at the start of the
178 string which is being searched (the "subject string"). This
179 effect can also be achieved by appropriate constructs in the
180 pattern itself, which is the only way to do it in Perl.
181
182 PCRE_CASELESS
183
184 If this bit is set, letters in the pattern match both upper
185 and lower case letters. It is equivalent to Perl's /i
186 option.
187
188 PCRE_DOLLAR_ENDONLY
189
190 If this bit is set, a dollar metacharacter in the pattern
191 matches only at the end of the subject string. Without this
192 option, a dollar also matches immediately before the final
193 character if it is a newline (but not before any other new-
194 lines). The PCRE_DOLLAR_ENDONLY option is ignored if
195 PCRE_MULTILINE is set. There is no equivalent to this option
196 in Perl.
197
198 PCRE_DOTALL
199
200 If this bit is set, a dot metacharater in the pattern
201 matches all characters, including newlines. Without it, new-
202 lines are excluded. This option is equivalent to Perl's /s
203 option. A negative class such as [^a] always matches a new-
204 line character, independent of the setting of this option.
205
206 PCRE_EXTENDED
207
208 If this bit is set, whitespace data characters in the pat-
209 tern are totally ignored except when escaped or inside a
210 character class, and characters between an unescaped # out-
211 side a character class and the next newline character,
212 inclusive, are also ignored. This is equivalent to Perl's /x
213 option, and makes it possible to include comments inside
214 complicated patterns. Note, however, that this applies only
215 to data characters. Whitespace characters may never appear
216 within special character sequences in a pattern, for example
217 within the sequence (?( which introduces a conditional sub-
218 pattern.
219
220 PCRE_EXTRA
221
222 This option was invented in order to turn on additional
223 functionality of PCRE that is incompatible with Perl, but it
224 is currently of very little use. When set, any backslash in
225 a pattern that is followed by a letter that has no special
226 meaning causes an error, thus reserving these combinations
227 for future expansion. By default, as in Perl, a backslash
228 followed by a letter with no special meaning is treated as a
229 literal. There are at present no other features controlled
230 by this option. It can also be set by a (?X) option setting
231 within a pattern.
232
233 PCRE_MULTILINE
234
235 By default, PCRE treats the subject string as consisting of
236 a single "line" of characters (even if it actually contains
237 several newlines). The "start of line" metacharacter (^)
238 matches only at the start of the string, while the "end of
239 line" metacharacter ($) matches only at the end of the
240 string, or before a terminating newline (unless
241 PCRE_DOLLAR_ENDONLY is set). This is the same as Perl.
242
243 When PCRE_MULTILINE it is set, the "start of line" and "end
244 of line" constructs match immediately following or immedi-
245 ately before any newline in the subject string, respec-
246 tively, as well as at the very start and end. This is
247 equivalent to Perl's /m option. If there are no "\n" charac-
248 ters in a subject string, or no occurrences of ^ or $ in a
249 pattern, setting PCRE_MULTILINE has no effect.
250
251 PCRE_UNGREEDY
252
253 This option inverts the "greediness" of the quantifiers so
254 that they are not greedy by default, but become greedy if
255 followed by "?". It is not compatible with Perl. It can also
256 be set by a (?U) option setting within the pattern.
257
258 PCRE_UTF8
259
260 This option causes PCRE to regard both the pattern and the
261 subject as strings of UTF-8 characters instead of just byte
262 strings. However, it is available only if PCRE has been
263 built to include UTF-8 support. If not, the use of this
264 option provokes an error. Support for UTF-8 is new, experi-
265 mental, and incomplete. Details of exactly what it entails
266 are given below.
267
268
269
270 STUDYING A PATTERN
271 When a pattern is going to be used several times, it is
272 worth spending more time analyzing it in order to speed up
273 the time taken for matching. The function pcre_study() takes
274
275 a pointer to a compiled pattern as its first argument, and
276 returns a pointer to a pcre_extra block (another void
277 typedef) containing additional information about the pat-
278 tern; this can be passed to pcre_exec(). If no additional
279 information is available, NULL is returned.
280
281 The second argument contains option bits. At present, no
282 options are defined for pcre_study(), and this argument
283 should always be zero.
284
285 The third argument for pcre_study() is a pointer to an error
286 message. If studying succeeds (even if no data is returned),
287 the variable it points to is set to NULL. Otherwise it
288 points to a textual error message.
289
290 At present, studying a pattern is useful only for non-
291 anchored patterns that do not have a single fixed starting
292 character. A bitmap of possible starting characters is
293 created.
294
295
296
297 LOCALE SUPPORT
298 PCRE handles caseless matching, and determines whether char-
299 acters are letters, digits, or whatever, by reference to a
300 set of tables. The library contains a default set of tables
301 which is created in the default C locale when PCRE is com-
302 piled. This is used when the final argument of
303 pcre_compile() is NULL, and is sufficient for many applica-
304 tions.
305
306 An alternative set of tables can, however, be supplied. Such
307 tables are built by calling the pcre_maketables() function,
308 which has no arguments, in the relevant locale. The result
309 can then be passed to pcre_compile() as often as necessary.
310 For example, to build and use tables that are appropriate
311 for the French locale (where accented characters with codes
312 greater than 128 are treated as letters), the following code
313 could be used:
314
315 setlocale(LC_CTYPE, "fr");
316 tables = pcre_maketables();
317 re = pcre_compile(..., tables);
318
319 The tables are built in memory that is obtained via
320 pcre_malloc. The pointer that is passed to pcre_compile is
321 saved with the compiled pattern, and the same tables are
322 used via this pointer by pcre_study() and pcre_exec(). Thus
323 for any single pattern, compilation, studying and matching
324 all happen in the same locale, but different patterns can be
325 compiled in different locales. It is the caller's responsi-
326 bility to ensure that the memory containing the tables
327 remains available for as long as it is needed.
328
329
330
331 INFORMATION ABOUT A PATTERN
332 The pcre_fullinfo() function returns information about a
333 compiled pattern. It replaces the obsolete pcre_info() func-
334 tion, which is nevertheless retained for backwards compabil-
335 ity (and is documented below).
336
337 The first argument for pcre_fullinfo() is a pointer to the
338 compiled pattern. The second argument is the result of
339 pcre_study(), or NULL if the pattern was not studied. The
340 third argument specifies which piece of information is
341 required, while the fourth argument is a pointer to a vari-
342 able to receive the data. The yield of the function is zero
343 for success, or one of the following negative numbers:
344
345 PCRE_ERROR_NULL the argument code was NULL
346 the argument where was NULL
347 PCRE_ERROR_BADMAGIC the "magic number" was not found
348 PCRE_ERROR_BADOPTION the value of what was invalid
349
350 The possible values for the third argument are defined in
351 pcre.h, and are as follows:
352
353 PCRE_INFO_OPTIONS
354
355 Return a copy of the options with which the pattern was com-
356 piled. The fourth argument should point to au unsigned long
357 int variable. These option bits are those specified in the
358 call to pcre_compile(), modified by any top-level option
359 settings within the pattern itself, and with the
360 PCRE_ANCHORED bit forcibly set if the form of the pattern
361 implies that it can match only at the start of a subject
362 string.
363
364 PCRE_INFO_SIZE
365
366 Return the size of the compiled pattern, that is, the value
367 that was passed as the argument to pcre_malloc() when PCRE
368 was getting memory in which to place the compiled data. The
369 fourth argument should point to a size_t variable.
370
371 PCRE_INFO_CAPTURECOUNT
372
373 Return the number of capturing subpatterns in the pattern.
374 The fourth argument should point to an int variable.
375
376 PCRE_INFO_BACKREFMAX
377
378 Return the number of the highest back reference in the
379 pattern. The fourth argument should point to an int vari-
380 able. Zero is returned if there are no back references.
381
382 PCRE_INFO_FIRSTCHAR
383
384 Return information about the first character of any matched
385 string, for a non-anchored pattern. If there is a fixed
386 first character, e.g. from a pattern such as
387 (cat|cow|coyote), it is returned in the integer pointed to
388 by where. Otherwise, if either
389
390 (a) the pattern was compiled with the PCRE_MULTILINE option,
391 and every branch starts with "^", or
392
393 (b) every branch of the pattern starts with ".*" and
394 PCRE_DOTALL is not set (if it were set, the pattern would be
395 anchored),
396
397 -1 is returned, indicating that the pattern matches only at
398 the start of a subject string or after any "\n" within the
399 string. Otherwise -2 is returned. For anchored patterns, -2
400 is returned.
401
402 PCRE_INFO_FIRSTTABLE
403
404 If the pattern was studied, and this resulted in the con-
405 struction of a 256-bit table indicating a fixed set of char-
406 acters for the first character in any matching string, a
407 pointer to the table is returned. Otherwise NULL is
408 returned. The fourth argument should point to an unsigned
409 char * variable.
410
411 PCRE_INFO_LASTLITERAL
412
413 For a non-anchored pattern, return the value of the right-
414 most literal character which must exist in any matched
415 string, other than at its start. The fourth argument should
416 point to an int variable. If there is no such character, or
417 if the pattern is anchored, -1 is returned. For example, for
418 the pattern /a\d+z\d+/ the returned value is 'z'.
419
420 The pcre_info() function is now obsolete because its inter-
421 face is too restrictive to return all the available data
422 about a compiled pattern. New programs should use
423 pcre_fullinfo() instead. The yield of pcre_info() is the
424 number of capturing subpatterns, or one of the following
425 negative numbers:
426
427 PCRE_ERROR_NULL the argument code was NULL
428 PCRE_ERROR_BADMAGIC the "magic number" was not found
429
430 If the optptr argument is not NULL, a copy of the options
431 with which the pattern was compiled is placed in the integer
432 it points to (see PCRE_INFO_OPTIONS above).
433
434 If the pattern is not anchored and the firstcharptr argument
435 is not NULL, it is used to pass back information about the
436 first character of any matched string (see
437 PCRE_INFO_FIRSTCHAR above).
438
439
440
441 MATCHING A PATTERN
442 The function pcre_exec() is called to match a subject string
443 against a pre-compiled pattern, which is passed in the code
444 argument. If the pattern has been studied, the result of the
445 study should be passed in the extra argument. Otherwise this
446 must be NULL.
447
448 The PCRE_ANCHORED option can be passed in the options argu-
449 ment, whose unused bits must be zero. However, if a pattern
450 was compiled with PCRE_ANCHORED, or turned out to be
451 anchored by virtue of its contents, it cannot be made
452 unachored at matching time.
453
454 There are also three further options that can be set only at
455 matching time:
456
457 PCRE_NOTBOL
458
459 The first character of the string is not the beginning of a
460 line, so the circumflex metacharacter should not match
461 before it. Setting this without PCRE_MULTILINE (at compile
462 time) causes circumflex never to match.
463
464 PCRE_NOTEOL
465
466 The end of the string is not the end of a line, so the dol-
467 lar metacharacter should not match it nor (except in multi-
468 line mode) a newline immediately before it. Setting this
469 without PCRE_MULTILINE (at compile time) causes dollar never
470 to match.
471
472 PCRE_NOTEMPTY
473
474 An empty string is not considered to be a valid match if
475 this option is set. If there are alternatives in the pat-
476 tern, they are tried. If all the alternatives match the
477 empty string, the entire match fails. For example, if the
478 pattern
479
480 a?b?
481
482 is applied to a string not beginning with "a" or "b", it
483 matches the empty string at the start of the subject. With
484 PCRE_NOTEMPTY set, this match is not valid, so PCRE searches
485 further into the string for occurrences of "a" or "b".
486
487 Perl has no direct equivalent of PCRE_NOTEMPTY, but it does
488 make a special case of a pattern match of the empty string
489 within its split() function, and when using the /g modifier.
490 It is possible to emulate Perl's behaviour after matching a
491 null string by first trying the match again at the same
492 offset with PCRE_NOTEMPTY set, and then if that fails by
493 advancing the starting offset (see below) and trying an
494 ordinary match again.
495
496 The subject string is passed as a pointer in subject, a
497 length in length, and a starting offset in startoffset.
498 Unlike the pattern string, it may contain binary zero char-
499 acters. When the starting offset is zero, the search for a
500 match starts at the beginning of the subject, and this is by
501 far the most common case.
502
503 A non-zero starting offset is useful when searching for
504 another match in the same subject by calling pcre_exec()
505 again after a previous success. Setting startoffset differs
506 from just passing over a shortened string and setting
507 PCRE_NOTBOL in the case of a pattern that begins with any
508 kind of lookbehind. For example, consider the pattern
509
510 \Biss\B
511
512 which finds occurrences of "iss" in the middle of words. (\B
513 matches only if the current position in the subject is not a
514 word boundary.) When applied to the string "Mississipi" the
515 first call to pcre_exec() finds the first occurrence. If
516 pcre_exec() is called again with just the remainder of the
517 subject, namely "issipi", it does not match, because \B is
518 always false at the start of the subject, which is deemed to
519 be a word boundary. However, if pcre_exec() is passed the
520 entire string again, but with startoffset set to 4, it finds
521 the second occurrence of "iss" because it is able to look
522 behind the starting point to discover that it is preceded by
523 a letter.
524
525 If a non-zero starting offset is passed when the pattern is
526 anchored, one attempt to match at the given offset is tried.
527 This can only succeed if the pattern does not require the
528 match to be at the start of the subject.
529
530 In general, a pattern matches a certain portion of the sub-
531 ject, and in addition, further substrings from the subject
532 may be picked out by parts of the pattern. Following the
533 usage in Jeffrey Friedl's book, this is called "capturing"
534 in what follows, and the phrase "capturing subpattern" is
535 used for a fragment of a pattern that picks out a substring.
536 PCRE supports several other kinds of parenthesized subpat-
537 tern that do not cause substrings to be captured.
538
539 Captured substrings are returned to the caller via a vector
540 of integer offsets whose address is passed in ovector. The
541 number of elements in the vector is passed in ovecsize. The
542 first two-thirds of the vector is used to pass back captured
543 substrings, each substring using a pair of integers. The
544 remaining third of the vector is used as workspace by
545 pcre_exec() while matching capturing subpatterns, and is not
546 available for passing back information. The length passed in
547 ovecsize should always be a multiple of three. If it is not,
548 it is rounded down.
549
550 When a match has been successful, information about captured
551 substrings is returned in pairs of integers, starting at the
552 beginning of ovector, and continuing up to two-thirds of its
553 length at the most. The first element of a pair is set to
554 the offset of the first character in a substring, and the
555 second is set to the offset of the first character after the
556 end of a substring. The first pair, ovector[0] and ovec-
557 tor[1], identify the portion of the subject string matched
558 by the entire pattern. The next pair is used for the first
559 capturing subpattern, and so on. The value returned by
560 pcre_exec() is the number of pairs that have been set. If
561 there are no capturing subpatterns, the return value from a
562 successful match is 1, indicating that just the first pair
563 of offsets has been set.
564
565 Some convenience functions are provided for extracting the
566 captured substrings as separate strings. These are described
567 in the following section.
568
569 It is possible for an capturing subpattern number n+1 to
570 match some part of the subject when subpattern n has not
571 been used at all. For example, if the string "abc" is
572 matched against the pattern (a|(z))(bc) subpatterns 1 and 3
573 are matched, but 2 is not. When this happens, both offset
574 values corresponding to the unused subpattern are set to -1.
575
576 If a capturing subpattern is matched repeatedly, it is the
577 last portion of the string that it matched that gets
578 returned.
579
580 If the vector is too small to hold all the captured sub-
581 strings, it is used as far as possible (up to two-thirds of
582 its length), and the function returns a value of zero. In
583 particular, if the substring offsets are not of interest,
584 pcre_exec() may be called with ovector passed as NULL and
585 ovecsize as zero. However, if the pattern contains back
586 references and the ovector isn't big enough to remember the
587 related substrings, PCRE has to get additional memory for
588 use during matching. Thus it is usually advisable to supply
589 an ovector.
590
591 Note that pcre_info() can be used to find out how many cap-
592 turing subpatterns there are in a compiled pattern. The
593 smallest size for ovector that will allow for n captured
594 substrings in addition to the offsets of the substring
595 matched by the whole pattern is (n+1)*3.
596
597 If pcre_exec() fails, it returns a negative number. The fol-
598 lowing are defined in the header file:
599
600 PCRE_ERROR_NOMATCH (-1)
601
602 The subject string did not match the pattern.
603
604 PCRE_ERROR_NULL (-2)
605
606 Either code or subject was passed as NULL, or ovector was
607 NULL and ovecsize was not zero.
608
609 PCRE_ERROR_BADOPTION (-3)
610
611 An unrecognized bit was set in the options argument.
612
613 PCRE_ERROR_BADMAGIC (-4)
614
615 PCRE stores a 4-byte "magic number" at the start of the com-
616 piled code, to catch the case when it is passed a junk
617 pointer. This is the error it gives when the magic number
618 isn't present.
619
620 PCRE_ERROR_UNKNOWN_NODE (-5)
621
622 While running the pattern match, an unknown item was encoun-
623 tered in the compiled pattern. This error could be caused by
624 a bug in PCRE or by overwriting of the compiled pattern.
625
626 PCRE_ERROR_NOMEMORY (-6)
627
628 If a pattern contains back references, but the ovector that
629 is passed to pcre_exec() is not big enough to remember the
630 referenced substrings, PCRE gets a block of memory at the
631 start of matching to use for this purpose. If the call via
632 pcre_malloc() fails, this error is given. The memory is
633 freed at the end of matching.
634
635
636
637 EXTRACTING CAPTURED SUBSTRINGS
638 Captured substrings can be accessed directly by using the
639
640
641
642
643
644 SunOS 5.8 Last change: 12
645
646
647
648 offsets returned by pcre_exec() in ovector. For convenience,
649 the functions pcre_copy_substring(), pcre_get_substring(),
650 and pcre_get_substring_list() are provided for extracting
651 captured substrings as new, separate, zero-terminated
652 strings. A substring that contains a binary zero is
653 correctly extracted and has a further zero added on the end,
654 but the result does not, of course, function as a C string.
655
656 The first three arguments are the same for all three func-
657 tions: subject is the subject string which has just been
658 successfully matched, ovector is a pointer to the vector of
659 integer offsets that was passed to pcre_exec(), and
660 stringcount is the number of substrings that were captured
661 by the match, including the substring that matched the
662 entire regular expression. This is the value returned by
663 pcre_exec if it is greater than zero. If pcre_exec()
664 returned zero, indicating that it ran out of space in ovec-
665 tor, the value passed as stringcount should be the size of
666 the vector divided by three.
667
668 The functions pcre_copy_substring() and pcre_get_substring()
669 extract a single substring, whose number is given as string-
670 number. A value of zero extracts the substring that matched
671 the entire pattern, while higher values extract the captured
672 substrings. For pcre_copy_substring(), the string is placed
673 in buffer, whose length is given by buffersize, while for
674 pcre_get_substring() a new block of memory is obtained via
675 pcre_malloc, and its address is returned via stringptr. The
676 yield of the function is the length of the string, not
677 including the terminating zero, or one of
678
679 PCRE_ERROR_NOMEMORY (-6)
680
681 The buffer was too small for pcre_copy_substring(), or the
682 attempt to get memory failed for pcre_get_substring().
683
684 PCRE_ERROR_NOSUBSTRING (-7)
685
686 There is no substring whose number is stringnumber.
687
688 The pcre_get_substring_list() function extracts all avail-
689 able substrings and builds a list of pointers to them. All
690 this is done in a single block of memory which is obtained
691 via pcre_malloc. The address of the memory block is returned
692 via listptr, which is also the start of the list of string
693 pointers. The end of the list is marked by a NULL pointer.
694 The yield of the function is zero if all went well, or
695
696 PCRE_ERROR_NOMEMORY (-6)
697
698 if the attempt to get the memory block failed.
699
700 When any of these functions encounter a substring that is
701 unset, which can happen when capturing subpattern number n+1
702 matches some part of the subject, but subpattern n has not
703 been used at all, they return an empty string. This can be
704 distinguished from a genuine zero-length substring by
705 inspecting the appropriate offset in ovector, which is nega-
706 tive for unset substrings.
707
708 The two convenience functions pcre_free_substring() and
709 pcre_free_substring_list() can be used to free the memory
710 returned by a previous call of pcre_get_substring() or
711 pcre_get_substring_list(), respectively. They do nothing
712 more than call the function pointed to by pcre_free, which
713 of course could be called directly from a C program. How-
714 ever, PCRE is used in some situations where it is linked via
715 a special interface to another programming language which
716 cannot use pcre_free directly; it is for these cases that
717 the functions are provided.
718
719
720
721 LIMITATIONS
722 There are some size limitations in PCRE but it is hoped that
723 they will never in practice be relevant. The maximum length
724 of a compiled pattern is 65539 (sic) bytes. All values in
725 repeating quantifiers must be less than 65536. The maximum
726 number of capturing subpatterns is 99. The maximum number
727 of all parenthesized subpatterns, including capturing sub-
728 patterns, assertions, and other types of subpattern, is 200.
729
730 The maximum length of a subject string is the largest posi-
731 tive number that an integer variable can hold. However, PCRE
732 uses recursion to handle subpatterns and indefinite repeti-
733 tion. This means that the available stack space may limit
734 the size of a subject string that can be processed by cer-
735 tain patterns.
736
737
738
739 DIFFERENCES FROM PERL
740 The differences described here are with respect to Perl
741 5.005.
742
743 1. By default, a whitespace character is any character that
744 the C library function isspace() recognizes, though it is
745 possible to compile PCRE with alternative character type
746 tables. Normally isspace() matches space, formfeed, newline,
747 carriage return, horizontal tab, and vertical tab. Perl 5 no
748 longer includes vertical tab in its set of whitespace char-
749 acters. The \v escape that was in the Perl documentation for
750 a long time was never in fact recognized. However, the char-
751 acter itself was treated as whitespace at least up to 5.002.
752 In 5.004 and 5.005 it does not match \s.
753
754 2. PCRE does not allow repeat quantifiers on lookahead
755 assertions. Perl permits them, but they do not mean what you
756 might think. For example, (?!a){3} does not assert that the
757 next three characters are not "a". It just asserts that the
758 next character is not "a" three times.
759
760 3. Capturing subpatterns that occur inside negative looka-
761 head assertions are counted, but their entries in the
762 offsets vector are never set. Perl sets its numerical vari-
763 ables from any such patterns that are matched before the
764 assertion fails to match something (thereby succeeding), but
765 only if the negative lookahead assertion contains just one
766 branch.
767
768 4. Though binary zero characters are supported in the sub-
769 ject string, they are not allowed in a pattern string
770 because it is passed as a normal C string, terminated by
771 zero. The escape sequence "\0" can be used in the pattern to
772 represent a binary zero.
773
774 5. The following Perl escape sequences are not supported:
775 \l, \u, \L, \U, \E, \Q. In fact these are implemented by
776 Perl's general string-handling and are not part of its pat-
777 tern matching engine.
778
779 6. The Perl \G assertion is not supported as it is not
780 relevant to single pattern matches.
781
782 7. Fairly obviously, PCRE does not support the (?{code}) and
783 (?p{code}) constructions. However, there is some experimen-
784 tal support for recursive patterns using the non-Perl item
785 (?R).
786
787 8. There are at the time of writing some oddities in Perl
788 5.005_02 concerned with the settings of captured strings
789 when part of a pattern is repeated. For example, matching
790 "aba" against the pattern /^(a(b)?)+$/ sets $2 to the value
791 "b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2
792 unset. However, if the pattern is changed to
793 /^(aa(b(b))?)+$/ then $2 (and $3) are set.
794
795 In Perl 5.004 $2 is set in both cases, and that is also true
796 of PCRE. If in the future Perl changes to a consistent state
797 that is different, PCRE may change to follow.
798
799 9. Another as yet unresolved discrepancy is that in Perl
800 5.005_02 the pattern /^(a)?(?(1)a|b)+$/ matches the string
801 "a", whereas in PCRE it does not. However, in both Perl and
802 PCRE /^(a)?a/ matched against "a" leaves $1 unset.
803
804 10. PCRE provides some extensions to the Perl regular
805 expression facilities:
806
807 (a) Although lookbehind assertions must match fixed length
808 strings, each alternative branch of a lookbehind assertion
809 can match a different length of string. Perl 5.005 requires
810 them all to have the same length.
811
812 (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not
813 set, the $ meta- character matches only at the very end of
814 the string.
815
816 (c) If PCRE_EXTRA is set, a backslash followed by a letter
817 with no special meaning is faulted.
818
819 (d) If PCRE_UNGREEDY is set, the greediness of the repeti-
820 tion quantifiers is inverted, that is, by default they are
821 not greedy, but if followed by a question mark they are.
822
823 (e) PCRE_ANCHORED can be used to force a pattern to be tried
824 only at the start of the subject.
825
826 (f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY options
827 for pcre_exec() have no Perl equivalents.
828
829 (g) The (?R) construct allows for recursive pattern matching
830 (Perl 5.6 can do this using the (?p{code}) construct, which
831 PCRE cannot of course support.)
832
833
834
835 REGULAR EXPRESSION DETAILS
836 The syntax and semantics of the regular expressions sup-
837 ported by PCRE are described below. Regular expressions are
838 also described in the Perl documentation and in a number of
839 other books, some of which have copious examples. Jeffrey
840 Friedl's "Mastering Regular Expressions", published by
841 O'Reilly (ISBN 1-56592-257), covers them in great detail.
842
843 The description here is intended as reference documentation.
844 The basic operation of PCRE is on strings of bytes. However,
845 there is the beginnings of some support for UTF-8 character
846 strings. To use this support you must configure PCRE to
847 include it, and then call pcre_compile() with the PCRE_UTF8
848 option. How this affects the pattern matching is described
849 in the final section of this document.
850
851 A regular expression is a pattern that is matched against a
852 subject string from left to right. Most characters stand for
853 themselves in a pattern, and match the corresponding charac-
854 ters in the subject. As a trivial example, the pattern
855
856 The quick brown fox
857
858 matches a portion of a subject string that is identical to
859 itself. The power of regular expressions comes from the
860 ability to include alternatives and repetitions in the pat-
861 tern. These are encoded in the pattern by the use of meta-
862 characters, which do not stand for themselves but instead
863 are interpreted in some special way.
864
865 There are two different sets of meta-characters: those that
866 are recognized anywhere in the pattern except within square
867 brackets, and those that are recognized in square brackets.
868 Outside square brackets, the meta-characters are as follows:
869
870 \ general escape character with several uses
871 ^ assert start of subject (or line, in multiline
872 mode)
873 $ assert end of subject (or line, in multiline mode)
874 . match any character except newline (by default)
875 [ start character class definition
876 | start of alternative branch
877 ( start subpattern
878 ) end subpattern
879 ? extends the meaning of (
880 also 0 or 1 quantifier
881 also quantifier minimizer
882 * 0 or more quantifier
883 + 1 or more quantifier
884 { start min/max quantifier
885
886 Part of a pattern that is in square brackets is called a
887 "character class". In a character class the only meta-
888 characters are:
889
890 \ general escape character
891 ^ negate the class, but only if the first character
892 - indicates character range
893 ] terminates the character class
894
895 The following sections describe the use of each of the
896 meta-characters.
897
898
899
900 BACKSLASH
901 The backslash character has several uses. Firstly, if it is
902 followed by a non-alphameric character, it takes away any
903 special meaning that character may have. This use of
904 backslash as an escape character applies both inside and
905 outside character classes.
906
907 For example, if you want to match a "*" character, you write
908 "\*" in the pattern. This applies whether or not the follow-
909 ing character would otherwise be interpreted as a meta-
910 character, so it is always safe to precede a non-alphameric
911 with "\" to specify that it stands for itself. In particu-
912 lar, if you want to match a backslash, you write "\\".
913
914 If a pattern is compiled with the PCRE_EXTENDED option, whi-
915 tespace in the pattern (other than in a character class) and
916 characters between a "#" outside a character class and the
917 next newline character are ignored. An escaping backslash
918 can be used to include a whitespace or "#" character as part
919 of the pattern.
920
921 A second use of backslash provides a way of encoding non-
922 printing characters in patterns in a visible manner. There
923 is no restriction on the appearance of non-printing charac-
924 ters, apart from the binary zero that terminates a pattern,
925 but when a pattern is being prepared by text editing, it is
926 usually easier to use one of the following escape sequences
927 than the binary character it represents:
928
929 \a alarm, that is, the BEL character (hex 07)
930 \cx "control-x", where x is any character
931 \e escape (hex 1B)
932 \f formfeed (hex 0C)
933 \n newline (hex 0A)
934 \r carriage return (hex 0D)
935 \t tab (hex 09)
936 \xhh character with hex code hh
937 \ddd character with octal code ddd, or backreference
938
939 The precise effect of "\cx" is as follows: if "x" is a lower
940 case letter, it is converted to upper case. Then bit 6 of
941 the character (hex 40) is inverted. Thus "\cz" becomes hex
942 1A, but "\c{" becomes hex 3B, while "\c;" becomes hex 7B.
943
944 After "\x", up to two hexadecimal digits are read (letters
945 can be in upper or lower case).
946
947 After "\0" up to two further octal digits are read. In both
948 cases, if there are fewer than two digits, just those that
949 are present are used. Thus the sequence "\0\x\07" specifies
950 two binary zeros followed by a BEL character. Make sure you
951 supply two digits after the initial zero if the character
952 that follows is itself an octal digit.
953
954 The handling of a backslash followed by a digit other than 0
955 is complicated. Outside a character class, PCRE reads it
956 and any following digits as a decimal number. If the number
957 is less than 10, or if there have been at least that many
958 previous capturing left parentheses in the expression, the
959 entire sequence is taken as a back reference. A description
960 of how this works is given later, following the discussion
961 of parenthesized subpatterns.
962
963 Inside a character class, or if the decimal number is
964 greater than 9 and there have not been that many capturing
965 subpatterns, PCRE re-reads up to three octal digits follow-
966 ing the backslash, and generates a single byte from the
967 least significant 8 bits of the value. Any subsequent digits
968 stand for themselves. For example:
969
970 \040 is another way of writing a space
971 \40 is the same, provided there are fewer than 40
972 previous capturing subpatterns
973 \7 is always a back reference
974 \11 might be a back reference, or another way of
975 writing a tab
976 \011 is always a tab
977 \0113 is a tab followed by the character "3"
978 \113 is the character with octal code 113 (since there
979 can be no more than 99 back references)
980 \377 is a byte consisting entirely of 1 bits
981 \81 is either a back reference, or a binary zero
982 followed by the two characters "8" and "1"
983
984 Note that octal values of 100 or greater must not be intro-
985 duced by a leading zero, because no more than three octal
986 digits are ever read.
987
988 All the sequences that define a single byte value can be
989 used both inside and outside character classes. In addition,
990 inside a character class, the sequence "\b" is interpreted
991 as the backspace character (hex 08). Outside a character
992 class it has a different meaning (see below).
993
994 The third use of backslash is for specifying generic charac-
995 ter types:
996
997 \d any decimal digit
998 \D any character that is not a decimal digit
999 \s any whitespace character
1000 \S any character that is not a whitespace character
1001 \w any "word" character
1002 \W any "non-word" character
1003
1004 Each pair of escape sequences partitions the complete set of
1005 characters into two disjoint sets. Any given character
1006 matches one, and only one, of each pair.
1007
1008 A "word" character is any letter or digit or the underscore
1009 character, that is, any character which can be part of a
1010 Perl "word". The definition of letters and digits is con-
1011 trolled by PCRE's character tables, and may vary if locale-
1012 specific matching is taking place (see "Locale support"
1013 above). For example, in the "fr" (French) locale, some char-
1014 acter codes greater than 128 are used for accented letters,
1015 and these are matched by \w.
1016
1017 These character type sequences can appear both inside and
1018 outside character classes. They each match one character of
1019 the appropriate type. If the current matching point is at
1020 the end of the subject string, all of them fail, since there
1021 is no character to match.
1022
1023 The fourth use of backslash is for certain simple asser-
1024 tions. An assertion specifies a condition that has to be met
1025 at a particular point in a match, without consuming any
1026 characters from the subject string. The use of subpatterns
1027 for more complicated assertions is described below. The
1028 backslashed assertions are
1029
1030 \b word boundary
1031 \B not a word boundary
1032 \A start of subject (independent of multiline mode)
1033 \Z end of subject or newline at end (independent of
1034 multiline mode)
1035 \z end of subject (independent of multiline mode)
1036
1037 These assertions may not appear in character classes (but
1038 note that "\b" has a different meaning, namely the backspace
1039 character, inside a character class).
1040
1041 A word boundary is a position in the subject string where
1042 the current character and the previous character do not both
1043 match \w or \W (i.e. one matches \w and the other matches
1044 \W), or the start or end of the string if the first or last
1045 character matches \w, respectively.
1046
1047 The \A, \Z, and \z assertions differ from the traditional
1048 circumflex and dollar (described below) in that they only
1049 ever match at the very start and end of the subject string,
1050 whatever options are set. They are not affected by the
1051 PCRE_NOTBOL or PCRE_NOTEOL options. If the startoffset argu-
1052 ment of pcre_exec() is non-zero, \A can never match. The
1053 difference between \Z and \z is that \Z matches before a
1054 newline that is the last character of the string as well as
1055 at the end of the string, whereas \z matches only at the
1056 end.
1057
1058
1059
1060 CIRCUMFLEX AND DOLLAR
1061 Outside a character class, in the default matching mode, the
1062 circumflex character is an assertion which is true only if
1063 the current matching point is at the start of the subject
1064
1065 string. If the startoffset argument of pcre_exec() is non-
1066 zero, circumflex can never match. Inside a character class,
1067 circumflex has an entirely different meaning (see below).
1068
1069 Circumflex need not be the first character of the pattern if
1070 a number of alternatives are involved, but it should be the
1071 first thing in each alternative in which it appears if the
1072 pattern is ever to match that branch. If all possible alter-
1073 natives start with a circumflex, that is, if the pattern is
1074 constrained to match only at the start of the subject, it is
1075 said to be an "anchored" pattern. (There are also other con-
1076 structs that can cause a pattern to be anchored.)
1077
1078 A dollar character is an assertion which is true only if the
1079 current matching point is at the end of the subject string,
1080 or immediately before a newline character that is the last
1081 character in the string (by default). Dollar need not be the
1082 last character of the pattern if a number of alternatives
1083 are involved, but it should be the last item in any branch
1084 in which it appears. Dollar has no special meaning in a
1085 character class.
1086
1087 The meaning of dollar can be changed so that it matches only
1088 at the very end of the string, by setting the
1089 PCRE_DOLLAR_ENDONLY option at compile or matching time. This
1090 does not affect the \Z assertion.
1091
1092 The meanings of the circumflex and dollar characters are
1093 changed if the PCRE_MULTILINE option is set. When this is
1094 the case, they match immediately after and immediately
1095 before an internal "\n" character, respectively, in addition
1096 to matching at the start and end of the subject string. For
1097 example, the pattern /^abc$/ matches the subject string
1098 "def\nabc" in multiline mode, but not otherwise. Conse-
1099 quently, patterns that are anchored in single line mode
1100 because all branches start with "^" are not anchored in mul-
1101 tiline mode, and a match for circumflex is possible when the
1102 startoffset argument of pcre_exec() is non-zero. The
1103 PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is
1104 set.
1105
1106 Note that the sequences \A, \Z, and \z can be used to match
1107 the start and end of the subject in both modes, and if all
1108 branches of a pattern start with \A is it always anchored,
1109 whether PCRE_MULTILINE is set or not.
1110
1111
1112
1113 FULL STOP (PERIOD, DOT)
1114 Outside a character class, a dot in the pattern matches any
1115 one character in the subject, including a non-printing char-
1116 acter, but not (by default) newline. If the PCRE_DOTALL
1117
1118 option is set, dots match newlines as well. The handling of
1119 dot is entirely independent of the handling of circumflex
1120 and dollar, the only relationship being that they both
1121 involve newline characters. Dot has no special meaning in a
1122 character class.
1123
1124
1125
1126 SQUARE BRACKETS
1127 An opening square bracket introduces a character class, ter-
1128 minated by a closing square bracket. A closing square
1129 bracket on its own is not special. If a closing square
1130 bracket is required as a member of the class, it should be
1131 the first data character in the class (after an initial cir-
1132 cumflex, if present) or escaped with a backslash.
1133
1134 A character class matches a single character in the subject;
1135 the character must be in the set of characters defined by
1136 the class, unless the first character in the class is a cir-
1137 cumflex, in which case the subject character must not be in
1138 the set defined by the class. If a circumflex is actually
1139 required as a member of the class, ensure it is not the
1140 first character, or escape it with a backslash.
1141
1142 For example, the character class [aeiou] matches any lower
1143 case vowel, while [^aeiou] matches any character that is not
1144 a lower case vowel. Note that a circumflex is just a con-
1145 venient notation for specifying the characters which are in
1146 the class by enumerating those that are not. It is not an
1147 assertion: it still consumes a character from the subject
1148 string, and fails if the current pointer is at the end of
1149 the string.
1150
1151 When caseless matching is set, any letters in a class
1152 represent both their upper case and lower case versions, so
1153 for example, a caseless [aeiou] matches "A" as well as "a",
1154 and a caseless [^aeiou] does not match "A", whereas a case-
1155 ful version would.
1156
1157 The newline character is never treated in any special way in
1158 character classes, whatever the setting of the PCRE_DOTALL
1159 or PCRE_MULTILINE options is. A class such as [^a] will
1160 always match a newline.
1161
1162 The minus (hyphen) character can be used to specify a range
1163 of characters in a character class. For example, [d-m]
1164 matches any letter between d and m, inclusive. If a minus
1165 character is required in a class, it must be escaped with a
1166 backslash or appear in a position where it cannot be inter-
1167 preted as indicating a range, typically as the first or last
1168 character in the class.
1169
1170 It is not possible to have the literal character "]" as the
1171 end character of a range. A pattern such as [W-]46] is
1172 interpreted as a class of two characters ("W" and "-") fol-
1173 lowed by a literal string "46]", so it would match "W46]" or
1174 "-46]". However, if the "]" is escaped with a backslash it
1175 is interpreted as the end of range, so [W-\]46] is inter-
1176 preted as a single class containing a range followed by two
1177 separate characters. The octal or hexadecimal representation
1178 of "]" can also be used to end a range.
1179
1180 Ranges operate in ASCII collating sequence. They can also be
1181 used for characters specified numerically, for example
1182 [\000-\037]. If a range that includes letters is used when
1183 caseless matching is set, it matches the letters in either
1184 case. For example, [W-c] is equivalent to [][\^_`wxyzabc],
1185 matched caselessly, and if character tables for the "fr"
1186 locale are in use, [\xc8-\xcb] matches accented E characters
1187 in both cases.
1188
1189 The character types \d, \D, \s, \S, \w, and \W may also
1190 appear in a character class, and add the characters that
1191 they match to the class. For example, [\dABCDEF] matches any
1192 hexadecimal digit. A circumflex can conveniently be used
1193 with the upper case character types to specify a more res-
1194 tricted set of characters than the matching lower case type.
1195 For example, the class [^\W_] matches any letter or digit,
1196 but not underscore.
1197
1198 All non-alphameric characters other than \, -, ^ (at the
1199 start) and the terminating ] are non-special in character
1200 classes, but it does no harm if they are escaped.
1201
1202
1203
1204 POSIX CHARACTER CLASSES
1205 Perl 5.6 (not yet released at the time of writing) is going
1206 to support the POSIX notation for character classes, which
1207 uses names enclosed by [: and :] within the enclosing
1208 square brackets. PCRE supports this notation. For example,
1209
1210 [01[:alpha:]%]
1211
1212 matches "0", "1", any alphabetic character, or "%". The sup-
1213 ported class names are
1214
1215 alnum letters and digits
1216 alpha letters
1217 ascii character codes 0 - 127
1218 cntrl control characters
1219 digit decimal digits (same as \d)
1220 graph printing characters, excluding space
1221 lower lower case letters
1222 print printing characters, including space
1223 punct printing characters, excluding letters and digits
1224 space white space (same as \s)
1225 upper upper case letters
1226 word "word" characters (same as \w)
1227 xdigit hexadecimal digits
1228
1229 The names "ascii" and "word" are Perl extensions. Another
1230 Perl extension is negation, which is indicated by a ^ char-
1231 acter after the colon. For example,
1232
1233 [12[:^digit:]]
1234
1235 matches "1", "2", or any non-digit. PCRE (and Perl) also
1236 recogize the POSIX syntax [.ch.] and [=ch=] where "ch" is a
1237 "collating element", but these are not supported, and an
1238 error is given if they are encountered.
1239
1240
1241
1242 VERTICAL BAR
1243 Vertical bar characters are used to separate alternative
1244 patterns. For example, the pattern
1245
1246 gilbert|sullivan
1247
1248 matches either "gilbert" or "sullivan". Any number of alter-
1249 natives may appear, and an empty alternative is permitted
1250 (matching the empty string). The matching process tries
1251 each alternative in turn, from left to right, and the first
1252 one that succeeds is used. If the alternatives are within a
1253 subpattern (defined below), "succeeds" means matching the
1254 rest of the main pattern as well as the alternative in the
1255 subpattern.
1256
1257
1258
1259 INTERNAL OPTION SETTING
1260 The settings of PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL,
1261 and PCRE_EXTENDED can be changed from within the pattern by
1262 a sequence of Perl option letters enclosed between "(?" and
1263 ")". The option letters are
1264
1265 i for PCRE_CASELESS
1266 m for PCRE_MULTILINE
1267 s for PCRE_DOTALL
1268 x for PCRE_EXTENDED
1269
1270 For example, (?im) sets caseless, multiline matching. It is
1271 also possible to unset these options by preceding the letter
1272 with a hyphen, and a combined setting and unsetting such as
1273 (?im-sx), which sets PCRE_CASELESS and PCRE_MULTILINE while
1274 unsetting PCRE_DOTALL and PCRE_EXTENDED, is also permitted.
1275 If a letter appears both before and after the hyphen, the
1276 option is unset.
1277
1278 The scope of these option changes depends on where in the
1279 pattern the setting occurs. For settings that are outside
1280 any subpattern (defined below), the effect is the same as if
1281 the options were set or unset at the start of matching. The
1282 following patterns all behave in exactly the same way:
1283
1284 (?i)abc
1285 a(?i)bc
1286 ab(?i)c
1287 abc(?i)
1288
1289 which in turn is the same as compiling the pattern abc with
1290 PCRE_CASELESS set. In other words, such "top level" set-
1291 tings apply to the whole pattern (unless there are other
1292 changes inside subpatterns). If there is more than one set-
1293 ting of the same option at top level, the rightmost setting
1294 is used.
1295
1296 If an option change occurs inside a subpattern, the effect
1297 is different. This is a change of behaviour in Perl 5.005.
1298 An option change inside a subpattern affects only that part
1299 of the subpattern that follows it, so
1300
1301 (a(?i)b)c
1302
1303 matches abc and aBc and no other strings (assuming
1304 PCRE_CASELESS is not used). By this means, options can be
1305 made to have different settings in different parts of the
1306 pattern. Any changes made in one alternative do carry on
1307 into subsequent branches within the same subpattern. For
1308 example,
1309
1310 (a(?i)b|c)
1311
1312 matches "ab", "aB", "c", and "C", even though when matching
1313 "C" the first branch is abandoned before the option setting.
1314 This is because the effects of option settings happen at
1315 compile time. There would be some very weird behaviour oth-
1316 erwise.
1317
1318 The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can
1319 be changed in the same way as the Perl-compatible options by
1320 using the characters U and X respectively. The (?X) flag
1321 setting is special in that it must always occur earlier in
1322 the pattern than any of the additional features it turns on,
1323 even when it is at top level. It is best put at the start.
1324
1325
1326
1327 SUBPATTERNS
1328 Subpatterns are delimited by parentheses (round brackets),
1329 which can be nested. Marking part of a pattern as a subpat-
1330 tern does two things:
1331
1332 1. It localizes a set of alternatives. For example, the pat-
1333 tern
1334
1335 cat(aract|erpillar|)
1336
1337 matches one of the words "cat", "cataract", or "caterpil-
1338 lar". Without the parentheses, it would match "cataract",
1339 "erpillar" or the empty string.
1340
1341 2. It sets up the subpattern as a capturing subpattern (as
1342 defined above). When the whole pattern matches, that por-
1343 tion of the subject string that matched the subpattern is
1344 passed back to the caller via the ovector argument of
1345 pcre_exec(). Opening parentheses are counted from left to
1346 right (starting from 1) to obtain the numbers of the captur-
1347 ing subpatterns.
1348
1349 For example, if the string "the red king" is matched against
1350 the pattern
1351
1352 the ((red|white) (king|queen))
1353
1354 the captured substrings are "red king", "red", and "king",
1355 and are numbered 1, 2, and 3.
1356
1357 The fact that plain parentheses fulfil two functions is not
1358 always helpful. There are often times when a grouping sub-
1359 pattern is required without a capturing requirement. If an
1360 opening parenthesis is followed by "?:", the subpattern does
1361 not do any capturing, and is not counted when computing the
1362 number of any subsequent capturing subpatterns. For example,
1363 if the string "the white queen" is matched against the pat-
1364 tern
1365
1366 the ((?:red|white) (king|queen))
1367
1368 the captured substrings are "white queen" and "queen", and
1369 are numbered 1 and 2. The maximum number of captured sub-
1370 strings is 99, and the maximum number of all subpatterns,
1371 both capturing and non-capturing, is 200.
1372
1373 As a convenient shorthand, if any option settings are
1374 required at the start of a non-capturing subpattern, the
1375 option letters may appear between the "?" and the ":". Thus
1376 the two patterns
1377
1378 (?i:saturday|sunday)
1379 (?:(?i)saturday|sunday)
1380
1381 match exactly the same set of strings. Because alternative
1382 branches are tried from left to right, and options are not
1383 reset until the end of the subpattern is reached, an option
1384 setting in one branch does affect subsequent branches, so
1385 the above patterns match "SUNDAY" as well as "Saturday".
1386
1387
1388
1389 REPETITION
1390 Repetition is specified by quantifiers, which can follow any
1391 of the following items:
1392
1393 a single character, possibly escaped
1394 the . metacharacter
1395 a character class
1396 a back reference (see next section)
1397 a parenthesized subpattern (unless it is an assertion -
1398 see below)
1399
1400 The general repetition quantifier specifies a minimum and
1401 maximum number of permitted matches, by giving the two
1402 numbers in curly brackets (braces), separated by a comma.
1403 The numbers must be less than 65536, and the first must be
1404 less than or equal to the second. For example:
1405
1406 z{2,4}
1407
1408 matches "zz", "zzz", or "zzzz". A closing brace on its own
1409 is not a special character. If the second number is omitted,
1410 but the comma is present, there is no upper limit; if the
1411 second number and the comma are both omitted, the quantifier
1412 specifies an exact number of required matches. Thus
1413
1414 [aeiou]{3,}
1415
1416 matches at least 3 successive vowels, but may match many
1417 more, while
1418
1419 \d{8}
1420
1421 matches exactly 8 digits. An opening curly bracket that
1422 appears in a position where a quantifier is not allowed, or
1423 one that does not match the syntax of a quantifier, is taken
1424 as a literal character. For example, {,6} is not a quantif-
1425 ier, but a literal string of four characters.
1426
1427 The quantifier {0} is permitted, causing the expression to
1428 behave as if the previous item and the quantifier were not
1429 present.
1430
1431 For convenience (and historical compatibility) the three
1432 most common quantifiers have single-character abbreviations:
1433
1434 * is equivalent to {0,}
1435 + is equivalent to {1,}
1436 ? is equivalent to {0,1}
1437
1438 It is possible to construct infinite loops by following a
1439 subpattern that can match no characters with a quantifier
1440 that has no upper limit, for example:
1441
1442 (a?)*
1443
1444 Earlier versions of Perl and PCRE used to give an error at
1445 compile time for such patterns. However, because there are
1446 cases where this can be useful, such patterns are now
1447 accepted, but if any repetition of the subpattern does in
1448 fact match no characters, the loop is forcibly broken.
1449
1450 By default, the quantifiers are "greedy", that is, they
1451 match as much as possible (up to the maximum number of per-
1452 mitted times), without causing the rest of the pattern to
1453 fail. The classic example of where this gives problems is in
1454 trying to match comments in C programs. These appear between
1455 the sequences /* and */ and within the sequence, individual
1456 * and / characters may appear. An attempt to match C com-
1457 ments by applying the pattern
1458
1459 /\*.*\*/
1460
1461 to the string
1462
1463 /* first command */ not comment /* second comment */
1464
1465 fails, because it matches the entire string owing to the
1466 greediness of the .* item.
1467
1468 However, if a quantifier is followed by a question mark, it
1469 ceases to be greedy, and instead matches the minimum number
1470 of times possible, so the pattern
1471
1472 /\*.*?\*/
1473
1474 does the right thing with the C comments. The meaning of the
1475 various quantifiers is not otherwise changed, just the pre-
1476 ferred number of matches. Do not confuse this use of ques-
1477 tion mark with its use as a quantifier in its own right.
1478 Because it has two uses, it can sometimes appear doubled, as
1479 in
1480
1481 \d??\d
1482
1483 which matches one digit by preference, but can match two if
1484 that is the only way the rest of the pattern matches.
1485
1486 If the PCRE_UNGREEDY option is set (an option which is not
1487 available in Perl), the quantifiers are not greedy by
1488 default, but individual ones can be made greedy by following
1489 them with a question mark. In other words, it inverts the
1490 default behaviour.
1491
1492 When a parenthesized subpattern is quantified with a minimum
1493 repeat count that is greater than 1 or with a limited max-
1494 imum, more store is required for the compiled pattern, in
1495 proportion to the size of the minimum or maximum.
1496
1497 If a pattern starts with .* or .{0,} and the PCRE_DOTALL
1498 option (equivalent to Perl's /s) is set, thus allowing the .
1499 to match newlines, the pattern is implicitly anchored,
1500 because whatever follows will be tried against every charac-
1501 ter position in the subject string, so there is no point in
1502 retrying the overall match at any position after the first.
1503 PCRE treats such a pattern as though it were preceded by \A.
1504 In cases where it is known that the subject string contains
1505 no newlines, it is worth setting PCRE_DOTALL when the pat-
1506 tern begins with .* in order to obtain this optimization, or
1507 alternatively using ^ to indicate anchoring explicitly.
1508
1509 When a capturing subpattern is repeated, the value captured
1510 is the substring that matched the final iteration. For exam-
1511 ple, after
1512
1513 (tweedle[dume]{3}\s*)+
1514
1515 has matched "tweedledum tweedledee" the value of the cap-
1516 tured substring is "tweedledee". However, if there are
1517 nested capturing subpatterns, the corresponding captured
1518 values may have been set in previous iterations. For exam-
1519 ple, after
1520
1521 /(a|(b))+/
1522
1523 matches "aba" the value of the second captured substring is
1524 "b".
1525
1526
1527
1528 BACK REFERENCES
1529 Outside a character class, a backslash followed by a digit
1530 greater than 0 (and possibly further digits) is a back
1531 reference to a capturing subpattern earlier (i.e. to its
1532 left) in the pattern, provided there have been that many
1533 previous capturing left parentheses.
1534
1535 However, if the decimal number following the backslash is
1536 less than 10, it is always taken as a back reference, and
1537 causes an error only if there are not that many capturing
1538 left parentheses in the entire pattern. In other words, the
1539 parentheses that are referenced need not be to the left of
1540 the reference for numbers less than 10. See the section
1541 entitled "Backslash" above for further details of the han-
1542 dling of digits following a backslash.
1543
1544 A back reference matches whatever actually matched the cap-
1545 turing subpattern in the current subject string, rather than
1546 anything matching the subpattern itself. So the pattern
1547
1548 (sens|respons)e and \1ibility
1549
1550 matches "sense and sensibility" and "response and responsi-
1551 bility", but not "sense and responsibility". If caseful
1552 matching is in force at the time of the back reference, the
1553 case of letters is relevant. For example,
1554
1555 ((?i)rah)\s+\1
1556
1557 matches "rah rah" and "RAH RAH", but not "RAH rah", even
1558 though the original capturing subpattern is matched case-
1559 lessly.
1560
1561 There may be more than one back reference to the same sub-
1562 pattern. If a subpattern has not actually been used in a
1563 particular match, any back references to it always fail. For
1564 example, the pattern
1565
1566 (a|(bc))\2
1567
1568 always fails if it starts to match "a" rather than "bc".
1569 Because there may be up to 99 back references, all digits
1570 following the backslash are taken as part of a potential
1571 back reference number. If the pattern continues with a digit
1572 character, some delimiter must be used to terminate the back
1573 reference. If the PCRE_EXTENDED option is set, this can be
1574 whitespace. Otherwise an empty comment can be used.
1575
1576 A back reference that occurs inside the parentheses to which
1577 it refers fails when the subpattern is first used, so, for
1578 example, (a\1) never matches. However, such references can
1579 be useful inside repeated subpatterns. For example, the pat-
1580 tern
1581
1582 (a|b\1)+
1583
1584 matches any number of "a"s and also "aba", "ababbaa" etc. At
1585 each iteration of the subpattern, the back reference matches
1586 the character string corresponding to the previous
1587 iteration. In order for this to work, the pattern must be
1588 such that the first iteration does not need to match the
1589 back reference. This can be done using alternation, as in
1590 the example above, or by a quantifier with a minimum of
1591 zero.
1592
1593
1594
1595 ASSERTIONS
1596 An assertion is a test on the characters following or
1597 preceding the current matching point that does not actually
1598 consume any characters. The simple assertions coded as \b,
1599 \B, \A, \Z, \z, ^ and $ are described above. More compli-
1600 cated assertions are coded as subpatterns. There are two
1601 kinds: those that look ahead of the current position in the
1602 subject string, and those that look behind it.
1603
1604 An assertion subpattern is matched in the normal way, except
1605 that it does not cause the current matching position to be
1606 changed. Lookahead assertions start with (?= for positive
1607 assertions and (?! for negative assertions. For example,
1608
1609 \w+(?=;)
1610
1611 matches a word followed by a semicolon, but does not include
1612 the semicolon in the match, and
1613
1614 foo(?!bar)
1615
1616 matches any occurrence of "foo" that is not followed by
1617 "bar". Note that the apparently similar pattern
1618
1619 (?!foo)bar
1620
1621 does not find an occurrence of "bar" that is preceded by
1622 something other than "foo"; it finds any occurrence of "bar"
1623 whatsoever, because the assertion (?!foo) is always true
1624 when the next three characters are "bar". A lookbehind
1625 assertion is needed to achieve this effect.
1626
1627 Lookbehind assertions start with (?<= for positive asser-
1628 tions and (?<! for negative assertions. For example,
1629
1630 (?<!foo)bar
1631
1632 does find an occurrence of "bar" that is not preceded by
1633 "foo". The contents of a lookbehind assertion are restricted
1634 such that all the strings it matches must have a fixed
1635 length. However, if there are several alternatives, they do
1636 not all have to have the same fixed length. Thus
1637
1638 (?<=bullock|donkey)
1639
1640 is permitted, but
1641
1642 (?<!dogs?|cats?)
1643
1644 causes an error at compile time. Branches that match dif-
1645 ferent length strings are permitted only at the top level of
1646 a lookbehind assertion. This is an extension compared with
1647 Perl 5.005, which requires all branches to match the same
1648 length of string. An assertion such as
1649
1650 (?<=ab(c|de))
1651
1652 is not permitted, because its single top-level branch can
1653 match two different lengths, but it is acceptable if rewrit-
1654 ten to use two top-level branches:
1655
1656 (?<=abc|abde)
1657
1658 The implementation of lookbehind assertions is, for each
1659 alternative, to temporarily move the current position back
1660 by the fixed width and then try to match. If there are
1661 insufficient characters before the current position, the
1662 match is deemed to fail. Lookbehinds in conjunction with
1663 once-only subpatterns can be particularly useful for match-
1664 ing at the ends of strings; an example is given at the end
1665 of the section on once-only subpatterns.
1666
1667 Several assertions (of any sort) may occur in succession.
1668 For example,
1669
1670 (?<=\d{3})(?<!999)foo
1671
1672 matches "foo" preceded by three digits that are not "999".
1673 Notice that each of the assertions is applied independently
1674 at the same point in the subject string. First there is a
1675 check that the previous three characters are all digits, and
1676 then there is a check that the same three characters are not
1677 "999". This pattern does not match "foo" preceded by six
1678 characters, the first of which are digits and the last three
1679 of which are not "999". For example, it doesn't match
1680 "123abcfoo". A pattern to do that is
1681
1682 (?<=\d{3}...)(?<!999)foo
1683
1684 This time the first assertion looks at the preceding six
1685 characters, checking that the first three are digits, and
1686 then the second assertion checks that the preceding three
1687 characters are not "999".
1688
1689 Assertions can be nested in any combination. For example,
1690
1691 (?<=(?<!foo)bar)baz
1692
1693 matches an occurrence of "baz" that is preceded by "bar"
1694 which in turn is not preceded by "foo", while
1695
1696 (?<=\d{3}(?!999)...)foo
1697
1698 is another pattern which matches "foo" preceded by three
1699 digits and any three characters that are not "999".
1700
1701 Assertion subpatterns are not capturing subpatterns, and may
1702 not be repeated, because it makes no sense to assert the
1703 same thing several times. If any kind of assertion contains
1704 capturing subpatterns within it, these are counted for the
1705 purposes of numbering the capturing subpatterns in the whole
1706 pattern. However, substring capturing is carried out only
1707 for positive assertions, because it does not make sense for
1708 negative assertions.
1709
1710 Assertions count towards the maximum of 200 parenthesized
1711 subpatterns.
1712
1713
1714
1715 ONCE-ONLY SUBPATTERNS
1716 With both maximizing and minimizing repetition, failure of
1717 what follows normally causes the repeated item to be re-
1718 evaluated to see if a different number of repeats allows the
1719 rest of the pattern to match. Sometimes it is useful to
1720 prevent this, either to change the nature of the match, or
1721 to cause it fail earlier than it otherwise might, when the
1722 author of the pattern knows there is no point in carrying
1723 on.
1724
1725 Consider, for example, the pattern \d+foo when applied to
1726 the subject line
1727
1728 123456bar
1729
1730 After matching all 6 digits and then failing to match "foo",
1731 the normal action of the matcher is to try again with only 5
1732 digits matching the \d+ item, and then with 4, and so on,
1733 before ultimately failing. Once-only subpatterns provide the
1734 means for specifying that once a portion of the pattern has
1735 matched, it is not to be re-evaluated in this way, so the
1736 matcher would give up immediately on failing to match "foo"
1737 the first time. The notation is another kind of special
1738 parenthesis, starting with (?> as in this example:
1739
1740 (?>\d+)bar
1741
1742 This kind of parenthesis "locks up" the part of the pattern
1743 it contains once it has matched, and a failure further into
1744 the pattern is prevented from backtracking into it.
1745 Backtracking past it to previous items, however, works as
1746 normal.
1747
1748 An alternative description is that a subpattern of this type
1749 matches the string of characters that an identical stan-
1750 dalone pattern would match, if anchored at the current point
1751 in the subject string.
1752
1753 Once-only subpatterns are not capturing subpatterns. Simple
1754 cases such as the above example can be thought of as a max-
1755 imizing repeat that must swallow everything it can. So,
1756 while both \d+ and \d+? are prepared to adjust the number of
1757 digits they match in order to make the rest of the pattern
1758 match, (?>\d+) can only match an entire sequence of digits.
1759
1760 This construction can of course contain arbitrarily compli-
1761 cated subpatterns, and it can be nested.
1762
1763 Once-only subpatterns can be used in conjunction with look-
1764 behind assertions to specify efficient matching at the end
1765 of the subject string. Consider a simple pattern such as
1766
1767 abcd$
1768
1769 when applied to a long string which does not match. Because
1770 matching proceeds from left to right, PCRE will look for
1771 each "a" in the subject and then see if what follows matches
1772 the rest of the pattern. If the pattern is specified as
1773
1774 ^.*abcd$
1775
1776 the initial .* matches the entire string at first, but when
1777 this fails (because there is no following "a"), it back-
1778 tracks to match all but the last character, then all but the
1779 last two characters, and so on. Once again the search for
1780 "a" covers the entire string, from right to left, so we are
1781 no better off. However, if the pattern is written as
1782
1783 ^(?>.*)(?<=abcd)
1784
1785 there can be no backtracking for the .* item; it can match
1786 only the entire string. The subsequent lookbehind assertion
1787 does a single test on the last four characters. If it fails,
1788 the match fails immediately. For long strings, this approach
1789 makes a significant difference to the processing time.
1790
1791 When a pattern contains an unlimited repeat inside a subpat-
1792 tern that can itself be repeated an unlimited number of
1793 times, the use of a once-only subpattern is the only way to
1794 avoid some failing matches taking a very long time indeed.
1795 The pattern
1796
1797 (\D+|<\d+>)*[!?]
1798
1799 matches an unlimited number of substrings that either con-
1800 sist of non-digits, or digits enclosed in <>, followed by
1801 either ! or ?. When it matches, it runs quickly. However, if
1802 it is applied to
1803
1804 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
1805
1806 it takes a long time before reporting failure. This is
1807 because the string can be divided between the two repeats in
1808 a large number of ways, and all have to be tried. (The exam-
1809 ple used [!?] rather than a single character at the end,
1810 because both PCRE and Perl have an optimization that allows
1811 for fast failure when a single character is used. They
1812 remember the last single character that is required for a
1813 match, and fail early if it is not present in the string.)
1814 If the pattern is changed to
1815
1816 ((?>\D+)|<\d+>)*[!?]
1817
1818 sequences of non-digits cannot be broken, and failure hap-
1819 pens quickly.
1820
1821
1822
1823 CONDITIONAL SUBPATTERNS
1824 It is possible to cause the matching process to obey a sub-
1825 pattern conditionally or to choose between two alternative
1826 subpatterns, depending on the result of an assertion, or
1827 whether a previous capturing subpattern matched or not. The
1828 two possible forms of conditional subpattern are
1829
1830 (?(condition)yes-pattern)
1831 (?(condition)yes-pattern|no-pattern)
1832
1833 If the condition is satisfied, the yes-pattern is used; oth-
1834 erwise the no-pattern (if present) is used. If there are
1835 more than two alternatives in the subpattern, a compile-time
1836 error occurs.
1837
1838 There are two kinds of condition. If the text between the
1839 parentheses consists of a sequence of digits, the condition
1840 is satisfied if the capturing subpattern of that number has
1841 previously matched. The number must be greater than zero.
1842 Consider the following pattern, which contains non-
1843 significant white space to make it more readable (assume the
1844 PCRE_EXTENDED option) and to divide it into three parts for
1845 ease of discussion:
1846
1847 ( \( )? [^()]+ (?(1) \) )
1848
1849 The first part matches an optional opening parenthesis, and
1850 if that character is present, sets it as the first captured
1851 substring. The second part matches one or more characters
1852 that are not parentheses. The third part is a conditional
1853 subpattern that tests whether the first set of parentheses
1854 matched or not. If they did, that is, if subject started
1855 with an opening parenthesis, the condition is true, and so
1856 the yes-pattern is executed and a closing parenthesis is
1857 required. Otherwise, since no-pattern is not present, the
1858 subpattern matches nothing. In other words, this pattern
1859 matches a sequence of non-parentheses, optionally enclosed
1860 in parentheses.
1861
1862 If the condition is not a sequence of digits, it must be an
1863 assertion. This may be a positive or negative lookahead or
1864 lookbehind assertion. Consider this pattern, again contain-
1865 ing non-significant white space, and with the two alterna-
1866 tives on the second line:
1867
1868 (?(?=[^a-z]*[a-z])
1869 \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
1870
1871 The condition is a positive lookahead assertion that matches
1872 an optional sequence of non-letters followed by a letter. In
1873 other words, it tests for the presence of at least one
1874 letter in the subject. If a letter is found, the subject is
1875 matched against the first alternative; otherwise it is
1876 matched against the second. This pattern matches strings in
1877 one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
1878 letters and dd are digits.
1879
1880
1881
1882 COMMENTS
1883 The sequence (?# marks the start of a comment which contin-
1884 ues up to the next closing parenthesis. Nested parentheses
1885 are not permitted. The characters that make up a comment
1886 play no part in the pattern matching at all.
1887
1888 If the PCRE_EXTENDED option is set, an unescaped # character
1889 outside a character class introduces a comment that contin-
1890 ues up to the next newline character in the pattern.
1891
1892
1893
1894 RECURSIVE PATTERNS
1895 Consider the problem of matching a string in parentheses,
1896 allowing for unlimited nested parentheses. Without the use
1897 of recursion, the best that can be done is to use a pattern
1898 that matches up to some fixed depth of nesting. It is not
1899 possible to handle an arbitrary nesting depth. Perl 5.6 has
1900 provided an experimental facility that allows regular
1901 expressions to recurse (amongst other things). It does this
1902 by interpolating Perl code in the expression at run time,
1903 and the code can refer to the expression itself. A Perl pat-
1904 tern to solve the parentheses problem can be created like
1905 this:
1906
1907 $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
1908
1909 The (?p{...}) item interpolates Perl code at run time, and
1910 in this case refers recursively to the pattern in which it
1911 appears. Obviously, PCRE cannot support the interpolation of
1912 Perl code. Instead, the special item (?R) is provided for
1913 the specific case of recursion. This PCRE pattern solves the
1914 parentheses problem (assume the PCRE_EXTENDED option is set
1915 so that white space is ignored):
1916
1917 \( ( (?>[^()]+) | (?R) )* \)
1918
1919 First it matches an opening parenthesis. Then it matches any
1920 number of substrings which can either be a sequence of non-
1921 parentheses, or a recursive match of the pattern itself
1922 (i.e. a correctly parenthesized substring). Finally there is
1923 a closing parenthesis.
1924
1925 This particular example pattern contains nested unlimited
1926 repeats, and so the use of a once-only subpattern for match-
1927 ing strings of non-parentheses is important when applying
1928 the pattern to strings that do not match. For example, when
1929 it is applied to
1930
1931 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
1932
1933 it yields "no match" quickly. However, if a once-only sub-
1934 pattern is not used, the match runs for a very long time
1935 indeed because there are so many different ways the + and *
1936 repeats can carve up the subject, and all have to be tested
1937 before failure can be reported.
1938
1939 The values set for any capturing subpatterns are those from
1940 the outermost level of the recursion at which the subpattern
1941 value is set. If the pattern above is matched against
1942
1943 (ab(cd)ef)
1944
1945 the value for the capturing parentheses is "ef", which is
1946 the last value taken on at the top level. If additional
1947 parentheses are added, giving
1948
1949 \( ( ( (?>[^()]+) | (?R) )* ) \)
1950 ^ ^
1951 ^ ^ the string they capture is
1952 "ab(cd)ef", the contents of the top level parentheses. If
1953 there are more than 15 capturing parentheses in a pattern,
1954 PCRE has to obtain extra memory to store data during a
1955 recursion, which it does by using pcre_malloc, freeing it
1956 via pcre_free afterwards. If no memory can be obtained, it
1957 saves data for the first 15 capturing parentheses only, as
1958 there is no way to give an out-of-memory error from within a
1959 recursion.
1960
1961
1962
1963 PERFORMANCE
1964 Certain items that may appear in patterns are more efficient
1965 than others. It is more efficient to use a character class
1966 like [aeiou] than a set of alternatives such as (a|e|i|o|u).
1967 In general, the simplest construction that provides the
1968 required behaviour is usually the most efficient. Jeffrey
1969 Friedl's book contains a lot of discussion about optimizing
1970 regular expressions for efficient performance.
1971
1972 When a pattern begins with .* and the PCRE_DOTALL option is
1973 set, the pattern is implicitly anchored by PCRE, since it
1974 can match only at the start of a subject string. However, if
1975 PCRE_DOTALL is not set, PCRE cannot make this optimization,
1976 because the . metacharacter does not then match a newline,
1977 and if the subject string contains newlines, the pattern may
1978 match from the character immediately following one of them
1979 instead of from the very start. For example, the pattern
1980
1981 (.*) second
1982
1983 matches the subject "first\nand second" (where \n stands for
1984 a newline character) with the first captured substring being
1985 "and". In order to do this, PCRE has to retry the match
1986 starting after every newline in the subject.
1987
1988 If you are using such a pattern with subject strings that do
1989 not contain newlines, the best performance is obtained by
1990 setting PCRE_DOTALL, or starting the pattern with ^.* to
1991 indicate explicit anchoring. That saves PCRE from having to
1992 scan along the subject looking for a newline to restart at.
1993
1994 Beware of patterns that contain nested indefinite repeats.
1995 These can take a long time to run when applied to a string
1996 that does not match. Consider the pattern fragment
1997
1998 (a+)*
1999
2000 This can match "aaaa" in 33 different ways, and this number
2001 increases very rapidly as the string gets longer. (The *
2002 repeat can match 0, 1, 2, 3, or 4 times, and for each of
2003 those cases other than 0, the + repeats can match different
2004 numbers of times.) When the remainder of the pattern is such
2005 that the entire match is going to fail, PCRE has in princi-
2006 ple to try every possible variation, and this can take an
2007 extremely long time.
2008
2009 An optimization catches some of the more simple cases such
2010 as
2011
2012 (a+)*b
2013
2014 where a literal character follows. Before embarking on the
2015 standard matching procedure, PCRE checks that there is a "b"
2016 later in the subject string, and if there is not, it fails
2017 the match immediately. However, when there is no following
2018 literal this optimization cannot be used. You can see the
2019 difference by comparing the behaviour of
2020
2021 (a+)*\d
2022
2023 with the pattern above. The former gives a failure almost
2024 instantly when applied to a whole line of "a" characters,
2025 whereas the latter takes an appreciable time with strings
2026 longer than about 20 characters.
2027
2028
2029
2030 UTF-8 SUPPORT
2031 Starting at release 3.3, PCRE has some support for character
2032 strings encoded in the UTF-8 format. This is incomplete, and
2033 is regarded as experimental. In order to use it, you must
2034 configure PCRE to include UTF-8 support in the code, and, in
2035 addition, you must call pcre_compile() with the PCRE_UTF8
2036 option flag. When you do this, both the pattern and any sub-
2037 ject strings that are matched against it are treated as
2038 UTF-8 strings instead of just strings of bytes, but only in
2039 the cases that are mentioned below.
2040
2041 If you compile PCRE with UTF-8 support, but do not use it at
2042 run time, the library will be a bit bigger, but the addi-
2043 tional run time overhead is limited to testing the PCRE_UTF8
2044 flag in several places, so should not be very large.
2045
2046 PCRE assumes that the strings it is given contain valid
2047 UTF-8 codes. It does not diagnose invalid UTF-8 strings. If
2048 you pass invalid UTF-8 strings to PCRE, the results are
2049 undefined.
2050
2051 Running with PCRE_UTF8 set causes these changes in the way
2052 PCRE works:
2053
2054 1. In a pattern, the escape sequence \x{...}, where the con-
2055 tents of the braces is a string of hexadecimal digits, is
2056 interpreted as a UTF-8 character whose code number is the
2057 given hexadecimal number, for example: \x{1234}. This
2058 inserts from one to six literal bytes into the pattern,
2059 using the UTF-8 encoding. If a non-hexadecimal digit appears
2060 between the braces, the item is not recognized.
2061
2062 2. The original hexadecimal escape sequence, \xhh, generates
2063 a two-byte UTF-8 character if its value is greater than 127.
2064
2065 3. Repeat quantifiers are NOT correctly handled if they fol-
2066 low a multibyte character. For example, \x{100}* and \xc3+
2067 do not work. If you want to repeat such characters, you must
2068 enclose them in non-capturing parentheses, for example
2069 (?:\x{100}), at present.
2070
2071 4. The dot metacharacter matches one UTF-8 character instead
2072 of a single byte.
2073
2074 5. Unlike literal UTF-8 characters, the dot metacharacter
2075 followed by a repeat quantifier does operate correctly on
2076 UTF-8 characters instead of single bytes.
2077
2078 4. Although the \x{...} escape is permitted in a character
2079 class, characters whose values are greater than 255 cannot
2080 be included in a class.
2081
2082 5. A class is matched against a UTF-8 character instead of
2083 just a single byte, but it can match only characters whose
2084 values are less than 256. Characters with greater values
2085 always fail to match a class.
2086
2087 6. Repeated classes work correctly on multiple characters.
2088
2089 7. Classes containing just a single character whose value is
2090 greater than 127 (but less than 256), for example, [\x80] or
2091 [^\x{93}], do not work because these are optimized into sin-
2092 gle byte matches. In the first case, of course, the class
2093 brackets are just redundant.
2094
2095 8. Lookbehind assertions move backwards in the subject by a
2096 fixed number of characters instead of a fixed number of
2097 bytes. Simple cases have been tested to work correctly, but
2098 there may be hidden gotchas herein.
2099
2100 9. The character types such as \d and \w do not work
2101 correctly with UTF-8 characters. They continue to test a
2102 single byte.
2103
2104 10. Anything not explicitly mentioned here continues to work
2105 in bytes rather than in characters.
2106
2107 The following UTF-8 features of Perl 5.6 are not imple-
2108 mented:
2109 1. The escape sequence \C to match a single byte.
2110
2111 2. The use of Unicode tables and properties and escapes \p,
2112 \P, and \X.
2113
2114
2115
2116 AUTHOR
2117 Philip Hazel <ph10@cam.ac.uk>
2118 University Computing Service,
2119 New Museums Site,
2120 Cambridge CB2 3QG, England.
2121 Phone: +44 1223 334714
2122
2123 Last updated: 28 August 2000,
2124 the 250th anniversary of the death of J.S. Bach.
2125 Copyright (c) 1997-2000 University of Cambridge.

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12