/[pcre]/code/trunk/doc/html/pcrepattern.html
ViewVC logotype

Contents of /code/trunk/doc/html/pcrepattern.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 73 - (show annotations) (download) (as text)
Sat Feb 24 21:40:30 2007 UTC (7 years, 5 months ago) by nigel
File MIME type: text/html
File size: 60320 byte(s)
Load pcre-4.5 into code/trunk.

1 <html>
2 <head>
3 <title>pcrepattern specification</title>
4 </head>
5 <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6 This HTML document has been generated automatically from the original man page.
7 If there is any nonsense in it, please consult the man page, in case the
8 conversion went wrong.<br>
9 <ul>
10 <li><a name="TOC1" href="#SEC1">PCRE REGULAR EXPRESSION DETAILS</a>
11 <li><a name="TOC2" href="#SEC2">BACKSLASH</a>
12 <li><a name="TOC3" href="#SEC3">CIRCUMFLEX AND DOLLAR</a>
13 <li><a name="TOC4" href="#SEC4">FULL STOP (PERIOD, DOT)</a>
14 <li><a name="TOC5" href="#SEC5">MATCHING A SINGLE BYTE</a>
15 <li><a name="TOC6" href="#SEC6">SQUARE BRACKETS</a>
16 <li><a name="TOC7" href="#SEC7">POSIX CHARACTER CLASSES</a>
17 <li><a name="TOC8" href="#SEC8">VERTICAL BAR</a>
18 <li><a name="TOC9" href="#SEC9">INTERNAL OPTION SETTING</a>
19 <li><a name="TOC10" href="#SEC10">SUBPATTERNS</a>
20 <li><a name="TOC11" href="#SEC11">NAMED SUBPATTERNS</a>
21 <li><a name="TOC12" href="#SEC12">REPETITION</a>
22 <li><a name="TOC13" href="#SEC13">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a>
23 <li><a name="TOC14" href="#SEC14">BACK REFERENCES</a>
24 <li><a name="TOC15" href="#SEC15">ASSERTIONS</a>
25 <li><a name="TOC16" href="#SEC16">CONDITIONAL SUBPATTERNS</a>
26 <li><a name="TOC17" href="#SEC17">COMMENTS</a>
27 <li><a name="TOC18" href="#SEC18">RECURSIVE PATTERNS</a>
28 <li><a name="TOC19" href="#SEC19">SUBPATTERNS AS SUBROUTINES</a>
29 <li><a name="TOC20" href="#SEC20">CALLOUTS</a>
30 </ul>
31 <br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION DETAILS</a><br>
32 <P>
33 The syntax and semantics of the regular expressions supported by PCRE are
34 described below. Regular expressions are also described in the Perl
35 documentation and in a number of other books, some of which have copious
36 examples. Jeffrey Friedl's "Mastering Regular Expressions", published by
37 O'Reilly, covers them in great detail. The description here is intended as
38 reference documentation.
39 </P>
40 <P>
41 The basic operation of PCRE is on strings of bytes. However, there is also
42 support for UTF-8 character strings. To use this support you must build PCRE to
43 include UTF-8 support, and then call <b>pcre_compile()</b> with the PCRE_UTF8
44 option. How this affects the pattern matching is mentioned in several places
45 below. There is also a summary of UTF-8 features in the
46 <a href="pcre.html#utf8support">section on UTF-8 support</a>
47 in the main
48 <a href="pcre.html"><b>pcre</b></a>
49 page.
50 </P>
51 <P>
52 A regular expression is a pattern that is matched against a subject string from
53 left to right. Most characters stand for themselves in a pattern, and match the
54 corresponding characters in the subject. As a trivial example, the pattern
55 </P>
56 <P>
57 <pre>
58 The quick brown fox
59 </PRE>
60 </P>
61 <P>
62 matches a portion of a subject string that is identical to itself. The power of
63 regular expressions comes from the ability to include alternatives and
64 repetitions in the pattern. These are encoded in the pattern by the use of
65 <i>meta-characters</i>, which do not stand for themselves but instead are
66 interpreted in some special way.
67 </P>
68 <P>
69 There are two different sets of meta-characters: those that are recognized
70 anywhere in the pattern except within square brackets, and those that are
71 recognized in square brackets. Outside square brackets, the meta-characters are
72 as follows:
73 </P>
74 <P>
75 <pre>
76 \ general escape character with several uses
77 ^ assert start of string (or line, in multiline mode)
78 $ assert end of string (or line, in multiline mode)
79 . match any character except newline (by default)
80 [ start character class definition
81 | start of alternative branch
82 ( start subpattern
83 ) end subpattern
84 ? extends the meaning of (
85 also 0 or 1 quantifier
86 also quantifier minimizer
87 * 0 or more quantifier
88 + 1 or more quantifier
89 also "possessive quantifier"
90 { start min/max quantifier
91 </PRE>
92 </P>
93 <P>
94 Part of a pattern that is in square brackets is called a "character class". In
95 a character class the only meta-characters are:
96 </P>
97 <P>
98 <pre>
99 \ general escape character
100 ^ negate the class, but only if the first character
101 - indicates character range
102 [ POSIX character class (only if followed by POSIX
103 syntax)
104 ] terminates the character class
105 </PRE>
106 </P>
107 <P>
108 The following sections describe the use of each of the meta-characters.
109 </P>
110 <br><a name="SEC2" href="#TOC1">BACKSLASH</a><br>
111 <P>
112 The backslash character has several uses. Firstly, if it is followed by a
113 non-alphameric character, it takes away any special meaning that character may
114 have. This use of backslash as an escape character applies both inside and
115 outside character classes.
116 </P>
117 <P>
118 For example, if you want to match a * character, you write \* in the pattern.
119 This escaping action applies whether or not the following character would
120 otherwise be interpreted as a meta-character, so it is always safe to precede a
121 non-alphameric with backslash to specify that it stands for itself. In
122 particular, if you want to match a backslash, you write \\.
123 </P>
124 <P>
125 If a pattern is compiled with the PCRE_EXTENDED option, whitespace in the
126 pattern (other than in a character class) and characters between a # outside
127 a character class and the next newline character are ignored. An escaping
128 backslash can be used to include a whitespace or # character as part of the
129 pattern.
130 </P>
131 <P>
132 If you want to remove the special meaning from a sequence of characters, you
133 can do so by putting them between \Q and \E. This is different from Perl in
134 that $ and @ are handled as literals in \Q...\E sequences in PCRE, whereas in
135 Perl, $ and @ cause variable interpolation. Note the following examples:
136 </P>
137 <P>
138 <pre>
139 Pattern PCRE matches Perl matches
140 </PRE>
141 </P>
142 <P>
143 <pre>
144 \Qabc$xyz\E abc$xyz abc followed by the
145 contents of $xyz
146 \Qabc\$xyz\E abc\$xyz abc\$xyz
147 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
148 </PRE>
149 </P>
150 <P>
151 The \Q...\E sequence is recognized both inside and outside character classes.
152 </P>
153 <P>
154 A second use of backslash provides a way of encoding non-printing characters
155 in patterns in a visible manner. There is no restriction on the appearance of
156 non-printing characters, apart from the binary zero that terminates a pattern,
157 but when a pattern is being prepared by text editing, it is usually easier to
158 use one of the following escape sequences than the binary character it
159 represents:
160 </P>
161 <P>
162 <pre>
163 \a alarm, that is, the BEL character (hex 07)
164 \cx "control-x", where x is any character
165 \e escape (hex 1B)
166 \f formfeed (hex 0C)
167 \n newline (hex 0A)
168 \r carriage return (hex 0D)
169 \t tab (hex 09)
170 \ddd character with octal code ddd, or backreference
171 \xhh character with hex code hh
172 \x{hhh..} character with hex code hhh... (UTF-8 mode only)
173 </PRE>
174 </P>
175 <P>
176 The precise effect of \cx is as follows: if x is a lower case letter, it
177 is converted to upper case. Then bit 6 of the character (hex 40) is inverted.
178 Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c; becomes hex
179 7B.
180 </P>
181 <P>
182 After \x, from zero to two hexadecimal digits are read (letters can be in
183 upper or lower case). In UTF-8 mode, any number of hexadecimal digits may
184 appear between \x{ and }, but the value of the character code must be less
185 than 2**31 (that is, the maximum hexadecimal value is 7FFFFFFF). If characters
186 other than hexadecimal digits appear between \x{ and }, or if there is no
187 terminating }, this form of escape is not recognized. Instead, the initial
188 \x will be interpreted as a basic hexadecimal escape, with no following
189 digits, giving a byte whose value is zero.
190 </P>
191 <P>
192 Characters whose value is less than 256 can be defined by either of the two
193 syntaxes for \x when PCRE is in UTF-8 mode. There is no difference in the
194 way they are handled. For example, \xdc is exactly the same as \x{dc}.
195 </P>
196 <P>
197 After \0 up to two further octal digits are read. In both cases, if there
198 are fewer than two digits, just those that are present are used. Thus the
199 sequence \0\x\07 specifies two binary zeros followed by a BEL character
200 (code value 7). Make sure you supply two digits after the initial zero if the
201 character that follows is itself an octal digit.
202 </P>
203 <P>
204 The handling of a backslash followed by a digit other than 0 is complicated.
205 Outside a character class, PCRE reads it and any following digits as a decimal
206 number. If the number is less than 10, or if there have been at least that many
207 previous capturing left parentheses in the expression, the entire sequence is
208 taken as a <i>back reference</i>. A description of how this works is given
209 later, following the discussion of parenthesized subpatterns.
210 </P>
211 <P>
212 Inside a character class, or if the decimal number is greater than 9 and there
213 have not been that many capturing subpatterns, PCRE re-reads up to three octal
214 digits following the backslash, and generates a single byte from the least
215 significant 8 bits of the value. Any subsequent digits stand for themselves.
216 For example:
217 </P>
218 <P>
219 <pre>
220 \040 is another way of writing a space
221 \40 is the same, provided there are fewer than 40
222 previous capturing subpatterns
223 \7 is always a back reference
224 \11 might be a back reference, or another way of
225 writing a tab
226 \011 is always a tab
227 \0113 is a tab followed by the character "3"
228 \113 might be a back reference, otherwise the
229 character with octal code 113
230 \377 might be a back reference, otherwise
231 the byte consisting entirely of 1 bits
232 \81 is either a back reference, or a binary zero
233 followed by the two characters "8" and "1"
234 </PRE>
235 </P>
236 <P>
237 Note that octal values of 100 or greater must not be introduced by a leading
238 zero, because no more than three octal digits are ever read.
239 </P>
240 <P>
241 All the sequences that define a single byte value or a single UTF-8 character
242 (in UTF-8 mode) can be used both inside and outside character classes. In
243 addition, inside a character class, the sequence \b is interpreted as the
244 backspace character (hex 08). Outside a character class it has a different
245 meaning (see below).
246 </P>
247 <P>
248 The third use of backslash is for specifying generic character types:
249 </P>
250 <P>
251 <pre>
252 \d any decimal digit
253 \D any character that is not a decimal digit
254 \s any whitespace character
255 \S any character that is not a whitespace character
256 \w any "word" character
257 \W any "non-word" character
258 </PRE>
259 </P>
260 <P>
261 Each pair of escape sequences partitions the complete set of characters into
262 two disjoint sets. Any given character matches one, and only one, of each pair.
263 </P>
264 <P>
265 In UTF-8 mode, characters with values greater than 255 never match \d, \s, or
266 \w, and always match \D, \S, and \W.
267 </P>
268 <P>
269 For compatibility with Perl, \s does not match the VT character (code 11).
270 This makes it different from the the POSIX "space" class. The \s characters
271 are HT (9), LF (10), FF (12), CR (13), and space (32).
272 </P>
273 <P>
274 A "word" character is any letter or digit or the underscore character, that is,
275 any character which can be part of a Perl "word". The definition of letters and
276 digits is controlled by PCRE's character tables, and may vary if locale-
277 specific matching is taking place (see
278 <a href="pcreapi.html#localesupport">"Locale support"</a>
279 in the
280 <a href="pcreapi.html"><b>pcreapi</b></a>
281 page). For example, in the "fr" (French) locale, some character codes greater
282 than 128 are used for accented letters, and these are matched by \w.
283 </P>
284 <P>
285 These character type sequences can appear both inside and outside character
286 classes. They each match one character of the appropriate type. If the current
287 matching point is at the end of the subject string, all of them fail, since
288 there is no character to match.
289 </P>
290 <P>
291 The fourth use of backslash is for certain simple assertions. An assertion
292 specifies a condition that has to be met at a particular point in a match,
293 without consuming any characters from the subject string. The use of
294 subpatterns for more complicated assertions is described below. The backslashed
295 assertions are
296 </P>
297 <P>
298 <pre>
299 \b matches at a word boundary
300 \B matches when not at a word boundary
301 \A matches at start of subject
302 \Z matches at end of subject or before newline at end
303 \z matches at end of subject
304 \G matches at first matching position in subject
305 </PRE>
306 </P>
307 <P>
308 These assertions may not appear in character classes (but note that \b has a
309 different meaning, namely the backspace character, inside a character class).
310 </P>
311 <P>
312 A word boundary is a position in the subject string where the current character
313 and the previous character do not both match \w or \W (i.e. one matches
314 \w and the other matches \W), or the start or end of the string if the
315 first or last character matches \w, respectively.
316 </P>
317 <P>
318 The \A, \Z, and \z assertions differ from the traditional circumflex and
319 dollar (described below) in that they only ever match at the very start and end
320 of the subject string, whatever options are set. Thus, they are independent of
321 multiline mode.
322 </P>
323 <P>
324 They are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options. If the
325 <i>startoffset</i> argument of <b>pcre_exec()</b> is non-zero, indicating that
326 matching is to start at a point other than the beginning of the subject, \A
327 can never match. The difference between \Z and \z is that \Z matches before
328 a newline that is the last character of the string as well as at the end of the
329 string, whereas \z matches only at the end.
330 </P>
331 <P>
332 The \G assertion is true only when the current matching position is at the
333 start point of the match, as specified by the <i>startoffset</i> argument of
334 <b>pcre_exec()</b>. It differs from \A when the value of <i>startoffset</i> is
335 non-zero. By calling <b>pcre_exec()</b> multiple times with appropriate
336 arguments, you can mimic Perl's /g option, and it is in this kind of
337 implementation where \G can be useful.
338 </P>
339 <P>
340 Note, however, that PCRE's interpretation of \G, as the start of the current
341 match, is subtly different from Perl's, which defines it as the end of the
342 previous match. In Perl, these can be different when the previously matched
343 string was empty. Because PCRE does just one match at a time, it cannot
344 reproduce this behaviour.
345 </P>
346 <P>
347 If all the alternatives of a pattern begin with \G, the expression is anchored
348 to the starting match position, and the "anchored" flag is set in the compiled
349 regular expression.
350 </P>
351 <br><a name="SEC3" href="#TOC1">CIRCUMFLEX AND DOLLAR</a><br>
352 <P>
353 Outside a character class, in the default matching mode, the circumflex
354 character is an assertion which is true only if the current matching point is
355 at the start of the subject string. If the <i>startoffset</i> argument of
356 <b>pcre_exec()</b> is non-zero, circumflex can never match if the PCRE_MULTILINE
357 option is unset. Inside a character class, circumflex has an entirely different
358 meaning (see below).
359 </P>
360 <P>
361 Circumflex need not be the first character of the pattern if a number of
362 alternatives are involved, but it should be the first thing in each alternative
363 in which it appears if the pattern is ever to match that branch. If all
364 possible alternatives start with a circumflex, that is, if the pattern is
365 constrained to match only at the start of the subject, it is said to be an
366 "anchored" pattern. (There are also other constructs that can cause a pattern
367 to be anchored.)
368 </P>
369 <P>
370 A dollar character is an assertion which is true only if the current matching
371 point is at the end of the subject string, or immediately before a newline
372 character that is the last character in the string (by default). Dollar need
373 not be the last character of the pattern if a number of alternatives are
374 involved, but it should be the last item in any branch in which it appears.
375 Dollar has no special meaning in a character class.
376 </P>
377 <P>
378 The meaning of dollar can be changed so that it matches only at the very end of
379 the string, by setting the PCRE_DOLLAR_ENDONLY option at compile time. This
380 does not affect the \Z assertion.
381 </P>
382 <P>
383 The meanings of the circumflex and dollar characters are changed if the
384 PCRE_MULTILINE option is set. When this is the case, they match immediately
385 after and immediately before an internal newline character, respectively, in
386 addition to matching at the start and end of the subject string. For example,
387 the pattern /^abc$/ matches the subject string "def\nabc" in multiline mode,
388 but not otherwise. Consequently, patterns that are anchored in single line mode
389 because all branches start with ^ are not anchored in multiline mode, and a
390 match for circumflex is possible when the <i>startoffset</i> argument of
391 <b>pcre_exec()</b> is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if
392 PCRE_MULTILINE is set.
393 </P>
394 <P>
395 Note that the sequences \A, \Z, and \z can be used to match the start and
396 end of the subject in both modes, and if all branches of a pattern start with
397 \A it is always anchored, whether PCRE_MULTILINE is set or not.
398 </P>
399 <br><a name="SEC4" href="#TOC1">FULL STOP (PERIOD, DOT)</a><br>
400 <P>
401 Outside a character class, a dot in the pattern matches any one character in
402 the subject, including a non-printing character, but not (by default) newline.
403 In UTF-8 mode, a dot matches any UTF-8 character, which might be more than one
404 byte long, except (by default) for newline. If the PCRE_DOTALL option is set,
405 dots match newlines as well. The handling of dot is entirely independent of the
406 handling of circumflex and dollar, the only relationship being that they both
407 involve newline characters. Dot has no special meaning in a character class.
408 </P>
409 <br><a name="SEC5" href="#TOC1">MATCHING A SINGLE BYTE</a><br>
410 <P>
411 Outside a character class, the escape sequence \C matches any one byte, both
412 in and out of UTF-8 mode. Unlike a dot, it always matches a newline. The
413 feature is provided in Perl in order to match individual bytes in UTF-8 mode.
414 Because it breaks up UTF-8 characters into individual bytes, what remains in
415 the string may be a malformed UTF-8 string. For this reason it is best avoided.
416 </P>
417 <P>
418 PCRE does not allow \C to appear in lookbehind assertions (see below), because
419 in UTF-8 mode it makes it impossible to calculate the length of the lookbehind.
420 </P>
421 <br><a name="SEC6" href="#TOC1">SQUARE BRACKETS</a><br>
422 <P>
423 An opening square bracket introduces a character class, terminated by a closing
424 square bracket. A closing square bracket on its own is not special. If a
425 closing square bracket is required as a member of the class, it should be the
426 first data character in the class (after an initial circumflex, if present) or
427 escaped with a backslash.
428 </P>
429 <P>
430 A character class matches a single character in the subject. In UTF-8 mode, the
431 character may occupy more than one byte. A matched character must be in the set
432 of characters defined by the class, unless the first character in the class
433 definition is a circumflex, in which case the subject character must not be in
434 the set defined by the class. If a circumflex is actually required as a member
435 of the class, ensure it is not the first character, or escape it with a
436 backslash.
437 </P>
438 <P>
439 For example, the character class [aeiou] matches any lower case vowel, while
440 [^aeiou] matches any character that is not a lower case vowel. Note that a
441 circumflex is just a convenient notation for specifying the characters which
442 are in the class by enumerating those that are not. It is not an assertion: it
443 still consumes a character from the subject string, and fails if the current
444 pointer is at the end of the string.
445 </P>
446 <P>
447 In UTF-8 mode, characters with values greater than 255 can be included in a
448 class as a literal string of bytes, or by using the \x{ escaping mechanism.
449 </P>
450 <P>
451 When caseless matching is set, any letters in a class represent both their
452 upper case and lower case versions, so for example, a caseless [aeiou] matches
453 "A" as well as "a", and a caseless [^aeiou] does not match "A", whereas a
454 caseful version would. PCRE does not support the concept of case for characters
455 with values greater than 255.
456 </P>
457 <P>
458 The newline character is never treated in any special way in character classes,
459 whatever the setting of the PCRE_DOTALL or PCRE_MULTILINE options is. A class
460 such as [^a] will always match a newline.
461 </P>
462 <P>
463 The minus (hyphen) character can be used to specify a range of characters in a
464 character class. For example, [d-m] matches any letter between d and m,
465 inclusive. If a minus character is required in a class, it must be escaped with
466 a backslash or appear in a position where it cannot be interpreted as
467 indicating a range, typically as the first or last character in the class.
468 </P>
469 <P>
470 It is not possible to have the literal character "]" as the end character of a
471 range. A pattern such as [W-]46] is interpreted as a class of two characters
472 ("W" and "-") followed by a literal string "46]", so it would match "W46]" or
473 "-46]". However, if the "]" is escaped with a backslash it is interpreted as
474 the end of range, so [W-\]46] is interpreted as a single class containing a
475 range followed by two separate characters. The octal or hexadecimal
476 representation of "]" can also be used to end a range.
477 </P>
478 <P>
479 Ranges operate in the collating sequence of character values. They can also be
480 used for characters specified numerically, for example [\000-\037]. In UTF-8
481 mode, ranges can include characters whose values are greater than 255, for
482 example [\x{100}-\x{2ff}].
483 </P>
484 <P>
485 If a range that includes letters is used when caseless matching is set, it
486 matches the letters in either case. For example, [W-c] is equivalent to
487 [][\^_`wxyzabc], matched caselessly, and if character tables for the "fr"
488 locale are in use, [\xc8-\xcb] matches accented E characters in both cases.
489 </P>
490 <P>
491 The character types \d, \D, \s, \S, \w, and \W may also appear in a
492 character class, and add the characters that they match to the class. For
493 example, [\dABCDEF] matches any hexadecimal digit. A circumflex can
494 conveniently be used with the upper case character types to specify a more
495 restricted set of characters than the matching lower case type. For example,
496 the class [^\W_] matches any letter or digit, but not underscore.
497 </P>
498 <P>
499 All non-alphameric characters other than \, -, ^ (at the start) and the
500 terminating ] are non-special in character classes, but it does no harm if they
501 are escaped.
502 </P>
503 <br><a name="SEC7" href="#TOC1">POSIX CHARACTER CLASSES</a><br>
504 <P>
505 Perl supports the POSIX notation for character classes, which uses names
506 enclosed by [: and :] within the enclosing square brackets. PCRE also supports
507 this notation. For example,
508 </P>
509 <P>
510 <pre>
511 [01[:alpha:]%]
512 </PRE>
513 </P>
514 <P>
515 matches "0", "1", any alphabetic character, or "%". The supported class names
516 are
517 </P>
518 <P>
519 <pre>
520 alnum letters and digits
521 alpha letters
522 ascii character codes 0 - 127
523 blank space or tab only
524 cntrl control characters
525 digit decimal digits (same as \d)
526 graph printing characters, excluding space
527 lower lower case letters
528 print printing characters, including space
529 punct printing characters, excluding letters and digits
530 space white space (not quite the same as \s)
531 upper upper case letters
532 word "word" characters (same as \w)
533 xdigit hexadecimal digits
534 </PRE>
535 </P>
536 <P>
537 The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), and
538 space (32). Notice that this list includes the VT character (code 11). This
539 makes "space" different to \s, which does not include VT (for Perl
540 compatibility).
541 </P>
542 <P>
543 The name "word" is a Perl extension, and "blank" is a GNU extension from Perl
544 5.8. Another Perl extension is negation, which is indicated by a ^ character
545 after the colon. For example,
546 </P>
547 <P>
548 <pre>
549 [12[:^digit:]]
550 </PRE>
551 </P>
552 <P>
553 matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the POSIX
554 syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
555 supported, and an error is given if they are encountered.
556 </P>
557 <P>
558 In UTF-8 mode, characters with values greater than 255 do not match any of
559 the POSIX character classes.
560 </P>
561 <br><a name="SEC8" href="#TOC1">VERTICAL BAR</a><br>
562 <P>
563 Vertical bar characters are used to separate alternative patterns. For example,
564 the pattern
565 </P>
566 <P>
567 <pre>
568 gilbert|sullivan
569 </PRE>
570 </P>
571 <P>
572 matches either "gilbert" or "sullivan". Any number of alternatives may appear,
573 and an empty alternative is permitted (matching the empty string).
574 The matching process tries each alternative in turn, from left to right,
575 and the first one that succeeds is used. If the alternatives are within a
576 subpattern (defined below), "succeeds" means matching the rest of the main
577 pattern as well as the alternative in the subpattern.
578 </P>
579 <br><a name="SEC9" href="#TOC1">INTERNAL OPTION SETTING</a><br>
580 <P>
581 The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
582 PCRE_EXTENDED options can be changed from within the pattern by a sequence of
583 Perl option letters enclosed between "(?" and ")". The option letters are
584 </P>
585 <P>
586 <pre>
587 i for PCRE_CASELESS
588 m for PCRE_MULTILINE
589 s for PCRE_DOTALL
590 x for PCRE_EXTENDED
591 </PRE>
592 </P>
593 <P>
594 For example, (?im) sets caseless, multiline matching. It is also possible to
595 unset these options by preceding the letter with a hyphen, and a combined
596 setting and unsetting such as (?im-sx), which sets PCRE_CASELESS and
597 PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED, is also
598 permitted. If a letter appears both before and after the hyphen, the option is
599 unset.
600 </P>
601 <P>
602 When an option change occurs at top level (that is, not inside subpattern
603 parentheses), the change applies to the remainder of the pattern that follows.
604 If the change is placed right at the start of a pattern, PCRE extracts it into
605 the global options (and it will therefore show up in data extracted by the
606 <b>pcre_fullinfo()</b> function).
607 </P>
608 <P>
609 An option change within a subpattern affects only that part of the current
610 pattern that follows it, so
611 </P>
612 <P>
613 <pre>
614 (a(?i)b)c
615 </PRE>
616 </P>
617 <P>
618 matches abc and aBc and no other strings (assuming PCRE_CASELESS is not used).
619 By this means, options can be made to have different settings in different
620 parts of the pattern. Any changes made in one alternative do carry on
621 into subsequent branches within the same subpattern. For example,
622 </P>
623 <P>
624 <pre>
625 (a(?i)b|c)
626 </PRE>
627 </P>
628 <P>
629 matches "ab", "aB", "c", and "C", even though when matching "C" the first
630 branch is abandoned before the option setting. This is because the effects of
631 option settings happen at compile time. There would be some very weird
632 behaviour otherwise.
633 </P>
634 <P>
635 The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can be changed in the
636 same way as the Perl-compatible options by using the characters U and X
637 respectively. The (?X) flag setting is special in that it must always occur
638 earlier in the pattern than any of the additional features it turns on, even
639 when it is at top level. It is best put at the start.
640 </P>
641 <br><a name="SEC10" href="#TOC1">SUBPATTERNS</a><br>
642 <P>
643 Subpatterns are delimited by parentheses (round brackets), which can be nested.
644 Marking part of a pattern as a subpattern does two things:
645 </P>
646 <P>
647 1. It localizes a set of alternatives. For example, the pattern
648 </P>
649 <P>
650 <pre>
651 cat(aract|erpillar|)
652 </PRE>
653 </P>
654 <P>
655 matches one of the words "cat", "cataract", or "caterpillar". Without the
656 parentheses, it would match "cataract", "erpillar" or the empty string.
657 </P>
658 <P>
659 2. It sets up the subpattern as a capturing subpattern (as defined above).
660 When the whole pattern matches, that portion of the subject string that matched
661 the subpattern is passed back to the caller via the <i>ovector</i> argument of
662 <b>pcre_exec()</b>. Opening parentheses are counted from left to right (starting
663 from 1) to obtain the numbers of the capturing subpatterns.
664 </P>
665 <P>
666 For example, if the string "the red king" is matched against the pattern
667 </P>
668 <P>
669 <pre>
670 the ((red|white) (king|queen))
671 </PRE>
672 </P>
673 <P>
674 the captured substrings are "red king", "red", and "king", and are numbered 1,
675 2, and 3, respectively.
676 </P>
677 <P>
678 The fact that plain parentheses fulfil two functions is not always helpful.
679 There are often times when a grouping subpattern is required without a
680 capturing requirement. If an opening parenthesis is followed by a question mark
681 and a colon, the subpattern does not do any capturing, and is not counted when
682 computing the number of any subsequent capturing subpatterns. For example, if
683 the string "the white queen" is matched against the pattern
684 </P>
685 <P>
686 <pre>
687 the ((?:red|white) (king|queen))
688 </PRE>
689 </P>
690 <P>
691 the captured substrings are "white queen" and "queen", and are numbered 1 and
692 2. The maximum number of capturing subpatterns is 65535, and the maximum depth
693 of nesting of all subpatterns, both capturing and non-capturing, is 200.
694 </P>
695 <P>
696 As a convenient shorthand, if any option settings are required at the start of
697 a non-capturing subpattern, the option letters may appear between the "?" and
698 the ":". Thus the two patterns
699 </P>
700 <P>
701 <pre>
702 (?i:saturday|sunday)
703 (?:(?i)saturday|sunday)
704 </PRE>
705 </P>
706 <P>
707 match exactly the same set of strings. Because alternative branches are tried
708 from left to right, and options are not reset until the end of the subpattern
709 is reached, an option setting in one branch does affect subsequent branches, so
710 the above patterns match "SUNDAY" as well as "Saturday".
711 </P>
712 <br><a name="SEC11" href="#TOC1">NAMED SUBPATTERNS</a><br>
713 <P>
714 Identifying capturing parentheses by number is simple, but it can be very hard
715 to keep track of the numbers in complicated regular expressions. Furthermore,
716 if an expression is modified, the numbers may change. To help with the
717 difficulty, PCRE supports the naming of subpatterns, something that Perl does
718 not provide. The Python syntax (?P&#60;name&#62;...) is used. Names consist of
719 alphanumeric characters and underscores, and must be unique within a pattern.
720 </P>
721 <P>
722 Named capturing parentheses are still allocated numbers as well as names. The
723 PCRE API provides function calls for extracting the name-to-number translation
724 table from a compiled pattern. For further details see the
725 <a href="pcreapi.html"><b>pcreapi</b></a>
726 documentation.
727 </P>
728 <br><a name="SEC12" href="#TOC1">REPETITION</a><br>
729 <P>
730 Repetition is specified by quantifiers, which can follow any of the following
731 items:
732 </P>
733 <P>
734 <pre>
735 a literal data character
736 the . metacharacter
737 the \C escape sequence
738 escapes such as \d that match single characters
739 a character class
740 a back reference (see next section)
741 a parenthesized subpattern (unless it is an assertion)
742 </PRE>
743 </P>
744 <P>
745 The general repetition quantifier specifies a minimum and maximum number of
746 permitted matches, by giving the two numbers in curly brackets (braces),
747 separated by a comma. The numbers must be less than 65536, and the first must
748 be less than or equal to the second. For example:
749 </P>
750 <P>
751 <pre>
752 z{2,4}
753 </PRE>
754 </P>
755 <P>
756 matches "zz", "zzz", or "zzzz". A closing brace on its own is not a special
757 character. If the second number is omitted, but the comma is present, there is
758 no upper limit; if the second number and the comma are both omitted, the
759 quantifier specifies an exact number of required matches. Thus
760 </P>
761 <P>
762 <pre>
763 [aeiou]{3,}
764 </PRE>
765 </P>
766 <P>
767 matches at least 3 successive vowels, but may match many more, while
768 </P>
769 <P>
770 <pre>
771 \d{8}
772 </PRE>
773 </P>
774 <P>
775 matches exactly 8 digits. An opening curly bracket that appears in a position
776 where a quantifier is not allowed, or one that does not match the syntax of a
777 quantifier, is taken as a literal character. For example, {,6} is not a
778 quantifier, but a literal string of four characters.
779 </P>
780 <P>
781 In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to individual
782 bytes. Thus, for example, \x{100}{2} matches two UTF-8 characters, each of
783 which is represented by a two-byte sequence.
784 </P>
785 <P>
786 The quantifier {0} is permitted, causing the expression to behave as if the
787 previous item and the quantifier were not present.
788 </P>
789 <P>
790 For convenience (and historical compatibility) the three most common
791 quantifiers have single-character abbreviations:
792 </P>
793 <P>
794 <pre>
795 * is equivalent to {0,}
796 + is equivalent to {1,}
797 ? is equivalent to {0,1}
798 </PRE>
799 </P>
800 <P>
801 It is possible to construct infinite loops by following a subpattern that can
802 match no characters with a quantifier that has no upper limit, for example:
803 </P>
804 <P>
805 <pre>
806 (a?)*
807 </PRE>
808 </P>
809 <P>
810 Earlier versions of Perl and PCRE used to give an error at compile time for
811 such patterns. However, because there are cases where this can be useful, such
812 patterns are now accepted, but if any repetition of the subpattern does in fact
813 match no characters, the loop is forcibly broken.
814 </P>
815 <P>
816 By default, the quantifiers are "greedy", that is, they match as much as
817 possible (up to the maximum number of permitted times), without causing the
818 rest of the pattern to fail. The classic example of where this gives problems
819 is in trying to match comments in C programs. These appear between the
820 sequences /* and */ and within the sequence, individual * and / characters may
821 appear. An attempt to match C comments by applying the pattern
822 </P>
823 <P>
824 <pre>
825 /\*.*\*/
826 </PRE>
827 </P>
828 <P>
829 to the string
830 </P>
831 <P>
832 <pre>
833 /* first command */ not comment /* second comment */
834 </PRE>
835 </P>
836 <P>
837 fails, because it matches the entire string owing to the greediness of the .*
838 item.
839 </P>
840 <P>
841 However, if a quantifier is followed by a question mark, it ceases to be
842 greedy, and instead matches the minimum number of times possible, so the
843 pattern
844 </P>
845 <P>
846 <pre>
847 /\*.*?\*/
848 </PRE>
849 </P>
850 <P>
851 does the right thing with the C comments. The meaning of the various
852 quantifiers is not otherwise changed, just the preferred number of matches.
853 Do not confuse this use of question mark with its use as a quantifier in its
854 own right. Because it has two uses, it can sometimes appear doubled, as in
855 </P>
856 <P>
857 <pre>
858 \d??\d
859 </PRE>
860 </P>
861 <P>
862 which matches one digit by preference, but can match two if that is the only
863 way the rest of the pattern matches.
864 </P>
865 <P>
866 If the PCRE_UNGREEDY option is set (an option which is not available in Perl),
867 the quantifiers are not greedy by default, but individual ones can be made
868 greedy by following them with a question mark. In other words, it inverts the
869 default behaviour.
870 </P>
871 <P>
872 When a parenthesized subpattern is quantified with a minimum repeat count that
873 is greater than 1 or with a limited maximum, more store is required for the
874 compiled pattern, in proportion to the size of the minimum or maximum.
875 </P>
876 <P>
877 If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equivalent
878 to Perl's /s) is set, thus allowing the . to match newlines, the pattern is
879 implicitly anchored, because whatever follows will be tried against every
880 character position in the subject string, so there is no point in retrying the
881 overall match at any position after the first. PCRE normally treats such a
882 pattern as though it were preceded by \A.
883 </P>
884 <P>
885 In cases where it is known that the subject string contains no newlines, it is
886 worth setting PCRE_DOTALL in order to obtain this optimization, or
887 alternatively using ^ to indicate anchoring explicitly.
888 </P>
889 <P>
890 However, there is one situation where the optimization cannot be used. When .*
891 is inside capturing parentheses that are the subject of a backreference
892 elsewhere in the pattern, a match at the start may fail, and a later one
893 succeed. Consider, for example:
894 </P>
895 <P>
896 <pre>
897 (.*)abc\1
898 </PRE>
899 </P>
900 <P>
901 If the subject is "xyz123abc123" the match point is the fourth character. For
902 this reason, such a pattern is not implicitly anchored.
903 </P>
904 <P>
905 When a capturing subpattern is repeated, the value captured is the substring
906 that matched the final iteration. For example, after
907 </P>
908 <P>
909 <pre>
910 (tweedle[dume]{3}\s*)+
911 </PRE>
912 </P>
913 <P>
914 has matched "tweedledum tweedledee" the value of the captured substring is
915 "tweedledee". However, if there are nested capturing subpatterns, the
916 corresponding captured values may have been set in previous iterations. For
917 example, after
918 </P>
919 <P>
920 <pre>
921 /(a|(b))+/
922 </PRE>
923 </P>
924 <P>
925 matches "aba" the value of the second captured substring is "b".
926 </P>
927 <br><a name="SEC13" href="#TOC1">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a><br>
928 <P>
929 With both maximizing and minimizing repetition, failure of what follows
930 normally causes the repeated item to be re-evaluated to see if a different
931 number of repeats allows the rest of the pattern to match. Sometimes it is
932 useful to prevent this, either to change the nature of the match, or to cause
933 it fail earlier than it otherwise might, when the author of the pattern knows
934 there is no point in carrying on.
935 </P>
936 <P>
937 Consider, for example, the pattern \d+foo when applied to the subject line
938 </P>
939 <P>
940 <pre>
941 123456bar
942 </PRE>
943 </P>
944 <P>
945 After matching all 6 digits and then failing to match "foo", the normal
946 action of the matcher is to try again with only 5 digits matching the \d+
947 item, and then with 4, and so on, before ultimately failing. "Atomic grouping"
948 (a term taken from Jeffrey Friedl's book) provides the means for specifying
949 that once a subpattern has matched, it is not to be re-evaluated in this way.
950 </P>
951 <P>
952 If we use atomic grouping for the previous example, the matcher would give up
953 immediately on failing to match "foo" the first time. The notation is a kind of
954 special parenthesis, starting with (?&#62; as in this example:
955 </P>
956 <P>
957 <pre>
958 (?&#62;\d+)foo
959 </PRE>
960 </P>
961 <P>
962 This kind of parenthesis "locks up" the part of the pattern it contains once
963 it has matched, and a failure further into the pattern is prevented from
964 backtracking into it. Backtracking past it to previous items, however, works as
965 normal.
966 </P>
967 <P>
968 An alternative description is that a subpattern of this type matches the string
969 of characters that an identical standalone pattern would match, if anchored at
970 the current point in the subject string.
971 </P>
972 <P>
973 Atomic grouping subpatterns are not capturing subpatterns. Simple cases such as
974 the above example can be thought of as a maximizing repeat that must swallow
975 everything it can. So, while both \d+ and \d+? are prepared to adjust the
976 number of digits they match in order to make the rest of the pattern match,
977 (?&#62;\d+) can only match an entire sequence of digits.
978 </P>
979 <P>
980 Atomic groups in general can of course contain arbitrarily complicated
981 subpatterns, and can be nested. However, when the subpattern for an atomic
982 group is just a single repeated item, as in the example above, a simpler
983 notation, called a "possessive quantifier" can be used. This consists of an
984 additional + character following a quantifier. Using this notation, the
985 previous example can be rewritten as
986 </P>
987 <P>
988 <pre>
989 \d++bar
990 </PRE>
991 </P>
992 <P>
993 Possessive quantifiers are always greedy; the setting of the PCRE_UNGREEDY
994 option is ignored. They are a convenient notation for the simpler forms of
995 atomic group. However, there is no difference in the meaning or processing of a
996 possessive quantifier and the equivalent atomic group.
997 </P>
998 <P>
999 The possessive quantifier syntax is an extension to the Perl syntax. It
1000 originates in Sun's Java package.
1001 </P>
1002 <P>
1003 When a pattern contains an unlimited repeat inside a subpattern that can itself
1004 be repeated an unlimited number of times, the use of an atomic group is the
1005 only way to avoid some failing matches taking a very long time indeed. The
1006 pattern
1007 </P>
1008 <P>
1009 <pre>
1010 (\D+|&#60;\d+&#62;)*[!?]
1011 </PRE>
1012 </P>
1013 <P>
1014 matches an unlimited number of substrings that either consist of non-digits, or
1015 digits enclosed in &#60;&#62;, followed by either ! or ?. When it matches, it runs
1016 quickly. However, if it is applied to
1017 </P>
1018 <P>
1019 <pre>
1020 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
1021 </PRE>
1022 </P>
1023 <P>
1024 it takes a long time before reporting failure. This is because the string can
1025 be divided between the two repeats in a large number of ways, and all have to
1026 be tried. (The example used [!?] rather than a single character at the end,
1027 because both PCRE and Perl have an optimization that allows for fast failure
1028 when a single character is used. They remember the last single character that
1029 is required for a match, and fail early if it is not present in the string.)
1030 If the pattern is changed to
1031 </P>
1032 <P>
1033 <pre>
1034 ((?&#62;\D+)|&#60;\d+&#62;)*[!?]
1035 </PRE>
1036 </P>
1037 <P>
1038 sequences of non-digits cannot be broken, and failure happens quickly.
1039 </P>
1040 <br><a name="SEC14" href="#TOC1">BACK REFERENCES</a><br>
1041 <P>
1042 Outside a character class, a backslash followed by a digit greater than 0 (and
1043 possibly further digits) is a back reference to a capturing subpattern earlier
1044 (that is, to its left) in the pattern, provided there have been that many
1045 previous capturing left parentheses.
1046 </P>
1047 <P>
1048 However, if the decimal number following the backslash is less than 10, it is
1049 always taken as a back reference, and causes an error only if there are not
1050 that many capturing left parentheses in the entire pattern. In other words, the
1051 parentheses that are referenced need not be to the left of the reference for
1052 numbers less than 10. See the section entitled "Backslash" above for further
1053 details of the handling of digits following a backslash.
1054 </P>
1055 <P>
1056 A back reference matches whatever actually matched the capturing subpattern in
1057 the current subject string, rather than anything matching the subpattern
1058 itself (see
1059 <a href="#subpatternsassubroutines">"Subpatterns as subroutines"</a>
1060 below for a way of doing that). So the pattern
1061 </P>
1062 <P>
1063 <pre>
1064 (sens|respons)e and \1ibility
1065 </PRE>
1066 </P>
1067 <P>
1068 matches "sense and sensibility" and "response and responsibility", but not
1069 "sense and responsibility". If caseful matching is in force at the time of the
1070 back reference, the case of letters is relevant. For example,
1071 </P>
1072 <P>
1073 <pre>
1074 ((?i)rah)\s+\1
1075 </PRE>
1076 </P>
1077 <P>
1078 matches "rah rah" and "RAH RAH", but not "RAH rah", even though the original
1079 capturing subpattern is matched caselessly.
1080 </P>
1081 <P>
1082 Back references to named subpatterns use the Python syntax (?P=name). We could
1083 rewrite the above example as follows:
1084 </P>
1085 <P>
1086 <pre>
1087 (?&#60;p1&#62;(?i)rah)\s+(?P=p1)
1088 </PRE>
1089 </P>
1090 <P>
1091 There may be more than one back reference to the same subpattern. If a
1092 subpattern has not actually been used in a particular match, any back
1093 references to it always fail. For example, the pattern
1094 </P>
1095 <P>
1096 <pre>
1097 (a|(bc))\2
1098 </PRE>
1099 </P>
1100 <P>
1101 always fails if it starts to match "a" rather than "bc". Because there may be
1102 many capturing parentheses in a pattern, all digits following the backslash are
1103 taken as part of a potential back reference number. If the pattern continues
1104 with a digit character, some delimiter must be used to terminate the back
1105 reference. If the PCRE_EXTENDED option is set, this can be whitespace.
1106 Otherwise an empty comment can be used.
1107 </P>
1108 <P>
1109 A back reference that occurs inside the parentheses to which it refers fails
1110 when the subpattern is first used, so, for example, (a\1) never matches.
1111 However, such references can be useful inside repeated subpatterns. For
1112 example, the pattern
1113 </P>
1114 <P>
1115 <pre>
1116 (a|b\1)+
1117 </PRE>
1118 </P>
1119 <P>
1120 matches any number of "a"s and also "aba", "ababbaa" etc. At each iteration of
1121 the subpattern, the back reference matches the character string corresponding
1122 to the previous iteration. In order for this to work, the pattern must be such
1123 that the first iteration does not need to match the back reference. This can be
1124 done using alternation, as in the example above, or by a quantifier with a
1125 minimum of zero.
1126 </P>
1127 <br><a name="SEC15" href="#TOC1">ASSERTIONS</a><br>
1128 <P>
1129 An assertion is a test on the characters following or preceding the current
1130 matching point that does not actually consume any characters. The simple
1131 assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are described above.
1132 More complicated assertions are coded as subpatterns. There are two kinds:
1133 those that look ahead of the current position in the subject string, and those
1134 that look behind it.
1135 </P>
1136 <P>
1137 An assertion subpattern is matched in the normal way, except that it does not
1138 cause the current matching position to be changed. Lookahead assertions start
1139 with (?= for positive assertions and (?! for negative assertions. For example,
1140 </P>
1141 <P>
1142 <pre>
1143 \w+(?=;)
1144 </PRE>
1145 </P>
1146 <P>
1147 matches a word followed by a semicolon, but does not include the semicolon in
1148 the match, and
1149 </P>
1150 <P>
1151 <pre>
1152 foo(?!bar)
1153 </PRE>
1154 </P>
1155 <P>
1156 matches any occurrence of "foo" that is not followed by "bar". Note that the
1157 apparently similar pattern
1158 </P>
1159 <P>
1160 <pre>
1161 (?!foo)bar
1162 </PRE>
1163 </P>
1164 <P>
1165 does not find an occurrence of "bar" that is preceded by something other than
1166 "foo"; it finds any occurrence of "bar" whatsoever, because the assertion
1167 (?!foo) is always true when the next three characters are "bar". A
1168 lookbehind assertion is needed to achieve this effect.
1169 </P>
1170 <P>
1171 If you want to force a matching failure at some point in a pattern, the most
1172 convenient way to do it is with (?!) because an empty string always matches, so
1173 an assertion that requires there not to be an empty string must always fail.
1174 </P>
1175 <P>
1176 Lookbehind assertions start with (?&#60;= for positive assertions and (?&#60;! for
1177 negative assertions. For example,
1178 </P>
1179 <P>
1180 <pre>
1181 (?&#60;!foo)bar
1182 </PRE>
1183 </P>
1184 <P>
1185 does find an occurrence of "bar" that is not preceded by "foo". The contents of
1186 a lookbehind assertion are restricted such that all the strings it matches must
1187 have a fixed length. However, if there are several alternatives, they do not
1188 all have to have the same fixed length. Thus
1189 </P>
1190 <P>
1191 <pre>
1192 (?&#60;=bullock|donkey)
1193 </PRE>
1194 </P>
1195 <P>
1196 is permitted, but
1197 </P>
1198 <P>
1199 <pre>
1200 (?&#60;!dogs?|cats?)
1201 </PRE>
1202 </P>
1203 <P>
1204 causes an error at compile time. Branches that match different length strings
1205 are permitted only at the top level of a lookbehind assertion. This is an
1206 extension compared with Perl (at least for 5.8), which requires all branches to
1207 match the same length of string. An assertion such as
1208 </P>
1209 <P>
1210 <pre>
1211 (?&#60;=ab(c|de))
1212 </PRE>
1213 </P>
1214 <P>
1215 is not permitted, because its single top-level branch can match two different
1216 lengths, but it is acceptable if rewritten to use two top-level branches:
1217 </P>
1218 <P>
1219 <pre>
1220 (?&#60;=abc|abde)
1221 </PRE>
1222 </P>
1223 <P>
1224 The implementation of lookbehind assertions is, for each alternative, to
1225 temporarily move the current position back by the fixed width and then try to
1226 match. If there are insufficient characters before the current position, the
1227 match is deemed to fail.
1228 </P>
1229 <P>
1230 PCRE does not allow the \C escape (which matches a single byte in UTF-8 mode)
1231 to appear in lookbehind assertions, because it makes it impossible to calculate
1232 the length of the lookbehind.
1233 </P>
1234 <P>
1235 Atomic groups can be used in conjunction with lookbehind assertions to specify
1236 efficient matching at the end of the subject string. Consider a simple pattern
1237 such as
1238 </P>
1239 <P>
1240 <pre>
1241 abcd$
1242 </PRE>
1243 </P>
1244 <P>
1245 when applied to a long string that does not match. Because matching proceeds
1246 from left to right, PCRE will look for each "a" in the subject and then see if
1247 what follows matches the rest of the pattern. If the pattern is specified as
1248 </P>
1249 <P>
1250 <pre>
1251 ^.*abcd$
1252 </PRE>
1253 </P>
1254 <P>
1255 the initial .* matches the entire string at first, but when this fails (because
1256 there is no following "a"), it backtracks to match all but the last character,
1257 then all but the last two characters, and so on. Once again the search for "a"
1258 covers the entire string, from right to left, so we are no better off. However,
1259 if the pattern is written as
1260 </P>
1261 <P>
1262 <pre>
1263 ^(?&#62;.*)(?&#60;=abcd)
1264 </PRE>
1265 </P>
1266 <P>
1267 or, equivalently,
1268 </P>
1269 <P>
1270 <pre>
1271 ^.*+(?&#60;=abcd)
1272 </PRE>
1273 </P>
1274 <P>
1275 there can be no backtracking for the .* item; it can match only the entire
1276 string. The subsequent lookbehind assertion does a single test on the last four
1277 characters. If it fails, the match fails immediately. For long strings, this
1278 approach makes a significant difference to the processing time.
1279 </P>
1280 <P>
1281 Several assertions (of any sort) may occur in succession. For example,
1282 </P>
1283 <P>
1284 <pre>
1285 (?&#60;=\d{3})(?&#60;!999)foo
1286 </PRE>
1287 </P>
1288 <P>
1289 matches "foo" preceded by three digits that are not "999". Notice that each of
1290 the assertions is applied independently at the same point in the subject
1291 string. First there is a check that the previous three characters are all
1292 digits, and then there is a check that the same three characters are not "999".
1293 This pattern does <i>not</i> match "foo" preceded by six characters, the first
1294 of which are digits and the last three of which are not "999". For example, it
1295 doesn't match "123abcfoo". A pattern to do that is
1296 </P>
1297 <P>
1298 <pre>
1299 (?&#60;=\d{3}...)(?&#60;!999)foo
1300 </PRE>
1301 </P>
1302 <P>
1303 This time the first assertion looks at the preceding six characters, checking
1304 that the first three are digits, and then the second assertion checks that the
1305 preceding three characters are not "999".
1306 </P>
1307 <P>
1308 Assertions can be nested in any combination. For example,
1309 </P>
1310 <P>
1311 <pre>
1312 (?&#60;=(?&#60;!foo)bar)baz
1313 </PRE>
1314 </P>
1315 <P>
1316 matches an occurrence of "baz" that is preceded by "bar" which in turn is not
1317 preceded by "foo", while
1318 </P>
1319 <P>
1320 <pre>
1321 (?&#60;=\d{3}(?!999)...)foo
1322 </PRE>
1323 </P>
1324 <P>
1325 is another pattern which matches "foo" preceded by three digits and any three
1326 characters that are not "999".
1327 </P>
1328 <P>
1329 Assertion subpatterns are not capturing subpatterns, and may not be repeated,
1330 because it makes no sense to assert the same thing several times. If any kind
1331 of assertion contains capturing subpatterns within it, these are counted for
1332 the purposes of numbering the capturing subpatterns in the whole pattern.
1333 However, substring capturing is carried out only for positive assertions,
1334 because it does not make sense for negative assertions.
1335 </P>
1336 <br><a name="SEC16" href="#TOC1">CONDITIONAL SUBPATTERNS</a><br>
1337 <P>
1338 It is possible to cause the matching process to obey a subpattern
1339 conditionally or to choose between two alternative subpatterns, depending on
1340 the result of an assertion, or whether a previous capturing subpattern matched
1341 or not. The two possible forms of conditional subpattern are
1342 </P>
1343 <P>
1344 <pre>
1345 (?(condition)yes-pattern)
1346 (?(condition)yes-pattern|no-pattern)
1347 </PRE>
1348 </P>
1349 <P>
1350 If the condition is satisfied, the yes-pattern is used; otherwise the
1351 no-pattern (if present) is used. If there are more than two alternatives in the
1352 subpattern, a compile-time error occurs.
1353 </P>
1354 <P>
1355 There are three kinds of condition. If the text between the parentheses
1356 consists of a sequence of digits, the condition is satisfied if the capturing
1357 subpattern of that number has previously matched. The number must be greater
1358 than zero. Consider the following pattern, which contains non-significant white
1359 space to make it more readable (assume the PCRE_EXTENDED option) and to divide
1360 it into three parts for ease of discussion:
1361 </P>
1362 <P>
1363 <pre>
1364 ( \( )? [^()]+ (?(1) \) )
1365 </PRE>
1366 </P>
1367 <P>
1368 The first part matches an optional opening parenthesis, and if that
1369 character is present, sets it as the first captured substring. The second part
1370 matches one or more characters that are not parentheses. The third part is a
1371 conditional subpattern that tests whether the first set of parentheses matched
1372 or not. If they did, that is, if subject started with an opening parenthesis,
1373 the condition is true, and so the yes-pattern is executed and a closing
1374 parenthesis is required. Otherwise, since no-pattern is not present, the
1375 subpattern matches nothing. In other words, this pattern matches a sequence of
1376 non-parentheses, optionally enclosed in parentheses.
1377 </P>
1378 <P>
1379 If the condition is the string (R), it is satisfied if a recursive call to the
1380 pattern or subpattern has been made. At "top level", the condition is false.
1381 This is a PCRE extension. Recursive patterns are described in the next section.
1382 </P>
1383 <P>
1384 If the condition is not a sequence of digits or (R), it must be an assertion.
1385 This may be a positive or negative lookahead or lookbehind assertion. Consider
1386 this pattern, again containing non-significant white space, and with the two
1387 alternatives on the second line:
1388 </P>
1389 <P>
1390 <pre>
1391 (?(?=[^a-z]*[a-z])
1392 \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
1393 </PRE>
1394 </P>
1395 <P>
1396 The condition is a positive lookahead assertion that matches an optional
1397 sequence of non-letters followed by a letter. In other words, it tests for the
1398 presence of at least one letter in the subject. If a letter is found, the
1399 subject is matched against the first alternative; otherwise it is matched
1400 against the second. This pattern matches strings in one of the two forms
1401 dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
1402 </P>
1403 <br><a name="SEC17" href="#TOC1">COMMENTS</a><br>
1404 <P>
1405 The sequence (?# marks the start of a comment which continues up to the next
1406 closing parenthesis. Nested parentheses are not permitted. The characters
1407 that make up a comment play no part in the pattern matching at all.
1408 </P>
1409 <P>
1410 If the PCRE_EXTENDED option is set, an unescaped # character outside a
1411 character class introduces a comment that continues up to the next newline
1412 character in the pattern.
1413 </P>
1414 <br><a name="SEC18" href="#TOC1">RECURSIVE PATTERNS</a><br>
1415 <P>
1416 Consider the problem of matching a string in parentheses, allowing for
1417 unlimited nested parentheses. Without the use of recursion, the best that can
1418 be done is to use a pattern that matches up to some fixed depth of nesting. It
1419 is not possible to handle an arbitrary nesting depth. Perl has provided an
1420 experimental facility that allows regular expressions to recurse (amongst other
1421 things). It does this by interpolating Perl code in the expression at run time,
1422 and the code can refer to the expression itself. A Perl pattern to solve the
1423 parentheses problem can be created like this:
1424 </P>
1425 <P>
1426 <pre>
1427 $re = qr{\( (?: (?&#62;[^()]+) | (?p{$re}) )* \)}x;
1428 </PRE>
1429 </P>
1430 <P>
1431 The (?p{...}) item interpolates Perl code at run time, and in this case refers
1432 recursively to the pattern in which it appears. Obviously, PCRE cannot support
1433 the interpolation of Perl code. Instead, it supports some special syntax for
1434 recursion of the entire pattern, and also for individual subpattern recursion.
1435 </P>
1436 <P>
1437 The special item that consists of (? followed by a number greater than zero and
1438 a closing parenthesis is a recursive call of the subpattern of the given
1439 number, provided that it occurs inside that subpattern. (If not, it is a
1440 "subroutine" call, which is described in the next section.) The special item
1441 (?R) is a recursive call of the entire regular expression.
1442 </P>
1443 <P>
1444 For example, this PCRE pattern solves the nested parentheses problem (assume
1445 the PCRE_EXTENDED option is set so that white space is ignored):
1446 </P>
1447 <P>
1448 <pre>
1449 \( ( (?&#62;[^()]+) | (?R) )* \)
1450 </PRE>
1451 </P>
1452 <P>
1453 First it matches an opening parenthesis. Then it matches any number of
1454 substrings which can either be a sequence of non-parentheses, or a recursive
1455 match of the pattern itself (that is a correctly parenthesized substring).
1456 Finally there is a closing parenthesis.
1457 </P>
1458 <P>
1459 If this were part of a larger pattern, you would not want to recurse the entire
1460 pattern, so instead you could use this:
1461 </P>
1462 <P>
1463 <pre>
1464 ( \( ( (?&#62;[^()]+) | (?1) )* \) )
1465 </PRE>
1466 </P>
1467 <P>
1468 We have put the pattern into parentheses, and caused the recursion to refer to
1469 them instead of the whole pattern. In a larger pattern, keeping track of
1470 parenthesis numbers can be tricky. It may be more convenient to use named
1471 parentheses instead. For this, PCRE uses (?P&#62;name), which is an extension to
1472 the Python syntax that PCRE uses for named parentheses (Perl does not provide
1473 named parentheses). We could rewrite the above example as follows:
1474 </P>
1475 <P>
1476 <pre>
1477 (?P&#60;pn&#62; \( ( (?&#62;[^()]+) | (?P&#62;pn) )* \) )
1478 </PRE>
1479 </P>
1480 <P>
1481 This particular example pattern contains nested unlimited repeats, and so the
1482 use of atomic grouping for matching strings of non-parentheses is important
1483 when applying the pattern to strings that do not match. For example, when this
1484 pattern is applied to
1485 </P>
1486 <P>
1487 <pre>
1488 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
1489 </PRE>
1490 </P>
1491 <P>
1492 it yields "no match" quickly. However, if atomic grouping is not used,
1493 the match runs for a very long time indeed because there are so many different
1494 ways the + and * repeats can carve up the subject, and all have to be tested
1495 before failure can be reported.
1496 </P>
1497 <P>
1498 At the end of a match, the values set for any capturing subpatterns are those
1499 from the outermost level of the recursion at which the subpattern value is set.
1500 If you want to obtain intermediate values, a callout function can be used (see
1501 below and the
1502 <a href="pcrecallout.html"><b>pcrecallout</b></a>
1503 documentation). If the pattern above is matched against
1504 </P>
1505 <P>
1506 <pre>
1507 (ab(cd)ef)
1508 </PRE>
1509 </P>
1510 <P>
1511 the value for the capturing parentheses is "ef", which is the last value taken
1512 on at the top level. If additional parentheses are added, giving
1513 </P>
1514 <P>
1515 <pre>
1516 \( ( ( (?&#62;[^()]+) | (?R) )* ) \)
1517 ^ ^
1518 ^ ^
1519 </PRE>
1520 </P>
1521 <P>
1522 the string they capture is "ab(cd)ef", the contents of the top level
1523 parentheses. If there are more than 15 capturing parentheses in a pattern, PCRE
1524 has to obtain extra memory to store data during a recursion, which it does by
1525 using <b>pcre_malloc</b>, freeing it via <b>pcre_free</b> afterwards. If no
1526 memory can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error.
1527 </P>
1528 <P>
1529 Do not confuse the (?R) item with the condition (R), which tests for recursion.
1530 Consider this pattern, which matches text in angle brackets, allowing for
1531 arbitrary nesting. Only digits are allowed in nested brackets (that is, when
1532 recursing), whereas any characters are permitted at the outer level.
1533 </P>
1534 <P>
1535 <pre>
1536 &#60; (?: (?(R) \d++ | [^&#60;&#62;]*+) | (?R)) * &#62;
1537 </PRE>
1538 </P>
1539 <P>
1540 In this pattern, (?(R) is the start of a conditional subpattern, with two
1541 different alternatives for the recursive and non-recursive cases. The (?R) item
1542 is the actual recursive call.
1543 </P>
1544 <a name="subpatternsassubroutines"></a><br><a name="SEC19" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>
1545 <P>
1546 If the syntax for a recursive subpattern reference (either by number or by
1547 name) is used outside the parentheses to which it refers, it operates like a
1548 subroutine in a programming language. An earlier example pointed out that the
1549 pattern
1550 </P>
1551 <P>
1552 <pre>
1553 (sens|respons)e and \1ibility
1554 </PRE>
1555 </P>
1556 <P>
1557 matches "sense and sensibility" and "response and responsibility", but not
1558 "sense and responsibility". If instead the pattern
1559 </P>
1560 <P>
1561 <pre>
1562 (sens|respons)e and (?1)ibility
1563 </PRE>
1564 </P>
1565 <P>
1566 is used, it does match "sense and responsibility" as well as the other two
1567 strings. Such references must, however, follow the subpattern to which they
1568 refer.
1569 </P>
1570 <br><a name="SEC20" href="#TOC1">CALLOUTS</a><br>
1571 <P>
1572 Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl
1573 code to be obeyed in the middle of matching a regular expression. This makes it
1574 possible, amongst other things, to extract different substrings that match the
1575 same pair of parentheses when there is a repetition.
1576 </P>
1577 <P>
1578 PCRE provides a similar feature, but of course it cannot obey arbitrary Perl
1579 code. The feature is called "callout". The caller of PCRE provides an external
1580 function by putting its entry point in the global variable <i>pcre_callout</i>.
1581 By default, this variable contains NULL, which disables all calling out.
1582 </P>
1583 <P>
1584 Within a regular expression, (?C) indicates the points at which the external
1585 function is to be called. If you want to identify different callout points, you
1586 can put a number less than 256 after the letter C. The default value is zero.
1587 For example, this pattern has two callout points:
1588 </P>
1589 <P>
1590 <pre>
1591 (?C1)\dabc(?C2)def
1592 </PRE>
1593 </P>
1594 <P>
1595 During matching, when PCRE reaches a callout point (and <i>pcre_callout</i> is
1596 set), the external function is called. It is provided with the number of the
1597 callout, and, optionally, one item of data originally supplied by the caller of
1598 <b>pcre_exec()</b>. The callout function may cause matching to backtrack, or to
1599 fail altogether. A complete description of the interface to the callout
1600 function is given in the
1601 <a href="pcrecallout.html"><b>pcrecallout</b></a>
1602 documentation.
1603 </P>
1604 <P>
1605 Last updated: 03 February 2003
1606 <br>
1607 Copyright &copy; 1997-2003 University of Cambridge.

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12