/[pcre]/code/trunk/doc/pcrepattern.3
ViewVC logotype

Contents of /code/trunk/doc/pcrepattern.3

Parent Directory Parent Directory | Revision Log Revision Log


Revision 942 - (show annotations) (download)
Tue Feb 28 14:50:31 2012 UTC (2 years, 9 months ago) by ph10
File size: 120745 byte(s)
Update for Unicode 6.1.0.

1 .TH PCREPATTERN 3
2 .SH NAME
3 PCRE - Perl-compatible regular expressions
4 .SH "PCRE REGULAR EXPRESSION DETAILS"
5 .rs
6 .sp
7 The syntax and semantics of the regular expressions that are supported by PCRE
8 are described in detail below. There is a quick-reference syntax summary in the
9 .\" HREF
10 \fBpcresyntax\fP
11 .\"
12 page. PCRE tries to match Perl syntax and semantics as closely as it can. PCRE
13 also supports some alternative regular expression syntax (which does not
14 conflict with the Perl syntax) in order to provide some compatibility with
15 regular expressions in Python, .NET, and Oniguruma.
16 .P
17 Perl's regular expressions are described in its own documentation, and
18 regular expressions in general are covered in a number of books, some of which
19 have copious examples. Jeffrey Friedl's "Mastering Regular Expressions",
20 published by O'Reilly, covers regular expressions in great detail. This
21 description of PCRE's regular expressions is intended as reference material.
22 .P
23 The original operation of PCRE was on strings of one-byte characters. However,
24 there is now also support for UTF-8 strings in the original library, and a
25 second library that supports 16-bit and UTF-16 character strings. To use these
26 features, PCRE must be built to include appropriate support. When using UTF
27 strings you must either call the compiling function with the PCRE_UTF8 or
28 PCRE_UTF16 option, or the pattern must start with one of these special
29 sequences:
30 .sp
31 (*UTF8)
32 (*UTF16)
33 .sp
34 Starting a pattern with such a sequence is equivalent to setting the relevant
35 option. This feature is not Perl-compatible. How setting a UTF mode affects
36 pattern matching is mentioned in several places below. There is also a summary
37 of features in the
38 .\" HREF
39 \fBpcreunicode\fP
40 .\"
41 page.
42 .P
43 Another special sequence that may appear at the start of a pattern or in
44 combination with (*UTF8) or (*UTF16) is:
45 .sp
46 (*UCP)
47 .sp
48 This has the same effect as setting the PCRE_UCP option: it causes sequences
49 such as \ed and \ew to use Unicode properties to determine character types,
50 instead of recognizing only characters with codes less than 128 via a lookup
51 table.
52 .P
53 If a pattern starts with (*NO_START_OPT), it has the same effect as setting the
54 PCRE_NO_START_OPTIMIZE option either at compile or matching time. There are
55 also some more of these special sequences that are concerned with the handling
56 of newlines; they are described below.
57 .P
58 The remainder of this document discusses the patterns that are supported by
59 PCRE when one its main matching functions, \fBpcre_exec()\fP (8-bit) or
60 \fBpcre16_exec()\fP (16-bit), is used. PCRE also has alternative matching
61 functions, \fBpcre_dfa_exec()\fP and \fBpcre16_dfa_exec()\fP, which match using
62 a different algorithm that is not Perl-compatible. Some of the features
63 discussed below are not available when DFA matching is used. The advantages and
64 disadvantages of the alternative functions, and how they differ from the normal
65 functions, are discussed in the
66 .\" HREF
67 \fBpcrematching\fP
68 .\"
69 page.
70 .
71 .
72 .\" HTML <a name="newlines"></a>
73 .SH "NEWLINE CONVENTIONS"
74 .rs
75 .sp
76 PCRE supports five different conventions for indicating line breaks in
77 strings: a single CR (carriage return) character, a single LF (linefeed)
78 character, the two-character sequence CRLF, any of the three preceding, or any
79 Unicode newline sequence. The
80 .\" HREF
81 \fBpcreapi\fP
82 .\"
83 page has
84 .\" HTML <a href="pcreapi.html#newlines">
85 .\" </a>
86 further discussion
87 .\"
88 about newlines, and shows how to set the newline convention in the
89 \fIoptions\fP arguments for the compiling and matching functions.
90 .P
91 It is also possible to specify a newline convention by starting a pattern
92 string with one of the following five sequences:
93 .sp
94 (*CR) carriage return
95 (*LF) linefeed
96 (*CRLF) carriage return, followed by linefeed
97 (*ANYCRLF) any of the three above
98 (*ANY) all Unicode newline sequences
99 .sp
100 These override the default and the options given to the compiling function. For
101 example, on a Unix system where LF is the default newline sequence, the pattern
102 .sp
103 (*CR)a.b
104 .sp
105 changes the convention to CR. That pattern matches "a\enb" because LF is no
106 longer a newline. Note that these special settings, which are not
107 Perl-compatible, are recognized only at the very start of a pattern, and that
108 they must be in upper case. If more than one of them is present, the last one
109 is used.
110 .P
111 The newline convention affects the interpretation of the dot metacharacter when
112 PCRE_DOTALL is not set, and also the behaviour of \eN. However, it does not
113 affect what the \eR escape sequence matches. By default, this is any Unicode
114 newline sequence, for Perl compatibility. However, this can be changed; see the
115 description of \eR in the section entitled
116 .\" HTML <a href="#newlineseq">
117 .\" </a>
118 "Newline sequences"
119 .\"
120 below. A change of \eR setting can be combined with a change of newline
121 convention.
122 .
123 .
124 .SH "CHARACTERS AND METACHARACTERS"
125 .rs
126 .sp
127 A regular expression is a pattern that is matched against a subject string from
128 left to right. Most characters stand for themselves in a pattern, and match the
129 corresponding characters in the subject. As a trivial example, the pattern
130 .sp
131 The quick brown fox
132 .sp
133 matches a portion of a subject string that is identical to itself. When
134 caseless matching is specified (the PCRE_CASELESS option), letters are matched
135 independently of case. In a UTF mode, PCRE always understands the concept of
136 case for characters whose values are less than 128, so caseless matching is
137 always possible. For characters with higher values, the concept of case is
138 supported if PCRE is compiled with Unicode property support, but not otherwise.
139 If you want to use caseless matching for characters 128 and above, you must
140 ensure that PCRE is compiled with Unicode property support as well as with
141 UTF support.
142 .P
143 The power of regular expressions comes from the ability to include alternatives
144 and repetitions in the pattern. These are encoded in the pattern by the use of
145 \fImetacharacters\fP, which do not stand for themselves but instead are
146 interpreted in some special way.
147 .P
148 There are two different sets of metacharacters: those that are recognized
149 anywhere in the pattern except within square brackets, and those that are
150 recognized within square brackets. Outside square brackets, the metacharacters
151 are as follows:
152 .sp
153 \e general escape character with several uses
154 ^ assert start of string (or line, in multiline mode)
155 $ assert end of string (or line, in multiline mode)
156 . match any character except newline (by default)
157 [ start character class definition
158 | start of alternative branch
159 ( start subpattern
160 ) end subpattern
161 ? extends the meaning of (
162 also 0 or 1 quantifier
163 also quantifier minimizer
164 * 0 or more quantifier
165 + 1 or more quantifier
166 also "possessive quantifier"
167 { start min/max quantifier
168 .sp
169 Part of a pattern that is in square brackets is called a "character class". In
170 a character class the only metacharacters are:
171 .sp
172 \e general escape character
173 ^ negate the class, but only if the first character
174 - indicates character range
175 .\" JOIN
176 [ POSIX character class (only if followed by POSIX
177 syntax)
178 ] terminates the character class
179 .sp
180 The following sections describe the use of each of the metacharacters.
181 .
182 .
183 .SH BACKSLASH
184 .rs
185 .sp
186 The backslash character has several uses. Firstly, if it is followed by a
187 character that is not a number or a letter, it takes away any special meaning
188 that character may have. This use of backslash as an escape character applies
189 both inside and outside character classes.
190 .P
191 For example, if you want to match a * character, you write \e* in the pattern.
192 This escaping action applies whether or not the following character would
193 otherwise be interpreted as a metacharacter, so it is always safe to precede a
194 non-alphanumeric with backslash to specify that it stands for itself. In
195 particular, if you want to match a backslash, you write \e\e.
196 .P
197 In a UTF mode, only ASCII numbers and letters have any special meaning after a
198 backslash. All other characters (in particular, those whose codepoints are
199 greater than 127) are treated as literals.
200 .P
201 If a pattern is compiled with the PCRE_EXTENDED option, whitespace in the
202 pattern (other than in a character class) and characters between a # outside
203 a character class and the next newline are ignored. An escaping backslash can
204 be used to include a whitespace or # character as part of the pattern.
205 .P
206 If you want to remove the special meaning from a sequence of characters, you
207 can do so by putting them between \eQ and \eE. This is different from Perl in
208 that $ and @ are handled as literals in \eQ...\eE sequences in PCRE, whereas in
209 Perl, $ and @ cause variable interpolation. Note the following examples:
210 .sp
211 Pattern PCRE matches Perl matches
212 .sp
213 .\" JOIN
214 \eQabc$xyz\eE abc$xyz abc followed by the
215 contents of $xyz
216 \eQabc\e$xyz\eE abc\e$xyz abc\e$xyz
217 \eQabc\eE\e$\eQxyz\eE abc$xyz abc$xyz
218 .sp
219 The \eQ...\eE sequence is recognized both inside and outside character classes.
220 An isolated \eE that is not preceded by \eQ is ignored. If \eQ is not followed
221 by \eE later in the pattern, the literal interpretation continues to the end of
222 the pattern (that is, \eE is assumed at the end). If the isolated \eQ is inside
223 a character class, this causes an error, because the character class is not
224 terminated.
225 .
226 .
227 .\" HTML <a name="digitsafterbackslash"></a>
228 .SS "Non-printing characters"
229 .rs
230 .sp
231 A second use of backslash provides a way of encoding non-printing characters
232 in patterns in a visible manner. There is no restriction on the appearance of
233 non-printing characters, apart from the binary zero that terminates a pattern,
234 but when a pattern is being prepared by text editing, it is often easier to use
235 one of the following escape sequences than the binary character it represents:
236 .sp
237 \ea alarm, that is, the BEL character (hex 07)
238 \ecx "control-x", where x is any ASCII character
239 \ee escape (hex 1B)
240 \ef formfeed (hex 0C)
241 \en linefeed (hex 0A)
242 \er carriage return (hex 0D)
243 \et tab (hex 09)
244 \eddd character with octal code ddd, or back reference
245 \exhh character with hex code hh
246 \ex{hhh..} character with hex code hhh.. (non-JavaScript mode)
247 \euhhhh character with hex code hhhh (JavaScript mode only)
248 .sp
249 The precise effect of \ecx is as follows: if x is a lower case letter, it
250 is converted to upper case. Then bit 6 of the character (hex 40) is inverted.
251 Thus \ecz becomes hex 1A (z is 7A), but \ec{ becomes hex 3B ({ is 7B), while
252 \ec; becomes hex 7B (; is 3B). If the byte following \ec has a value greater
253 than 127, a compile-time error occurs. This locks out non-ASCII characters in
254 all modes. (When PCRE is compiled in EBCDIC mode, all byte values are valid. A
255 lower case letter is converted to upper case, and then the 0xc0 bits are
256 flipped.)
257 .P
258 By default, after \ex, from zero to two hexadecimal digits are read (letters
259 can be in upper or lower case). Any number of hexadecimal digits may appear
260 between \ex{ and }, but the character code is constrained as follows:
261 .sp
262 8-bit non-UTF mode less than 0x100
263 8-bit UTF-8 mode less than 0x10ffff and a valid codepoint
264 16-bit non-UTF mode less than 0x10000
265 16-bit UTF-16 mode less than 0x10ffff and a valid codepoint
266 .sp
267 Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called
268 "surrogate" codepoints).
269 .P
270 If characters other than hexadecimal digits appear between \ex{ and }, or if
271 there is no terminating }, this form of escape is not recognized. Instead, the
272 initial \ex will be interpreted as a basic hexadecimal escape, with no
273 following digits, giving a character whose value is zero.
274 .P
275 If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \ex is
276 as just described only when it is followed by two hexadecimal digits.
277 Otherwise, it matches a literal "x" character. In JavaScript mode, support for
278 code points greater than 256 is provided by \eu, which must be followed by
279 four hexadecimal digits; otherwise it matches a literal "u" character.
280 .P
281 Characters whose value is less than 256 can be defined by either of the two
282 syntaxes for \ex (or by \eu in JavaScript mode). There is no difference in the
283 way they are handled. For example, \exdc is exactly the same as \ex{dc} (or
284 \eu00dc in JavaScript mode).
285 .P
286 After \e0 up to two further octal digits are read. If there are fewer than two
287 digits, just those that are present are used. Thus the sequence \e0\ex\e07
288 specifies two binary zeros followed by a BEL character (code value 7). Make
289 sure you supply two digits after the initial zero if the pattern character that
290 follows is itself an octal digit.
291 .P
292 The handling of a backslash followed by a digit other than 0 is complicated.
293 Outside a character class, PCRE reads it and any following digits as a decimal
294 number. If the number is less than 10, or if there have been at least that many
295 previous capturing left parentheses in the expression, the entire sequence is
296 taken as a \fIback reference\fP. A description of how this works is given
297 .\" HTML <a href="#backreferences">
298 .\" </a>
299 later,
300 .\"
301 following the discussion of
302 .\" HTML <a href="#subpattern">
303 .\" </a>
304 parenthesized subpatterns.
305 .\"
306 .P
307 Inside a character class, or if the decimal number is greater than 9 and there
308 have not been that many capturing subpatterns, PCRE re-reads up to three octal
309 digits following the backslash, and uses them to generate a data character. Any
310 subsequent digits stand for themselves. The value of the character is
311 constrained in the same way as characters specified in hexadecimal.
312 For example:
313 .sp
314 \e040 is another way of writing a space
315 .\" JOIN
316 \e40 is the same, provided there are fewer than 40
317 previous capturing subpatterns
318 \e7 is always a back reference
319 .\" JOIN
320 \e11 might be a back reference, or another way of
321 writing a tab
322 \e011 is always a tab
323 \e0113 is a tab followed by the character "3"
324 .\" JOIN
325 \e113 might be a back reference, otherwise the
326 character with octal code 113
327 .\" JOIN
328 \e377 might be a back reference, otherwise
329 the value 255 (decimal)
330 .\" JOIN
331 \e81 is either a back reference, or a binary zero
332 followed by the two characters "8" and "1"
333 .sp
334 Note that octal values of 100 or greater must not be introduced by a leading
335 zero, because no more than three octal digits are ever read.
336 .P
337 All the sequences that define a single character value can be used both inside
338 and outside character classes. In addition, inside a character class, \eb is
339 interpreted as the backspace character (hex 08).
340 .P
341 \eN is not allowed in a character class. \eB, \eR, and \eX are not special
342 inside a character class. Like other unrecognized escape sequences, they are
343 treated as the literal characters "B", "R", and "X" by default, but cause an
344 error if the PCRE_EXTRA option is set. Outside a character class, these
345 sequences have different meanings.
346 .
347 .
348 .SS "Unsupported escape sequences"
349 .rs
350 .sp
351 In Perl, the sequences \el, \eL, \eu, and \eU are recognized by its string
352 handler and used to modify the case of following characters. By default, PCRE
353 does not support these escape sequences. However, if the PCRE_JAVASCRIPT_COMPAT
354 option is set, \eU matches a "U" character, and \eu can be used to define a
355 character by code point, as described in the previous section.
356 .
357 .
358 .SS "Absolute and relative back references"
359 .rs
360 .sp
361 The sequence \eg followed by an unsigned or a negative number, optionally
362 enclosed in braces, is an absolute or relative back reference. A named back
363 reference can be coded as \eg{name}. Back references are discussed
364 .\" HTML <a href="#backreferences">
365 .\" </a>
366 later,
367 .\"
368 following the discussion of
369 .\" HTML <a href="#subpattern">
370 .\" </a>
371 parenthesized subpatterns.
372 .\"
373 .
374 .
375 .SS "Absolute and relative subroutine calls"
376 .rs
377 .sp
378 For compatibility with Oniguruma, the non-Perl syntax \eg followed by a name or
379 a number enclosed either in angle brackets or single quotes, is an alternative
380 syntax for referencing a subpattern as a "subroutine". Details are discussed
381 .\" HTML <a href="#onigurumasubroutines">
382 .\" </a>
383 later.
384 .\"
385 Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP
386 synonymous. The former is a back reference; the latter is a
387 .\" HTML <a href="#subpatternsassubroutines">
388 .\" </a>
389 subroutine
390 .\"
391 call.
392 .
393 .
394 .\" HTML <a name="genericchartypes"></a>
395 .SS "Generic character types"
396 .rs
397 .sp
398 Another use of backslash is for specifying generic character types:
399 .sp
400 \ed any decimal digit
401 \eD any character that is not a decimal digit
402 \eh any horizontal whitespace character
403 \eH any character that is not a horizontal whitespace character
404 \es any whitespace character
405 \eS any character that is not a whitespace character
406 \ev any vertical whitespace character
407 \eV any character that is not a vertical whitespace character
408 \ew any "word" character
409 \eW any "non-word" character
410 .sp
411 There is also the single sequence \eN, which matches a non-newline character.
412 This is the same as
413 .\" HTML <a href="#fullstopdot">
414 .\" </a>
415 the "." metacharacter
416 .\"
417 when PCRE_DOTALL is not set. Perl also uses \eN to match characters by name;
418 PCRE does not support this.
419 .P
420 Each pair of lower and upper case escape sequences partitions the complete set
421 of characters into two disjoint sets. Any given character matches one, and only
422 one, of each pair. The sequences can appear both inside and outside character
423 classes. They each match one character of the appropriate type. If the current
424 matching point is at the end of the subject string, all of them fail, because
425 there is no character to match.
426 .P
427 For compatibility with Perl, \es does not match the VT character (code 11).
428 This makes it different from the the POSIX "space" class. The \es characters
429 are HT (9), LF (10), FF (12), CR (13), and space (32). If "use locale;" is
430 included in a Perl script, \es may match the VT character. In PCRE, it never
431 does.
432 .P
433 A "word" character is an underscore or any character that is a letter or digit.
434 By default, the definition of letters and digits is controlled by PCRE's
435 low-valued character tables, and may vary if locale-specific matching is taking
436 place (see
437 .\" HTML <a href="pcreapi.html#localesupport">
438 .\" </a>
439 "Locale support"
440 .\"
441 in the
442 .\" HREF
443 \fBpcreapi\fP
444 .\"
445 page). For example, in a French locale such as "fr_FR" in Unix-like systems,
446 or "french" in Windows, some character codes greater than 128 are used for
447 accented letters, and these are then matched by \ew. The use of locales with
448 Unicode is discouraged.
449 .P
450 By default, in a UTF mode, characters with values greater than 128 never match
451 \ed, \es, or \ew, and always match \eD, \eS, and \eW. These sequences retain
452 their original meanings from before UTF support was available, mainly for
453 efficiency reasons. However, if PCRE is compiled with Unicode property support,
454 and the PCRE_UCP option is set, the behaviour is changed so that Unicode
455 properties are used to determine character types, as follows:
456 .sp
457 \ed any character that \ep{Nd} matches (decimal digit)
458 \es any character that \ep{Z} matches, plus HT, LF, FF, CR
459 \ew any character that \ep{L} or \ep{N} matches, plus underscore
460 .sp
461 The upper case escapes match the inverse sets of characters. Note that \ed
462 matches only decimal digits, whereas \ew matches any Unicode digit, as well as
463 any Unicode letter, and underscore. Note also that PCRE_UCP affects \eb, and
464 \eB because they are defined in terms of \ew and \eW. Matching these sequences
465 is noticeably slower when PCRE_UCP is set.
466 .P
467 The sequences \eh, \eH, \ev, and \eV are features that were added to Perl at
468 release 5.10. In contrast to the other sequences, which match only ASCII
469 characters by default, these always match certain high-valued codepoints,
470 whether or not PCRE_UCP is set. The horizontal space characters are:
471 .sp
472 U+0009 Horizontal tab
473 U+0020 Space
474 U+00A0 Non-break space
475 U+1680 Ogham space mark
476 U+180E Mongolian vowel separator
477 U+2000 En quad
478 U+2001 Em quad
479 U+2002 En space
480 U+2003 Em space
481 U+2004 Three-per-em space
482 U+2005 Four-per-em space
483 U+2006 Six-per-em space
484 U+2007 Figure space
485 U+2008 Punctuation space
486 U+2009 Thin space
487 U+200A Hair space
488 U+202F Narrow no-break space
489 U+205F Medium mathematical space
490 U+3000 Ideographic space
491 .sp
492 The vertical space characters are:
493 .sp
494 U+000A Linefeed
495 U+000B Vertical tab
496 U+000C Formfeed
497 U+000D Carriage return
498 U+0085 Next line
499 U+2028 Line separator
500 U+2029 Paragraph separator
501 .sp
502 In 8-bit, non-UTF-8 mode, only the characters with codepoints less than 256 are
503 relevant.
504 .
505 .
506 .\" HTML <a name="newlineseq"></a>
507 .SS "Newline sequences"
508 .rs
509 .sp
510 Outside a character class, by default, the escape sequence \eR matches any
511 Unicode newline sequence. In 8-bit non-UTF-8 mode \eR is equivalent to the
512 following:
513 .sp
514 (?>\er\en|\en|\ex0b|\ef|\er|\ex85)
515 .sp
516 This is an example of an "atomic group", details of which are given
517 .\" HTML <a href="#atomicgroup">
518 .\" </a>
519 below.
520 .\"
521 This particular group matches either the two-character sequence CR followed by
522 LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab,
523 U+000B), FF (formfeed, U+000C), CR (carriage return, U+000D), or NEL (next
524 line, U+0085). The two-character sequence is treated as a single unit that
525 cannot be split.
526 .P
527 In other modes, two additional characters whose codepoints are greater than 255
528 are added: LS (line separator, U+2028) and PS (paragraph separator, U+2029).
529 Unicode character property support is not needed for these characters to be
530 recognized.
531 .P
532 It is possible to restrict \eR to match only CR, LF, or CRLF (instead of the
533 complete set of Unicode line endings) by setting the option PCRE_BSR_ANYCRLF
534 either at compile time or when the pattern is matched. (BSR is an abbrevation
535 for "backslash R".) This can be made the default when PCRE is built; if this is
536 the case, the other behaviour can be requested via the PCRE_BSR_UNICODE option.
537 It is also possible to specify these settings by starting a pattern string with
538 one of the following sequences:
539 .sp
540 (*BSR_ANYCRLF) CR, LF, or CRLF only
541 (*BSR_UNICODE) any Unicode newline sequence
542 .sp
543 These override the default and the options given to the compiling function, but
544 they can themselves be overridden by options given to a matching function. Note
545 that these special settings, which are not Perl-compatible, are recognized only
546 at the very start of a pattern, and that they must be in upper case. If more
547 than one of them is present, the last one is used. They can be combined with a
548 change of newline convention; for example, a pattern can start with:
549 .sp
550 (*ANY)(*BSR_ANYCRLF)
551 .sp
552 They can also be combined with the (*UTF8), (*UTF16), or (*UCP) special
553 sequences. Inside a character class, \eR is treated as an unrecognized escape
554 sequence, and so matches the letter "R" by default, but causes an error if
555 PCRE_EXTRA is set.
556 .
557 .
558 .\" HTML <a name="uniextseq"></a>
559 .SS Unicode character properties
560 .rs
561 .sp
562 When PCRE is built with Unicode character property support, three additional
563 escape sequences that match characters with specific properties are available.
564 When in 8-bit non-UTF-8 mode, these sequences are of course limited to testing
565 characters whose codepoints are less than 256, but they do work in this mode.
566 The extra escape sequences are:
567 .sp
568 \ep{\fIxx\fP} a character with the \fIxx\fP property
569 \eP{\fIxx\fP} a character without the \fIxx\fP property
570 \eX an extended Unicode sequence
571 .sp
572 The property names represented by \fIxx\fP above are limited to the Unicode
573 script names, the general category properties, "Any", which matches any
574 character (including newline), and some special PCRE properties (described
575 in the
576 .\" HTML <a href="#extraprops">
577 .\" </a>
578 next section).
579 .\"
580 Other Perl properties such as "InMusicalSymbols" are not currently supported by
581 PCRE. Note that \eP{Any} does not match any characters, so always causes a
582 match failure.
583 .P
584 Sets of Unicode characters are defined as belonging to certain scripts. A
585 character from one of these sets can be matched using a script name. For
586 example:
587 .sp
588 \ep{Greek}
589 \eP{Han}
590 .sp
591 Those that are not part of an identified script are lumped together as
592 "Common". The current list of scripts is:
593 .P
594 Arabic,
595 Armenian,
596 Avestan,
597 Balinese,
598 Bamum,
599 Batak,
600 Bengali,
601 Bopomofo,
602 Brahmi,
603 Braille,
604 Buginese,
605 Buhid,
606 Canadian_Aboriginal,
607 Carian,
608 Chakma,
609 Cham,
610 Cherokee,
611 Common,
612 Coptic,
613 Cuneiform,
614 Cypriot,
615 Cyrillic,
616 Deseret,
617 Devanagari,
618 Egyptian_Hieroglyphs,
619 Ethiopic,
620 Georgian,
621 Glagolitic,
622 Gothic,
623 Greek,
624 Gujarati,
625 Gurmukhi,
626 Han,
627 Hangul,
628 Hanunoo,
629 Hebrew,
630 Hiragana,
631 Imperial_Aramaic,
632 Inherited,
633 Inscriptional_Pahlavi,
634 Inscriptional_Parthian,
635 Javanese,
636 Kaithi,
637 Kannada,
638 Katakana,
639 Kayah_Li,
640 Kharoshthi,
641 Khmer,
642 Lao,
643 Latin,
644 Lepcha,
645 Limbu,
646 Linear_B,
647 Lisu,
648 Lycian,
649 Lydian,
650 Malayalam,
651 Mandaic,
652 Meetei_Mayek,
653 Meroitic_Cursive,
654 Meroitic_Hieroglyphs,
655 Miao,
656 Mongolian,
657 Myanmar,
658 New_Tai_Lue,
659 Nko,
660 Ogham,
661 Old_Italic,
662 Old_Persian,
663 Old_South_Arabian,
664 Old_Turkic,
665 Ol_Chiki,
666 Oriya,
667 Osmanya,
668 Phags_Pa,
669 Phoenician,
670 Rejang,
671 Runic,
672 Samaritan,
673 Saurashtra,
674 Sharada,
675 Shavian,
676 Sinhala,
677 Sora_Sompeng,
678 Sundanese,
679 Syloti_Nagri,
680 Syriac,
681 Tagalog,
682 Tagbanwa,
683 Tai_Le,
684 Tai_Tham,
685 Tai_Viet,
686 Takri,
687 Tamil,
688 Telugu,
689 Thaana,
690 Thai,
691 Tibetan,
692 Tifinagh,
693 Ugaritic,
694 Vai,
695 Yi.
696 .P
697 Each character has exactly one Unicode general category property, specified by
698 a two-letter abbreviation. For compatibility with Perl, negation can be
699 specified by including a circumflex between the opening brace and the property
700 name. For example, \ep{^Lu} is the same as \eP{Lu}.
701 .P
702 If only one letter is specified with \ep or \eP, it includes all the general
703 category properties that start with that letter. In this case, in the absence
704 of negation, the curly brackets in the escape sequence are optional; these two
705 examples have the same effect:
706 .sp
707 \ep{L}
708 \epL
709 .sp
710 The following general category property codes are supported:
711 .sp
712 C Other
713 Cc Control
714 Cf Format
715 Cn Unassigned
716 Co Private use
717 Cs Surrogate
718 .sp
719 L Letter
720 Ll Lower case letter
721 Lm Modifier letter
722 Lo Other letter
723 Lt Title case letter
724 Lu Upper case letter
725 .sp
726 M Mark
727 Mc Spacing mark
728 Me Enclosing mark
729 Mn Non-spacing mark
730 .sp
731 N Number
732 Nd Decimal number
733 Nl Letter number
734 No Other number
735 .sp
736 P Punctuation
737 Pc Connector punctuation
738 Pd Dash punctuation
739 Pe Close punctuation
740 Pf Final punctuation
741 Pi Initial punctuation
742 Po Other punctuation
743 Ps Open punctuation
744 .sp
745 S Symbol
746 Sc Currency symbol
747 Sk Modifier symbol
748 Sm Mathematical symbol
749 So Other symbol
750 .sp
751 Z Separator
752 Zl Line separator
753 Zp Paragraph separator
754 Zs Space separator
755 .sp
756 The special property L& is also supported: it matches a character that has
757 the Lu, Ll, or Lt property, in other words, a letter that is not classified as
758 a modifier or "other".
759 .P
760 The Cs (Surrogate) property applies only to characters in the range U+D800 to
761 U+DFFF. Such characters are not valid in Unicode strings and so
762 cannot be tested by PCRE, unless UTF validity checking has been turned off
763 (see the discussion of PCRE_NO_UTF8_CHECK and PCRE_NO_UTF16_CHECK in the
764 .\" HREF
765 \fBpcreapi\fP
766 .\"
767 page). Perl does not support the Cs property.
768 .P
769 The long synonyms for property names that Perl supports (such as \ep{Letter})
770 are not supported by PCRE, nor is it permitted to prefix any of these
771 properties with "Is".
772 .P
773 No character that is in the Unicode table has the Cn (unassigned) property.
774 Instead, this property is assumed for any code point that is not in the
775 Unicode table.
776 .P
777 Specifying caseless matching does not affect these escape sequences. For
778 example, \ep{Lu} always matches only upper case letters.
779 .P
780 The \eX escape matches any number of Unicode characters that form an extended
781 Unicode sequence. \eX is equivalent to
782 .sp
783 (?>\ePM\epM*)
784 .sp
785 That is, it matches a character without the "mark" property, followed by zero
786 or more characters with the "mark" property, and treats the sequence as an
787 atomic group
788 .\" HTML <a href="#atomicgroup">
789 .\" </a>
790 (see below).
791 .\"
792 Characters with the "mark" property are typically accents that affect the
793 preceding character. None of them have codepoints less than 256, so in
794 8-bit non-UTF-8 mode \eX matches any one character.
795 .P
796 Note that recent versions of Perl have changed \eX to match what Unicode calls
797 an "extended grapheme cluster", which has a more complicated definition.
798 .P
799 Matching characters by Unicode property is not fast, because PCRE has to search
800 a structure that contains data for over fifteen thousand characters. That is
801 why the traditional escape sequences such as \ed and \ew do not use Unicode
802 properties in PCRE by default, though you can make them do so by setting the
803 PCRE_UCP option or by starting the pattern with (*UCP).
804 .
805 .
806 .\" HTML <a name="extraprops"></a>
807 .SS PCRE's additional properties
808 .rs
809 .sp
810 As well as the standard Unicode properties described in the previous
811 section, PCRE supports four more that make it possible to convert traditional
812 escape sequences such as \ew and \es and POSIX character classes to use Unicode
813 properties. PCRE uses these non-standard, non-Perl properties internally when
814 PCRE_UCP is set. They are:
815 .sp
816 Xan Any alphanumeric character
817 Xps Any POSIX space character
818 Xsp Any Perl space character
819 Xwd Any Perl "word" character
820 .sp
821 Xan matches characters that have either the L (letter) or the N (number)
822 property. Xps matches the characters tab, linefeed, vertical tab, formfeed, or
823 carriage return, and any other character that has the Z (separator) property.
824 Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the
825 same characters as Xan, plus underscore.
826 .
827 .
828 .\" HTML <a name="resetmatchstart"></a>
829 .SS "Resetting the match start"
830 .rs
831 .sp
832 The escape sequence \eK causes any previously matched characters not to be
833 included in the final matched sequence. For example, the pattern:
834 .sp
835 foo\eKbar
836 .sp
837 matches "foobar", but reports that it has matched "bar". This feature is
838 similar to a lookbehind assertion
839 .\" HTML <a href="#lookbehind">
840 .\" </a>
841 (described below).
842 .\"
843 However, in this case, the part of the subject before the real match does not
844 have to be of fixed length, as lookbehind assertions do. The use of \eK does
845 not interfere with the setting of
846 .\" HTML <a href="#subpattern">
847 .\" </a>
848 captured substrings.
849 .\"
850 For example, when the pattern
851 .sp
852 (foo)\eKbar
853 .sp
854 matches "foobar", the first substring is still set to "foo".
855 .P
856 Perl documents that the use of \eK within assertions is "not well defined". In
857 PCRE, \eK is acted upon when it occurs inside positive assertions, but is
858 ignored in negative assertions.
859 .
860 .
861 .\" HTML <a name="smallassertions"></a>
862 .SS "Simple assertions"
863 .rs
864 .sp
865 The final use of backslash is for certain simple assertions. An assertion
866 specifies a condition that has to be met at a particular point in a match,
867 without consuming any characters from the subject string. The use of
868 subpatterns for more complicated assertions is described
869 .\" HTML <a href="#bigassertions">
870 .\" </a>
871 below.
872 .\"
873 The backslashed assertions are:
874 .sp
875 \eb matches at a word boundary
876 \eB matches when not at a word boundary
877 \eA matches at the start of the subject
878 \eZ matches at the end of the subject
879 also matches before a newline at the end of the subject
880 \ez matches only at the end of the subject
881 \eG matches at the first matching position in the subject
882 .sp
883 Inside a character class, \eb has a different meaning; it matches the backspace
884 character. If any other of these assertions appears in a character class, by
885 default it matches the corresponding literal character (for example, \eB
886 matches the letter B). However, if the PCRE_EXTRA option is set, an "invalid
887 escape sequence" error is generated instead.
888 .P
889 A word boundary is a position in the subject string where the current character
890 and the previous character do not both match \ew or \eW (i.e. one matches
891 \ew and the other matches \eW), or the start or end of the string if the
892 first or last character matches \ew, respectively. In a UTF mode, the meanings
893 of \ew and \eW can be changed by setting the PCRE_UCP option. When this is
894 done, it also affects \eb and \eB. Neither PCRE nor Perl has a separate "start
895 of word" or "end of word" metasequence. However, whatever follows \eb normally
896 determines which it is. For example, the fragment \eba matches "a" at the start
897 of a word.
898 .P
899 The \eA, \eZ, and \ez assertions differ from the traditional circumflex and
900 dollar (described in the next section) in that they only ever match at the very
901 start and end of the subject string, whatever options are set. Thus, they are
902 independent of multiline mode. These three assertions are not affected by the
903 PCRE_NOTBOL or PCRE_NOTEOL options, which affect only the behaviour of the
904 circumflex and dollar metacharacters. However, if the \fIstartoffset\fP
905 argument of \fBpcre_exec()\fP is non-zero, indicating that matching is to start
906 at a point other than the beginning of the subject, \eA can never match. The
907 difference between \eZ and \ez is that \eZ matches before a newline at the end
908 of the string as well as at the very end, whereas \ez matches only at the end.
909 .P
910 The \eG assertion is true only when the current matching position is at the
911 start point of the match, as specified by the \fIstartoffset\fP argument of
912 \fBpcre_exec()\fP. It differs from \eA when the value of \fIstartoffset\fP is
913 non-zero. By calling \fBpcre_exec()\fP multiple times with appropriate
914 arguments, you can mimic Perl's /g option, and it is in this kind of
915 implementation where \eG can be useful.
916 .P
917 Note, however, that PCRE's interpretation of \eG, as the start of the current
918 match, is subtly different from Perl's, which defines it as the end of the
919 previous match. In Perl, these can be different when the previously matched
920 string was empty. Because PCRE does just one match at a time, it cannot
921 reproduce this behaviour.
922 .P
923 If all the alternatives of a pattern begin with \eG, the expression is anchored
924 to the starting match position, and the "anchored" flag is set in the compiled
925 regular expression.
926 .
927 .
928 .SH "CIRCUMFLEX AND DOLLAR"
929 .rs
930 .sp
931 Outside a character class, in the default matching mode, the circumflex
932 character is an assertion that is true only if the current matching point is
933 at the start of the subject string. If the \fIstartoffset\fP argument of
934 \fBpcre_exec()\fP is non-zero, circumflex can never match if the PCRE_MULTILINE
935 option is unset. Inside a character class, circumflex has an entirely different
936 meaning
937 .\" HTML <a href="#characterclass">
938 .\" </a>
939 (see below).
940 .\"
941 .P
942 Circumflex need not be the first character of the pattern if a number of
943 alternatives are involved, but it should be the first thing in each alternative
944 in which it appears if the pattern is ever to match that branch. If all
945 possible alternatives start with a circumflex, that is, if the pattern is
946 constrained to match only at the start of the subject, it is said to be an
947 "anchored" pattern. (There are also other constructs that can cause a pattern
948 to be anchored.)
949 .P
950 A dollar character is an assertion that is true only if the current matching
951 point is at the end of the subject string, or immediately before a newline
952 at the end of the string (by default). Dollar need not be the last character of
953 the pattern if a number of alternatives are involved, but it should be the last
954 item in any branch in which it appears. Dollar has no special meaning in a
955 character class.
956 .P
957 The meaning of dollar can be changed so that it matches only at the very end of
958 the string, by setting the PCRE_DOLLAR_ENDONLY option at compile time. This
959 does not affect the \eZ assertion.
960 .P
961 The meanings of the circumflex and dollar characters are changed if the
962 PCRE_MULTILINE option is set. When this is the case, a circumflex matches
963 immediately after internal newlines as well as at the start of the subject
964 string. It does not match after a newline that ends the string. A dollar
965 matches before any newlines in the string, as well as at the very end, when
966 PCRE_MULTILINE is set. When newline is specified as the two-character
967 sequence CRLF, isolated CR and LF characters do not indicate newlines.
968 .P
969 For example, the pattern /^abc$/ matches the subject string "def\enabc" (where
970 \en represents a newline) in multiline mode, but not otherwise. Consequently,
971 patterns that are anchored in single line mode because all branches start with
972 ^ are not anchored in multiline mode, and a match for circumflex is possible
973 when the \fIstartoffset\fP argument of \fBpcre_exec()\fP is non-zero. The
974 PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is set.
975 .P
976 Note that the sequences \eA, \eZ, and \ez can be used to match the start and
977 end of the subject in both modes, and if all branches of a pattern start with
978 \eA it is always anchored, whether or not PCRE_MULTILINE is set.
979 .
980 .
981 .\" HTML <a name="fullstopdot"></a>
982 .SH "FULL STOP (PERIOD, DOT) AND \eN"
983 .rs
984 .sp
985 Outside a character class, a dot in the pattern matches any one character in
986 the subject string except (by default) a character that signifies the end of a
987 line.
988 .P
989 When a line ending is defined as a single character, dot never matches that
990 character; when the two-character sequence CRLF is used, dot does not match CR
991 if it is immediately followed by LF, but otherwise it matches all characters
992 (including isolated CRs and LFs). When any Unicode line endings are being
993 recognized, dot does not match CR or LF or any of the other line ending
994 characters.
995 .P
996 The behaviour of dot with regard to newlines can be changed. If the PCRE_DOTALL
997 option is set, a dot matches any one character, without exception. If the
998 two-character sequence CRLF is present in the subject string, it takes two dots
999 to match it.
1000 .P
1001 The handling of dot is entirely independent of the handling of circumflex and
1002 dollar, the only relationship being that they both involve newlines. Dot has no
1003 special meaning in a character class.
1004 .P
1005 The escape sequence \eN behaves like a dot, except that it is not affected by
1006 the PCRE_DOTALL option. In other words, it matches any character except one
1007 that signifies the end of a line. Perl also uses \eN to match characters by
1008 name; PCRE does not support this.
1009 .
1010 .
1011 .SH "MATCHING A SINGLE DATA UNIT"
1012 .rs
1013 .sp
1014 Outside a character class, the escape sequence \eC matches any one data unit,
1015 whether or not a UTF mode is set. In the 8-bit library, one data unit is one
1016 byte; in the 16-bit library it is a 16-bit unit. Unlike a dot, \eC always
1017 matches line-ending characters. The feature is provided in Perl in order to
1018 match individual bytes in UTF-8 mode, but it is unclear how it can usefully be
1019 used. Because \eC breaks up characters into individual data units, matching one
1020 unit with \eC in a UTF mode means that the rest of the string may start with a
1021 malformed UTF character. This has undefined results, because PCRE assumes that
1022 it is dealing with valid UTF strings (and by default it checks this at the
1023 start of processing unless the PCRE_NO_UTF8_CHECK option is used).
1024 .P
1025 PCRE does not allow \eC to appear in lookbehind assertions
1026 .\" HTML <a href="#lookbehind">
1027 .\" </a>
1028 (described below)
1029 .\"
1030 in a UTF mode, because this would make it impossible to calculate the length of
1031 the lookbehind.
1032 .P
1033 In general, the \eC escape sequence is best avoided. However, one
1034 way of using it that avoids the problem of malformed UTF characters is to use a
1035 lookahead to check the length of the next character, as in this pattern, which
1036 could be used with a UTF-8 string (ignore white space and line breaks):
1037 .sp
1038 (?| (?=[\ex00-\ex7f])(\eC) |
1039 (?=[\ex80-\ex{7ff}])(\eC)(\eC) |
1040 (?=[\ex{800}-\ex{ffff}])(\eC)(\eC)(\eC) |
1041 (?=[\ex{10000}-\ex{1fffff}])(\eC)(\eC)(\eC)(\eC))
1042 .sp
1043 A group that starts with (?| resets the capturing parentheses numbers in each
1044 alternative (see
1045 .\" HTML <a href="#dupsubpatternnumber">
1046 .\" </a>
1047 "Duplicate Subpattern Numbers"
1048 .\"
1049 below). The assertions at the start of each branch check the next UTF-8
1050 character for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The
1051 character's individual bytes are then captured by the appropriate number of
1052 groups.
1053 .
1054 .
1055 .\" HTML <a name="characterclass"></a>
1056 .SH "SQUARE BRACKETS AND CHARACTER CLASSES"
1057 .rs
1058 .sp
1059 An opening square bracket introduces a character class, terminated by a closing
1060 square bracket. A closing square bracket on its own is not special by default.
1061 However, if the PCRE_JAVASCRIPT_COMPAT option is set, a lone closing square
1062 bracket causes a compile-time error. If a closing square bracket is required as
1063 a member of the class, it should be the first data character in the class
1064 (after an initial circumflex, if present) or escaped with a backslash.
1065 .P
1066 A character class matches a single character in the subject. In a UTF mode, the
1067 character may be more than one data unit long. A matched character must be in
1068 the set of characters defined by the class, unless the first character in the
1069 class definition is a circumflex, in which case the subject character must not
1070 be in the set defined by the class. If a circumflex is actually required as a
1071 member of the class, ensure it is not the first character, or escape it with a
1072 backslash.
1073 .P
1074 For example, the character class [aeiou] matches any lower case vowel, while
1075 [^aeiou] matches any character that is not a lower case vowel. Note that a
1076 circumflex is just a convenient notation for specifying the characters that
1077 are in the class by enumerating those that are not. A class that starts with a
1078 circumflex is not an assertion; it still consumes a character from the subject
1079 string, and therefore it fails if the current pointer is at the end of the
1080 string.
1081 .P
1082 In UTF-8 (UTF-16) mode, characters with values greater than 255 (0xffff) can be
1083 included in a class as a literal string of data units, or by using the \ex{
1084 escaping mechanism.
1085 .P
1086 When caseless matching is set, any letters in a class represent both their
1087 upper case and lower case versions, so for example, a caseless [aeiou] matches
1088 "A" as well as "a", and a caseless [^aeiou] does not match "A", whereas a
1089 caseful version would. In a UTF mode, PCRE always understands the concept of
1090 case for characters whose values are less than 128, so caseless matching is
1091 always possible. For characters with higher values, the concept of case is
1092 supported if PCRE is compiled with Unicode property support, but not otherwise.
1093 If you want to use caseless matching in a UTF mode for characters 128 and
1094 above, you must ensure that PCRE is compiled with Unicode property support as
1095 well as with UTF support.
1096 .P
1097 Characters that might indicate line breaks are never treated in any special way
1098 when matching character classes, whatever line-ending sequence is in use, and
1099 whatever setting of the PCRE_DOTALL and PCRE_MULTILINE options is used. A class
1100 such as [^a] always matches one of these characters.
1101 .P
1102 The minus (hyphen) character can be used to specify a range of characters in a
1103 character class. For example, [d-m] matches any letter between d and m,
1104 inclusive. If a minus character is required in a class, it must be escaped with
1105 a backslash or appear in a position where it cannot be interpreted as
1106 indicating a range, typically as the first or last character in the class.
1107 .P
1108 It is not possible to have the literal character "]" as the end character of a
1109 range. A pattern such as [W-]46] is interpreted as a class of two characters
1110 ("W" and "-") followed by a literal string "46]", so it would match "W46]" or
1111 "-46]". However, if the "]" is escaped with a backslash it is interpreted as
1112 the end of range, so [W-\e]46] is interpreted as a class containing a range
1113 followed by two other characters. The octal or hexadecimal representation of
1114 "]" can also be used to end a range.
1115 .P
1116 Ranges operate in the collating sequence of character values. They can also be
1117 used for characters specified numerically, for example [\e000-\e037]. Ranges
1118 can include any characters that are valid for the current mode.
1119 .P
1120 If a range that includes letters is used when caseless matching is set, it
1121 matches the letters in either case. For example, [W-c] is equivalent to
1122 [][\e\e^_`wxyzabc], matched caselessly, and in a non-UTF mode, if character
1123 tables for a French locale are in use, [\exc8-\excb] matches accented E
1124 characters in both cases. In UTF modes, PCRE supports the concept of case for
1125 characters with values greater than 128 only when it is compiled with Unicode
1126 property support.
1127 .P
1128 The character escape sequences \ed, \eD, \eh, \eH, \ep, \eP, \es, \eS, \ev,
1129 \eV, \ew, and \eW may appear in a character class, and add the characters that
1130 they match to the class. For example, [\edABCDEF] matches any hexadecimal
1131 digit. In UTF modes, the PCRE_UCP option affects the meanings of \ed, \es, \ew
1132 and their upper case partners, just as it does when they appear outside a
1133 character class, as described in the section entitled
1134 .\" HTML <a href="#genericchartypes">
1135 .\" </a>
1136 "Generic character types"
1137 .\"
1138 above. The escape sequence \eb has a different meaning inside a character
1139 class; it matches the backspace character. The sequences \eB, \eN, \eR, and \eX
1140 are not special inside a character class. Like any other unrecognized escape
1141 sequences, they are treated as the literal characters "B", "N", "R", and "X" by
1142 default, but cause an error if the PCRE_EXTRA option is set.
1143 .P
1144 A circumflex can conveniently be used with the upper case character types to
1145 specify a more restricted set of characters than the matching lower case type.
1146 For example, the class [^\eW_] matches any letter or digit, but not underscore,
1147 whereas [\ew] includes underscore. A positive character class should be read as
1148 "something OR something OR ..." and a negative class as "NOT something AND NOT
1149 something AND NOT ...".
1150 .P
1151 The only metacharacters that are recognized in character classes are backslash,
1152 hyphen (only where it can be interpreted as specifying a range), circumflex
1153 (only at the start), opening square bracket (only when it can be interpreted as
1154 introducing a POSIX class name - see the next section), and the terminating
1155 closing square bracket. However, escaping other non-alphanumeric characters
1156 does no harm.
1157 .
1158 .
1159 .SH "POSIX CHARACTER CLASSES"
1160 .rs
1161 .sp
1162 Perl supports the POSIX notation for character classes. This uses names
1163 enclosed by [: and :] within the enclosing square brackets. PCRE also supports
1164 this notation. For example,
1165 .sp
1166 [01[:alpha:]%]
1167 .sp
1168 matches "0", "1", any alphabetic character, or "%". The supported class names
1169 are:
1170 .sp
1171 alnum letters and digits
1172 alpha letters
1173 ascii character codes 0 - 127
1174 blank space or tab only
1175 cntrl control characters
1176 digit decimal digits (same as \ed)
1177 graph printing characters, excluding space
1178 lower lower case letters
1179 print printing characters, including space
1180 punct printing characters, excluding letters and digits and space
1181 space white space (not quite the same as \es)
1182 upper upper case letters
1183 word "word" characters (same as \ew)
1184 xdigit hexadecimal digits
1185 .sp
1186 The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), and
1187 space (32). Notice that this list includes the VT character (code 11). This
1188 makes "space" different to \es, which does not include VT (for Perl
1189 compatibility).
1190 .P
1191 The name "word" is a Perl extension, and "blank" is a GNU extension from Perl
1192 5.8. Another Perl extension is negation, which is indicated by a ^ character
1193 after the colon. For example,
1194 .sp
1195 [12[:^digit:]]
1196 .sp
1197 matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the POSIX
1198 syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
1199 supported, and an error is given if they are encountered.
1200 .P
1201 By default, in UTF modes, characters with values greater than 128 do not match
1202 any of the POSIX character classes. However, if the PCRE_UCP option is passed
1203 to \fBpcre_compile()\fP, some of the classes are changed so that Unicode
1204 character properties are used. This is achieved by replacing the POSIX classes
1205 by other sequences, as follows:
1206 .sp
1207 [:alnum:] becomes \ep{Xan}
1208 [:alpha:] becomes \ep{L}
1209 [:blank:] becomes \eh
1210 [:digit:] becomes \ep{Nd}
1211 [:lower:] becomes \ep{Ll}
1212 [:space:] becomes \ep{Xps}
1213 [:upper:] becomes \ep{Lu}
1214 [:word:] becomes \ep{Xwd}
1215 .sp
1216 Negated versions, such as [:^alpha:] use \eP instead of \ep. The other POSIX
1217 classes are unchanged, and match only characters with code points less than
1218 128.
1219 .
1220 .
1221 .SH "VERTICAL BAR"
1222 .rs
1223 .sp
1224 Vertical bar characters are used to separate alternative patterns. For example,
1225 the pattern
1226 .sp
1227 gilbert|sullivan
1228 .sp
1229 matches either "gilbert" or "sullivan". Any number of alternatives may appear,
1230 and an empty alternative is permitted (matching the empty string). The matching
1231 process tries each alternative in turn, from left to right, and the first one
1232 that succeeds is used. If the alternatives are within a subpattern
1233 .\" HTML <a href="#subpattern">
1234 .\" </a>
1235 (defined below),
1236 .\"
1237 "succeeds" means matching the rest of the main pattern as well as the
1238 alternative in the subpattern.
1239 .
1240 .
1241 .SH "INTERNAL OPTION SETTING"
1242 .rs
1243 .sp
1244 The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
1245 PCRE_EXTENDED options (which are Perl-compatible) can be changed from within
1246 the pattern by a sequence of Perl option letters enclosed between "(?" and ")".
1247 The option letters are
1248 .sp
1249 i for PCRE_CASELESS
1250 m for PCRE_MULTILINE
1251 s for PCRE_DOTALL
1252 x for PCRE_EXTENDED
1253 .sp
1254 For example, (?im) sets caseless, multiline matching. It is also possible to
1255 unset these options by preceding the letter with a hyphen, and a combined
1256 setting and unsetting such as (?im-sx), which sets PCRE_CASELESS and
1257 PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED, is also
1258 permitted. If a letter appears both before and after the hyphen, the option is
1259 unset.
1260 .P
1261 The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA can be
1262 changed in the same way as the Perl-compatible options by using the characters
1263 J, U and X respectively.
1264 .P
1265 When one of these option changes occurs at top level (that is, not inside
1266 subpattern parentheses), the change applies to the remainder of the pattern
1267 that follows. If the change is placed right at the start of a pattern, PCRE
1268 extracts it into the global options (and it will therefore show up in data
1269 extracted by the \fBpcre_fullinfo()\fP function).
1270 .P
1271 An option change within a subpattern (see below for a description of
1272 subpatterns) affects only that part of the subpattern that follows it, so
1273 .sp
1274 (a(?i)b)c
1275 .sp
1276 matches abc and aBc and no other strings (assuming PCRE_CASELESS is not used).
1277 By this means, options can be made to have different settings in different
1278 parts of the pattern. Any changes made in one alternative do carry on
1279 into subsequent branches within the same subpattern. For example,
1280 .sp
1281 (a(?i)b|c)
1282 .sp
1283 matches "ab", "aB", "c", and "C", even though when matching "C" the first
1284 branch is abandoned before the option setting. This is because the effects of
1285 option settings happen at compile time. There would be some very weird
1286 behaviour otherwise.
1287 .P
1288 \fBNote:\fP There are other PCRE-specific options that can be set by the
1289 application when the compiling or matching functions are called. In some cases
1290 the pattern can contain special leading sequences such as (*CRLF) to override
1291 what the application has set or what has been defaulted. Details are given in
1292 the section entitled
1293 .\" HTML <a href="#newlineseq">
1294 .\" </a>
1295 "Newline sequences"
1296 .\"
1297 above. There are also the (*UTF8), (*UTF16), and (*UCP) leading sequences that
1298 can be used to set UTF and Unicode property modes; they are equivalent to
1299 setting the PCRE_UTF8, PCRE_UTF16, and the PCRE_UCP options, respectively.
1300 .
1301 .
1302 .\" HTML <a name="subpattern"></a>
1303 .SH SUBPATTERNS
1304 .rs
1305 .sp
1306 Subpatterns are delimited by parentheses (round brackets), which can be nested.
1307 Turning part of a pattern into a subpattern does two things:
1308 .sp
1309 1. It localizes a set of alternatives. For example, the pattern
1310 .sp
1311 cat(aract|erpillar|)
1312 .sp
1313 matches "cataract", "caterpillar", or "cat". Without the parentheses, it would
1314 match "cataract", "erpillar" or an empty string.
1315 .sp
1316 2. It sets up the subpattern as a capturing subpattern. This means that, when
1317 the whole pattern matches, that portion of the subject string that matched the
1318 subpattern is passed back to the caller via the \fIovector\fP argument of the
1319 matching function. (This applies only to the traditional matching functions;
1320 the DFA matching functions do not support capturing.)
1321 .P
1322 Opening parentheses are counted from left to right (starting from 1) to obtain
1323 numbers for the capturing subpatterns. For example, if the string "the red
1324 king" is matched against the pattern
1325 .sp
1326 the ((red|white) (king|queen))
1327 .sp
1328 the captured substrings are "red king", "red", and "king", and are numbered 1,
1329 2, and 3, respectively.
1330 .P
1331 The fact that plain parentheses fulfil two functions is not always helpful.
1332 There are often times when a grouping subpattern is required without a
1333 capturing requirement. If an opening parenthesis is followed by a question mark
1334 and a colon, the subpattern does not do any capturing, and is not counted when
1335 computing the number of any subsequent capturing subpatterns. For example, if
1336 the string "the white queen" is matched against the pattern
1337 .sp
1338 the ((?:red|white) (king|queen))
1339 .sp
1340 the captured substrings are "white queen" and "queen", and are numbered 1 and
1341 2. The maximum number of capturing subpatterns is 65535.
1342 .P
1343 As a convenient shorthand, if any option settings are required at the start of
1344 a non-capturing subpattern, the option letters may appear between the "?" and
1345 the ":". Thus the two patterns
1346 .sp
1347 (?i:saturday|sunday)
1348 (?:(?i)saturday|sunday)
1349 .sp
1350 match exactly the same set of strings. Because alternative branches are tried
1351 from left to right, and options are not reset until the end of the subpattern
1352 is reached, an option setting in one branch does affect subsequent branches, so
1353 the above patterns match "SUNDAY" as well as "Saturday".
1354 .
1355 .
1356 .\" HTML <a name="dupsubpatternnumber"></a>
1357 .SH "DUPLICATE SUBPATTERN NUMBERS"
1358 .rs
1359 .sp
1360 Perl 5.10 introduced a feature whereby each alternative in a subpattern uses
1361 the same numbers for its capturing parentheses. Such a subpattern starts with
1362 (?| and is itself a non-capturing subpattern. For example, consider this
1363 pattern:
1364 .sp
1365 (?|(Sat)ur|(Sun))day
1366 .sp
1367 Because the two alternatives are inside a (?| group, both sets of capturing
1368 parentheses are numbered one. Thus, when the pattern matches, you can look
1369 at captured substring number one, whichever alternative matched. This construct
1370 is useful when you want to capture part, but not all, of one of a number of
1371 alternatives. Inside a (?| group, parentheses are numbered as usual, but the
1372 number is reset at the start of each branch. The numbers of any capturing
1373 parentheses that follow the subpattern start after the highest number used in
1374 any branch. The following example is taken from the Perl documentation. The
1375 numbers underneath show in which buffer the captured content will be stored.
1376 .sp
1377 # before ---------------branch-reset----------- after
1378 / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
1379 # 1 2 2 3 2 3 4
1380 .sp
1381 A back reference to a numbered subpattern uses the most recent value that is
1382 set for that number by any subpattern. The following pattern matches "abcabc"
1383 or "defdef":
1384 .sp
1385 /(?|(abc)|(def))\e1/
1386 .sp
1387 In contrast, a subroutine call to a numbered subpattern always refers to the
1388 first one in the pattern with the given number. The following pattern matches
1389 "abcabc" or "defabc":
1390 .sp
1391 /(?|(abc)|(def))(?1)/
1392 .sp
1393 If a
1394 .\" HTML <a href="#conditions">
1395 .\" </a>
1396 condition test
1397 .\"
1398 for a subpattern's having matched refers to a non-unique number, the test is
1399 true if any of the subpatterns of that number have matched.
1400 .P
1401 An alternative approach to using this "branch reset" feature is to use
1402 duplicate named subpatterns, as described in the next section.
1403 .
1404 .
1405 .SH "NAMED SUBPATTERNS"
1406 .rs
1407 .sp
1408 Identifying capturing parentheses by number is simple, but it can be very hard
1409 to keep track of the numbers in complicated regular expressions. Furthermore,
1410 if an expression is modified, the numbers may change. To help with this
1411 difficulty, PCRE supports the naming of subpatterns. This feature was not
1412 added to Perl until release 5.10. Python had the feature earlier, and PCRE
1413 introduced it at release 4.0, using the Python syntax. PCRE now supports both
1414 the Perl and the Python syntax. Perl allows identically numbered subpatterns to
1415 have different names, but PCRE does not.
1416 .P
1417 In PCRE, a subpattern can be named in one of three ways: (?<name>...) or
1418 (?'name'...) as in Perl, or (?P<name>...) as in Python. References to capturing
1419 parentheses from other parts of the pattern, such as
1420 .\" HTML <a href="#backreferences">
1421 .\" </a>
1422 back references,
1423 .\"
1424 .\" HTML <a href="#recursion">
1425 .\" </a>
1426 recursion,
1427 .\"
1428 and
1429 .\" HTML <a href="#conditions">
1430 .\" </a>
1431 conditions,
1432 .\"
1433 can be made by name as well as by number.
1434 .P
1435 Names consist of up to 32 alphanumeric characters and underscores. Named
1436 capturing parentheses are still allocated numbers as well as names, exactly as
1437 if the names were not present. The PCRE API provides function calls for
1438 extracting the name-to-number translation table from a compiled pattern. There
1439 is also a convenience function for extracting a captured substring by name.
1440 .P
1441 By default, a name must be unique within a pattern, but it is possible to relax
1442 this constraint by setting the PCRE_DUPNAMES option at compile time. (Duplicate
1443 names are also always permitted for subpatterns with the same number, set up as
1444 described in the previous section.) Duplicate names can be useful for patterns
1445 where only one instance of the named parentheses can match. Suppose you want to
1446 match the name of a weekday, either as a 3-letter abbreviation or as the full
1447 name, and in both cases you want to extract the abbreviation. This pattern
1448 (ignoring the line breaks) does the job:
1449 .sp
1450 (?<DN>Mon|Fri|Sun)(?:day)?|
1451 (?<DN>Tue)(?:sday)?|
1452 (?<DN>Wed)(?:nesday)?|
1453 (?<DN>Thu)(?:rsday)?|
1454 (?<DN>Sat)(?:urday)?
1455 .sp
1456 There are five capturing substrings, but only one is ever set after a match.
1457 (An alternative way of solving this problem is to use a "branch reset"
1458 subpattern, as described in the previous section.)
1459 .P
1460 The convenience function for extracting the data by name returns the substring
1461 for the first (and in this example, the only) subpattern of that name that
1462 matched. This saves searching to find which numbered subpattern it was.
1463 .P
1464 If you make a back reference to a non-unique named subpattern from elsewhere in
1465 the pattern, the one that corresponds to the first occurrence of the name is
1466 used. In the absence of duplicate numbers (see the previous section) this is
1467 the one with the lowest number. If you use a named reference in a condition
1468 test (see the
1469 .\"
1470 .\" HTML <a href="#conditions">
1471 .\" </a>
1472 section about conditions
1473 .\"
1474 below), either to check whether a subpattern has matched, or to check for
1475 recursion, all subpatterns with the same name are tested. If the condition is
1476 true for any one of them, the overall condition is true. This is the same
1477 behaviour as testing by number. For further details of the interfaces for
1478 handling named subpatterns, see the
1479 .\" HREF
1480 \fBpcreapi\fP
1481 .\"
1482 documentation.
1483 .P
1484 \fBWarning:\fP You cannot use different names to distinguish between two
1485 subpatterns with the same number because PCRE uses only the numbers when
1486 matching. For this reason, an error is given at compile time if different names
1487 are given to subpatterns with the same number. However, you can give the same
1488 name to subpatterns with the same number, even when PCRE_DUPNAMES is not set.
1489 .
1490 .
1491 .SH REPETITION
1492 .rs
1493 .sp
1494 Repetition is specified by quantifiers, which can follow any of the following
1495 items:
1496 .sp
1497 a literal data character
1498 the dot metacharacter
1499 the \eC escape sequence
1500 the \eX escape sequence
1501 the \eR escape sequence
1502 an escape such as \ed or \epL that matches a single character
1503 a character class
1504 a back reference (see next section)
1505 a parenthesized subpattern (including assertions)
1506 a subroutine call to a subpattern (recursive or otherwise)
1507 .sp
1508 The general repetition quantifier specifies a minimum and maximum number of
1509 permitted matches, by giving the two numbers in curly brackets (braces),
1510 separated by a comma. The numbers must be less than 65536, and the first must
1511 be less than or equal to the second. For example:
1512 .sp
1513 z{2,4}
1514 .sp
1515 matches "zz", "zzz", or "zzzz". A closing brace on its own is not a special
1516 character. If the second number is omitted, but the comma is present, there is
1517 no upper limit; if the second number and the comma are both omitted, the
1518 quantifier specifies an exact number of required matches. Thus
1519 .sp
1520 [aeiou]{3,}
1521 .sp
1522 matches at least 3 successive vowels, but may match many more, while
1523 .sp
1524 \ed{8}
1525 .sp
1526 matches exactly 8 digits. An opening curly bracket that appears in a position
1527 where a quantifier is not allowed, or one that does not match the syntax of a
1528 quantifier, is taken as a literal character. For example, {,6} is not a
1529 quantifier, but a literal string of four characters.
1530 .P
1531 In UTF modes, quantifiers apply to characters rather than to individual data
1532 units. Thus, for example, \ex{100}{2} matches two characters, each of
1533 which is represented by a two-byte sequence in a UTF-8 string. Similarly,
1534 \eX{3} matches three Unicode extended sequences, each of which may be several
1535 data units long (and they may be of different lengths).
1536 .P
1537 The quantifier {0} is permitted, causing the expression to behave as if the
1538 previous item and the quantifier were not present. This may be useful for
1539 subpatterns that are referenced as
1540 .\" HTML <a href="#subpatternsassubroutines">
1541 .\" </a>
1542 subroutines
1543 .\"
1544 from elsewhere in the pattern (but see also the section entitled
1545 .\" HTML <a href="#subdefine">
1546 .\" </a>
1547 "Defining subpatterns for use by reference only"
1548 .\"
1549 below). Items other than subpatterns that have a {0} quantifier are omitted
1550 from the compiled pattern.
1551 .P
1552 For convenience, the three most common quantifiers have single-character
1553 abbreviations:
1554 .sp
1555 * is equivalent to {0,}
1556 + is equivalent to {1,}
1557 ? is equivalent to {0,1}
1558 .sp
1559 It is possible to construct infinite loops by following a subpattern that can
1560 match no characters with a quantifier that has no upper limit, for example:
1561 .sp
1562 (a?)*
1563 .sp
1564 Earlier versions of Perl and PCRE used to give an error at compile time for
1565 such patterns. However, because there are cases where this can be useful, such
1566 patterns are now accepted, but if any repetition of the subpattern does in fact
1567 match no characters, the loop is forcibly broken.
1568 .P
1569 By default, the quantifiers are "greedy", that is, they match as much as
1570 possible (up to the maximum number of permitted times), without causing the
1571 rest of the pattern to fail. The classic example of where this gives problems
1572 is in trying to match comments in C programs. These appear between /* and */
1573 and within the comment, individual * and / characters may appear. An attempt to
1574 match C comments by applying the pattern
1575 .sp
1576 /\e*.*\e*/
1577 .sp
1578 to the string
1579 .sp
1580 /* first comment */ not comment /* second comment */
1581 .sp
1582 fails, because it matches the entire string owing to the greediness of the .*
1583 item.
1584 .P
1585 However, if a quantifier is followed by a question mark, it ceases to be
1586 greedy, and instead matches the minimum number of times possible, so the
1587 pattern
1588 .sp
1589 /\e*.*?\e*/
1590 .sp
1591 does the right thing with the C comments. The meaning of the various
1592 quantifiers is not otherwise changed, just the preferred number of matches.
1593 Do not confuse this use of question mark with its use as a quantifier in its
1594 own right. Because it has two uses, it can sometimes appear doubled, as in
1595 .sp
1596 \ed??\ed
1597 .sp
1598 which matches one digit by preference, but can match two if that is the only
1599 way the rest of the pattern matches.
1600 .P
1601 If the PCRE_UNGREEDY option is set (an option that is not available in Perl),
1602 the quantifiers are not greedy by default, but individual ones can be made
1603 greedy by following them with a question mark. In other words, it inverts the
1604 default behaviour.
1605 .P
1606 When a parenthesized subpattern is quantified with a minimum repeat count that
1607 is greater than 1 or with a limited maximum, more memory is required for the
1608 compiled pattern, in proportion to the size of the minimum or maximum.
1609 .P
1610 If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equivalent
1611 to Perl's /s) is set, thus allowing the dot to match newlines, the pattern is
1612 implicitly anchored, because whatever follows will be tried against every
1613 character position in the subject string, so there is no point in retrying the
1614 overall match at any position after the first. PCRE normally treats such a
1615 pattern as though it were preceded by \eA.
1616 .P
1617 In cases where it is known that the subject string contains no newlines, it is
1618 worth setting PCRE_DOTALL in order to obtain this optimization, or
1619 alternatively using ^ to indicate anchoring explicitly.
1620 .P
1621 However, there is one situation where the optimization cannot be used. When .*
1622 is inside capturing parentheses that are the subject of a back reference
1623 elsewhere in the pattern, a match at the start may fail where a later one
1624 succeeds. Consider, for example:
1625 .sp
1626 (.*)abc\e1
1627 .sp
1628 If the subject is "xyz123abc123" the match point is the fourth character. For
1629 this reason, such a pattern is not implicitly anchored.
1630 .P
1631 When a capturing subpattern is repeated, the value captured is the substring
1632 that matched the final iteration. For example, after
1633 .sp
1634 (tweedle[dume]{3}\es*)+
1635 .sp
1636 has matched "tweedledum tweedledee" the value of the captured substring is
1637 "tweedledee". However, if there are nested capturing subpatterns, the
1638 corresponding captured values may have been set in previous iterations. For
1639 example, after
1640 .sp
1641 /(a|(b))+/
1642 .sp
1643 matches "aba" the value of the second captured substring is "b".
1644 .
1645 .
1646 .\" HTML <a name="atomicgroup"></a>
1647 .SH "ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS"
1648 .rs
1649 .sp
1650 With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
1651 repetition, failure of what follows normally causes the repeated item to be
1652 re-evaluated to see if a different number of repeats allows the rest of the
1653 pattern to match. Sometimes it is useful to prevent this, either to change the
1654 nature of the match, or to cause it fail earlier than it otherwise might, when
1655 the author of the pattern knows there is no point in carrying on.
1656 .P
1657 Consider, for example, the pattern \ed+foo when applied to the subject line
1658 .sp
1659 123456bar
1660 .sp
1661 After matching all 6 digits and then failing to match "foo", the normal
1662 action of the matcher is to try again with only 5 digits matching the \ed+
1663 item, and then with 4, and so on, before ultimately failing. "Atomic grouping"
1664 (a term taken from Jeffrey Friedl's book) provides the means for specifying
1665 that once a subpattern has matched, it is not to be re-evaluated in this way.
1666 .P
1667 If we use atomic grouping for the previous example, the matcher gives up
1668 immediately on failing to match "foo" the first time. The notation is a kind of
1669 special parenthesis, starting with (?> as in this example:
1670 .sp
1671 (?>\ed+)foo
1672 .sp
1673 This kind of parenthesis "locks up" the part of the pattern it contains once
1674 it has matched, and a failure further into the pattern is prevented from
1675 backtracking into it. Backtracking past it to previous items, however, works as
1676 normal.
1677 .P
1678 An alternative description is that a subpattern of this type matches the string
1679 of characters that an identical standalone pattern would match, if anchored at
1680 the current point in the subject string.
1681 .P
1682 Atomic grouping subpatterns are not capturing subpatterns. Simple cases such as
1683 the above example can be thought of as a maximizing repeat that must swallow
1684 everything it can. So, while both \ed+ and \ed+? are prepared to adjust the
1685 number of digits they match in order to make the rest of the pattern match,
1686 (?>\ed+) can only match an entire sequence of digits.
1687 .P
1688 Atomic groups in general can of course contain arbitrarily complicated
1689 subpatterns, and can be nested. However, when the subpattern for an atomic
1690 group is just a single repeated item, as in the example above, a simpler
1691 notation, called a "possessive quantifier" can be used. This consists of an
1692 additional + character following a quantifier. Using this notation, the
1693 previous example can be rewritten as
1694 .sp
1695 \ed++foo
1696 .sp
1697 Note that a possessive quantifier can be used with an entire group, for
1698 example:
1699 .sp
1700 (abc|xyz){2,3}+
1701 .sp
1702 Possessive quantifiers are always greedy; the setting of the PCRE_UNGREEDY
1703 option is ignored. They are a convenient notation for the simpler forms of
1704 atomic group. However, there is no difference in the meaning of a possessive
1705 quantifier and the equivalent atomic group, though there may be a performance
1706 difference; possessive quantifiers should be slightly faster.
1707 .P
1708 The possessive quantifier syntax is an extension to the Perl 5.8 syntax.
1709 Jeffrey Friedl originated the idea (and the name) in the first edition of his
1710 book. Mike McCloskey liked it, so implemented it when he built Sun's Java
1711 package, and PCRE copied it from there. It ultimately found its way into Perl
1712 at release 5.10.
1713 .P
1714 PCRE has an optimization that automatically "possessifies" certain simple
1715 pattern constructs. For example, the sequence A+B is treated as A++B because
1716 there is no point in backtracking into a sequence of A's when B must follow.
1717 .P
1718 When a pattern contains an unlimited repeat inside a subpattern that can itself
1719 be repeated an unlimited number of times, the use of an atomic group is the
1720 only way to avoid some failing matches taking a very long time indeed. The
1721 pattern
1722 .sp
1723 (\eD+|<\ed+>)*[!?]
1724 .sp
1725 matches an unlimited number of substrings that either consist of non-digits, or
1726 digits enclosed in <>, followed by either ! or ?. When it matches, it runs
1727 quickly. However, if it is applied to
1728 .sp
1729 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
1730 .sp
1731 it takes a long time before reporting failure. This is because the string can
1732 be divided between the internal \eD+ repeat and the external * repeat in a
1733 large number of ways, and all have to be tried. (The example uses [!?] rather
1734 than a single character at the end, because both PCRE and Perl have an
1735 optimization that allows for fast failure when a single character is used. They
1736 remember the last single character that is required for a match, and fail early
1737 if it is not present in the string.) If the pattern is changed so that it uses
1738 an atomic group, like this:
1739 .sp
1740 ((?>\eD+)|<\ed+>)*[!?]
1741 .sp
1742 sequences of non-digits cannot be broken, and failure happens quickly.
1743 .
1744 .
1745 .\" HTML <a name="backreferences"></a>
1746 .SH "BACK REFERENCES"
1747 .rs
1748 .sp
1749 Outside a character class, a backslash followed by a digit greater than 0 (and
1750 possibly further digits) is a back reference to a capturing subpattern earlier
1751 (that is, to its left) in the pattern, provided there have been that many
1752 previous capturing left parentheses.
1753 .P
1754 However, if the decimal number following the backslash is less than 10, it is
1755 always taken as a back reference, and causes an error only if there are not
1756 that many capturing left parentheses in the entire pattern. In other words, the
1757 parentheses that are referenced need not be to the left of the reference for
1758 numbers less than 10. A "forward back reference" of this type can make sense
1759 when a repetition is involved and the subpattern to the right has participated
1760 in an earlier iteration.
1761 .P
1762 It is not possible to have a numerical "forward back reference" to a subpattern
1763 whose number is 10 or more using this syntax because a sequence such as \e50 is
1764 interpreted as a character defined in octal. See the subsection entitled
1765 "Non-printing characters"
1766 .\" HTML <a href="#digitsafterbackslash">
1767 .\" </a>
1768 above
1769 .\"
1770 for further details of the handling of digits following a backslash. There is
1771 no such problem when named parentheses are used. A back reference to any
1772 subpattern is possible using named parentheses (see below).
1773 .P
1774 Another way of avoiding the ambiguity inherent in the use of digits following a
1775 backslash is to use the \eg escape sequence. This escape must be followed by an
1776 unsigned number or a negative number, optionally enclosed in braces. These
1777 examples are all identical:
1778 .sp
1779 (ring), \e1
1780 (ring), \eg1
1781 (ring), \eg{1}
1782 .sp
1783 An unsigned number specifies an absolute reference without the ambiguity that
1784 is present in the older syntax. It is also useful when literal digits follow
1785 the reference. A negative number is a relative reference. Consider this
1786 example:
1787 .sp
1788 (abc(def)ghi)\eg{-1}
1789 .sp
1790 The sequence \eg{-1} is a reference to the most recently started capturing
1791 subpattern before \eg, that is, is it equivalent to \e2 in this example.
1792 Similarly, \eg{-2} would be equivalent to \e1. The use of relative references
1793 can be helpful in long patterns, and also in patterns that are created by
1794 joining together fragments that contain references within themselves.
1795 .P
1796 A back reference matches whatever actually matched the capturing subpattern in
1797 the current subject string, rather than anything matching the subpattern
1798 itself (see
1799 .\" HTML <a href="#subpatternsassubroutines">
1800 .\" </a>
1801 "Subpatterns as subroutines"
1802 .\"
1803 below for a way of doing that). So the pattern
1804 .sp
1805 (sens|respons)e and \e1ibility
1806 .sp
1807 matches "sense and sensibility" and "response and responsibility", but not
1808 "sense and responsibility". If caseful matching is in force at the time of the
1809 back reference, the case of letters is relevant. For example,
1810 .sp
1811 ((?i)rah)\es+\e1
1812 .sp
1813 matches "rah rah" and "RAH RAH", but not "RAH rah", even though the original
1814 capturing subpattern is matched caselessly.
1815 .P
1816 There are several different ways of writing back references to named
1817 subpatterns. The .NET syntax \ek{name} and the Perl syntax \ek<name> or
1818 \ek'name' are supported, as is the Python syntax (?P=name). Perl 5.10's unified
1819 back reference syntax, in which \eg can be used for both numeric and named
1820 references, is also supported. We could rewrite the above example in any of
1821 the following ways:
1822 .sp
1823 (?<p1>(?i)rah)\es+\ek<p1>
1824 (?'p1'(?i)rah)\es+\ek{p1}
1825 (?P<p1>(?i)rah)\es+(?P=p1)
1826 (?<p1>(?i)rah)\es+\eg{p1}
1827 .sp
1828 A subpattern that is referenced by name may appear in the pattern before or
1829 after the reference.
1830 .P
1831 There may be more than one back reference to the same subpattern. If a
1832 subpattern has not actually been used in a particular match, any back
1833 references to it always fail by default. For example, the pattern
1834 .sp
1835 (a|(bc))\e2
1836 .sp
1837 always fails if it starts to match "a" rather than "bc". However, if the
1838 PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back reference to an
1839 unset value matches an empty string.
1840 .P
1841 Because there may be many capturing parentheses in a pattern, all digits
1842 following a backslash are taken as part of a potential back reference number.
1843 If the pattern continues with a digit character, some delimiter must be used to
1844 terminate the back reference. If the PCRE_EXTENDED option is set, this can be
1845 whitespace. Otherwise, the \eg{ syntax or an empty comment (see
1846 .\" HTML <a href="#comments">
1847 .\" </a>
1848 "Comments"
1849 .\"
1850 below) can be used.
1851 .
1852 .SS "Recursive back references"
1853 .rs
1854 .sp
1855 A back reference that occurs inside the parentheses to which it refers fails
1856 when the subpattern is first used, so, for example, (a\e1) never matches.
1857 However, such references can be useful inside repeated subpatterns. For
1858 example, the pattern
1859 .sp
1860 (a|b\e1)+
1861 .sp
1862 matches any number of "a"s and also "aba", "ababbaa" etc. At each iteration of
1863 the subpattern, the back reference matches the character string corresponding
1864 to the previous iteration. In order for this to work, the pattern must be such
1865 that the first iteration does not need to match the back reference. This can be
1866 done using alternation, as in the example above, or by a quantifier with a
1867 minimum of zero.
1868 .P
1869 Back references of this type cause the group that they reference to be treated
1870 as an
1871 .\" HTML <a href="#atomicgroup">
1872 .\" </a>
1873 atomic group.
1874 .\"
1875 Once the whole group has been matched, a subsequent matching failure cannot
1876 cause backtracking into the middle of the group.
1877 .
1878 .
1879 .\" HTML <a name="bigassertions"></a>
1880 .SH ASSERTIONS
1881 .rs
1882 .sp
1883 An assertion is a test on the characters following or preceding the current
1884 matching point that does not actually consume any characters. The simple
1885 assertions coded as \eb, \eB, \eA, \eG, \eZ, \ez, ^ and $ are described
1886 .\" HTML <a href="#smallassertions">
1887 .\" </a>
1888 above.
1889 .\"
1890 .P
1891 More complicated assertions are coded as subpatterns. There are two kinds:
1892 those that look ahead of the current position in the subject string, and those
1893 that look behind it. An assertion subpattern is matched in the normal way,
1894 except that it does not cause the current matching position to be changed.
1895 .P
1896 Assertion subpatterns are not capturing subpatterns. If such an assertion
1897 contains capturing subpatterns within it, these are counted for the purposes of
1898 numbering the capturing subpatterns in the whole pattern. However, substring
1899 capturing is carried out only for positive assertions, because it does not make
1900 sense for negative assertions.
1901 .P
1902 For compatibility with Perl, assertion subpatterns may be repeated; though
1903 it makes no sense to assert the same thing several times, the side effect of
1904 capturing parentheses may occasionally be useful. In practice, there only three
1905 cases:
1906 .sp
1907 (1) If the quantifier is {0}, the assertion is never obeyed during matching.
1908 However, it may contain internal capturing parenthesized groups that are called
1909 from elsewhere via the
1910 .\" HTML <a href="#subpatternsassubroutines">
1911 .\" </a>
1912 subroutine mechanism.
1913 .\"
1914 .sp
1915 (2) If quantifier is {0,n} where n is greater than zero, it is treated as if it
1916 were {0,1}. At run time, the rest of the pattern match is tried with and
1917 without the assertion, the order depending on the greediness of the quantifier.
1918 .sp
1919 (3) If the minimum repetition is greater than zero, the quantifier is ignored.
1920 The assertion is obeyed just once when encountered during matching.
1921 .
1922 .
1923 .SS "Lookahead assertions"
1924 .rs
1925 .sp
1926 Lookahead assertions start with (?= for positive assertions and (?! for
1927 negative assertions. For example,
1928 .sp
1929 \ew+(?=;)
1930 .sp
1931 matches a word followed by a semicolon, but does not include the semicolon in
1932 the match, and
1933 .sp
1934 foo(?!bar)
1935 .sp
1936 matches any occurrence of "foo" that is not followed by "bar". Note that the
1937 apparently similar pattern
1938 .sp
1939 (?!foo)bar
1940 .sp
1941 does not find an occurrence of "bar" that is preceded by something other than
1942 "foo"; it finds any occurrence of "bar" whatsoever, because the assertion
1943 (?!foo) is always true when the next three characters are "bar". A
1944 lookbehind assertion is needed to achieve the other effect.
1945 .P
1946 If you want to force a matching failure at some point in a pattern, the most
1947 convenient way to do it is with (?!) because an empty string always matches, so
1948 an assertion that requires there not to be an empty string must always fail.
1949 The backtracking control verb (*FAIL) or (*F) is a synonym for (?!).
1950 .
1951 .
1952 .\" HTML <a name="lookbehind"></a>
1953 .SS "Lookbehind assertions"
1954 .rs
1955 .sp
1956 Lookbehind assertions start with (?<= for positive assertions and (?<! for
1957 negative assertions. For example,
1958 .sp
1959 (?<!foo)bar
1960 .sp
1961 does find an occurrence of "bar" that is not preceded by "foo". The contents of
1962 a lookbehind assertion are restricted such that all the strings it matches must
1963 have a fixed length. However, if there are several top-level alternatives, they
1964 do not all have to have the same fixed length. Thus
1965 .sp
1966 (?<=bullock|donkey)
1967 .sp
1968 is permitted, but
1969 .sp
1970 (?<!dogs?|cats?)
1971 .sp
1972 causes an error at compile time. Branches that match different length strings
1973 are permitted only at the top level of a lookbehind assertion. This is an
1974 extension compared with Perl, which requires all branches to match the same
1975 length of string. An assertion such as
1976 .sp
1977 (?<=ab(c|de))
1978 .sp
1979 is not permitted, because its single top-level branch can match two different
1980 lengths, but it is acceptable to PCRE if rewritten to use two top-level
1981 branches:
1982 .sp
1983 (?<=abc|abde)
1984 .sp
1985 In some cases, the escape sequence \eK
1986 .\" HTML <a href="#resetmatchstart">
1987 .\" </a>
1988 (see above)
1989 .\"
1990 can be used instead of a lookbehind assertion to get round the fixed-length
1991 restriction.
1992 .P
1993 The implementation of lookbehind assertions is, for each alternative, to
1994 temporarily move the current position back by the fixed length and then try to
1995 match. If there are insufficient characters before the current position, the
1996 assertion fails.
1997 .P
1998 In a UTF mode, PCRE does not allow the \eC escape (which matches a single data
1999 unit even in a UTF mode) to appear in lookbehind assertions, because it makes
2000 it impossible to calculate the length of the lookbehind. The \eX and \eR
2001 escapes, which can match different numbers of data units, are also not
2002 permitted.
2003 .P
2004 .\" HTML <a href="#subpatternsassubroutines">
2005 .\" </a>
2006 "Subroutine"
2007 .\"
2008 calls (see below) such as (?2) or (?&X) are permitted in lookbehinds, as long
2009 as the subpattern matches a fixed-length string.
2010 .\" HTML <a href="#recursion">
2011 .\" </a>
2012 Recursion,
2013 .\"
2014 however, is not supported.
2015 .P
2016 Possessive quantifiers can be used in conjunction with lookbehind assertions to
2017 specify efficient matching of fixed-length strings at the end of subject
2018 strings. Consider a simple pattern such as
2019 .sp
2020 abcd$
2021 .sp
2022 when applied to a long string that does not match. Because matching proceeds
2023 from left to right, PCRE will look for each "a" in the subject and then see if
2024 what follows matches the rest of the pattern. If the pattern is specified as
2025 .sp
2026 ^.*abcd$
2027 .sp
2028 the initial .* matches the entire string at first, but when this fails (because
2029 there is no following "a"), it backtracks to match all but the last character,
2030 then all but the last two characters, and so on. Once again the search for "a"
2031 covers the entire string, from right to left, so we are no better off. However,
2032 if the pattern is written as
2033 .sp
2034 ^.*+(?<=abcd)
2035 .sp
2036 there can be no backtracking for the .*+ item; it can match only the entire
2037 string. The subsequent lookbehind assertion does a single test on the last four
2038 characters. If it fails, the match fails immediately. For long strings, this
2039 approach makes a significant difference to the processing time.
2040 .
2041 .
2042 .SS "Using multiple assertions"
2043 .rs
2044 .sp
2045 Several assertions (of any sort) may occur in succession. For example,
2046 .sp
2047 (?<=\ed{3})(?<!999)foo
2048 .sp
2049 matches "foo" preceded by three digits that are not "999". Notice that each of
2050 the assertions is applied independently at the same point in the subject
2051 string. First there is a check that the previous three characters are all
2052 digits, and then there is a check that the same three characters are not "999".
2053 This pattern does \fInot\fP match "foo" preceded by six characters, the first
2054 of which are digits and the last three of which are not "999". For example, it
2055 doesn't match "123abcfoo". A pattern to do that is
2056 .sp
2057 (?<=\ed{3}...)(?<!999)foo
2058 .sp
2059 This time the first assertion looks at the preceding six characters, checking
2060 that the first three are digits, and then the second assertion checks that the
2061 preceding three characters are not "999".
2062 .P
2063 Assertions can be nested in any combination. For example,
2064 .sp
2065 (?<=(?<!foo)bar)baz
2066 .sp
2067 matches an occurrence of "baz" that is preceded by "bar" which in turn is not
2068 preceded by "foo", while
2069 .sp
2070 (?<=\ed{3}(?!999)...)foo
2071 .sp
2072 is another pattern that matches "foo" preceded by three digits and any three
2073 characters that are not "999".
2074 .
2075 .
2076 .\" HTML <a name="conditions"></a>
2077 .SH "CONDITIONAL SUBPATTERNS"
2078 .rs
2079 .sp
2080 It is possible to cause the matching process to obey a subpattern
2081 conditionally or to choose between two alternative subpatterns, depending on
2082 the result of an assertion, or whether a specific capturing subpattern has
2083 already been matched. The two possible forms of conditional subpattern are:
2084 .sp
2085 (?(condition)yes-pattern)
2086 (?(condition)yes-pattern|no-pattern)
2087 .sp
2088 If the condition is satisfied, the yes-pattern is used; otherwise the
2089 no-pattern (if present) is used. If there are more than two alternatives in the
2090 subpattern, a compile-time error occurs. Each of the two alternatives may
2091 itself contain nested subpatterns of any form, including conditional
2092 subpatterns; the restriction to two alternatives applies only at the level of
2093 the condition. This pattern fragment is an example where the alternatives are
2094 complex:
2095 .sp
2096 (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
2097 .sp
2098 .P
2099 There are four kinds of condition: references to subpatterns, references to
2100 recursion, a pseudo-condition called DEFINE, and assertions.
2101 .
2102 .SS "Checking for a used subpattern by number"
2103 .rs
2104 .sp
2105 If the text between the parentheses consists of a sequence of digits, the
2106 condition is true if a capturing subpattern of that number has previously
2107 matched. If there is more than one capturing subpattern with the same number
2108 (see the earlier
2109 .\"
2110 .\" HTML <a href="#recursion">
2111 .\" </a>
2112 section about duplicate subpattern numbers),
2113 .\"
2114 the condition is true if any of them have matched. An alternative notation is
2115 to precede the digits with a plus or minus sign. In this case, the subpattern
2116 number is relative rather than absolute. The most recently opened parentheses
2117 can be referenced by (?(-1), the next most recent by (?(-2), and so on. Inside
2118 loops it can also make sense to refer to subsequent groups. The next
2119 parentheses to be opened can be referenced as (?(+1), and so on. (The value
2120 zero in any of these forms is not used; it provokes a compile-time error.)
2121 .P
2122 Consider the following pattern, which contains non-significant white space to
2123 make it more readable (assume the PCRE_EXTENDED option) and to divide it into
2124 three parts for ease of discussion:
2125 .sp
2126 ( \e( )? [^()]+ (?(1) \e) )
2127 .sp
2128 The first part matches an optional opening parenthesis, and if that
2129 character is present, sets it as the first captured substring. The second part
2130 matches one or more characters that are not parentheses. The third part is a
2131 conditional subpattern that tests whether or not the first set of parentheses
2132 matched. If they did, that is, if subject started with an opening parenthesis,
2133 the condition is true, and so the yes-pattern is executed and a closing
2134 parenthesis is required. Otherwise, since no-pattern is not present, the
2135 subpattern matches nothing. In other words, this pattern matches a sequence of
2136 non-parentheses, optionally enclosed in parentheses.
2137 .P
2138 If you were embedding this pattern in a larger one, you could use a relative
2139 reference:
2140 .sp
2141 ...other stuff... ( \e( )? [^()]+ (?(-1) \e) ) ...
2142 .sp
2143 This makes the fragment independent of the parentheses in the larger pattern.
2144 .
2145 .SS "Checking for a used subpattern by name"
2146 .rs
2147 .sp
2148 Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a used
2149 subpattern by name. For compatibility with earlier versions of PCRE, which had
2150 this facility before Perl, the syntax (?(name)...) is also recognized. However,
2151 there is a possible ambiguity with this syntax, because subpattern names may
2152 consist entirely of digits. PCRE looks first for a named subpattern; if it
2153 cannot find one and the name consists entirely of digits, PCRE looks for a
2154 subpattern of that number, which must be greater than zero. Using subpattern
2155 names that consist entirely of digits is not recommended.
2156 .P
2157 Rewriting the above example to use a named subpattern gives this:
2158 .sp
2159 (?<OPEN> \e( )? [^()]+ (?(<OPEN>) \e) )
2160 .sp
2161 If the name used in a condition of this kind is a duplicate, the test is
2162 applied to all subpatterns of the same name, and is true if any one of them has
2163 matched.
2164 .
2165 .SS "Checking for pattern recursion"
2166 .rs
2167 .sp
2168 If the condition is the string (R), and there is no subpattern with the name R,
2169 the condition is true if a recursive call to the whole pattern or any
2170 subpattern has been made. If digits or a name preceded by ampersand follow the
2171 letter R, for example:
2172 .sp
2173 (?(R3)...) or (?(R&name)...)
2174 .sp
2175 the condition is true if the most recent recursion is into a subpattern whose
2176 number or name is given. This condition does not check the entire recursion
2177 stack. If the name used in a condition of this kind is a duplicate, the test is
2178 applied to all subpatterns of the same name, and is true if any one of them is
2179 the most recent recursion.
2180 .P
2181 At "top level", all these recursion test conditions are false.
2182 .\" HTML <a href="#recursion">
2183 .\" </a>
2184 The syntax for recursive patterns
2185 .\"
2186 is described below.
2187 .
2188 .\" HTML <a name="subdefine"></a>
2189 .SS "Defining subpatterns for use by reference only"
2190 .rs
2191 .sp
2192 If the condition is the string (DEFINE), and there is no subpattern with the
2193 name DEFINE, the condition is always false. In this case, there may be only one
2194 alternative in the subpattern. It is always skipped if control reaches this
2195 point in the pattern; the idea of DEFINE is that it can be used to define
2196 subroutines that can be referenced from elsewhere. (The use of
2197 .\" HTML <a href="#subpatternsassubroutines">
2198 .\" </a>
2199 subroutines
2200 .\"
2201 is described below.) For example, a pattern to match an IPv4 address such as
2202 "192.168.23.245" could be written like this (ignore whitespace and line
2203 breaks):
2204 .sp
2205 (?(DEFINE) (?<byte> 2[0-4]\ed | 25[0-5] | 1\ed\ed | [1-9]?\ed) )
2206 \eb (?&byte) (\e.(?&byte)){3} \eb
2207 .sp
2208 The first part of the pattern is a DEFINE group inside which a another group
2209 named "byte" is defined. This matches an individual component of an IPv4
2210 address (a number less than 256). When matching takes place, this part of the
2211 pattern is skipped because DEFINE acts like a false condition. The rest of the
2212 pattern uses references to the named group to match the four dot-separated
2213 components of an IPv4 address, insisting on a word boundary at each end.
2214 .
2215 .SS "Assertion conditions"
2216 .rs
2217 .sp
2218 If the condition is not in any of the above formats, it must be an assertion.
2219 This may be a positive or negative lookahead or lookbehind assertion. Consider
2220 this pattern, again containing non-significant white space, and with the two
2221 alternatives on the second line:
2222 .sp
2223 (?(?=[^a-z]*[a-z])
2224 \ed{2}-[a-z]{3}-\ed{2} | \ed{2}-\ed{2}-\ed{2} )
2225 .sp
2226 The condition is a positive lookahead assertion that matches an optional
2227 sequence of non-letters followed by a letter. In other words, it tests for the
2228 presence of at least one letter in the subject. If a letter is found, the
2229 subject is matched against the first alternative; otherwise it is matched
2230 against the second. This pattern matches strings in one of the two forms
2231 dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
2232 .
2233 .
2234 .\" HTML <a name="comments"></a>
2235 .SH COMMENTS
2236 .rs
2237 .sp
2238 There are two ways of including comments in patterns that are processed by
2239 PCRE. In both cases, the start of the comment must not be in a character class,
2240 nor in the middle of any other sequence of related characters such as (?: or a
2241 subpattern name or number. The characters that make up a comment play no part
2242 in the pattern matching.
2243 .P
2244 The sequence (?# marks the start of a comment that continues up to the next
2245 closing parenthesis. Nested parentheses are not permitted. If the PCRE_EXTENDED
2246 option is set, an unescaped # character also introduces a comment, which in
2247 this case continues to immediately after the next newline character or
2248 character sequence in the pattern. Which characters are interpreted as newlines
2249 is controlled by the options passed to a compiling function or by a special
2250 sequence at the start of the pattern, as described in the section entitled
2251 .\" HTML <a href="#newlines">
2252 .\" </a>
2253 "Newline conventions"
2254 .\"
2255 above. Note that the end of this type of comment is a literal newline sequence
2256 in the pattern; escape sequences that happen to represent a newline do not
2257 count. For example, consider this pattern when PCRE_EXTENDED is set, and the
2258 default newline convention is in force:
2259 .sp
2260 abc #comment \en still comment
2261 .sp
2262 On encountering the # character, \fBpcre_compile()\fP skips along, looking for
2263 a newline in the pattern. The sequence \en is still literal at this stage, so
2264 it does not terminate the comment. Only an actual character with the code value
2265 0x0a (the default newline) does so.
2266 .
2267 .
2268 .\" HTML <a name="recursion"></a>
2269 .SH "RECURSIVE PATTERNS"
2270 .rs
2271 .sp
2272 Consider the problem of matching a string in parentheses, allowing for
2273 unlimited nested parentheses. Without the use of recursion, the best that can
2274 be done is to use a pattern that matches up to some fixed depth of nesting. It
2275 is not possible to handle an arbitrary nesting depth.
2276 .P
2277 For some time, Perl has provided a facility that allows regular expressions to
2278 recurse (amongst other things). It does this by interpolating Perl code in the
2279 expression at run time, and the code can refer to the expression itself. A Perl
2280 pattern using code interpolation to solve the parentheses problem can be
2281 created like this:
2282 .sp
2283 $re = qr{\e( (?: (?>[^()]+) | (?p{$re}) )* \e)}x;
2284 .sp
2285 The (?p{...}) item interpolates Perl code at run time, and in this case refers
2286 recursively to the pattern in which it appears.
2287 .P
2288 Obviously, PCRE cannot support the interpolation of Perl code. Instead, it
2289 supports special syntax for recursion of the entire pattern, and also for
2290 individual subpattern recursion. After its introduction in PCRE and Python,
2291 this kind of recursion was subsequently introduced into Perl at release 5.10.
2292 .P
2293 A special item that consists of (? followed by a number greater than zero and a
2294 closing parenthesis is a recursive subroutine call of the subpattern of the
2295 given number, provided that it occurs inside that subpattern. (If not, it is a
2296 .\" HTML <a href="#subpatternsassubroutines">
2297 .\" </a>
2298 non-recursive subroutine
2299 .\"
2300 call, which is described in the next section.) The special item (?R) or (?0) is
2301 a recursive call of the entire regular expression.
2302 .P
2303 This PCRE pattern solves the nested parentheses problem (assume the
2304 PCRE_EXTENDED option is set so that white space is ignored):
2305 .sp
2306 \e( ( [^()]++ | (?R) )* \e)
2307 .sp
2308 First it matches an opening parenthesis. Then it matches any number of
2309 substrings which can either be a sequence of non-parentheses, or a recursive
2310 match of the pattern itself (that is, a correctly parenthesized substring).
2311 Finally there is a closing parenthesis. Note the use of a possessive quantifier
2312 to avoid backtracking into sequences of non-parentheses.
2313 .P
2314 If this were part of a larger pattern, you would not want to recurse the entire
2315 pattern, so instead you could use this:
2316 .sp
2317 ( \e( ( [^()]++ | (?1) )* \e) )
2318 .sp
2319 We have put the pattern into parentheses, and caused the recursion to refer to
2320 them instead of the whole pattern.
2321 .P
2322 In a larger pattern, keeping track of parenthesis numbers can be tricky. This
2323 is made easier by the use of relative references. Instead of (?1) in the
2324 pattern above you can write (?-2) to refer to the second most recently opened
2325 parentheses preceding the recursion. In other words, a negative number counts
2326 capturing parentheses leftwards from the point at which it is encountered.
2327 .P
2328 It is also possible to refer to subsequently opened parentheses, by writing
2329 references such as (?+2). However, these cannot be recursive because the
2330 reference is not inside the parentheses that are referenced. They are always
2331 .\" HTML <a href="#subpatternsassubroutines">
2332 .\" </a>
2333 non-recursive subroutine
2334 .\"
2335 calls, as described in the next section.
2336 .P
2337 An alternative approach is to use named parentheses instead. The Perl syntax
2338 for this is (?&name); PCRE's earlier syntax (?P>name) is also supported. We
2339 could rewrite the above example as follows:
2340 .sp
2341 (?<pn> \e( ( [^()]++ | (?&pn) )* \e) )
2342 .sp
2343 If there is more than one subpattern with the same name, the earliest one is
2344 used.
2345 .P
2346 This particular example pattern that we have been looking at contains nested
2347 unlimited repeats, and so the use of a possessive quantifier for matching
2348 strings of non-parentheses is important when applying the pattern to strings
2349 that do not match. For example, when this pattern is applied to
2350 .sp
2351 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
2352 .sp
2353 it yields "no match" quickly. However, if a possessive quantifier is not used,
2354 the match runs for a very long time indeed because there are so many different
2355 ways the + and * repeats can carve up the subject, and all have to be tested
2356 before failure can be reported.
2357 .P
2358 At the end of a match, the values of capturing parentheses are those from
2359 the outermost level. If you want to obtain intermediate values, a callout
2360 function can be used (see below and the
2361 .\" HREF
2362 \fBpcrecallout\fP
2363 .\"
2364 documentation). If the pattern above is matched against
2365 .sp
2366 (ab(cd)ef)
2367 .sp
2368 the value for the inner capturing parentheses (numbered 2) is "ef", which is
2369 the last value taken on at the top level. If a capturing subpattern is not
2370 matched at the top level, its final captured value is unset, even if it was
2371 (temporarily) set at a deeper level during the matching process.
2372 .P
2373 If there are more than 15 capturing parentheses in a pattern, PCRE has to
2374 obtain extra memory to store data during a recursion, which it does by using
2375 \fBpcre_malloc\fP, freeing it via \fBpcre_free\fP afterwards. If no memory can
2376 be obtained, the match fails with the PCRE_ERROR_NOMEMORY error.
2377 .P
2378 Do not confuse the (?R) item with the condition (R), which tests for recursion.
2379 Consider this pattern, which matches text in angle brackets, allowing for
2380 arbitrary nesting. Only digits are allowed in nested brackets (that is, when
2381 recursing), whereas any characters are permitted at the outer level.
2382 .sp
2383 < (?: (?(R) \ed++ | [^<>]*+) | (?R)) * >
2384 .sp
2385 In this pattern, (?(R) is the start of a conditional subpattern, with two
2386 different alternatives for the recursive and non-recursive cases. The (?R) item
2387 is the actual recursive call.
2388 .
2389 .
2390 .\" HTML <a name="recursiondifference"></a>
2391 .SS "Differences in recursion processing between PCRE and Perl"
2392 .rs
2393 .sp
2394 Recursion processing in PCRE differs from Perl in two important ways. In PCRE
2395 (like Python, but unlike Perl), a recursive subpattern call is always treated
2396 as an atomic group. That is, once it has matched some of the subject string, it
2397 is never re-entered, even if it contains untried alternatives and there is a
2398 subsequent matching failure. This can be illustrated by the following pattern,
2399 which purports to match a palindromic string that contains an odd number of
2400 characters (for example, "a", "aba", "abcba", "abcdcba"):
2401 .sp
2402 ^(.|(.)(?1)\e2)$
2403 .sp
2404 The idea is that it either matches a single character, or two identical
2405 characters surrounding a sub-palindrome. In Perl, this pattern works; in PCRE
2406 it does not if the pattern is longer than three characters. Consider the
2407 subject string "abcba":
2408 .P
2409 At the top level, the first character is matched, but as it is not at the end
2410 of the string, the first alternative fails; the second alternative is taken
2411 and the recursion kicks in. The recursive call to subpattern 1 successfully
2412 matches the next character ("b"). (Note that the beginning and end of line
2413 tests are not part of the recursion).
2414 .P
2415 Back at the top level, the next character ("c") is compared with what
2416 subpattern 2 matched, which was "a". This fails. Because the recursion is
2417 treated as an atomic group, there are now no backtracking points, and so the
2418 entire match fails. (Perl is able, at this point, to re-enter the recursion and
2419 try the second alternative.) However, if the pattern is written with the
2420 alternatives in the other order, things are different:
2421 .sp
2422 ^((.)(?1)\e2|.)$
2423 .sp
2424 This time, the recursing alternative is tried first, and continues to recurse
2425 until it runs out of characters, at which point the recursion fails. But this
2426 time we do have another alternative to try at the higher level. That is the big
2427 difference: in the previous case the remaining alternative is at a deeper
2428 recursion level, which PCRE cannot use.
2429 .P
2430 To change the pattern so that it matches all palindromic strings, not just
2431 those with an odd number of characters, it is tempting to change the pattern to
2432 this:
2433 .sp
2434 ^((.)(?1)\e2|.?)$
2435 .sp
2436 Again, this works in Perl, but not in PCRE, and for the same reason. When a
2437 deeper recursion has matched a single character, it cannot be entered again in
2438 order to match an empty string. The solution is to separate the two cases, and
2439 write out the odd and even cases as alternatives at the higher level:
2440 .sp
2441 ^(?:((.)(?1)\e2|)|((.)(?3)\e4|.))
2442 .sp
2443 If you want to match typical palindromic phrases, the pattern has to ignore all
2444 non-word characters, which can be done like this:
2445 .sp
2446 ^\eW*+(?:((.)\eW*+(?1)\eW*+\e2|)|((.)\eW*+(?3)\eW*+\e4|\eW*+.\eW*+))\eW*+$
2447 .sp
2448 If run with the PCRE_CASELESS option, this pattern matches phrases such as "A
2449 man, a plan, a canal: Panama!" and it works well in both PCRE and Perl. Note
2450 the use of the possessive quantifier *+ to avoid backtracking into sequences of
2451 non-word characters. Without this, PCRE takes a great deal longer (ten times or
2452 more) to match typical phrases, and Perl takes so long that you think it has
2453 gone into a loop.
2454 .P
2455 \fBWARNING\fP: The palindrome-matching patterns above work only if the subject
2456 string does not start with a palindrome that is shorter than the entire string.
2457 For example, although "abcba" is correctly matched, if the subject is "ababa",
2458 PCRE finds the palindrome "aba" at the start, then fails at top level because
2459 the end of the string does not follow. Once again, it cannot jump back into the
2460 recursion to try other alternatives, so the entire match fails.
2461 .P
2462 The second way in which PCRE and Perl differ in their recursion processing is
2463 in the handling of captured values. In Perl, when a subpattern is called
2464 recursively or as a subpattern (see the next section), it has no access to any
2465 values that were captured outside the recursion, whereas in PCRE these values
2466 can be referenced. Consider this pattern:
2467 .sp
2468 ^(.)(\e1|a(?2))
2469 .sp
2470 In PCRE, this pattern matches "bab". The first capturing parentheses match "b",
2471 then in the second group, when the back reference \e1 fails to match "b", the
2472 second alternative matches "a" and then recurses. In the recursion, \e1 does
2473 now match "b" and so the whole match succeeds. In Perl, the pattern fails to
2474 match because inside the recursive call \e1 cannot access the externally set
2475 value.
2476 .
2477 .
2478 .\" HTML <a name="subpatternsassubroutines"></a>
2479 .SH "SUBPATTERNS AS SUBROUTINES"
2480 .rs
2481 .sp
2482 If the syntax for a recursive subpattern call (either by number or by
2483 name) is used outside the parentheses to which it refers, it operates like a
2484 subroutine in a programming language. The called subpattern may be defined
2485 before or after the reference. A numbered reference can be absolute or
2486 relative, as in these examples:
2487 .sp
2488 (...(absolute)...)...(?2)...
2489 (...(relative)...)...(?-1)...
2490 (...(?+1)...(relative)...
2491 .sp
2492 An earlier example pointed out that the pattern
2493 .sp
2494 (sens|respons)e and \e1ibility
2495 .sp
2496 matches "sense and sensibility" and "response and responsibility", but not
2497 "sense and responsibility". If instead the pattern
2498 .sp
2499 (sens|respons)e and (?1)ibility
2500 .sp
2501 is used, it does match "sense and responsibility" as well as the other two
2502 strings. Another example is given in the discussion of DEFINE above.
2503 .P
2504 All subroutine calls, whether recursive or not, are always treated as atomic
2505 groups. That is, once a subroutine has matched some of the subject string, it
2506 is never re-entered, even if it contains untried alternatives and there is a
2507 subsequent matching failure. Any capturing parentheses that are set during the
2508 subroutine call revert to their previous values afterwards.
2509 .P
2510 Processing options such as case-independence are fixed when a subpattern is
2511 defined, so if it is used as a subroutine, such options cannot be changed for
2512 different calls. For example, consider this pattern:
2513 .sp
2514 (abc)(?i:(?-1))
2515 .sp
2516 It matches "abcabc". It does not match "abcABC" because the change of
2517 processing option does not affect the called subpattern.
2518 .
2519 .
2520 .\" HTML <a name="onigurumasubroutines"></a>
2521 .SH "ONIGURUMA SUBROUTINE SYNTAX"
2522 .rs
2523 .sp
2524 For compatibility with Oniguruma, the non-Perl syntax \eg followed by a name or
2525 a number enclosed either in angle brackets or single quotes, is an alternative
2526 syntax for referencing a subpattern as a subroutine, possibly recursively. Here
2527 are two of the examples used above, rewritten using this syntax:
2528 .sp
2529 (?<pn> \e( ( (?>[^()]+) | \eg<pn> )* \e) )
2530 (sens|respons)e and \eg'1'ibility
2531 .sp
2532 PCRE supports an extension to Oniguruma: if a number is preceded by a
2533 plus or a minus sign it is taken as a relative reference. For example:
2534 .sp
2535 (abc)(?i:\eg<-1>)
2536 .sp
2537 Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP
2538 synonymous. The former is a back reference; the latter is a subroutine call.
2539 .
2540 .
2541 .SH CALLOUTS
2542 .rs
2543 .sp
2544 Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl
2545 code to be obeyed in the middle of matching a regular expression. This makes it
2546 possible, amongst other things, to extract different substrings that match the
2547 same pair of parentheses when there is a repetition.
2548 .P
2549 PCRE provides a similar feature, but of course it cannot obey arbitrary Perl
2550 code. The feature is called "callout". The caller of PCRE provides an external
2551 function by putting its entry point in the global variable \fIpcre_callout\fP
2552 (8-bit library) or \fIpcre16_callout\fP (16-bit library). By default, this
2553 variable contains NULL, which disables all calling out.
2554 .P
2555 Within a regular expression, (?C) indicates the points at which the external
2556 function is to be called. If you want to identify different callout points, you
2557 can put a number less than 256 after the letter C. The default value is zero.
2558 For example, this pattern has two callout points:
2559 .sp
2560 (?C1)abc(?C2)def
2561 .sp
2562 If the PCRE_AUTO_CALLOUT flag is passed to a compiling function, callouts are
2563 automatically installed before each item in the pattern. They are all numbered
2564 255.
2565 .P
2566 During matching, when PCRE reaches a callout point, the external function is
2567 called. It is provided with the number of the callout, the position in the
2568 pattern, and, optionally, one item of data originally supplied by the caller of
2569 the matching function. The callout function may cause matching to proceed, to
2570 backtrack, or to fail altogether. A complete description of the interface to
2571 the callout function is given in the
2572 .\" HREF
2573 \fBpcrecallout\fP
2574 .\"
2575 documentation.
2576 .
2577 .
2578 .\" HTML <a name="backtrackcontrol"></a>
2579 .SH "BACKTRACKING CONTROL"
2580 .rs
2581 .sp
2582 Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which
2583 are described in the Perl documentation as "experimental and subject to change
2584 or removal in a future version of Perl". It goes on to say: "Their usage in
2585 production code should be noted to avoid problems during upgrades." The same
2586 remarks apply to the PCRE features described in this section.
2587 .P
2588 Since these verbs are specifically related to backtracking, most of them can be
2589 used only when the pattern is to be matched using one of the traditional
2590 matching functions, which use a backtracking algorithm. With the exception of
2591 (*FAIL), which behaves like a failing negative assertion, they cause an error
2592 if encountered by a DFA matching function.
2593 .P
2594 If any of these verbs are used in an assertion or in a subpattern that is
2595 called as a subroutine (whether or not recursively), their effect is confined
2596 to that subpattern; it does not extend to the surrounding pattern, with one
2597 exception: the name from a *(MARK), (*PRUNE), or (*THEN) that is encountered in
2598 a successful positive assertion \fIis\fP passed back when a match succeeds
2599 (compare capturing parentheses in assertions). Note that such subpatterns are
2600 processed as anchored at the point where they are tested. Note also that Perl's
2601 treatment of subroutines is different in some cases.
2602 .P
2603 The new verbs make use of what was previously invalid syntax: an opening
2604 parenthesis followed by an asterisk. They are generally of the form
2605 (*VERB) or (*VERB:NAME). Some may take either form, with differing behaviour,
2606 depending on whether or not an argument is present. A name is any sequence of
2607 characters that does not include a closing parenthesis. If the name is empty,
2608 that is, if the closing parenthesis immediately follows the colon, the effect
2609 is as if the colon were not there. Any number of these verbs may occur in a
2610 pattern.
2611 .
2612 .
2613 .\" HTML <a name="nooptimize"></a>
2614 .SS "Optimizations that affect backtracking verbs"
2615 .rs
2616 .sp
2617 PCRE contains some optimizations that are used to speed up matching by running
2618 some checks at the start of each match attempt. For example, it may know the
2619 minimum length of matching subject, or that a particular character must be
2620 present. When one of these optimizations suppresses the running of a match, any
2621 included backtracking verbs will not, of course, be processed. You can suppress
2622 the start-of-match optimizations by setting the PCRE_NO_START_OPTIMIZE option
2623 when calling \fBpcre_compile()\fP or \fBpcre_exec()\fP, or by starting the
2624 pattern with (*NO_START_OPT). There is more discussion of this option in the
2625 section entitled
2626 .\" HTML <a href="pcreapi.html#execoptions">
2627 .\" </a>
2628 "Option bits for \fBpcre_exec()\fP"
2629 .\"
2630 in the
2631 .\" HREF
2632 \fBpcreapi\fP
2633 .\"
2634 documentation.
2635 .P
2636 Experiments with Perl suggest that it too has similar optimizations, sometimes
2637 leading to anomalous results.
2638 .
2639 .
2640 .SS "Verbs that act immediately"
2641 .rs
2642 .sp
2643 The following verbs act as soon as they are encountered. They may not be
2644 followed by a name.
2645 .sp
2646 (*ACCEPT)
2647 .sp
2648 This verb causes the match to end successfully, skipping the remainder of the
2649 pattern. However, when it is inside a subpattern that is called as a
2650 subroutine, only that subpattern is ended successfully. Matching then continues
2651 at the outer level. If (*ACCEPT) is inside capturing parentheses, the data so
2652 far is captured. For example:
2653 .sp
2654 A((?:A|B(*ACCEPT)|C)D)
2655 .sp
2656 This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by
2657 the outer parentheses.
2658 .sp
2659 (*FAIL) or (*F)
2660 .sp
2661 This verb causes a matching failure, forcing backtracking to occur. It is
2662 equivalent to (?!) but easier to read. The Perl documentation notes that it is
2663 probably useful only when combined with (?{}) or (??{}). Those are, of course,
2664 Perl features that are not present in PCRE. The nearest equivalent is the
2665 callout feature, as for example in this pattern:
2666 .sp
2667 a+(?C)(*FAIL)
2668 .sp
2669 A match with the string "aaaa" always fails, but the callout is taken before
2670 each backtrack happens (in this example, 10 times).
2671 .
2672 .
2673 .SS "Recording which path was taken"
2674 .rs
2675 .sp
2676 There is one verb whose main purpose is to track how a match was arrived at,
2677 though it also has a secondary use in conjunction with advancing the match
2678 starting point (see (*SKIP) below).
2679 .sp
2680 (*MARK:NAME) or (*:NAME)
2681 .sp
2682 A name is always required with this verb. There may be as many instances of
2683 (*MARK) as you like in a pattern, and their names do not have to be unique.
2684 .P
2685 When a match succeeds, the name of the last-encountered (*MARK) on the matching
2686 path is passed back to the caller as described in the section entitled
2687 .\" HTML <a href="pcreapi.html#extradata">
2688 .\" </a>
2689 "Extra data for \fBpcre_exec()\fP"
2690 .\"
2691 in the
2692 .\" HREF
2693 \fBpcreapi\fP
2694 .\"
2695 documentation. Here is an example of \fBpcretest\fP output, where the /K
2696 modifier requests the retrieval and outputting of (*MARK) data:
2697 .sp
2698 re> /X(*MARK:A)Y|X(*MARK:B)Z/K
2699 data> XY
2700 0: XY
2701 MK: A
2702 XZ
2703 0: XZ
2704 MK: B
2705 .sp
2706 The (*MARK) name is tagged with "MK:" in this output, and in this example it
2707 indicates which of the two alternatives matched. This is a more efficient way
2708 of obtaining this information than putting each alternative in its own
2709 capturing parentheses.
2710 .P
2711 If (*MARK) is encountered in a positive assertion, its name is recorded and
2712 passed back if it is the last-encountered. This does not happen for negative
2713 assertions.
2714 .P
2715 After a partial match or a failed match, the name of the last encountered
2716 (*MARK) in the entire match process is returned. For example:
2717 .sp
2718 re> /X(*MARK:A)Y|X(*MARK:B)Z/K
2719 data> XP
2720 No match, mark = B
2721 .sp
2722 Note that in this unanchored example the mark is retained from the match
2723 attempt that started at the letter "X" in the subject. Subsequent match
2724 attempts starting at "P" and then with an empty string do not get as far as the
2725 (*MARK) item, but nevertheless do not reset it.
2726 .P
2727 If you are interested in (*MARK) values after failed matches, you should
2728 probably set the PCRE_NO_START_OPTIMIZE option
2729 .\" HTML <a href="#nooptimize">
2730 .\" </a>
2731 (see above)
2732 .\"
2733 to ensure that the match is always attempted.
2734 .
2735 .
2736 .SS "Verbs that act after backtracking"
2737 .rs
2738 .sp
2739 The following verbs do nothing when they are encountered. Matching continues
2740 with what follows, but if there is no subsequent match, causing a backtrack to
2741 the verb, a failure is forced. That is, backtracking cannot pass to the left of
2742 the verb. However, when one of these verbs appears inside an atomic group, its
2743 effect is confined to that group, because once the group has been matched,
2744 there is never any backtracking into it. In this situation, backtracking can
2745 "jump back" to the left of the entire atomic group. (Remember also, as stated
2746 above, that this localization also applies in subroutine calls and assertions.)
2747 .P
2748 These verbs differ in exactly what kind of failure occurs when backtracking
2749 reaches them.
2750 .sp
2751 (*COMMIT)
2752 .sp
2753 This verb, which may not be followed by a name, causes the whole match to fail
2754 outright if the rest of the pattern does not match. Even if the pattern is
2755 unanchored, no further attempts to find a match by advancing the starting point
2756 take place. Once (*COMMIT) has been passed, \fBpcre_exec()\fP is committed to
2757 finding a match at the current starting point, or not at all. For example:
2758 .sp
2759 a+(*COMMIT)b
2760 .sp
2761 This matches "xxaab" but not "aacaab". It can be thought of as a kind of
2762 dynamic anchor, or "I've started, so I must finish." The name of the most
2763 recently passed (*MARK) in the path is passed back when (*COMMIT) forces a
2764 match failure.
2765 .P
2766 Note that (*COMMIT) at the start of a pattern is not the same as an anchor,
2767 unless PCRE's start-of-match optimizations are turned off, as shown in this
2768 \fBpcretest\fP example:
2769 .sp
2770 re> /(*COMMIT)abc/
2771 data> xyzabc
2772 0: abc
2773 xyzabc\eY
2774 No match
2775 .sp
2776 PCRE knows that any match must start with "a", so the optimization skips along
2777 the subject to "a" before running the first match attempt, which succeeds. When
2778 the optimization is disabled by the \eY escape in the second subject, the match
2779 starts at "x" and so the (*COMMIT) causes it to fail without trying any other
2780 starting points.
2781 .sp
2782 (*PRUNE) or (*PRUNE:NAME)
2783 .sp
2784 This verb causes the match to fail at the current starting position in the
2785 subject if the rest of the pattern does not match. If the pattern is
2786 unanchored, the normal "bumpalong" advance to the next starting character then
2787 happens. Backtracking can occur as usual to the left of (*PRUNE), before it is
2788 reached, or when matching to the right of (*PRUNE), but if there is no match to
2789 the right, backtracking cannot cross (*PRUNE). In simple cases, the use of
2790 (*PRUNE) is just an alternative to an atomic group or possessive quantifier,
2791 but there are some uses of (*PRUNE) that cannot be expressed in any other way.
2792 The behaviour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE). In an
2793 anchored pattern (*PRUNE) has the same effect as (*COMMIT).
2794 .sp
2795 (*SKIP)
2796 .sp
2797 This verb, when given without a name, is like (*PRUNE), except that if the
2798 pattern is unanchored, the "bumpalong" advance is not to the next character,
2799 but to the position in the subject where (*SKIP) was encountered. (*SKIP)
2800 signifies that whatever text was matched leading up to it cannot be part of a
2801 successful match. Consider:
2802 .sp
2803 a+(*SKIP)b
2804 .sp
2805 If the subject is "aaaac...", after the first match attempt fails (starting at
2806 the first character in the string), the starting point skips on to start the
2807 next attempt at "c". Note that a possessive quantifer does not have the same
2808 effect as this example; although it would suppress backtracking during the
2809 first match attempt, the second attempt would start at the second character
2810 instead of skipping on to "c".
2811 .sp
2812 (*SKIP:NAME)
2813 .sp
2814 When (*SKIP) has an associated name, its behaviour is modified. If the
2815 following pattern fails to match, the previous path through the pattern is
2816 searched for the most recent (*MARK) that has the same name. If one is found,
2817 the "bumpalong" advance is to the subject position that corresponds to that
2818 (*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with a
2819 matching name is found, the (*SKIP) is ignored.
2820 .sp
2821 (*THEN) or (*THEN:NAME)
2822 .sp
2823 This verb causes a skip to the next innermost alternative if the rest of the
2824 pattern does not match. That is, it cancels pending backtracking, but only
2825 within the current alternative. Its name comes from the observation that it can
2826 be used for a pattern-based if-then-else block:
2827 .sp
2828 ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
2829 .sp
2830 If the COND1 pattern matches, FOO is tried (and possibly further items after
2831 the end of the group if FOO succeeds); on failure, the matcher skips to the
2832 second alternative and tries COND2, without backtracking into COND1. The
2833 behaviour of (*THEN:NAME) is exactly the same as (*MARK:NAME)(*THEN).
2834 If (*THEN) is not inside an alternation, it acts like (*PRUNE).
2835 .P
2836 Note that a subpattern that does not contain a | character is just a part of
2837 the enclosing alternative; it is not a nested alternation with only one
2838 alternative. The effect of (*THEN) extends beyond such a subpattern to the
2839 enclosing alternative. Consider this pattern, where A, B, etc. are complex
2840 pattern fragments that do not contain any | characters at this level:
2841 .sp
2842 A (B(*THEN)C) | D
2843 .sp
2844 If A and B are matched, but there is a failure in C, matching does not
2845 backtrack into A; instead it moves to the next alternative, that is, D.
2846 However, if the subpattern containing (*THEN) is given an alternative, it
2847 behaves differently:
2848 .sp
2849 A (B(*THEN)C | (*FAIL)) | D
2850 .sp
2851 The effect of (*THEN) is now confined to the inner subpattern. After a failure
2852 in C, matching moves to (*FAIL), which causes the whole subpattern to fail
2853 because there are no more alternatives to try. In this case, matching does now
2854 backtrack into A.
2855 .P
2856 Note also that a conditional subpattern is not considered as having two
2857 alternatives, because only one is ever used. In other words, the | character in
2858 a conditional subpattern has a different meaning. Ignoring white space,
2859 consider:
2860 .sp
2861 ^.*? (?(?=a) a | b(*THEN)c )
2862 .sp
2863 If the subject is "ba", this pattern does not match. Because .*? is ungreedy,
2864 it initially matches zero characters. The condition (?=a) then fails, the
2865 character "b" is matched, but "c" is not. At this point, matching does not
2866 backtrack to .*? as might perhaps be expected from the presence of the |
2867 character. The conditional subpattern is part of the single alternative that
2868 comprises the whole pattern, and so the match fails. (If there was a backtrack
2869 into .*?, allowing it to match "b", the match would succeed.)
2870 .P
2871 The verbs just described provide four different "strengths" of control when
2872 subsequent matching fails. (*THEN) is the weakest, carrying on the match at the
2873 next alternative. (*PRUNE) comes next, failing the match at the current
2874 starting position, but allowing an advance to the next character (for an
2875 unanchored pattern). (*SKIP) is similar, except that the advance may be more
2876 than one character. (*COMMIT) is the strongest, causing the entire match to
2877 fail.
2878 .P
2879 If more than one such verb is present in a pattern, the "strongest" one wins.
2880 For example, consider this pattern, where A, B, etc. are complex pattern
2881 fragments:
2882 .sp
2883 (A(*COMMIT)B(*THEN)C|D)
2884 .sp
2885 Once A has matched, PCRE is committed to this match, at the current starting
2886 position. If subsequently B matches, but C does not, the normal (*THEN) action
2887 of trying the next alternative (that is, D) does not happen because (*COMMIT)
2888 overrides.
2889 .
2890 .
2891 .SH "SEE ALSO"
2892 .rs
2893 .sp
2894 \fBpcreapi\fP(3), \fBpcrecallout\fP(3), \fBpcrematching\fP(3),
2895 \fBpcresyntax\fP(3), \fBpcre\fP(3), \fBpcre16(3)\fP.
2896 .
2897 .
2898 .SH AUTHOR
2899 .rs
2900 .sp
2901 .nf
2902 Philip Hazel
2903 University Computing Service
2904 Cambridge CB2 3QH, England.
2905 .fi
2906 .
2907 .
2908 .SH REVISION
2909 .rs
2910 .sp
2911 .nf
2912 Last updated: 24 February 2012
2913 Copyright (c) 1997-2012 University of Cambridge.
2914 .fi

Properties

Name Value
svn:eol-style native
svn:keywords "Author Date Id Revision Url"

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12