/[pcre]/code/trunk/doc/html/pcrepattern.html
ViewVC logotype

Contents of /code/trunk/doc/html/pcrepattern.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 91 - (hide annotations) (download) (as text)
Sat Feb 24 21:41:34 2007 UTC (7 years, 6 months ago) by nigel
File MIME type: text/html
File size: 71296 byte(s)
Load pcre-6.7 into code/trunk.

1 nigel 63 <html>
2     <head>
3     <title>pcrepattern specification</title>
4     </head>
5     <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6 nigel 75 <h1>pcrepattern man page</h1>
7     <p>
8     Return to the <a href="index.html">PCRE index page</a>.
9     </p>
10     <p>
11     This page is part of the PCRE HTML documentation. It was generated automatically
12     from the original man page. If there is any nonsense in it, please consult the
13     man page, in case the conversion went wrong.
14     <br>
15 nigel 63 <ul>
16     <li><a name="TOC1" href="#SEC1">PCRE REGULAR EXPRESSION DETAILS</a>
17     <li><a name="TOC2" href="#SEC2">BACKSLASH</a>
18     <li><a name="TOC3" href="#SEC3">CIRCUMFLEX AND DOLLAR</a>
19     <li><a name="TOC4" href="#SEC4">FULL STOP (PERIOD, DOT)</a>
20     <li><a name="TOC5" href="#SEC5">MATCHING A SINGLE BYTE</a>
21 nigel 75 <li><a name="TOC6" href="#SEC6">SQUARE BRACKETS AND CHARACTER CLASSES</a>
22 nigel 63 <li><a name="TOC7" href="#SEC7">POSIX CHARACTER CLASSES</a>
23     <li><a name="TOC8" href="#SEC8">VERTICAL BAR</a>
24     <li><a name="TOC9" href="#SEC9">INTERNAL OPTION SETTING</a>
25     <li><a name="TOC10" href="#SEC10">SUBPATTERNS</a>
26     <li><a name="TOC11" href="#SEC11">NAMED SUBPATTERNS</a>
27     <li><a name="TOC12" href="#SEC12">REPETITION</a>
28     <li><a name="TOC13" href="#SEC13">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a>
29     <li><a name="TOC14" href="#SEC14">BACK REFERENCES</a>
30     <li><a name="TOC15" href="#SEC15">ASSERTIONS</a>
31     <li><a name="TOC16" href="#SEC16">CONDITIONAL SUBPATTERNS</a>
32     <li><a name="TOC17" href="#SEC17">COMMENTS</a>
33     <li><a name="TOC18" href="#SEC18">RECURSIVE PATTERNS</a>
34     <li><a name="TOC19" href="#SEC19">SUBPATTERNS AS SUBROUTINES</a>
35     <li><a name="TOC20" href="#SEC20">CALLOUTS</a>
36     </ul>
37     <br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION DETAILS</a><br>
38     <P>
39     The syntax and semantics of the regular expressions supported by PCRE are
40     described below. Regular expressions are also described in the Perl
41 nigel 75 documentation and in a number of books, some of which have copious examples.
42     Jeffrey Friedl's "Mastering Regular Expressions", published by O'Reilly, covers
43     regular expressions in great detail. This description of PCRE's regular
44     expressions is intended as reference material.
45 nigel 63 </P>
46     <P>
47 nigel 75 The original operation of PCRE was on strings of one-byte characters. However,
48     there is now also support for UTF-8 character strings. To use this, you must
49     build PCRE to include UTF-8 support, and then call <b>pcre_compile()</b> with
50     the PCRE_UTF8 option. How this affects pattern matching is mentioned in several
51     places below. There is also a summary of UTF-8 features in the
52 nigel 63 <a href="pcre.html#utf8support">section on UTF-8 support</a>
53     in the main
54     <a href="pcre.html"><b>pcre</b></a>
55     page.
56     </P>
57     <P>
58 nigel 77 The remainder of this document discusses the patterns that are supported by
59     PCRE when its main matching function, <b>pcre_exec()</b>, is used.
60     From release 6.0, PCRE offers a second matching function,
61     <b>pcre_dfa_exec()</b>, which matches using a different algorithm that is not
62     Perl-compatible. The advantages and disadvantages of the alternative function,
63     and how it differs from the normal function, are discussed in the
64     <a href="pcrematching.html"><b>pcrematching</b></a>
65     page.
66     </P>
67     <P>
68 nigel 63 A regular expression is a pattern that is matched against a subject string from
69     left to right. Most characters stand for themselves in a pattern, and match the
70     corresponding characters in the subject. As a trivial example, the pattern
71     <pre>
72     The quick brown fox
73 nigel 75 </pre>
74 nigel 77 matches a portion of a subject string that is identical to itself. When
75     caseless matching is specified (the PCRE_CASELESS option), letters are matched
76     independently of case. In UTF-8 mode, PCRE always understands the concept of
77     case for characters whose values are less than 128, so caseless matching is
78     always possible. For characters with higher values, the concept of case is
79     supported if PCRE is compiled with Unicode property support, but not otherwise.
80     If you want to use caseless matching for characters 128 and above, you must
81     ensure that PCRE is compiled with Unicode property support as well as with
82     UTF-8 support.
83     </P>
84     <P>
85     The power of regular expressions comes from the ability to include alternatives
86     and repetitions in the pattern. These are encoded in the pattern by the use of
87 nigel 75 <i>metacharacters</i>, which do not stand for themselves but instead are
88 nigel 63 interpreted in some special way.
89     </P>
90     <P>
91 nigel 75 There are two different sets of metacharacters: those that are recognized
92 nigel 63 anywhere in the pattern except within square brackets, and those that are
93 nigel 75 recognized in square brackets. Outside square brackets, the metacharacters are
94 nigel 63 as follows:
95     <pre>
96     \ general escape character with several uses
97     ^ assert start of string (or line, in multiline mode)
98     $ assert end of string (or line, in multiline mode)
99     . match any character except newline (by default)
100     [ start character class definition
101     | start of alternative branch
102     ( start subpattern
103     ) end subpattern
104     ? extends the meaning of (
105     also 0 or 1 quantifier
106     also quantifier minimizer
107     * 0 or more quantifier
108     + 1 or more quantifier
109     also "possessive quantifier"
110     { start min/max quantifier
111 nigel 75 </pre>
112 nigel 63 Part of a pattern that is in square brackets is called a "character class". In
113 nigel 75 a character class the only metacharacters are:
114 nigel 63 <pre>
115     \ general escape character
116     ^ negate the class, but only if the first character
117     - indicates character range
118 nigel 75 [ POSIX character class (only if followed by POSIX syntax)
119 nigel 63 ] terminates the character class
120 nigel 75 </pre>
121     The following sections describe the use of each of the metacharacters.
122 nigel 63 </P>
123     <br><a name="SEC2" href="#TOC1">BACKSLASH</a><br>
124     <P>
125     The backslash character has several uses. Firstly, if it is followed by a
126 nigel 91 non-alphanumeric character, it takes away any special meaning that character
127     may have. This use of backslash as an escape character applies both inside and
128 nigel 63 outside character classes.
129     </P>
130     <P>
131     For example, if you want to match a * character, you write \* in the pattern.
132     This escaping action applies whether or not the following character would
133 nigel 75 otherwise be interpreted as a metacharacter, so it is always safe to precede a
134     non-alphanumeric with backslash to specify that it stands for itself. In
135 nigel 63 particular, if you want to match a backslash, you write \\.
136     </P>
137     <P>
138     If a pattern is compiled with the PCRE_EXTENDED option, whitespace in the
139     pattern (other than in a character class) and characters between a # outside
140 nigel 91 a character class and the next newline are ignored. An escaping backslash can
141     be used to include a whitespace or # character as part of the pattern.
142 nigel 63 </P>
143     <P>
144     If you want to remove the special meaning from a sequence of characters, you
145     can do so by putting them between \Q and \E. This is different from Perl in
146     that $ and @ are handled as literals in \Q...\E sequences in PCRE, whereas in
147     Perl, $ and @ cause variable interpolation. Note the following examples:
148     <pre>
149     Pattern PCRE matches Perl matches
150 nigel 75
151     \Qabc$xyz\E abc$xyz abc followed by the contents of $xyz
152 nigel 63 \Qabc\$xyz\E abc\$xyz abc\$xyz
153     \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
154 nigel 75 </pre>
155 nigel 63 The \Q...\E sequence is recognized both inside and outside character classes.
156 nigel 75 <a name="digitsafterbackslash"></a></P>
157     <br><b>
158     Non-printing characters
159     </b><br>
160 nigel 63 <P>
161     A second use of backslash provides a way of encoding non-printing characters
162     in patterns in a visible manner. There is no restriction on the appearance of
163     non-printing characters, apart from the binary zero that terminates a pattern,
164     but when a pattern is being prepared by text editing, it is usually easier to
165     use one of the following escape sequences than the binary character it
166     represents:
167     <pre>
168     \a alarm, that is, the BEL character (hex 07)
169     \cx "control-x", where x is any character
170     \e escape (hex 1B)
171     \f formfeed (hex 0C)
172     \n newline (hex 0A)
173     \r carriage return (hex 0D)
174     \t tab (hex 09)
175     \ddd character with octal code ddd, or backreference
176     \xhh character with hex code hh
177 nigel 87 \x{hhh..} character with hex code hhh..
178 nigel 75 </pre>
179 nigel 63 The precise effect of \cx is as follows: if x is a lower case letter, it
180     is converted to upper case. Then bit 6 of the character (hex 40) is inverted.
181     Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c; becomes hex
182     7B.
183     </P>
184     <P>
185     After \x, from zero to two hexadecimal digits are read (letters can be in
186 nigel 87 upper or lower case). Any number of hexadecimal digits may appear between \x{
187     and }, but the value of the character code must be less than 256 in non-UTF-8
188     mode, and less than 2**31 in UTF-8 mode (that is, the maximum hexadecimal value
189     is 7FFFFFFF). If characters other than hexadecimal digits appear between \x{
190     and }, or if there is no terminating }, this form of escape is not recognized.
191     Instead, the initial \x will be interpreted as a basic hexadecimal escape,
192     with no following digits, giving a character whose value is zero.
193 nigel 63 </P>
194     <P>
195     Characters whose value is less than 256 can be defined by either of the two
196 nigel 87 syntaxes for \x. There is no difference in the way they are handled. For
197     example, \xdc is exactly the same as \x{dc}.
198 nigel 63 </P>
199     <P>
200 nigel 91 After \0 up to two further octal digits are read. If there are fewer than two
201     digits, just those that are present are used. Thus the sequence \0\x\07
202     specifies two binary zeros followed by a BEL character (code value 7). Make
203     sure you supply two digits after the initial zero if the pattern character that
204     follows is itself an octal digit.
205 nigel 63 </P>
206     <P>
207     The handling of a backslash followed by a digit other than 0 is complicated.
208     Outside a character class, PCRE reads it and any following digits as a decimal
209     number. If the number is less than 10, or if there have been at least that many
210     previous capturing left parentheses in the expression, the entire sequence is
211     taken as a <i>back reference</i>. A description of how this works is given
212 nigel 75 <a href="#backreferences">later,</a>
213     following the discussion of
214     <a href="#subpattern">parenthesized subpatterns.</a>
215 nigel 63 </P>
216     <P>
217     Inside a character class, or if the decimal number is greater than 9 and there
218     have not been that many capturing subpatterns, PCRE re-reads up to three octal
219 nigel 91 digits following the backslash, ane uses them to generate a data character. Any
220     subsequent digits stand for themselves. In non-UTF-8 mode, the value of a
221     character specified in octal must be less than \400. In UTF-8 mode, values up
222     to \777 are permitted. For example:
223 nigel 63 <pre>
224     \040 is another way of writing a space
225 nigel 75 \40 is the same, provided there are fewer than 40 previous capturing subpatterns
226 nigel 63 \7 is always a back reference
227 nigel 75 \11 might be a back reference, or another way of writing a tab
228 nigel 63 \011 is always a tab
229     \0113 is a tab followed by the character "3"
230 nigel 75 \113 might be a back reference, otherwise the character with octal code 113
231     \377 might be a back reference, otherwise the byte consisting entirely of 1 bits
232     \81 is either a back reference, or a binary zero followed by the two characters "8" and "1"
233     </pre>
234 nigel 63 Note that octal values of 100 or greater must not be introduced by a leading
235     zero, because no more than three octal digits are ever read.
236     </P>
237     <P>
238 nigel 91 All the sequences that define a single character value can be used both inside
239     and outside character classes. In addition, inside a character class, the
240     sequence \b is interpreted as the backspace character (hex 08), and the
241     sequence \X is interpreted as the character "X". Outside a character class,
242     these sequences have different meanings
243 nigel 75 <a href="#uniextseq">(see below).</a>
244 nigel 63 </P>
245 nigel 75 <br><b>
246     Generic character types
247     </b><br>
248 nigel 63 <P>
249 nigel 75 The third use of backslash is for specifying generic character types. The
250     following are always recognized:
251 nigel 63 <pre>
252     \d any decimal digit
253     \D any character that is not a decimal digit
254     \s any whitespace character
255     \S any character that is not a whitespace character
256     \w any "word" character
257     \W any "non-word" character
258 nigel 75 </pre>
259 nigel 63 Each pair of escape sequences partitions the complete set of characters into
260     two disjoint sets. Any given character matches one, and only one, of each pair.
261     </P>
262     <P>
263 nigel 75 These character type sequences can appear both inside and outside character
264     classes. They each match one character of the appropriate type. If the current
265     matching point is at the end of the subject string, all of them fail, since
266     there is no character to match.
267 nigel 63 </P>
268     <P>
269     For compatibility with Perl, \s does not match the VT character (code 11).
270     This makes it different from the the POSIX "space" class. The \s characters
271 nigel 91 are HT (9), LF (10), FF (12), CR (13), and space (32). (If "use locale;" is
272     included in a Perl script, \s may match the VT character. In PCRE, it never
273     does.)
274 nigel 63 </P>
275     <P>
276 nigel 75 A "word" character is an underscore or any character less than 256 that is a
277     letter or digit. The definition of letters and digits is controlled by PCRE's
278     low-valued character tables, and may vary if locale-specific matching is taking
279     place (see
280 nigel 63 <a href="pcreapi.html#localesupport">"Locale support"</a>
281     in the
282     <a href="pcreapi.html"><b>pcreapi</b></a>
283 nigel 75 page). For example, in the "fr_FR" (French) locale, some character codes
284     greater than 128 are used for accented letters, and these are matched by \w.
285 nigel 63 </P>
286     <P>
287 nigel 75 In UTF-8 mode, characters with values greater than 128 never match \d, \s, or
288     \w, and always match \D, \S, and \W. This is true even when Unicode
289 nigel 87 character property support is available. The use of locales with Unicode is
290     discouraged.
291 nigel 75 <a name="uniextseq"></a></P>
292     <br><b>
293     Unicode character properties
294     </b><br>
295     <P>
296     When PCRE is built with Unicode character property support, three additional
297 nigel 87 escape sequences to match character properties are available when UTF-8 mode
298 nigel 75 is selected. They are:
299     <pre>
300 nigel 87 \p{<i>xx</i>} a character with the <i>xx</i> property
301     \P{<i>xx</i>} a character without the <i>xx</i> property
302     \X an extended Unicode sequence
303 nigel 75 </pre>
304 nigel 87 The property names represented by <i>xx</i> above are limited to the Unicode
305     script names, the general category properties, and "Any", which matches any
306     character (including newline). Other properties such as "InMusicalSymbols" are
307     not currently supported by PCRE. Note that \P{Any} does not match any
308     characters, so always causes a match failure.
309 nigel 63 </P>
310     <P>
311 nigel 87 Sets of Unicode characters are defined as belonging to certain scripts. A
312     character from one of these sets can be matched using a script name. For
313     example:
314 nigel 75 <pre>
315 nigel 87 \p{Greek}
316     \P{Han}
317     </pre>
318     Those that are not part of an identified script are lumped together as
319     "Common". The current list of scripts is:
320     </P>
321     <P>
322     Arabic,
323     Armenian,
324     Bengali,
325     Bopomofo,
326     Braille,
327     Buginese,
328     Buhid,
329     Canadian_Aboriginal,
330     Cherokee,
331     Common,
332     Coptic,
333     Cypriot,
334     Cyrillic,
335     Deseret,
336     Devanagari,
337     Ethiopic,
338     Georgian,
339     Glagolitic,
340     Gothic,
341     Greek,
342     Gujarati,
343     Gurmukhi,
344     Han,
345     Hangul,
346     Hanunoo,
347     Hebrew,
348     Hiragana,
349     Inherited,
350     Kannada,
351     Katakana,
352     Kharoshthi,
353     Khmer,
354     Lao,
355     Latin,
356     Limbu,
357     Linear_B,
358     Malayalam,
359     Mongolian,
360     Myanmar,
361     New_Tai_Lue,
362     Ogham,
363     Old_Italic,
364     Old_Persian,
365     Oriya,
366     Osmanya,
367     Runic,
368     Shavian,
369     Sinhala,
370     Syloti_Nagri,
371     Syriac,
372     Tagalog,
373     Tagbanwa,
374     Tai_Le,
375     Tamil,
376     Telugu,
377     Thaana,
378     Thai,
379     Tibetan,
380     Tifinagh,
381     Ugaritic,
382     Yi.
383     </P>
384     <P>
385     Each character has exactly one general category property, specified by a
386     two-letter abbreviation. For compatibility with Perl, negation can be specified
387     by including a circumflex between the opening brace and the property name. For
388     example, \p{^Lu} is the same as \P{Lu}.
389     </P>
390     <P>
391     If only one letter is specified with \p or \P, it includes all the general
392     category properties that start with that letter. In this case, in the absence
393     of negation, the curly brackets in the escape sequence are optional; these two
394     examples have the same effect:
395     <pre>
396 nigel 75 \p{L}
397     \pL
398     </pre>
399 nigel 87 The following general category property codes are supported:
400 nigel 75 <pre>
401     C Other
402     Cc Control
403     Cf Format
404     Cn Unassigned
405     Co Private use
406     Cs Surrogate
407    
408     L Letter
409     Ll Lower case letter
410     Lm Modifier letter
411     Lo Other letter
412     Lt Title case letter
413     Lu Upper case letter
414    
415     M Mark
416     Mc Spacing mark
417     Me Enclosing mark
418     Mn Non-spacing mark
419    
420     N Number
421     Nd Decimal number
422     Nl Letter number
423     No Other number
424    
425     P Punctuation
426     Pc Connector punctuation
427     Pd Dash punctuation
428     Pe Close punctuation
429     Pf Final punctuation
430     Pi Initial punctuation
431     Po Other punctuation
432     Ps Open punctuation
433    
434     S Symbol
435     Sc Currency symbol
436     Sk Modifier symbol
437     Sm Mathematical symbol
438     So Other symbol
439    
440     Z Separator
441     Zl Line separator
442     Zp Paragraph separator
443     Zs Space separator
444     </pre>
445 nigel 87 The special property L& is also supported: it matches a character that has
446     the Lu, Ll, or Lt property, in other words, a letter that is not classified as
447     a modifier or "other".
448 nigel 75 </P>
449     <P>
450 nigel 87 The long synonyms for these properties that Perl supports (such as \p{Letter})
451 nigel 91 are not supported by PCRE, nor is it permitted to prefix any of these
452 nigel 87 properties with "Is".
453     </P>
454     <P>
455     No character that is in the Unicode table has the Cn (unassigned) property.
456     Instead, this property is assumed for any code point that is not in the
457     Unicode table.
458     </P>
459     <P>
460 nigel 75 Specifying caseless matching does not affect these escape sequences. For
461     example, \p{Lu} always matches only upper case letters.
462     </P>
463     <P>
464     The \X escape matches any number of Unicode characters that form an extended
465     Unicode sequence. \X is equivalent to
466     <pre>
467     (?&#62;\PM\pM*)
468     </pre>
469     That is, it matches a character without the "mark" property, followed by zero
470     or more characters with the "mark" property, and treats the sequence as an
471     atomic group
472     <a href="#atomicgroup">(see below).</a>
473     Characters with the "mark" property are typically accents that affect the
474     preceding character.
475     </P>
476     <P>
477     Matching characters by Unicode property is not fast, because PCRE has to search
478     a structure that contains data for over fifteen thousand characters. That is
479     why the traditional escape sequences such as \d and \w do not use Unicode
480     properties in PCRE.
481     <a name="smallassertions"></a></P>
482     <br><b>
483     Simple assertions
484     </b><br>
485     <P>
486 nigel 63 The fourth use of backslash is for certain simple assertions. An assertion
487     specifies a condition that has to be met at a particular point in a match,
488     without consuming any characters from the subject string. The use of
489 nigel 75 subpatterns for more complicated assertions is described
490     <a href="#bigassertions">below.</a>
491 nigel 91 The backslashed assertions are:
492 nigel 63 <pre>
493     \b matches at a word boundary
494     \B matches when not at a word boundary
495     \A matches at start of subject
496     \Z matches at end of subject or before newline at end
497     \z matches at end of subject
498     \G matches at first matching position in subject
499 nigel 75 </pre>
500 nigel 63 These assertions may not appear in character classes (but note that \b has a
501     different meaning, namely the backspace character, inside a character class).
502     </P>
503     <P>
504     A word boundary is a position in the subject string where the current character
505     and the previous character do not both match \w or \W (i.e. one matches
506     \w and the other matches \W), or the start or end of the string if the
507     first or last character matches \w, respectively.
508     </P>
509     <P>
510     The \A, \Z, and \z assertions differ from the traditional circumflex and
511 nigel 75 dollar (described in the next section) in that they only ever match at the very
512     start and end of the subject string, whatever options are set. Thus, they are
513     independent of multiline mode. These three assertions are not affected by the
514     PCRE_NOTBOL or PCRE_NOTEOL options, which affect only the behaviour of the
515     circumflex and dollar metacharacters. However, if the <i>startoffset</i>
516     argument of <b>pcre_exec()</b> is non-zero, indicating that matching is to start
517     at a point other than the beginning of the subject, \A can never match. The
518 nigel 91 difference between \Z and \z is that \Z matches before a newline at the end
519     of the string as well as at the very end, whereas \z matches only at the end.
520 nigel 63 </P>
521     <P>
522     The \G assertion is true only when the current matching position is at the
523     start point of the match, as specified by the <i>startoffset</i> argument of
524     <b>pcre_exec()</b>. It differs from \A when the value of <i>startoffset</i> is
525     non-zero. By calling <b>pcre_exec()</b> multiple times with appropriate
526     arguments, you can mimic Perl's /g option, and it is in this kind of
527     implementation where \G can be useful.
528     </P>
529     <P>
530     Note, however, that PCRE's interpretation of \G, as the start of the current
531     match, is subtly different from Perl's, which defines it as the end of the
532     previous match. In Perl, these can be different when the previously matched
533     string was empty. Because PCRE does just one match at a time, it cannot
534     reproduce this behaviour.
535     </P>
536     <P>
537     If all the alternatives of a pattern begin with \G, the expression is anchored
538     to the starting match position, and the "anchored" flag is set in the compiled
539     regular expression.
540     </P>
541     <br><a name="SEC3" href="#TOC1">CIRCUMFLEX AND DOLLAR</a><br>
542     <P>
543     Outside a character class, in the default matching mode, the circumflex
544 nigel 75 character is an assertion that is true only if the current matching point is
545 nigel 63 at the start of the subject string. If the <i>startoffset</i> argument of
546     <b>pcre_exec()</b> is non-zero, circumflex can never match if the PCRE_MULTILINE
547     option is unset. Inside a character class, circumflex has an entirely different
548 nigel 75 meaning
549     <a href="#characterclass">(see below).</a>
550 nigel 63 </P>
551     <P>
552     Circumflex need not be the first character of the pattern if a number of
553     alternatives are involved, but it should be the first thing in each alternative
554     in which it appears if the pattern is ever to match that branch. If all
555     possible alternatives start with a circumflex, that is, if the pattern is
556     constrained to match only at the start of the subject, it is said to be an
557     "anchored" pattern. (There are also other constructs that can cause a pattern
558     to be anchored.)
559     </P>
560     <P>
561 nigel 75 A dollar character is an assertion that is true only if the current matching
562 nigel 63 point is at the end of the subject string, or immediately before a newline
563 nigel 91 at the end of the string (by default). Dollar need not be the last character of
564     the pattern if a number of alternatives are involved, but it should be the last
565     item in any branch in which it appears. Dollar has no special meaning in a
566     character class.
567 nigel 63 </P>
568     <P>
569     The meaning of dollar can be changed so that it matches only at the very end of
570     the string, by setting the PCRE_DOLLAR_ENDONLY option at compile time. This
571     does not affect the \Z assertion.
572     </P>
573     <P>
574     The meanings of the circumflex and dollar characters are changed if the
575 nigel 91 PCRE_MULTILINE option is set. When this is the case, a circumflex matches
576     immediately after internal newlines as well as at the start of the subject
577     string. It does not match after a newline that ends the string. A dollar
578     matches before any newlines in the string, as well as at the very end, when
579     PCRE_MULTILINE is set. When newline is specified as the two-character
580     sequence CRLF, isolated CR and LF characters do not indicate newlines.
581 nigel 63 </P>
582     <P>
583 nigel 91 For example, the pattern /^abc$/ matches the subject string "def\nabc" (where
584     \n represents a newline) in multiline mode, but not otherwise. Consequently,
585     patterns that are anchored in single line mode because all branches start with
586     ^ are not anchored in multiline mode, and a match for circumflex is possible
587     when the <i>startoffset</i> argument of <b>pcre_exec()</b> is non-zero. The
588     PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is set.
589     </P>
590     <P>
591 nigel 63 Note that the sequences \A, \Z, and \z can be used to match the start and
592     end of the subject in both modes, and if all branches of a pattern start with
593 nigel 91 \A it is always anchored, whether or not PCRE_MULTILINE is set.
594 nigel 63 </P>
595     <br><a name="SEC4" href="#TOC1">FULL STOP (PERIOD, DOT)</a><br>
596     <P>
597     Outside a character class, a dot in the pattern matches any one character in
598 nigel 91 the subject string except (by default) a character that signifies the end of a
599     line. In UTF-8 mode, the matched character may be more than one byte long. When
600     a line ending is defined as a single character (CR or LF), dot never matches
601     that character; when the two-character sequence CRLF is used, dot does not
602     match CR if it is immediately followed by LF, but otherwise it matches all
603     characters (including isolated CRs and LFs).
604 nigel 63 </P>
605 nigel 91 <P>
606     The behaviour of dot with regard to newlines can be changed. If the PCRE_DOTALL
607     option is set, a dot matches any one character, without exception. If newline
608     is defined as the two-character sequence CRLF, it takes two dots to match it.
609     </P>
610     <P>
611     The handling of dot is entirely independent of the handling of circumflex and
612     dollar, the only relationship being that they both involve newlines. Dot has no
613     special meaning in a character class.
614     </P>
615 nigel 63 <br><a name="SEC5" href="#TOC1">MATCHING A SINGLE BYTE</a><br>
616     <P>
617     Outside a character class, the escape sequence \C matches any one byte, both
618 nigel 91 in and out of UTF-8 mode. Unlike a dot, it always matches CR and LF. The
619     feature is provided in Perl in order to match individual bytes in UTF-8 mode.
620     Because it breaks up UTF-8 characters into individual bytes, what remains in
621     the string may be a malformed UTF-8 string. For this reason, the \C escape
622     sequence is best avoided.
623 nigel 63 </P>
624     <P>
625 nigel 75 PCRE does not allow \C to appear in lookbehind assertions
626     <a href="#lookbehind">(described below),</a>
627     because in UTF-8 mode this would make it impossible to calculate the length of
628     the lookbehind.
629     <a name="characterclass"></a></P>
630     <br><a name="SEC6" href="#TOC1">SQUARE BRACKETS AND CHARACTER CLASSES</a><br>
631 nigel 63 <P>
632     An opening square bracket introduces a character class, terminated by a closing
633     square bracket. A closing square bracket on its own is not special. If a
634     closing square bracket is required as a member of the class, it should be the
635     first data character in the class (after an initial circumflex, if present) or
636     escaped with a backslash.
637     </P>
638     <P>
639     A character class matches a single character in the subject. In UTF-8 mode, the
640     character may occupy more than one byte. A matched character must be in the set
641     of characters defined by the class, unless the first character in the class
642     definition is a circumflex, in which case the subject character must not be in
643     the set defined by the class. If a circumflex is actually required as a member
644     of the class, ensure it is not the first character, or escape it with a
645     backslash.
646     </P>
647     <P>
648     For example, the character class [aeiou] matches any lower case vowel, while
649     [^aeiou] matches any character that is not a lower case vowel. Note that a
650 nigel 75 circumflex is just a convenient notation for specifying the characters that
651     are in the class by enumerating those that are not. A class that starts with a
652     circumflex is not an assertion: it still consumes a character from the subject
653     string, and therefore it fails if the current pointer is at the end of the
654     string.
655 nigel 63 </P>
656     <P>
657     In UTF-8 mode, characters with values greater than 255 can be included in a
658     class as a literal string of bytes, or by using the \x{ escaping mechanism.
659     </P>
660     <P>
661     When caseless matching is set, any letters in a class represent both their
662     upper case and lower case versions, so for example, a caseless [aeiou] matches
663     "A" as well as "a", and a caseless [^aeiou] does not match "A", whereas a
664 nigel 77 caseful version would. In UTF-8 mode, PCRE always understands the concept of
665     case for characters whose values are less than 128, so caseless matching is
666     always possible. For characters with higher values, the concept of case is
667     supported if PCRE is compiled with Unicode property support, but not otherwise.
668     If you want to use caseless matching for characters 128 and above, you must
669     ensure that PCRE is compiled with Unicode property support as well as with
670     UTF-8 support.
671 nigel 63 </P>
672     <P>
673 nigel 91 Characters that might indicate line breaks (CR and LF) are never treated in any
674     special way when matching character classes, whatever line-ending sequence is
675     in use, and whatever setting of the PCRE_DOTALL and PCRE_MULTILINE options is
676     used. A class such as [^a] always matches one of these characters.
677 nigel 63 </P>
678     <P>
679     The minus (hyphen) character can be used to specify a range of characters in a
680     character class. For example, [d-m] matches any letter between d and m,
681     inclusive. If a minus character is required in a class, it must be escaped with
682     a backslash or appear in a position where it cannot be interpreted as
683     indicating a range, typically as the first or last character in the class.
684     </P>
685     <P>
686     It is not possible to have the literal character "]" as the end character of a
687     range. A pattern such as [W-]46] is interpreted as a class of two characters
688     ("W" and "-") followed by a literal string "46]", so it would match "W46]" or
689     "-46]". However, if the "]" is escaped with a backslash it is interpreted as
690 nigel 75 the end of range, so [W-\]46] is interpreted as a class containing a range
691     followed by two other characters. The octal or hexadecimal representation of
692     "]" can also be used to end a range.
693 nigel 63 </P>
694     <P>
695     Ranges operate in the collating sequence of character values. They can also be
696     used for characters specified numerically, for example [\000-\037]. In UTF-8
697     mode, ranges can include characters whose values are greater than 255, for
698     example [\x{100}-\x{2ff}].
699     </P>
700     <P>
701     If a range that includes letters is used when caseless matching is set, it
702     matches the letters in either case. For example, [W-c] is equivalent to
703 nigel 75 [][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if character
704     tables for the "fr_FR" locale are in use, [\xc8-\xcb] matches accented E
705     characters in both cases. In UTF-8 mode, PCRE supports the concept of case for
706     characters with values greater than 128 only when it is compiled with Unicode
707     property support.
708 nigel 63 </P>
709     <P>
710 nigel 75 The character types \d, \D, \p, \P, \s, \S, \w, and \W may also appear
711     in a character class, and add the characters that they match to the class. For
712 nigel 63 example, [\dABCDEF] matches any hexadecimal digit. A circumflex can
713     conveniently be used with the upper case character types to specify a more
714     restricted set of characters than the matching lower case type. For example,
715     the class [^\W_] matches any letter or digit, but not underscore.
716     </P>
717     <P>
718 nigel 75 The only metacharacters that are recognized in character classes are backslash,
719     hyphen (only where it can be interpreted as specifying a range), circumflex
720     (only at the start), opening square bracket (only when it can be interpreted as
721     introducing a POSIX class name - see the next section), and the terminating
722     closing square bracket. However, escaping other non-alphanumeric characters
723     does no harm.
724 nigel 63 </P>
725     <br><a name="SEC7" href="#TOC1">POSIX CHARACTER CLASSES</a><br>
726     <P>
727 nigel 75 Perl supports the POSIX notation for character classes. This uses names
728 nigel 63 enclosed by [: and :] within the enclosing square brackets. PCRE also supports
729     this notation. For example,
730     <pre>
731     [01[:alpha:]%]
732 nigel 75 </pre>
733 nigel 63 matches "0", "1", any alphabetic character, or "%". The supported class names
734     are
735     <pre>
736     alnum letters and digits
737     alpha letters
738     ascii character codes 0 - 127
739     blank space or tab only
740     cntrl control characters
741     digit decimal digits (same as \d)
742     graph printing characters, excluding space
743     lower lower case letters
744     print printing characters, including space
745     punct printing characters, excluding letters and digits
746     space white space (not quite the same as \s)
747     upper upper case letters
748     word "word" characters (same as \w)
749     xdigit hexadecimal digits
750 nigel 75 </pre>
751 nigel 63 The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), and
752     space (32). Notice that this list includes the VT character (code 11). This
753     makes "space" different to \s, which does not include VT (for Perl
754     compatibility).
755     </P>
756     <P>
757     The name "word" is a Perl extension, and "blank" is a GNU extension from Perl
758     5.8. Another Perl extension is negation, which is indicated by a ^ character
759     after the colon. For example,
760     <pre>
761     [12[:^digit:]]
762 nigel 75 </pre>
763 nigel 63 matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the POSIX
764     syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
765     supported, and an error is given if they are encountered.
766     </P>
767     <P>
768 nigel 75 In UTF-8 mode, characters with values greater than 128 do not match any of
769 nigel 63 the POSIX character classes.
770     </P>
771     <br><a name="SEC8" href="#TOC1">VERTICAL BAR</a><br>
772     <P>
773     Vertical bar characters are used to separate alternative patterns. For example,
774     the pattern
775     <pre>
776     gilbert|sullivan
777 nigel 75 </pre>
778 nigel 63 matches either "gilbert" or "sullivan". Any number of alternatives may appear,
779 nigel 91 and an empty alternative is permitted (matching the empty string). The matching
780     process tries each alternative in turn, from left to right, and the first one
781     that succeeds is used. If the alternatives are within a subpattern
782 nigel 75 <a href="#subpattern">(defined below),</a>
783     "succeeds" means matching the rest of the main pattern as well as the
784     alternative in the subpattern.
785 nigel 63 </P>
786     <br><a name="SEC9" href="#TOC1">INTERNAL OPTION SETTING</a><br>
787     <P>
788     The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
789     PCRE_EXTENDED options can be changed from within the pattern by a sequence of
790     Perl option letters enclosed between "(?" and ")". The option letters are
791     <pre>
792     i for PCRE_CASELESS
793     m for PCRE_MULTILINE
794     s for PCRE_DOTALL
795     x for PCRE_EXTENDED
796 nigel 75 </pre>
797 nigel 63 For example, (?im) sets caseless, multiline matching. It is also possible to
798     unset these options by preceding the letter with a hyphen, and a combined
799     setting and unsetting such as (?im-sx), which sets PCRE_CASELESS and
800     PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED, is also
801     permitted. If a letter appears both before and after the hyphen, the option is
802     unset.
803     </P>
804     <P>
805     When an option change occurs at top level (that is, not inside subpattern
806     parentheses), the change applies to the remainder of the pattern that follows.
807     If the change is placed right at the start of a pattern, PCRE extracts it into
808     the global options (and it will therefore show up in data extracted by the
809     <b>pcre_fullinfo()</b> function).
810     </P>
811     <P>
812     An option change within a subpattern affects only that part of the current
813     pattern that follows it, so
814     <pre>
815     (a(?i)b)c
816 nigel 75 </pre>
817 nigel 63 matches abc and aBc and no other strings (assuming PCRE_CASELESS is not used).
818     By this means, options can be made to have different settings in different
819     parts of the pattern. Any changes made in one alternative do carry on
820     into subsequent branches within the same subpattern. For example,
821     <pre>
822     (a(?i)b|c)
823 nigel 75 </pre>
824 nigel 63 matches "ab", "aB", "c", and "C", even though when matching "C" the first
825     branch is abandoned before the option setting. This is because the effects of
826     option settings happen at compile time. There would be some very weird
827     behaviour otherwise.
828     </P>
829     <P>
830 nigel 91 The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA can be
831     changed in the same way as the Perl-compatible options by using the characters
832     J, U and X respectively.
833 nigel 75 <a name="subpattern"></a></P>
834 nigel 63 <br><a name="SEC10" href="#TOC1">SUBPATTERNS</a><br>
835     <P>
836     Subpatterns are delimited by parentheses (round brackets), which can be nested.
837 nigel 75 Turning part of a pattern into a subpattern does two things:
838     <br>
839     <br>
840 nigel 63 1. It localizes a set of alternatives. For example, the pattern
841     <pre>
842     cat(aract|erpillar|)
843 nigel 75 </pre>
844 nigel 63 matches one of the words "cat", "cataract", or "caterpillar". Without the
845     parentheses, it would match "cataract", "erpillar" or the empty string.
846 nigel 75 <br>
847     <br>
848     2. It sets up the subpattern as a capturing subpattern. This means that, when
849     the whole pattern matches, that portion of the subject string that matched the
850     subpattern is passed back to the caller via the <i>ovector</i> argument of
851 nigel 63 <b>pcre_exec()</b>. Opening parentheses are counted from left to right (starting
852 nigel 75 from 1) to obtain numbers for the capturing subpatterns.
853 nigel 63 </P>
854     <P>
855     For example, if the string "the red king" is matched against the pattern
856     <pre>
857     the ((red|white) (king|queen))
858 nigel 75 </pre>
859 nigel 63 the captured substrings are "red king", "red", and "king", and are numbered 1,
860     2, and 3, respectively.
861     </P>
862     <P>
863     The fact that plain parentheses fulfil two functions is not always helpful.
864     There are often times when a grouping subpattern is required without a
865     capturing requirement. If an opening parenthesis is followed by a question mark
866     and a colon, the subpattern does not do any capturing, and is not counted when
867     computing the number of any subsequent capturing subpatterns. For example, if
868     the string "the white queen" is matched against the pattern
869     <pre>
870     the ((?:red|white) (king|queen))
871 nigel 75 </pre>
872 nigel 63 the captured substrings are "white queen" and "queen", and are numbered 1 and
873     2. The maximum number of capturing subpatterns is 65535, and the maximum depth
874     of nesting of all subpatterns, both capturing and non-capturing, is 200.
875     </P>
876     <P>
877     As a convenient shorthand, if any option settings are required at the start of
878     a non-capturing subpattern, the option letters may appear between the "?" and
879     the ":". Thus the two patterns
880     <pre>
881     (?i:saturday|sunday)
882     (?:(?i)saturday|sunday)
883 nigel 75 </pre>
884 nigel 63 match exactly the same set of strings. Because alternative branches are tried
885     from left to right, and options are not reset until the end of the subpattern
886     is reached, an option setting in one branch does affect subsequent branches, so
887     the above patterns match "SUNDAY" as well as "Saturday".
888     </P>
889     <br><a name="SEC11" href="#TOC1">NAMED SUBPATTERNS</a><br>
890     <P>
891     Identifying capturing parentheses by number is simple, but it can be very hard
892     to keep track of the numbers in complicated regular expressions. Furthermore,
893 nigel 75 if an expression is modified, the numbers may change. To help with this
894 nigel 63 difficulty, PCRE supports the naming of subpatterns, something that Perl does
895 nigel 91 not provide. The Python syntax (?P&#60;name&#62;...) is used. References to capturing
896     parentheses from other parts of the pattern, such as
897     <a href="#backreferences">backreferences,</a>
898     <a href="#recursion">recursion,</a>
899     and
900     <a href="#conditions">conditions,</a>
901     can be made by name as well as by number.
902 nigel 63 </P>
903     <P>
904 nigel 91 Names consist of up to 32 alphanumeric characters and underscores. Named
905     capturing parentheses are still allocated numbers as well as names. The PCRE
906     API provides function calls for extracting the name-to-number translation table
907     from a compiled pattern. There is also a convenience function for extracting a
908     captured substring by name.
909     </P>
910     <P>
911     By default, a name must be unique within a pattern, but it is possible to relax
912     this constraint by setting the PCRE_DUPNAMES option at compile time. This can
913     be useful for patterns where only one instance of the named parentheses can
914     match. Suppose you want to match the name of a weekday, either as a 3-letter
915     abbreviation or as the full name, and in both cases you want to extract the
916     abbreviation. This pattern (ignoring the line breaks) does the job:
917     <pre>
918     (?P&#60;DN&#62;Mon|Fri|Sun)(?:day)?|
919     (?P&#60;DN&#62;Tue)(?:sday)?|
920     (?P&#60;DN&#62;Wed)(?:nesday)?|
921     (?P&#60;DN&#62;Thu)(?:rsday)?|
922     (?P&#60;DN&#62;Sat)(?:urday)?
923     </pre>
924     There are five capturing substrings, but only one is ever set after a match.
925     The convenience function for extracting the data by name returns the substring
926     for the first, and in this example, the only, subpattern of that name that
927     matched. This saves searching to find which numbered subpattern it was. If you
928     make a reference to a non-unique named subpattern from elsewhere in the
929     pattern, the one that corresponds to the lowest number is used. For further
930     details of the interfaces for handling named subpatterns, see the
931 nigel 63 <a href="pcreapi.html"><b>pcreapi</b></a>
932     documentation.
933     </P>
934     <br><a name="SEC12" href="#TOC1">REPETITION</a><br>
935     <P>
936     Repetition is specified by quantifiers, which can follow any of the following
937     items:
938     <pre>
939     a literal data character
940     the . metacharacter
941     the \C escape sequence
942 nigel 75 the \X escape sequence (in UTF-8 mode with Unicode properties)
943     an escape such as \d that matches a single character
944 nigel 63 a character class
945     a back reference (see next section)
946     a parenthesized subpattern (unless it is an assertion)
947 nigel 75 </pre>
948 nigel 63 The general repetition quantifier specifies a minimum and maximum number of
949     permitted matches, by giving the two numbers in curly brackets (braces),
950     separated by a comma. The numbers must be less than 65536, and the first must
951     be less than or equal to the second. For example:
952     <pre>
953     z{2,4}
954 nigel 75 </pre>
955 nigel 63 matches "zz", "zzz", or "zzzz". A closing brace on its own is not a special
956     character. If the second number is omitted, but the comma is present, there is
957     no upper limit; if the second number and the comma are both omitted, the
958     quantifier specifies an exact number of required matches. Thus
959     <pre>
960     [aeiou]{3,}
961 nigel 75 </pre>
962 nigel 63 matches at least 3 successive vowels, but may match many more, while
963     <pre>
964     \d{8}
965 nigel 75 </pre>
966 nigel 63 matches exactly 8 digits. An opening curly bracket that appears in a position
967     where a quantifier is not allowed, or one that does not match the syntax of a
968     quantifier, is taken as a literal character. For example, {,6} is not a
969     quantifier, but a literal string of four characters.
970     </P>
971     <P>
972     In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to individual
973     bytes. Thus, for example, \x{100}{2} matches two UTF-8 characters, each of
974 nigel 75 which is represented by a two-byte sequence. Similarly, when Unicode property
975     support is available, \X{3} matches three Unicode extended sequences, each of
976     which may be several bytes long (and they may be of different lengths).
977 nigel 63 </P>
978     <P>
979     The quantifier {0} is permitted, causing the expression to behave as if the
980     previous item and the quantifier were not present.
981     </P>
982     <P>
983     For convenience (and historical compatibility) the three most common
984     quantifiers have single-character abbreviations:
985     <pre>
986     * is equivalent to {0,}
987     + is equivalent to {1,}
988     ? is equivalent to {0,1}
989 nigel 75 </pre>
990 nigel 63 It is possible to construct infinite loops by following a subpattern that can
991     match no characters with a quantifier that has no upper limit, for example:
992     <pre>
993     (a?)*
994 nigel 75 </pre>
995 nigel 63 Earlier versions of Perl and PCRE used to give an error at compile time for
996     such patterns. However, because there are cases where this can be useful, such
997     patterns are now accepted, but if any repetition of the subpattern does in fact
998     match no characters, the loop is forcibly broken.
999     </P>
1000     <P>
1001     By default, the quantifiers are "greedy", that is, they match as much as
1002     possible (up to the maximum number of permitted times), without causing the
1003     rest of the pattern to fail. The classic example of where this gives problems
1004 nigel 75 is in trying to match comments in C programs. These appear between /* and */
1005     and within the comment, individual * and / characters may appear. An attempt to
1006     match C comments by applying the pattern
1007 nigel 63 <pre>
1008     /\*.*\*/
1009 nigel 75 </pre>
1010 nigel 63 to the string
1011     <pre>
1012 nigel 75 /* first comment */ not comment /* second comment */
1013     </pre>
1014 nigel 63 fails, because it matches the entire string owing to the greediness of the .*
1015     item.
1016     </P>
1017     <P>
1018     However, if a quantifier is followed by a question mark, it ceases to be
1019     greedy, and instead matches the minimum number of times possible, so the
1020     pattern
1021     <pre>
1022     /\*.*?\*/
1023 nigel 75 </pre>
1024 nigel 63 does the right thing with the C comments. The meaning of the various
1025     quantifiers is not otherwise changed, just the preferred number of matches.
1026     Do not confuse this use of question mark with its use as a quantifier in its
1027     own right. Because it has two uses, it can sometimes appear doubled, as in
1028     <pre>
1029     \d??\d
1030 nigel 75 </pre>
1031 nigel 63 which matches one digit by preference, but can match two if that is the only
1032     way the rest of the pattern matches.
1033     </P>
1034     <P>
1035     If the PCRE_UNGREEDY option is set (an option which is not available in Perl),
1036     the quantifiers are not greedy by default, but individual ones can be made
1037     greedy by following them with a question mark. In other words, it inverts the
1038     default behaviour.
1039     </P>
1040     <P>
1041     When a parenthesized subpattern is quantified with a minimum repeat count that
1042 nigel 75 is greater than 1 or with a limited maximum, more memory is required for the
1043 nigel 63 compiled pattern, in proportion to the size of the minimum or maximum.
1044     </P>
1045     <P>
1046     If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equivalent
1047     to Perl's /s) is set, thus allowing the . to match newlines, the pattern is
1048     implicitly anchored, because whatever follows will be tried against every
1049     character position in the subject string, so there is no point in retrying the
1050     overall match at any position after the first. PCRE normally treats such a
1051     pattern as though it were preceded by \A.
1052     </P>
1053     <P>
1054     In cases where it is known that the subject string contains no newlines, it is
1055     worth setting PCRE_DOTALL in order to obtain this optimization, or
1056     alternatively using ^ to indicate anchoring explicitly.
1057     </P>
1058     <P>
1059     However, there is one situation where the optimization cannot be used. When .*
1060     is inside capturing parentheses that are the subject of a backreference
1061     elsewhere in the pattern, a match at the start may fail, and a later one
1062     succeed. Consider, for example:
1063     <pre>
1064     (.*)abc\1
1065 nigel 75 </pre>
1066 nigel 63 If the subject is "xyz123abc123" the match point is the fourth character. For
1067     this reason, such a pattern is not implicitly anchored.
1068     </P>
1069     <P>
1070     When a capturing subpattern is repeated, the value captured is the substring
1071     that matched the final iteration. For example, after
1072     <pre>
1073     (tweedle[dume]{3}\s*)+
1074 nigel 75 </pre>
1075 nigel 63 has matched "tweedledum tweedledee" the value of the captured substring is
1076     "tweedledee". However, if there are nested capturing subpatterns, the
1077     corresponding captured values may have been set in previous iterations. For
1078     example, after
1079     <pre>
1080     /(a|(b))+/
1081 nigel 75 </pre>
1082 nigel 63 matches "aba" the value of the second captured substring is "b".
1083 nigel 75 <a name="atomicgroup"></a></P>
1084 nigel 63 <br><a name="SEC13" href="#TOC1">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a><br>
1085     <P>
1086     With both maximizing and minimizing repetition, failure of what follows
1087     normally causes the repeated item to be re-evaluated to see if a different
1088     number of repeats allows the rest of the pattern to match. Sometimes it is
1089     useful to prevent this, either to change the nature of the match, or to cause
1090     it fail earlier than it otherwise might, when the author of the pattern knows
1091     there is no point in carrying on.
1092     </P>
1093     <P>
1094     Consider, for example, the pattern \d+foo when applied to the subject line
1095     <pre>
1096     123456bar
1097 nigel 75 </pre>
1098 nigel 63 After matching all 6 digits and then failing to match "foo", the normal
1099     action of the matcher is to try again with only 5 digits matching the \d+
1100     item, and then with 4, and so on, before ultimately failing. "Atomic grouping"
1101     (a term taken from Jeffrey Friedl's book) provides the means for specifying
1102     that once a subpattern has matched, it is not to be re-evaluated in this way.
1103     </P>
1104     <P>
1105     If we use atomic grouping for the previous example, the matcher would give up
1106     immediately on failing to match "foo" the first time. The notation is a kind of
1107     special parenthesis, starting with (?&#62; as in this example:
1108     <pre>
1109 nigel 73 (?&#62;\d+)foo
1110 nigel 75 </pre>
1111 nigel 63 This kind of parenthesis "locks up" the part of the pattern it contains once
1112     it has matched, and a failure further into the pattern is prevented from
1113     backtracking into it. Backtracking past it to previous items, however, works as
1114     normal.
1115     </P>
1116     <P>
1117     An alternative description is that a subpattern of this type matches the string
1118     of characters that an identical standalone pattern would match, if anchored at
1119     the current point in the subject string.
1120     </P>
1121     <P>
1122     Atomic grouping subpatterns are not capturing subpatterns. Simple cases such as
1123     the above example can be thought of as a maximizing repeat that must swallow
1124     everything it can. So, while both \d+ and \d+? are prepared to adjust the
1125     number of digits they match in order to make the rest of the pattern match,
1126     (?&#62;\d+) can only match an entire sequence of digits.
1127     </P>
1128     <P>
1129     Atomic groups in general can of course contain arbitrarily complicated
1130     subpatterns, and can be nested. However, when the subpattern for an atomic
1131     group is just a single repeated item, as in the example above, a simpler
1132     notation, called a "possessive quantifier" can be used. This consists of an
1133     additional + character following a quantifier. Using this notation, the
1134     previous example can be rewritten as
1135     <pre>
1136 nigel 75 \d++foo
1137     </pre>
1138 nigel 63 Possessive quantifiers are always greedy; the setting of the PCRE_UNGREEDY
1139     option is ignored. They are a convenient notation for the simpler forms of
1140     atomic group. However, there is no difference in the meaning or processing of a
1141     possessive quantifier and the equivalent atomic group.
1142     </P>
1143     <P>
1144 nigel 91 The possessive quantifier syntax is an extension to the Perl syntax. Jeffrey
1145     Friedl originated the idea (and the name) in the first edition of his book.
1146     Mike McCloskey liked it, so implemented it when he built Sun's Java package,
1147     and PCRE copied it from there.
1148 nigel 63 </P>
1149     <P>
1150     When a pattern contains an unlimited repeat inside a subpattern that can itself
1151     be repeated an unlimited number of times, the use of an atomic group is the
1152     only way to avoid some failing matches taking a very long time indeed. The
1153     pattern
1154     <pre>
1155     (\D+|&#60;\d+&#62;)*[!?]
1156 nigel 75 </pre>
1157 nigel 63 matches an unlimited number of substrings that either consist of non-digits, or
1158     digits enclosed in &#60;&#62;, followed by either ! or ?. When it matches, it runs
1159     quickly. However, if it is applied to
1160     <pre>
1161     aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
1162 nigel 75 </pre>
1163 nigel 63 it takes a long time before reporting failure. This is because the string can
1164 nigel 75 be divided between the internal \D+ repeat and the external * repeat in a
1165     large number of ways, and all have to be tried. (The example uses [!?] rather
1166     than a single character at the end, because both PCRE and Perl have an
1167     optimization that allows for fast failure when a single character is used. They
1168     remember the last single character that is required for a match, and fail early
1169     if it is not present in the string.) If the pattern is changed so that it uses
1170     an atomic group, like this:
1171 nigel 63 <pre>
1172     ((?&#62;\D+)|&#60;\d+&#62;)*[!?]
1173 nigel 75 </pre>
1174 nigel 63 sequences of non-digits cannot be broken, and failure happens quickly.
1175 nigel 75 <a name="backreferences"></a></P>
1176 nigel 63 <br><a name="SEC14" href="#TOC1">BACK REFERENCES</a><br>
1177     <P>
1178     Outside a character class, a backslash followed by a digit greater than 0 (and
1179     possibly further digits) is a back reference to a capturing subpattern earlier
1180     (that is, to its left) in the pattern, provided there have been that many
1181     previous capturing left parentheses.
1182     </P>
1183     <P>
1184     However, if the decimal number following the backslash is less than 10, it is
1185     always taken as a back reference, and causes an error only if there are not
1186     that many capturing left parentheses in the entire pattern. In other words, the
1187     parentheses that are referenced need not be to the left of the reference for
1188 nigel 91 numbers less than 10. A "forward back reference" of this type can make sense
1189     when a repetition is involved and the subpattern to the right has participated
1190     in an earlier iteration.
1191     </P>
1192     <P>
1193     It is not possible to have a numerical "forward back reference" to subpattern
1194     whose number is 10 or more. However, a back reference to any subpattern is
1195     possible using named parentheses (see below). See also the subsection entitled
1196     "Non-printing characters"
1197 nigel 75 <a href="#digitsafterbackslash">above</a>
1198     for further details of the handling of digits following a backslash.
1199 nigel 63 </P>
1200     <P>
1201     A back reference matches whatever actually matched the capturing subpattern in
1202     the current subject string, rather than anything matching the subpattern
1203     itself (see
1204     <a href="#subpatternsassubroutines">"Subpatterns as subroutines"</a>
1205     below for a way of doing that). So the pattern
1206     <pre>
1207     (sens|respons)e and \1ibility
1208 nigel 75 </pre>
1209 nigel 63 matches "sense and sensibility" and "response and responsibility", but not
1210     "sense and responsibility". If caseful matching is in force at the time of the
1211     back reference, the case of letters is relevant. For example,
1212     <pre>
1213     ((?i)rah)\s+\1
1214 nigel 75 </pre>
1215 nigel 63 matches "rah rah" and "RAH RAH", but not "RAH rah", even though the original
1216     capturing subpattern is matched caselessly.
1217     </P>
1218     <P>
1219     Back references to named subpatterns use the Python syntax (?P=name). We could
1220     rewrite the above example as follows:
1221     <pre>
1222 nigel 91 (?P&#60;p1&#62;(?i)rah)\s+(?P=p1)
1223 nigel 75 </pre>
1224 nigel 91 A subpattern that is referenced by name may appear in the pattern before or
1225     after the reference.
1226     </P>
1227     <P>
1228 nigel 63 There may be more than one back reference to the same subpattern. If a
1229     subpattern has not actually been used in a particular match, any back
1230     references to it always fail. For example, the pattern
1231     <pre>
1232     (a|(bc))\2
1233 nigel 75 </pre>
1234 nigel 63 always fails if it starts to match "a" rather than "bc". Because there may be
1235     many capturing parentheses in a pattern, all digits following the backslash are
1236     taken as part of a potential back reference number. If the pattern continues
1237     with a digit character, some delimiter must be used to terminate the back
1238     reference. If the PCRE_EXTENDED option is set, this can be whitespace.
1239 nigel 75 Otherwise an empty comment (see
1240     <a href="#comments">"Comments"</a>
1241     below) can be used.
1242 nigel 63 </P>
1243     <P>
1244     A back reference that occurs inside the parentheses to which it refers fails
1245     when the subpattern is first used, so, for example, (a\1) never matches.
1246     However, such references can be useful inside repeated subpatterns. For
1247     example, the pattern
1248     <pre>
1249     (a|b\1)+
1250 nigel 75 </pre>
1251 nigel 63 matches any number of "a"s and also "aba", "ababbaa" etc. At each iteration of
1252     the subpattern, the back reference matches the character string corresponding
1253     to the previous iteration. In order for this to work, the pattern must be such
1254     that the first iteration does not need to match the back reference. This can be
1255     done using alternation, as in the example above, or by a quantifier with a
1256     minimum of zero.
1257 nigel 75 <a name="bigassertions"></a></P>
1258 nigel 63 <br><a name="SEC15" href="#TOC1">ASSERTIONS</a><br>
1259     <P>
1260     An assertion is a test on the characters following or preceding the current
1261     matching point that does not actually consume any characters. The simple
1262 nigel 75 assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are described
1263     <a href="#smallassertions">above.</a>
1264     </P>
1265     <P>
1266 nigel 63 More complicated assertions are coded as subpatterns. There are two kinds:
1267     those that look ahead of the current position in the subject string, and those
1268 nigel 75 that look behind it. An assertion subpattern is matched in the normal way,
1269     except that it does not cause the current matching position to be changed.
1270 nigel 63 </P>
1271     <P>
1272 nigel 75 Assertion subpatterns are not capturing subpatterns, and may not be repeated,
1273     because it makes no sense to assert the same thing several times. If any kind
1274     of assertion contains capturing subpatterns within it, these are counted for
1275     the purposes of numbering the capturing subpatterns in the whole pattern.
1276     However, substring capturing is carried out only for positive assertions,
1277     because it does not make sense for negative assertions.
1278 nigel 63 </P>
1279 nigel 75 <br><b>
1280     Lookahead assertions
1281     </b><br>
1282 nigel 63 <P>
1283 nigel 91 Lookahead assertions start with (?= for positive assertions and (?! for
1284     negative assertions. For example,
1285 nigel 63 <pre>
1286     \w+(?=;)
1287 nigel 75 </pre>
1288 nigel 63 matches a word followed by a semicolon, but does not include the semicolon in
1289     the match, and
1290     <pre>
1291     foo(?!bar)
1292 nigel 75 </pre>
1293 nigel 63 matches any occurrence of "foo" that is not followed by "bar". Note that the
1294     apparently similar pattern
1295     <pre>
1296     (?!foo)bar
1297 nigel 75 </pre>
1298 nigel 63 does not find an occurrence of "bar" that is preceded by something other than
1299     "foo"; it finds any occurrence of "bar" whatsoever, because the assertion
1300     (?!foo) is always true when the next three characters are "bar". A
1301 nigel 75 lookbehind assertion is needed to achieve the other effect.
1302 nigel 63 </P>
1303     <P>
1304     If you want to force a matching failure at some point in a pattern, the most
1305     convenient way to do it is with (?!) because an empty string always matches, so
1306     an assertion that requires there not to be an empty string must always fail.
1307 nigel 75 <a name="lookbehind"></a></P>
1308     <br><b>
1309     Lookbehind assertions
1310     </b><br>
1311 nigel 63 <P>
1312     Lookbehind assertions start with (?&#60;= for positive assertions and (?&#60;! for
1313     negative assertions. For example,
1314     <pre>
1315     (?&#60;!foo)bar
1316 nigel 75 </pre>
1317 nigel 63 does find an occurrence of "bar" that is not preceded by "foo". The contents of
1318     a lookbehind assertion are restricted such that all the strings it matches must
1319 nigel 91 have a fixed length. However, if there are several top-level alternatives, they
1320     do not all have to have the same fixed length. Thus
1321 nigel 63 <pre>
1322     (?&#60;=bullock|donkey)
1323 nigel 75 </pre>
1324 nigel 63 is permitted, but
1325     <pre>
1326     (?&#60;!dogs?|cats?)
1327 nigel 75 </pre>
1328 nigel 63 causes an error at compile time. Branches that match different length strings
1329     are permitted only at the top level of a lookbehind assertion. This is an
1330     extension compared with Perl (at least for 5.8), which requires all branches to
1331     match the same length of string. An assertion such as
1332     <pre>
1333     (?&#60;=ab(c|de))
1334 nigel 75 </pre>
1335 nigel 63 is not permitted, because its single top-level branch can match two different
1336     lengths, but it is acceptable if rewritten to use two top-level branches:
1337     <pre>
1338     (?&#60;=abc|abde)
1339 nigel 75 </pre>
1340 nigel 63 The implementation of lookbehind assertions is, for each alternative, to
1341     temporarily move the current position back by the fixed width and then try to
1342     match. If there are insufficient characters before the current position, the
1343     match is deemed to fail.
1344     </P>
1345     <P>
1346     PCRE does not allow the \C escape (which matches a single byte in UTF-8 mode)
1347     to appear in lookbehind assertions, because it makes it impossible to calculate
1348 nigel 75 the length of the lookbehind. The \X escape, which can match different numbers
1349     of bytes, is also not permitted.
1350 nigel 63 </P>
1351     <P>
1352     Atomic groups can be used in conjunction with lookbehind assertions to specify
1353     efficient matching at the end of the subject string. Consider a simple pattern
1354     such as
1355     <pre>
1356     abcd$
1357 nigel 75 </pre>
1358 nigel 63 when applied to a long string that does not match. Because matching proceeds
1359     from left to right, PCRE will look for each "a" in the subject and then see if
1360     what follows matches the rest of the pattern. If the pattern is specified as
1361     <pre>
1362     ^.*abcd$
1363 nigel 75 </pre>
1364 nigel 63 the initial .* matches the entire string at first, but when this fails (because
1365     there is no following "a"), it backtracks to match all but the last character,
1366     then all but the last two characters, and so on. Once again the search for "a"
1367     covers the entire string, from right to left, so we are no better off. However,
1368     if the pattern is written as
1369     <pre>
1370     ^(?&#62;.*)(?&#60;=abcd)
1371 nigel 75 </pre>
1372     or, equivalently, using the possessive quantifier syntax,
1373 nigel 63 <pre>
1374     ^.*+(?&#60;=abcd)
1375 nigel 75 </pre>
1376 nigel 63 there can be no backtracking for the .* item; it can match only the entire
1377     string. The subsequent lookbehind assertion does a single test on the last four
1378     characters. If it fails, the match fails immediately. For long strings, this
1379     approach makes a significant difference to the processing time.
1380     </P>
1381 nigel 75 <br><b>
1382     Using multiple assertions
1383     </b><br>
1384 nigel 63 <P>
1385     Several assertions (of any sort) may occur in succession. For example,
1386     <pre>
1387     (?&#60;=\d{3})(?&#60;!999)foo
1388 nigel 75 </pre>
1389 nigel 63 matches "foo" preceded by three digits that are not "999". Notice that each of
1390     the assertions is applied independently at the same point in the subject
1391     string. First there is a check that the previous three characters are all
1392     digits, and then there is a check that the same three characters are not "999".
1393     This pattern does <i>not</i> match "foo" preceded by six characters, the first
1394     of which are digits and the last three of which are not "999". For example, it
1395     doesn't match "123abcfoo". A pattern to do that is
1396     <pre>
1397     (?&#60;=\d{3}...)(?&#60;!999)foo
1398 nigel 75 </pre>
1399 nigel 63 This time the first assertion looks at the preceding six characters, checking
1400     that the first three are digits, and then the second assertion checks that the
1401     preceding three characters are not "999".
1402     </P>
1403     <P>
1404     Assertions can be nested in any combination. For example,
1405     <pre>
1406     (?&#60;=(?&#60;!foo)bar)baz
1407 nigel 75 </pre>
1408 nigel 63 matches an occurrence of "baz" that is preceded by "bar" which in turn is not
1409     preceded by "foo", while
1410     <pre>
1411     (?&#60;=\d{3}(?!999)...)foo
1412 nigel 75 </pre>
1413     is another pattern that matches "foo" preceded by three digits and any three
1414 nigel 63 characters that are not "999".
1415 nigel 91 <a name="conditions"></a></P>
1416 nigel 63 <br><a name="SEC16" href="#TOC1">CONDITIONAL SUBPATTERNS</a><br>
1417     <P>
1418     It is possible to cause the matching process to obey a subpattern
1419     conditionally or to choose between two alternative subpatterns, depending on
1420     the result of an assertion, or whether a previous capturing subpattern matched
1421     or not. The two possible forms of conditional subpattern are
1422     <pre>
1423     (?(condition)yes-pattern)
1424     (?(condition)yes-pattern|no-pattern)
1425 nigel 75 </pre>
1426 nigel 63 If the condition is satisfied, the yes-pattern is used; otherwise the
1427     no-pattern (if present) is used. If there are more than two alternatives in the
1428     subpattern, a compile-time error occurs.
1429     </P>
1430     <P>
1431     There are three kinds of condition. If the text between the parentheses
1432 nigel 91 consists of a sequence of digits, or a sequence of alphanumeric characters and
1433     underscores, the condition is satisfied if the capturing subpattern of that
1434     number or name has previously matched. There is a possible ambiguity here,
1435     because subpattern names may consist entirely of digits. PCRE looks first for a
1436     named subpattern; if it cannot find one and the text consists entirely of
1437     digits, it looks for a subpattern of that number, which must be greater than
1438     zero. Using subpattern names that consist entirely of digits is not
1439     recommended.
1440     </P>
1441     <P>
1442     Consider the following pattern, which contains non-significant white space to
1443     make it more readable (assume the PCRE_EXTENDED option) and to divide it into
1444     three parts for ease of discussion:
1445 nigel 63 <pre>
1446     ( \( )? [^()]+ (?(1) \) )
1447 nigel 75 </pre>
1448 nigel 63 The first part matches an optional opening parenthesis, and if that
1449     character is present, sets it as the first captured substring. The second part
1450     matches one or more characters that are not parentheses. The third part is a
1451     conditional subpattern that tests whether the first set of parentheses matched
1452     or not. If they did, that is, if subject started with an opening parenthesis,
1453     the condition is true, and so the yes-pattern is executed and a closing
1454     parenthesis is required. Otherwise, since no-pattern is not present, the
1455     subpattern matches nothing. In other words, this pattern matches a sequence of
1456 nigel 91 non-parentheses, optionally enclosed in parentheses. Rewriting it to use a
1457     named subpattern gives this:
1458     <pre>
1459     (?P&#60;OPEN&#62; \( )? [^()]+ (?(OPEN) \) )
1460     </pre>
1461     If the condition is the string (R), and there is no subpattern with the name R,
1462     the condition is satisfied if a recursive call to the pattern or subpattern has
1463     been made. At "top level", the condition is false. This is a PCRE extension.
1464     Recursive patterns are described in the next section.
1465 nigel 63 </P>
1466     <P>
1467     If the condition is not a sequence of digits or (R), it must be an assertion.
1468     This may be a positive or negative lookahead or lookbehind assertion. Consider
1469     this pattern, again containing non-significant white space, and with the two
1470     alternatives on the second line:
1471     <pre>
1472     (?(?=[^a-z]*[a-z])
1473     \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
1474 nigel 75 </pre>
1475 nigel 63 The condition is a positive lookahead assertion that matches an optional
1476     sequence of non-letters followed by a letter. In other words, it tests for the
1477     presence of at least one letter in the subject. If a letter is found, the
1478     subject is matched against the first alternative; otherwise it is matched
1479     against the second. This pattern matches strings in one of the two forms
1480     dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
1481 nigel 75 <a name="comments"></a></P>
1482 nigel 63 <br><a name="SEC17" href="#TOC1">COMMENTS</a><br>
1483     <P>
1484 nigel 75 The sequence (?# marks the start of a comment that continues up to the next
1485 nigel 63 closing parenthesis. Nested parentheses are not permitted. The characters
1486     that make up a comment play no part in the pattern matching at all.
1487     </P>
1488     <P>
1489     If the PCRE_EXTENDED option is set, an unescaped # character outside a
1490 nigel 91 character class introduces a comment that continues to immediately after the
1491     next newline in the pattern.
1492     <a name="recursion"></a></P>
1493 nigel 63 <br><a name="SEC18" href="#TOC1">RECURSIVE PATTERNS</a><br>
1494     <P>
1495     Consider the problem of matching a string in parentheses, allowing for
1496     unlimited nested parentheses. Without the use of recursion, the best that can
1497     be done is to use a pattern that matches up to some fixed depth of nesting. It
1498 nigel 75 is not possible to handle an arbitrary nesting depth. Perl provides a facility
1499     that allows regular expressions to recurse (amongst other things). It does this
1500     by interpolating Perl code in the expression at run time, and the code can
1501     refer to the expression itself. A Perl pattern to solve the parentheses problem
1502     can be created like this:
1503 nigel 63 <pre>
1504     $re = qr{\( (?: (?&#62;[^()]+) | (?p{$re}) )* \)}x;
1505 nigel 75 </pre>
1506 nigel 63 The (?p{...}) item interpolates Perl code at run time, and in this case refers
1507     recursively to the pattern in which it appears. Obviously, PCRE cannot support
1508     the interpolation of Perl code. Instead, it supports some special syntax for
1509     recursion of the entire pattern, and also for individual subpattern recursion.
1510     </P>
1511     <P>
1512     The special item that consists of (? followed by a number greater than zero and
1513     a closing parenthesis is a recursive call of the subpattern of the given
1514     number, provided that it occurs inside that subpattern. (If not, it is a
1515     "subroutine" call, which is described in the next section.) The special item
1516     (?R) is a recursive call of the entire regular expression.
1517     </P>
1518     <P>
1519 nigel 87 A recursive subpattern call is always treated as an atomic group. That is, once
1520     it has matched some of the subject string, it is never re-entered, even if
1521     it contains untried alternatives and there is a subsequent matching failure.
1522     </P>
1523     <P>
1524     This PCRE pattern solves the nested parentheses problem (assume the
1525     PCRE_EXTENDED option is set so that white space is ignored):
1526 nigel 63 <pre>
1527     \( ( (?&#62;[^()]+) | (?R) )* \)
1528 nigel 75 </pre>
1529 nigel 63 First it matches an opening parenthesis. Then it matches any number of
1530     substrings which can either be a sequence of non-parentheses, or a recursive
1531 nigel 87 match of the pattern itself (that is, a correctly parenthesized substring).
1532 nigel 63 Finally there is a closing parenthesis.
1533     </P>
1534     <P>
1535     If this were part of a larger pattern, you would not want to recurse the entire
1536     pattern, so instead you could use this:
1537     <pre>
1538     ( \( ( (?&#62;[^()]+) | (?1) )* \) )
1539 nigel 75 </pre>
1540 nigel 63 We have put the pattern into parentheses, and caused the recursion to refer to
1541     them instead of the whole pattern. In a larger pattern, keeping track of
1542     parenthesis numbers can be tricky. It may be more convenient to use named
1543     parentheses instead. For this, PCRE uses (?P&#62;name), which is an extension to
1544     the Python syntax that PCRE uses for named parentheses (Perl does not provide
1545     named parentheses). We could rewrite the above example as follows:
1546     <pre>
1547 nigel 73 (?P&#60;pn&#62; \( ( (?&#62;[^()]+) | (?P&#62;pn) )* \) )
1548 nigel 75 </pre>
1549 nigel 63 This particular example pattern contains nested unlimited repeats, and so the
1550     use of atomic grouping for matching strings of non-parentheses is important
1551     when applying the pattern to strings that do not match. For example, when this
1552     pattern is applied to
1553     <pre>
1554     (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
1555 nigel 75 </pre>
1556 nigel 63 it yields "no match" quickly. However, if atomic grouping is not used,
1557     the match runs for a very long time indeed because there are so many different
1558     ways the + and * repeats can carve up the subject, and all have to be tested
1559     before failure can be reported.
1560     </P>
1561     <P>
1562     At the end of a match, the values set for any capturing subpatterns are those
1563     from the outermost level of the recursion at which the subpattern value is set.
1564     If you want to obtain intermediate values, a callout function can be used (see
1565 nigel 75 the next section and the
1566 nigel 63 <a href="pcrecallout.html"><b>pcrecallout</b></a>
1567     documentation). If the pattern above is matched against
1568     <pre>
1569     (ab(cd)ef)
1570 nigel 75 </pre>
1571 nigel 63 the value for the capturing parentheses is "ef", which is the last value taken
1572     on at the top level. If additional parentheses are added, giving
1573     <pre>
1574     \( ( ( (?&#62;[^()]+) | (?R) )* ) \)
1575     ^ ^
1576     ^ ^
1577 nigel 75 </pre>
1578 nigel 63 the string they capture is "ab(cd)ef", the contents of the top level
1579     parentheses. If there are more than 15 capturing parentheses in a pattern, PCRE
1580     has to obtain extra memory to store data during a recursion, which it does by
1581     using <b>pcre_malloc</b>, freeing it via <b>pcre_free</b> afterwards. If no
1582     memory can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error.
1583     </P>
1584     <P>
1585     Do not confuse the (?R) item with the condition (R), which tests for recursion.
1586     Consider this pattern, which matches text in angle brackets, allowing for
1587     arbitrary nesting. Only digits are allowed in nested brackets (that is, when
1588     recursing), whereas any characters are permitted at the outer level.
1589     <pre>
1590     &#60; (?: (?(R) \d++ | [^&#60;&#62;]*+) | (?R)) * &#62;
1591 nigel 75 </pre>
1592 nigel 63 In this pattern, (?(R) is the start of a conditional subpattern, with two
1593     different alternatives for the recursive and non-recursive cases. The (?R) item
1594     is the actual recursive call.
1595 nigel 75 <a name="subpatternsassubroutines"></a></P>
1596     <br><a name="SEC19" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>
1597 nigel 63 <P>
1598     If the syntax for a recursive subpattern reference (either by number or by
1599     name) is used outside the parentheses to which it refers, it operates like a
1600     subroutine in a programming language. An earlier example pointed out that the
1601     pattern
1602     <pre>
1603     (sens|respons)e and \1ibility
1604 nigel 75 </pre>
1605 nigel 63 matches "sense and sensibility" and "response and responsibility", but not
1606     "sense and responsibility". If instead the pattern
1607     <pre>
1608     (sens|respons)e and (?1)ibility
1609 nigel 75 </pre>
1610 nigel 63 is used, it does match "sense and responsibility" as well as the other two
1611 nigel 91 strings. Such references, if given numerically, must follow the subpattern to
1612     which they refer. However, named references can refer to later subpatterns.
1613 nigel 63 </P>
1614 nigel 87 <P>
1615     Like recursive subpatterns, a "subroutine" call is always treated as an atomic
1616     group. That is, once it has matched some of the subject string, it is never
1617     re-entered, even if it contains untried alternatives and there is a subsequent
1618     matching failure.
1619     </P>
1620 nigel 63 <br><a name="SEC20" href="#TOC1">CALLOUTS</a><br>
1621     <P>
1622     Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl
1623     code to be obeyed in the middle of matching a regular expression. This makes it
1624     possible, amongst other things, to extract different substrings that match the
1625     same pair of parentheses when there is a repetition.
1626     </P>
1627     <P>
1628     PCRE provides a similar feature, but of course it cannot obey arbitrary Perl
1629     code. The feature is called "callout". The caller of PCRE provides an external
1630     function by putting its entry point in the global variable <i>pcre_callout</i>.
1631     By default, this variable contains NULL, which disables all calling out.
1632     </P>
1633     <P>
1634     Within a regular expression, (?C) indicates the points at which the external
1635     function is to be called. If you want to identify different callout points, you
1636     can put a number less than 256 after the letter C. The default value is zero.
1637     For example, this pattern has two callout points:
1638     <pre>
1639     (?C1)\dabc(?C2)def
1640 nigel 75 </pre>
1641     If the PCRE_AUTO_CALLOUT flag is passed to <b>pcre_compile()</b>, callouts are
1642     automatically installed before each item in the pattern. They are all numbered
1643     255.
1644 nigel 63 </P>
1645     <P>
1646     During matching, when PCRE reaches a callout point (and <i>pcre_callout</i> is
1647     set), the external function is called. It is provided with the number of the
1648 nigel 75 callout, the position in the pattern, and, optionally, one item of data
1649     originally supplied by the caller of <b>pcre_exec()</b>. The callout function
1650     may cause matching to proceed, to backtrack, or to fail altogether. A complete
1651     description of the interface to the callout function is given in the
1652 nigel 63 <a href="pcrecallout.html"><b>pcrecallout</b></a>
1653     documentation.
1654     </P>
1655     <P>
1656 nigel 91 Last updated: 06 June 2006
1657 nigel 63 <br>
1658 nigel 87 Copyright &copy; 1997-2006 University of Cambridge.
1659 nigel 75 <p>
1660     Return to the <a href="index.html">PCRE index page</a>.
1661     </p>

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12